subject:"Spark"

Survey/Pulse: Beam on Flink/Spark/Samza/etc

2024-04-29 Thread Austin Bennett

Curious who all is using Beam on runners other than dataflow.

Please respond either on-list or to me directly ...  Mostly just curious
the extent of whether Beam is fulfilling its promise of runner
agnosticism.  Getting good data on that is hard, so anecdotes would be very
welcomed!

[Question] Apache Beam Spark Runner Support - Spark 3.5, Scala 2.13 Environment

2024-03-22 Thread Sri Ganesh Venkataraman

Hello,

Does the latest Apache Beam version support Spark 3.5, Scala 2.13
environment for spark runner ?
Based on initial analysis -
Spark runner supports Spark 3.5.0 version
*Reference*:
https://github.com/apache/beam/blob/f660f49b6cf5d241285b156ccc4a5f52d82cfa63/runners/spark/3/build.gradle#L36

Would like to know compatibility with Scala version 2.13

Based on the below references, support is extended only to Scala version
2.12 (Libraries are compiled with Scala version 2.12)
https://github.com/apache/beam/blob/0dc330e6d53cce51b13bcb652e80859c1b3a5975/runners/spark/3/build.gradle#L24
https://mvnrepository.com/artifact/org.apache.beam/beam-runners-spark-3/2.54.0

Any insight/future roadmap plans on Scala 2.13 support will be very helpful

Thanks,
Sri Ganesh V

Beam on spark yarn could only start one worker

2024-01-04 Thread Huiming Jiang




Hello,
  We are using beam, which is written by python. And we want to deploy to spark on yarn cluster . 
  But we found that it could only start one spark worker. Also on this worker, beam docker is started.
  - The beam pipeline option printed : 
{key:"beam:option:spark_master:v1" value:{string_value:"local[4]"}}  
{key:"beam:option:direct_runner_use_stacked_bundle:v1" value:{bool_value:true}}  

  Our steps: 
Firstly, we generate jar file with python file. 
python3.7 -m  beam_demo \
--spark_version=3 \
--runner=SparkRunner \
--spark_job_server_jar=/home/hadoop/beam-runners-spark-3-job-server-2.52.0.jar \
--sdk_container_image=apache/beam_python3.7_sdk_boto3:2.45.0   \
--sdk_harness_container_image=apache/beam_python3.7_sdk_boto3:2.45.0 \
--worker_harness_container_image=apache/beam_python3.7_sdk_boto3:2.45.0 \
--environment_type=DOCKER --environment_config=apache/beam_python3.7_sdk_boto3:2.45.0 \
--sdk_location=container  \
--num_workers=2 \
--output_executable_path=jars/beam_demo.jar

  Secondly, we submit it to spark:
 spark-submit   --master yarn --deploy-mode cluster jars/beam_demo.jar

  
Our question is:
1. We configured sparkRunner. Why the runner is direct_runner_use_stacked_bundle? 
2. We submitted job to spark on yarn. Why the printed spark_master is local[4]?
3. Could python beam on Spark yarn cluster start multi workers?

Could you please help to check those questions? Maybe we have something missing. Thank you very much for  your help.

SparkRunner / PortableRunner Spark config besides Spark Master

2023-12-08 Thread Janek Bevendorff


Hi,

I'm struggling to figure out the best way to make Python Beam jobs 
execute on a Spark cluster running on Kubernetes. Unfortunately, the 
available documentation is incomplete and confusing at best.


The most flexible way I found was to compile a JAR from my Python job 
and submit that via spark-submit. Unfortunately, this seems to be 
extremely buggy and I cannot get it to feed logs from the SDK containers 
back to the Spark executors back to the driver. See: 
https://github.com/apache/beam/issues/29683


The other way would be to use a Beam job server, but here I cannot find 
a sensible way to set any Spark config options besides the master URL. I 
have a spark-defaults.conf with vital configuration, which needs to be 
passed to the job. I see two ways forward here:


1) I could let users run the job server locally in a Docker container. 
This way they could potentially mount their spark-defaults.conf 
somewhere, but I don't really see where (pointers here?). They would 
also need to mount their Kubernetes access credentials somehow, 
otherwise the job server cannot access the cluster.


2) I could run the Job server in the Kubernetes cluster, which would 
resolve the Kubernetes credential issue but not the Spark config issue. 
Though, even if that were solved, I would now force all users to use the 
same Spark config (not ideal).


Is there a better way? From what I can see, the compiled JAR is the only 
viable option, but the log issue is a deal breaker.


Thanks
Janek



smime.p7s
Description: S/MIME Cryptographic Signature

Re: [Question] Apache Beam Spark Runner Support - Spark 3.5 Environment

2023-11-09 Thread Alexey Romanenko

I already added Spark 3.5.0 version to Beam Spark version tests [1] and I 
didn’t notice any regression. 

The next Beam release (2.53.0) should be available in a couple on months, 
depending on release preparation process.

—
Alexey

[1] https://github.com/apache/beam/pull/29327

> On 9 Nov 2023, at 06:37, Giridhar Addepalli  wrote:
> 
> Thank you Alexey for sharing the details.
> 
> Can you please let us know if you are planning to add Spark 3.5.0 
> compatibility test as part of Beam 2.53.0 or not.
> If so, approximately what is the timeline we are looking at for Beam 2.53.0 
> release.
> https://github.com/apache/beam/milestone/17
> 
> Thanks,
> Giridhar.
> 
> On Tue, Nov 7, 2023 at 6:24 PM Alexey Romanenko  <mailto:aromanenko@gmail.com>> wrote:
>> Hi Giridhar,
>> 
>>> On 4 Nov 2023, at 08:04, Giridhar Addepalli >> <mailto:giridhar1...@gmail.com>> wrote:
>>> 
>>>  Thank you Alexey for the response.
>>> 
>>> We are using Beam 2.41.0 with Spark 3.3.0 cluster. 
>>> We did not run into any issues. 
>>> is it because in Beam 2.41.0, compatibility tests were run against spark 
>>> 3.3.0 ?
>>> https://github.com/apache/beam/blob/release-2.41.0/runners/spark/3/build.gradle
>> 
>> Correct. 
>> There are some incompatibilities between Spark 3.1/3.2/3.3 versions and we 
>> fixed this for Spark runner in Beam 2.41 to make it possible to compile and 
>> run with different Spark versions. That was a goal of these compatibility 
>> tests.
>> 
>> <22157.png>
>> Fixes #22156: Fix Spark3 runner to compile against Spark 3.2/3.3 by mosche · 
>> Pull Request #22157 · apache/beam
>> github.com
>>  <https://github.com/apache/beam/pull/22157>Fixes #22156: Fix Spark3 runner 
>> to compile against Spark 3.2/3.3 by mosche · Pull Request #22157 · 
>> apache/beam <https://github.com/apache/beam/pull/22157>
>> github.com <https://github.com/apache/beam/pull/22157>
>> 
>>> If so, since compatibility tests were not run against Spark 3.5.0 even in 
>>> latest release of Beam 2.52.0, is it not advised to use Beam 2.52.0 with 
>>> Spark 3.5.0 cluster ?
>> 
>> I’d say, for now it's up to user to test and run it since it was not tested 
>> on Beam CI. 
>> I’m going to add this version for future testing.
>> 
>> —
>> Alexey
>> 
>>> 
>>> Thanks,
>>> Giridhar.
>>> 
>>> On 2023/11/03 13:05:45 Alexey Romanenko wrote:
>>> > AFAICT, the latest tested (compatibility tests) version for now is 3.4.1 
>>> > [1] We may try to add 3.5.x version there.
>>> >  
>>> > I believe that ValidateRunners tests are run only against default Spark 
>>> > 3.2.2 version.
>>> >
>>> > —
>>> > Alexey
>>> >
>>> > [1] 
>>> > https://github.com/apache/beam/blob/2aaf09c0eb6928390d861ba228447338b8ca92d3/runners/spark/3/build.gradle#L36
>>> >
>>> >
>>> > > On 3 Nov 2023, at 05:06, Sri Ganesh Venkataraman >> > > <mailto:sr...@gmail.com>> wrote:
>>> > >
>>> > > Does Apache Beam version (2.41.0)  or latest (2.51.0) support Spark 3.5 
>>> > > environment for spark runner ?
>>> > >
>>> > > Apache Beam - Spark Runner Documentation states -
>>> > > The Spark runner currently supports Spark’s 3.2.x branch
>>> > >
>>> > > Thanks
>>> > > Sri Ganesh V
>>> >
>>> > 
>>

RE: Re: [Question] Apache Beam Spark Runner Support - Spark 3.5 Environment

2023-11-04 Thread Giridhar Addepalli

 Thank you Alexey for the response.

We are using Beam 2.41.0 with Spark 3.3.0 cluster.
We did not run into any issues.

is it because in Beam 2.41.0, compatibility tests were run against spark
3.3.0 ?
https://github.com/apache/beam/blob/release-2.41.0/runners/spark/3/build.gradle

If so, since compatibility tests were not run against Spark 3.5.0 even in
latest release of Beam 2.52.0, is it not advised to use Beam 2.52.0 with
Spark 3.5.0 cluster ?

Thanks,

Giridhar.

On 2023/11/03 13:05:45 Alexey Romanenko wrote:

> AFAICT, the latest tested (compatibility tests) version for now is 3.4.1
[1] We may try to add 3.5.x version there.

>

> I believe that ValidateRunners tests are run only against default Spark
3.2.2 version.

>

> —

> Alexey

>

> [1]
https://github.com/apache/beam/blob/2aaf09c0eb6928390d861ba228447338b8ca92d3/runners/spark/3/build.gradle#L36

>

>

> > On 3 Nov 2023, at 05:06, Sri Ganesh Venkataraman 
wrote:

> >

> > Does Apache Beam version (2.41.0)  or latest (2.51.0) support Spark 3.5
environment for spark runner ?

> >

> > Apache Beam - Spark Runner Documentation states -

> > The Spark runner currently supports Spark’s 3.2.x branch

> >

> > Thanks

> > Sri Ganesh V

>

>

Re: [Question] Apache Beam Spark Runner Support - Spark 3.5 Environment

2023-11-03 Thread Alexey Romanenko

AFAICT, the latest tested (compatibility tests) version for now is 3.4.1 [1] We 
may try to add 3.5.x version there.

I believe that ValidateRunners tests are run only against default Spark 3.2.2 
version.

—
Alexey

[1] 
https://github.com/apache/beam/blob/2aaf09c0eb6928390d861ba228447338b8ca92d3/runners/spark/3/build.gradle#L36

> On 3 Nov 2023, at 05:06, Sri Ganesh Venkataraman 
>  wrote:
> 
> Does Apache Beam version (2.41.0)  or latest (2.51.0) support Spark 3.5 
> environment for spark runner ?
> 
> Apache Beam - Spark Runner Documentation states -
> The Spark runner currently supports Spark’s 3.2.x branch
> 
> Thanks
> Sri Ganesh V

[Question] Apache Beam Spark Runner Support - Spark 3.5 Environment

2023-11-02 Thread Sri Ganesh Venkataraman

Does Apache Beam version (2.41.0)  or latest (2.51.0) support Spark 3.5
environment for spark runner ?

Apache Beam - Spark Runner Documentation states -
The Spark runner currently supports Spark’s 3.2.x branch

Thanks
Sri Ganesh V

Re: Passing "conf" arguments using a portable runner in Java (spark job runner)

2023-07-20 Thread Moritz Mack

Hi Jon,

sorry for the late replay.

A while ago I was struggling with as well. Unfortunately, there’s no direct way
to do this per pipeline.
However, you can set default arguments by passing them to the job service
container using the environment variable _JAVA_OPTIONS.

I hope this still helps!
Cheers,
Moritz

On 30.06.23, 20:29, "Jon Molle via user" wrote:

Hi, I'm having trouble trying to figure out where to pass "conf" spark-submit
arguments to a spark job service. I don't particularly care at this point
whether the job service uses the same set of args for all jobs passed

Hi,

I'm having trouble trying to figure out where to pass "conf" spark-submit
arguments to a spark job service. I don't particularly care at this point
whether the job service uses the same set of args for all jobs passed to the
service, I'm just having trouble finding where I can send these args where they
will get picked up by the service.

For reference, I'm using the portable pipeline options (IE: the portable
runner, job endpoint, and DOCKER environment type).

Has anyone tried this, specifically in Java? I'm actually writing in Kotlin,
but all the examples of the portable runner are in Python, which is
significantly different from the Java design.

Thanks!

As a recipient of an email from the Talend Group, your personal data will be
processed by our systems. Please see our Privacy Notice
<https://www.talend.com/privacy-policy/> for more information about our
collection and use of your personal information, our security practices, and
your data protection rights, including any rights you may have to object to
automated-decision making or profiling we use to analyze support or marketing
related communications. To manage or discontinue promotional communications,
use the communication preferences
portal<https://info.talend.com/emailpreferencesen.html>. To exercise your data
protection rights, use the privacy request
form<https://talend.my.onetrust.com/webform/ef906c5a-de41-4ea0-ba73-96c079cdd15a/b191c71d-f3cb-4a42-9815-0c3ca021704cl>.
Contact us here <https://www.talend.com/contact/> or by mail to either of our
co-headquarters: Talend, Inc.: 400 South El Camino Real, Ste 1400, San Mateo,
CA 94402; Talend SAS: 5/7 rue Salomon De Rothschild, 92150 Suresnes, France

1 2 3 4 >

1 - 100 of 340 matches

Mail list logo