Re: [E] COMMERCIAL BULK: Re: TensorFlow on Spark
Dear Sean, I do agree with you to a certain extent, makes sense. Perhaps I am wrong in asking for native integrations and not depending on over engineered external solutions which have their own performance issues, and bottlenecks in live production environment. But asking and stating ones opinion should be fine I think. Just like inspite of having Pandas UDF we went for Koalas, similarly SPARK native integrations which are light weight and easy to use and extend to deep learning frameworks perhaps makes sense according to me. Regards, Gourav Sengupta Regards, Gourav Sengupta On Thu, Feb 24, 2022 at 2:06 PM Sean Owen wrote: > On the contrary, distributed deep learning is not data parallel. It's > dominated by the need to share parameters across workers. > Gourav, I don't understand what you're looking for. Have you looked at > Petastorm and Horovod? they _use Spark_, not another platform like Ray. Why > recreate this which has worked for years? what would it matter if it were > in the Spark project? I think you're on a limb there. > One goal of Spark is very much not to build in everything that could exist > as a library, and distributed deep learning remains an important but niche > use case. Instead it provides the infra for these things, like barrier mode. > > On Thu, Feb 24, 2022 at 7:21 AM Bitfox wrote: > >> I have been using tensorflow for a long time, it's not hard to implement >> a distributed training job at all, either by model parallelization or data >> parallelization. I don't think there is much need to develop spark to >> support tensorflow jobs. Just my thoughts... >> >> >> On Thu, Feb 24, 2022 at 4:36 PM Gourav Sengupta < >> gourav.sengu...@gmail.com> wrote: >> >>> Hi, >>> >>> I do not think that there is any reason for using over engineered >>> platforms like Petastorm and Ray, except for certain use cases. >>> >>> What Ray is doing, except for certain use cases, could have been easily >>> done by SPARK, I think, had the open source community got that steer. But >>> maybe I am wrong and someone should be able to explain why the SPARK open >>> source community cannot develop the capabilities which are so natural to >>> almost all use cases of data processing in SPARK where the data gets >>> consumed by deep learning frameworks and we are asked to use Ray or >>> Petastorm? >>> >>> For those of us who are asking what does native integrations means >>> please try to compare delta between release 2.x and 3.x and koalas before >>> 3.2 and after 3.2. >>> >>> I am sure that the SPARK community can push for extending the dataframes >>> from SPARK to deep learning and other frameworks by natively integrating >>> them. >>> >>> >>> Regards, >>> Gourav Sengupta >>> >>>
Re: [E] COMMERCIAL BULK: Re: TensorFlow on Spark
Hi Bitfox, yes distributed training using Pytorch and Tensorflow is really superb and great and you are spot on. There is actually absolutely no need for solutions like Ray/ Petastorm etc... But in case I want to pre process data in SPARK and push the results to these deep learning libraries, then what do we do? Because creating professional quality data loaders is a very big job, therefore, these solutions try to occupy that space as an entry point. Regards, Gourav Sengupta On Thu, Feb 24, 2022 at 1:21 PM Bitfox wrote: > I have been using tensorflow for a long time, it's not hard to implement a > distributed training job at all, either by model parallelization or data > parallelization. I don't think there is much need to develop spark to > support tensorflow jobs. Just my thoughts... > > > On Thu, Feb 24, 2022 at 4:36 PM Gourav Sengupta > wrote: > >> Hi, >> >> I do not think that there is any reason for using over engineered >> platforms like Petastorm and Ray, except for certain use cases. >> >> What Ray is doing, except for certain use cases, could have been easily >> done by SPARK, I think, had the open source community got that steer. But >> maybe I am wrong and someone should be able to explain why the SPARK open >> source community cannot develop the capabilities which are so natural to >> almost all use cases of data processing in SPARK where the data gets >> consumed by deep learning frameworks and we are asked to use Ray or >> Petastorm? >> >> For those of us who are asking what does native integrations means please >> try to compare delta between release 2.x and 3.x and koalas before 3.2 and >> after 3.2. >> >> I am sure that the SPARK community can push for extending the dataframes >> from SPARK to deep learning and other frameworks by natively integrating >> them. >> >> >> Regards, >> Gourav Sengupta >> >> >> On Wed, Feb 23, 2022 at 4:42 PM Dennis Suhari >> wrote: >> >>> Currently we are trying AnalyticsZoo and Ray >>> >>> >>> Von meinem iPhone gesendet >>> >>> Am 23.02.2022 um 04:53 schrieb Bitfox : >>> >>> >>> tensorflow itself can implement the distributed computing via a >>> parameter server. Why did you want spark here? >>> >>> regards. >>> >>> On Wed, Feb 23, 2022 at 11:27 AM Vijayant Kumar >>> wrote: >>> Thanks Sean for your response. !! Want to add some more background here. I am using Spark3.0+ version with Tensorflow 2.0+. My use case is not for the image data but for the Time-series data where I am using LSTM and transformers to forecast. I evaluated *SparkFlow* and *spark_tensorflow_distributor *libraries, and there has been no major development recently on those libraries. I faced the issue of version dependencies on those and had a hard time fixing the library compatibilities. Hence a couple of below doubts:- - Does *Horovod* have any dependencies? - Any other library which is suitable for my use case.? - Any example code would really be of great help to understand. Thanks, Vijayant *From:* Sean Owen *Sent:* Wednesday, February 23, 2022 8:40 AM *To:* Vijayant Kumar *Cc:* user @spark *Subject:* [E] COMMERCIAL BULK: Re: TensorFlow on Spark *Email is from a Free Mail Service (Gmail/Yahoo/Hotmail….) *: Beware of Phishing Scams, Report questionable emails to s...@mavenir.com Sure, Horovod is commonly used on Spark for this: https://horovod.readthedocs.io/en/stable/spark_include.html On Tue, Feb 22, 2022 at 8:51 PM Vijayant Kumar < vijayant.ku...@mavenir.com.invalid> wrote: Hi All, Anyone using Apache spark with TensorFlow for building models. My requirement is to use TensorFlow distributed model training across the Spark executors. Please help me with some resources or some sample code. Thanks, Vijayant -- This e-mail message may contain confidential or proprietary information of Mavenir Systems, Inc. or its affiliates and is intended solely for the use of the intended recipient(s). If you are not the intended recipient of this message, you are hereby notified that any review, use or distribution of this information is absolutely prohibited and we request that you delete all copies in your control and contact us by e-mailing to secur...@mavenir.com. This message contains the views of its author and may not necessarily reflect the views of Mavenir Systems, Inc. or its affiliates, who employ systems to monitor email messages, but make no representation that such messages are authorized, secure, uncompromised, or free from computer viruses, malware, or other defects. Thank You --
Non-Partition based Workload Distribution
We got a Spark program that iterates through a while loop on the same input DataFrame and produces different results per iteration. I see through Spark UI that the workload is concentrated on a single core of the same worker. Is there anyway to distribute the workload to different cores/workers, e.g. per iteration, since each iteration is not dependent from each other? Certainly this type of problem could be easily implemented using threads, e.g. spawn a child thread for each iteration, and wait at the end of the loop. But threads apparently don't go beyond the worker boundary. We also thought about using MapReduce, but it won't be straightforward since mapping only deals with rows, not at the dataframe level. Any thoughts/suggestions are highly appreciated.. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
[no subject]
Unsubscribe
RE: Consuming from Kafka to delta table - stream or batch mode?
Thank you. From: Peyman Mohajerian [mailto:mohaj...@gmail.com] Sent: Thursday, February 24, 2022 9:00 AM To: Michael Williams (SSI) Cc: user@spark.apache.org Subject: Re: Consuming from Kafka to delta table - stream or batch mode? If you want to batch consume from Kafka, trigger-once config would work with structured streaming and you get the benefit of the checkpointing. On Thu, Feb 24, 2022 at 6:07 AM Michael Williams (SSI) mailto:michael.willi...@ssigroup.com>> wrote: Hello, Our team is working with Spark (for the first time) and one of the sources we need to consume is Kafka (multiple topics). Are there any practical or operational issues to be aware of when deciding whether to a) consume in batches until all messages are consumed then shut down the spark job, then when new messages show up, start a new job; or b) use spark streaming and run the job continuously? If it makes a difference, the environment is on-premise spark on k8s. Any experience shared is appreciated. Thank you, Mike This electronic message may contain information that is Proprietary, Confidential, or legally privileged or protected. It is intended only for the use of the individual(s) and entity named in the message. If you are not an intended recipient of this message, please notify the sender immediately and delete the material from your computer. Do not deliver, distribute or copy this message and do not disclose its contents or take any action in reliance on the information it contains. Thank You. This electronic message may contain information that is Proprietary, Confidential, or legally privileged or protected. It is intended only for the use of the individual(s) and entity named in the message. If you are not an intended recipient of this message, please notify the sender immediately and delete the material from your computer. Do not deliver, distribute or copy this message and do not disclose its contents or take any action in reliance on the information it contains. Thank You.
Re: Consuming from Kafka to delta table - stream or batch mode?
If you want to batch consume from Kafka, trigger-once config would work with structured streaming and you get the benefit of the checkpointing. On Thu, Feb 24, 2022 at 6:07 AM Michael Williams (SSI) < michael.willi...@ssigroup.com> wrote: > Hello, > > > > Our team is working with Spark (for the first time) and one of the sources > we need to consume is Kafka (multiple topics). Are there any practical or > operational issues to be aware of when deciding whether to a) consume in > batches until all messages are consumed then shut down the spark job, then > when new messages show up, start a new job; or b) use spark streaming and > run the job continuously? If it makes a difference, the environment is > on-premise spark on k8s. > > > > Any experience shared is appreciated. > > > > Thank you, > > Mike > > > This electronic message may contain information that is Proprietary, > Confidential, or legally privileged or protected. It is intended only for > the use of the individual(s) and entity named in the message. If you are > not an intended recipient of this message, please notify the sender > immediately and delete the material from your computer. Do not deliver, > distribute or copy this message and do not disclose its contents or take > any action in reliance on the information it contains. Thank You. >
Re: DataTables 1.10.20 reported vulnerable in spark-core_2.13:3.2.1
What is the vulnerability and does it affect Spark? what is the remediation? Can you try updating these and open a pull request if it works? On Thu, Feb 24, 2022 at 7:28 AM vinodh palanisamy wrote: > Hi Team, > We are using spark-core_2.13:3.2.1 in our project. Where in that > version Blackduck scan reports the below the js files as vulnerable. > > dataTables.bootstrap4.1.10.20.min.js > jquery.dataTables..1.10.20.min.js > > Please let me know if this can be fixed in my project or Datatables > version used in the spark-core would be updated to a non vulnerable version. > > Regards > Vinodh Palaniswamy > >
Consuming from Kafka to delta table - stream or batch mode?
Hello, Our team is working with Spark (for the first time) and one of the sources we need to consume is Kafka (multiple topics). Are there any practical or operational issues to be aware of when deciding whether to a) consume in batches until all messages are consumed then shut down the spark job, then when new messages show up, start a new job; or b) use spark streaming and run the job continuously? If it makes a difference, the environment is on-premise spark on k8s. Any experience shared is appreciated. Thank you, Mike This electronic message may contain information that is Proprietary, Confidential, or legally privileged or protected. It is intended only for the use of the individual(s) and entity named in the message. If you are not an intended recipient of this message, please notify the sender immediately and delete the material from your computer. Do not deliver, distribute or copy this message and do not disclose its contents or take any action in reliance on the information it contains. Thank You.
Re: [E] COMMERCIAL BULK: Re: TensorFlow on Spark
On the contrary, distributed deep learning is not data parallel. It's dominated by the need to share parameters across workers. Gourav, I don't understand what you're looking for. Have you looked at Petastorm and Horovod? they _use Spark_, not another platform like Ray. Why recreate this which has worked for years? what would it matter if it were in the Spark project? I think you're on a limb there. One goal of Spark is very much not to build in everything that could exist as a library, and distributed deep learning remains an important but niche use case. Instead it provides the infra for these things, like barrier mode. On Thu, Feb 24, 2022 at 7:21 AM Bitfox wrote: > I have been using tensorflow for a long time, it's not hard to implement a > distributed training job at all, either by model parallelization or data > parallelization. I don't think there is much need to develop spark to > support tensorflow jobs. Just my thoughts... > > > On Thu, Feb 24, 2022 at 4:36 PM Gourav Sengupta > wrote: > >> Hi, >> >> I do not think that there is any reason for using over engineered >> platforms like Petastorm and Ray, except for certain use cases. >> >> What Ray is doing, except for certain use cases, could have been easily >> done by SPARK, I think, had the open source community got that steer. But >> maybe I am wrong and someone should be able to explain why the SPARK open >> source community cannot develop the capabilities which are so natural to >> almost all use cases of data processing in SPARK where the data gets >> consumed by deep learning frameworks and we are asked to use Ray or >> Petastorm? >> >> For those of us who are asking what does native integrations means please >> try to compare delta between release 2.x and 3.x and koalas before 3.2 and >> after 3.2. >> >> I am sure that the SPARK community can push for extending the dataframes >> from SPARK to deep learning and other frameworks by natively integrating >> them. >> >> >> Regards, >> Gourav Sengupta >> >>
DataTables 1.10.20 reported vulnerable in spark-core_2.13:3.2.1
Hi Team, We are using spark-core_2.13:3.2.1 in our project. Where in that version Blackduck scan reports the below the js files as vulnerable. dataTables.bootstrap4.1.10.20.min.js jquery.dataTables..1.10.20.min.js Please let me know if this can be fixed in my project or Datatables version used in the spark-core would be updated to a non vulnerable version. Regards Vinodh Palaniswamy
Re: [E] COMMERCIAL BULK: Re: TensorFlow on Spark
I have been using tensorflow for a long time, it's not hard to implement a distributed training job at all, either by model parallelization or data parallelization. I don't think there is much need to develop spark to support tensorflow jobs. Just my thoughts... On Thu, Feb 24, 2022 at 4:36 PM Gourav Sengupta wrote: > Hi, > > I do not think that there is any reason for using over engineered > platforms like Petastorm and Ray, except for certain use cases. > > What Ray is doing, except for certain use cases, could have been easily > done by SPARK, I think, had the open source community got that steer. But > maybe I am wrong and someone should be able to explain why the SPARK open > source community cannot develop the capabilities which are so natural to > almost all use cases of data processing in SPARK where the data gets > consumed by deep learning frameworks and we are asked to use Ray or > Petastorm? > > For those of us who are asking what does native integrations means please > try to compare delta between release 2.x and 3.x and koalas before 3.2 and > after 3.2. > > I am sure that the SPARK community can push for extending the dataframes > from SPARK to deep learning and other frameworks by natively integrating > them. > > > Regards, > Gourav Sengupta > > > On Wed, Feb 23, 2022 at 4:42 PM Dennis Suhari > wrote: > >> Currently we are trying AnalyticsZoo and Ray >> >> >> Von meinem iPhone gesendet >> >> Am 23.02.2022 um 04:53 schrieb Bitfox : >> >> >> tensorflow itself can implement the distributed computing via a >> parameter server. Why did you want spark here? >> >> regards. >> >> On Wed, Feb 23, 2022 at 11:27 AM Vijayant Kumar >> wrote: >> >>> Thanks Sean for your response. !! >>> >>> >>> >>> Want to add some more background here. >>> >>> >>> >>> I am using Spark3.0+ version with Tensorflow 2.0+. >>> >>> My use case is not for the image data but for the Time-series data where >>> I am using LSTM and transformers to forecast. >>> >>> >>> >>> I evaluated *SparkFlow* and *spark_tensorflow_distributor *libraries, and >>> there has been no major development recently on those libraries. I faced >>> the issue of version dependencies on those and had a hard time fixing the >>> library compatibilities. Hence a couple of below doubts:- >>> >>> >>> >>>- Does *Horovod* have any dependencies? >>>- Any other library which is suitable for my use case.? >>>- Any example code would really be of great help to understand. >>> >>> >>> >>> Thanks, >>> >>> Vijayant >>> >>> >>> >>> *From:* Sean Owen >>> *Sent:* Wednesday, February 23, 2022 8:40 AM >>> *To:* Vijayant Kumar >>> *Cc:* user @spark >>> *Subject:* [E] COMMERCIAL BULK: Re: TensorFlow on Spark >>> >>> >>> >>> *Email is from a Free Mail Service (Gmail/Yahoo/Hotmail….) *: Beware of >>> Phishing Scams, Report questionable emails to s...@mavenir.com >>> >>> Sure, Horovod is commonly used on Spark for this: >>> >>> https://horovod.readthedocs.io/en/stable/spark_include.html >>> >>> >>> >>> On Tue, Feb 22, 2022 at 8:51 PM Vijayant Kumar < >>> vijayant.ku...@mavenir.com.invalid> wrote: >>> >>> Hi All, >>> >>> >>> >>> Anyone using Apache spark with TensorFlow for building models. My >>> requirement is to use TensorFlow distributed model training across the >>> Spark executors. >>> >>> Please help me with some resources or some sample code. >>> >>> >>> >>> Thanks, >>> >>> Vijayant >>> -- >>> >>> This e-mail message may contain confidential or proprietary information >>> of Mavenir Systems, Inc. or its affiliates and is intended solely for the >>> use of the intended recipient(s). If you are not the intended recipient of >>> this message, you are hereby notified that any review, use or distribution >>> of this information is absolutely prohibited and we request that you delete >>> all copies in your control and contact us by e-mailing to >>> secur...@mavenir.com. This message contains the views of its author and >>> may not necessarily reflect the views of Mavenir Systems, Inc. or its >>> affiliates, who employ systems to monitor email messages, but make no >>> representation that such messages are authorized, secure, uncompromised, or >>> free from computer viruses, malware, or other defects. Thank You >>> >>> -- >>> >>> This e-mail message may contain confidential or proprietary information >>> of Mavenir Systems, Inc. or its affiliates and is intended solely for the >>> use of the intended recipient(s). If you are not the intended recipient of >>> this message, you are hereby notified that any review, use or distribution >>> of this information is absolutely prohibited and we request that you delete >>> all copies in your control and contact us by e-mailing to >>> secur...@mavenir.com. This message contains the views of its author and >>> may not necessarily reflect the views of Mavenir Systems, Inc. or its >>> affiliates, who employ systems to monitor email messages, but make no >>>
Re: [E] COMMERCIAL BULK: Re: TensorFlow on Spark
Hi, I do not think that there is any reason for using over engineered platforms like Petastorm and Ray, except for certain use cases. What Ray is doing, except for certain use cases, could have been easily done by SPARK, I think, had the open source community got that steer. But maybe I am wrong and someone should be able to explain why the SPARK open source community cannot develop the capabilities which are so natural to almost all use cases of data processing in SPARK where the data gets consumed by deep learning frameworks and we are asked to use Ray or Petastorm? For those of us who are asking what does native integrations means please try to compare delta between release 2.x and 3.x and koalas before 3.2 and after 3.2. I am sure that the SPARK community can push for extending the dataframes from SPARK to deep learning and other frameworks by natively integrating them. Regards, Gourav Sengupta On Wed, Feb 23, 2022 at 4:42 PM Dennis Suhari wrote: > Currently we are trying AnalyticsZoo and Ray > > > Von meinem iPhone gesendet > > Am 23.02.2022 um 04:53 schrieb Bitfox : > > > tensorflow itself can implement the distributed computing via a > parameter server. Why did you want spark here? > > regards. > > On Wed, Feb 23, 2022 at 11:27 AM Vijayant Kumar > wrote: > >> Thanks Sean for your response. !! >> >> >> >> Want to add some more background here. >> >> >> >> I am using Spark3.0+ version with Tensorflow 2.0+. >> >> My use case is not for the image data but for the Time-series data where >> I am using LSTM and transformers to forecast. >> >> >> >> I evaluated *SparkFlow* and *spark_tensorflow_distributor *libraries, and >> there has been no major development recently on those libraries. I faced >> the issue of version dependencies on those and had a hard time fixing the >> library compatibilities. Hence a couple of below doubts:- >> >> >> >>- Does *Horovod* have any dependencies? >>- Any other library which is suitable for my use case.? >>- Any example code would really be of great help to understand. >> >> >> >> Thanks, >> >> Vijayant >> >> >> >> *From:* Sean Owen >> *Sent:* Wednesday, February 23, 2022 8:40 AM >> *To:* Vijayant Kumar >> *Cc:* user @spark >> *Subject:* [E] COMMERCIAL BULK: Re: TensorFlow on Spark >> >> >> >> *Email is from a Free Mail Service (Gmail/Yahoo/Hotmail….) *: Beware of >> Phishing Scams, Report questionable emails to s...@mavenir.com >> >> Sure, Horovod is commonly used on Spark for this: >> >> https://horovod.readthedocs.io/en/stable/spark_include.html >> >> >> >> On Tue, Feb 22, 2022 at 8:51 PM Vijayant Kumar < >> vijayant.ku...@mavenir.com.invalid> wrote: >> >> Hi All, >> >> >> >> Anyone using Apache spark with TensorFlow for building models. My >> requirement is to use TensorFlow distributed model training across the >> Spark executors. >> >> Please help me with some resources or some sample code. >> >> >> >> Thanks, >> >> Vijayant >> -- >> >> This e-mail message may contain confidential or proprietary information >> of Mavenir Systems, Inc. or its affiliates and is intended solely for the >> use of the intended recipient(s). If you are not the intended recipient of >> this message, you are hereby notified that any review, use or distribution >> of this information is absolutely prohibited and we request that you delete >> all copies in your control and contact us by e-mailing to >> secur...@mavenir.com. This message contains the views of its author and >> may not necessarily reflect the views of Mavenir Systems, Inc. or its >> affiliates, who employ systems to monitor email messages, but make no >> representation that such messages are authorized, secure, uncompromised, or >> free from computer viruses, malware, or other defects. Thank You >> >> -- >> >> This e-mail message may contain confidential or proprietary information >> of Mavenir Systems, Inc. or its affiliates and is intended solely for the >> use of the intended recipient(s). If you are not the intended recipient of >> this message, you are hereby notified that any review, use or distribution >> of this information is absolutely prohibited and we request that you delete >> all copies in your control and contact us by e-mailing to >> secur...@mavenir.com. This message contains the views of its author and >> may not necessarily reflect the views of Mavenir Systems, Inc. or its >> affiliates, who employ systems to monitor email messages, but make no >> representation that such messages are authorized, secure, uncompromised, or >> free from computer viruses, malware, or other defects. Thank You >> >