Re: [E] COMMERCIAL BULK: Re: TensorFlow on Spark

2022-02-24 Thread Gourav Sengupta
Dear Sean,

I do agree with you to a certain extent, makes sense. Perhaps I am wrong in
asking for native integrations and not depending on over engineered
external solutions which have their own performance issues, and bottlenecks
in live production environment. But asking and stating ones opinion should
be fine I think.

Just like inspite of having Pandas UDF we went for Koalas, similarly SPARK
native integrations which are light weight and easy to use and extend to
deep learning frameworks perhaps makes sense according to me.

Regards,
Gourav Sengupta

Regards,
Gourav Sengupta

On Thu, Feb 24, 2022 at 2:06 PM Sean Owen  wrote:

> On the contrary, distributed deep learning is not data parallel. It's
> dominated by the need to share parameters across workers.
> Gourav, I don't understand what you're looking for. Have you looked at
> Petastorm and Horovod? they _use Spark_, not another platform like Ray. Why
> recreate this which has worked for years? what would it matter if it were
> in the Spark project? I think you're on a limb there.
> One goal of Spark is very much not to build in everything that could exist
> as a library, and distributed deep learning remains an important but niche
> use case. Instead it provides the infra for these things, like barrier mode.
>
> On Thu, Feb 24, 2022 at 7:21 AM Bitfox  wrote:
>
>> I have been using tensorflow for a long time, it's not hard to implement
>> a distributed training job at all, either by model parallelization or data
>> parallelization. I don't think there is much need to develop spark to
>> support tensorflow jobs. Just my thoughts...
>>
>>
>> On Thu, Feb 24, 2022 at 4:36 PM Gourav Sengupta <
>> gourav.sengu...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I do not think that there is any reason for using over engineered
>>> platforms like Petastorm and Ray, except for certain use cases.
>>>
>>> What Ray is doing, except for certain use cases, could have been easily
>>> done by SPARK, I think, had the open source community got that steer. But
>>> maybe I am wrong and someone should be able to explain why the SPARK open
>>> source community cannot develop the capabilities which are so natural to
>>> almost all use cases of data processing in SPARK where the data gets
>>> consumed by deep learning frameworks and we are asked to use Ray or
>>> Petastorm?
>>>
>>> For those of us who are asking what does native integrations means
>>> please try to compare delta between release 2.x and 3.x and koalas before
>>> 3.2 and after 3.2.
>>>
>>> I am sure that the SPARK community can push for extending the dataframes
>>> from SPARK to deep learning and other frameworks by natively integrating
>>> them.
>>>
>>>
>>> Regards,
>>> Gourav Sengupta
>>>
>>>


Re: [E] COMMERCIAL BULK: Re: TensorFlow on Spark

2022-02-24 Thread Gourav Sengupta
Hi Bitfox,

yes distributed training using Pytorch and Tensorflow is really superb and
great and you are spot on. There is actually absolutely no need for
solutions like Ray/ Petastorm etc...

But in case I want to pre process data in SPARK and push the results to
these deep learning libraries, then what do we do? Because creating
professional quality data loaders is a very big job, therefore, these
solutions try to occupy that space as an entry point.


Regards,
Gourav Sengupta



On Thu, Feb 24, 2022 at 1:21 PM Bitfox  wrote:

> I have been using tensorflow for a long time, it's not hard to implement a
> distributed training job at all, either by model parallelization or data
> parallelization. I don't think there is much need to develop spark to
> support tensorflow jobs. Just my thoughts...
>
>
> On Thu, Feb 24, 2022 at 4:36 PM Gourav Sengupta 
> wrote:
>
>> Hi,
>>
>> I do not think that there is any reason for using over engineered
>> platforms like Petastorm and Ray, except for certain use cases.
>>
>> What Ray is doing, except for certain use cases, could have been easily
>> done by SPARK, I think, had the open source community got that steer. But
>> maybe I am wrong and someone should be able to explain why the SPARK open
>> source community cannot develop the capabilities which are so natural to
>> almost all use cases of data processing in SPARK where the data gets
>> consumed by deep learning frameworks and we are asked to use Ray or
>> Petastorm?
>>
>> For those of us who are asking what does native integrations means please
>> try to compare delta between release 2.x and 3.x and koalas before 3.2 and
>> after 3.2.
>>
>> I am sure that the SPARK community can push for extending the dataframes
>> from SPARK to deep learning and other frameworks by natively integrating
>> them.
>>
>>
>> Regards,
>> Gourav Sengupta
>>
>>
>> On Wed, Feb 23, 2022 at 4:42 PM Dennis Suhari 
>> wrote:
>>
>>> Currently we are trying AnalyticsZoo and Ray
>>>
>>>
>>> Von meinem iPhone gesendet
>>>
>>> Am 23.02.2022 um 04:53 schrieb Bitfox :
>>>
>>> 
>>> tensorflow itself can implement the distributed computing via a
>>> parameter server. Why did you want spark here?
>>>
>>> regards.
>>>
>>> On Wed, Feb 23, 2022 at 11:27 AM Vijayant Kumar
>>>  wrote:
>>>
 Thanks Sean for your response. !!



 Want to add some more background here.



 I am using Spark3.0+ version with Tensorflow 2.0+.

 My use case is not for the image data but for the Time-series data
 where I am using LSTM and transformers to forecast.



 I evaluated *SparkFlow* and *spark_tensorflow_distributor *libraries, and
 there has been no major development recently on those libraries. I faced
 the issue of version dependencies on those and had a hard time fixing the
 library compatibilities. Hence a couple of below doubts:-



- Does *Horovod* have any dependencies?
- Any other library which is suitable for my use case.?
- Any example code would really be of great help to understand.



 Thanks,

 Vijayant



 *From:* Sean Owen 
 *Sent:* Wednesday, February 23, 2022 8:40 AM
 *To:* Vijayant Kumar 
 *Cc:* user @spark 
 *Subject:* [E] COMMERCIAL BULK: Re: TensorFlow on Spark



 *Email is from a Free Mail Service (Gmail/Yahoo/Hotmail….) *: Beware
 of Phishing Scams, Report questionable emails to s...@mavenir.com

 Sure, Horovod is commonly used on Spark for this:

 https://horovod.readthedocs.io/en/stable/spark_include.html



 On Tue, Feb 22, 2022 at 8:51 PM Vijayant Kumar <
 vijayant.ku...@mavenir.com.invalid> wrote:

 Hi All,



 Anyone using Apache spark with TensorFlow for building models. My
 requirement is to use TensorFlow distributed model training across the
 Spark executors.

 Please help me with some resources or some sample code.



 Thanks,

 Vijayant
 --

 This e-mail message may contain confidential or proprietary information
 of Mavenir Systems, Inc. or its affiliates and is intended solely for the
 use of the intended recipient(s). If you are not the intended recipient of
 this message, you are hereby notified that any review, use or distribution
 of this information is absolutely prohibited and we request that you delete
 all copies in your control and contact us by e-mailing to
 secur...@mavenir.com. This message contains the views of its author
 and may not necessarily reflect the views of Mavenir Systems, Inc. or its
 affiliates, who employ systems to monitor email messages, but make no
 representation that such messages are authorized, secure, uncompromised, or
 free from computer viruses, malware, or other defects. Thank You

 --

 

Non-Partition based Workload Distribution

2022-02-24 Thread Artemis User
We got a Spark program that iterates through a while loop on the same 
input DataFrame and produces different results per iteration. I see 
through Spark UI that the workload is concentrated on a single core of 
the same worker.  Is there anyway to distribute the workload to 
different cores/workers, e.g. per iteration, since each iteration is not 
dependent from each other?


Certainly this type of problem could be easily implemented using 
threads, e.g. spawn a child thread for each iteration, and wait at the 
end of the loop.  But threads apparently don't go beyond the worker 
boundary.  We also thought about using MapReduce, but it won't be 
straightforward since mapping only deals with rows, not at the dataframe 
level.  Any thoughts/suggestions are highly appreciated..


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



[no subject]

2022-02-24 Thread Luca Borin
Unsubscribe


RE: Consuming from Kafka to delta table - stream or batch mode?

2022-02-24 Thread Michael Williams (SSI)
Thank you.

From: Peyman Mohajerian [mailto:mohaj...@gmail.com]
Sent: Thursday, February 24, 2022 9:00 AM
To: Michael Williams (SSI) 
Cc: user@spark.apache.org
Subject: Re: Consuming from Kafka to delta table - stream or batch mode?

If you want to batch consume from Kafka, trigger-once config would work with 
structured streaming and you get the benefit of the checkpointing.

On Thu, Feb 24, 2022 at 6:07 AM Michael Williams (SSI) 
mailto:michael.willi...@ssigroup.com>> wrote:
Hello,

Our team is working with Spark (for the first time) and one of the sources we 
need to consume is Kafka (multiple topics).  Are there any practical or 
operational issues to be aware of when deciding whether to a) consume in 
batches until all messages are consumed then shut down the spark job, then when 
new messages show up, start a new job; or b) use spark streaming and run the 
job continuously?  If it makes a difference, the environment is on-premise 
spark on k8s.

Any experience shared is appreciated.

Thank you,
Mike


This electronic message may contain information that is Proprietary, 
Confidential, or legally privileged or protected. It is intended only for the 
use of the individual(s) and entity named in the message. If you are not an 
intended recipient of this message, please notify the sender immediately and 
delete the material from your computer. Do not deliver, distribute or copy this 
message and do not disclose its contents or take any action in reliance on the 
information it contains. Thank You.



This electronic message may contain information that is Proprietary, 
Confidential, or legally privileged or protected. It is intended only for the 
use of the individual(s) and entity named in the message. If you are not an 
intended recipient of this message, please notify the sender immediately and 
delete the material from your computer. Do not deliver, distribute or copy this 
message and do not disclose its contents or take any action in reliance on the 
information it contains. Thank You.


Re: Consuming from Kafka to delta table - stream or batch mode?

2022-02-24 Thread Peyman Mohajerian
If you want to batch consume from Kafka, trigger-once config would work
with structured streaming and you get the benefit of the checkpointing.

On Thu, Feb 24, 2022 at 6:07 AM Michael Williams (SSI) <
michael.willi...@ssigroup.com> wrote:

> Hello,
>
>
>
> Our team is working with Spark (for the first time) and one of the sources
> we need to consume is Kafka (multiple topics).  Are there any practical or
> operational issues to be aware of when deciding whether to a) consume in
> batches until all messages are consumed then shut down the spark job, then
> when new messages show up, start a new job; or b) use spark streaming and
> run the job continuously?  If it makes a difference, the environment is
> on-premise spark on k8s.
>
>
>
> Any experience shared is appreciated.
>
>
>
> Thank you,
>
> Mike
>
>
> This electronic message may contain information that is Proprietary,
> Confidential, or legally privileged or protected. It is intended only for
> the use of the individual(s) and entity named in the message. If you are
> not an intended recipient of this message, please notify the sender
> immediately and delete the material from your computer. Do not deliver,
> distribute or copy this message and do not disclose its contents or take
> any action in reliance on the information it contains. Thank You.
>


Re: DataTables 1.10.20 reported vulnerable in spark-core_2.13:3.2.1

2022-02-24 Thread Sean Owen
What is the vulnerability and does it affect Spark? what is the remediation?
Can you try updating these and open a pull request if it works?

On Thu, Feb 24, 2022 at 7:28 AM vinodh palanisamy 
wrote:

> Hi Team,
>   We are using spark-core_2.13:3.2.1 in our project. Where in that
> version Blackduck scan reports the below the js files as vulnerable.
>
> dataTables.bootstrap4.1.10.20.min.js
> jquery.dataTables..1.10.20.min.js
>
> Please let me know if this can be fixed in my project or Datatables
> version used in the spark-core would be updated to a non vulnerable version.
>
> Regards
> Vinodh Palaniswamy
>
>


Consuming from Kafka to delta table - stream or batch mode?

2022-02-24 Thread Michael Williams (SSI)
Hello,

Our team is working with Spark (for the first time) and one of the sources we 
need to consume is Kafka (multiple topics).  Are there any practical or 
operational issues to be aware of when deciding whether to a) consume in 
batches until all messages are consumed then shut down the spark job, then when 
new messages show up, start a new job; or b) use spark streaming and run the 
job continuously?  If it makes a difference, the environment is on-premise 
spark on k8s.

Any experience shared is appreciated.

Thank you,
Mike



This electronic message may contain information that is Proprietary, 
Confidential, or legally privileged or protected. It is intended only for the 
use of the individual(s) and entity named in the message. If you are not an 
intended recipient of this message, please notify the sender immediately and 
delete the material from your computer. Do not deliver, distribute or copy this 
message and do not disclose its contents or take any action in reliance on the 
information it contains. Thank You.


Re: [E] COMMERCIAL BULK: Re: TensorFlow on Spark

2022-02-24 Thread Sean Owen
On the contrary, distributed deep learning is not data parallel. It's
dominated by the need to share parameters across workers.
Gourav, I don't understand what you're looking for. Have you looked at
Petastorm and Horovod? they _use Spark_, not another platform like Ray. Why
recreate this which has worked for years? what would it matter if it were
in the Spark project? I think you're on a limb there.
One goal of Spark is very much not to build in everything that could exist
as a library, and distributed deep learning remains an important but niche
use case. Instead it provides the infra for these things, like barrier mode.

On Thu, Feb 24, 2022 at 7:21 AM Bitfox  wrote:

> I have been using tensorflow for a long time, it's not hard to implement a
> distributed training job at all, either by model parallelization or data
> parallelization. I don't think there is much need to develop spark to
> support tensorflow jobs. Just my thoughts...
>
>
> On Thu, Feb 24, 2022 at 4:36 PM Gourav Sengupta 
> wrote:
>
>> Hi,
>>
>> I do not think that there is any reason for using over engineered
>> platforms like Petastorm and Ray, except for certain use cases.
>>
>> What Ray is doing, except for certain use cases, could have been easily
>> done by SPARK, I think, had the open source community got that steer. But
>> maybe I am wrong and someone should be able to explain why the SPARK open
>> source community cannot develop the capabilities which are so natural to
>> almost all use cases of data processing in SPARK where the data gets
>> consumed by deep learning frameworks and we are asked to use Ray or
>> Petastorm?
>>
>> For those of us who are asking what does native integrations means please
>> try to compare delta between release 2.x and 3.x and koalas before 3.2 and
>> after 3.2.
>>
>> I am sure that the SPARK community can push for extending the dataframes
>> from SPARK to deep learning and other frameworks by natively integrating
>> them.
>>
>>
>> Regards,
>> Gourav Sengupta
>>
>>


DataTables 1.10.20 reported vulnerable in spark-core_2.13:3.2.1

2022-02-24 Thread vinodh palanisamy
Hi Team,
  We are using spark-core_2.13:3.2.1 in our project. Where in that
version Blackduck scan reports the below the js files as vulnerable.

dataTables.bootstrap4.1.10.20.min.js
jquery.dataTables..1.10.20.min.js

Please let me know if this can be fixed in my project or Datatables version
used in the spark-core would be updated to a non vulnerable version.

Regards
Vinodh Palaniswamy


Re: [E] COMMERCIAL BULK: Re: TensorFlow on Spark

2022-02-24 Thread Bitfox
I have been using tensorflow for a long time, it's not hard to implement a
distributed training job at all, either by model parallelization or data
parallelization. I don't think there is much need to develop spark to
support tensorflow jobs. Just my thoughts...


On Thu, Feb 24, 2022 at 4:36 PM Gourav Sengupta 
wrote:

> Hi,
>
> I do not think that there is any reason for using over engineered
> platforms like Petastorm and Ray, except for certain use cases.
>
> What Ray is doing, except for certain use cases, could have been easily
> done by SPARK, I think, had the open source community got that steer. But
> maybe I am wrong and someone should be able to explain why the SPARK open
> source community cannot develop the capabilities which are so natural to
> almost all use cases of data processing in SPARK where the data gets
> consumed by deep learning frameworks and we are asked to use Ray or
> Petastorm?
>
> For those of us who are asking what does native integrations means please
> try to compare delta between release 2.x and 3.x and koalas before 3.2 and
> after 3.2.
>
> I am sure that the SPARK community can push for extending the dataframes
> from SPARK to deep learning and other frameworks by natively integrating
> them.
>
>
> Regards,
> Gourav Sengupta
>
>
> On Wed, Feb 23, 2022 at 4:42 PM Dennis Suhari 
> wrote:
>
>> Currently we are trying AnalyticsZoo and Ray
>>
>>
>> Von meinem iPhone gesendet
>>
>> Am 23.02.2022 um 04:53 schrieb Bitfox :
>>
>> 
>> tensorflow itself can implement the distributed computing via a
>> parameter server. Why did you want spark here?
>>
>> regards.
>>
>> On Wed, Feb 23, 2022 at 11:27 AM Vijayant Kumar
>>  wrote:
>>
>>> Thanks Sean for your response. !!
>>>
>>>
>>>
>>> Want to add some more background here.
>>>
>>>
>>>
>>> I am using Spark3.0+ version with Tensorflow 2.0+.
>>>
>>> My use case is not for the image data but for the Time-series data where
>>> I am using LSTM and transformers to forecast.
>>>
>>>
>>>
>>> I evaluated *SparkFlow* and *spark_tensorflow_distributor *libraries, and
>>> there has been no major development recently on those libraries. I faced
>>> the issue of version dependencies on those and had a hard time fixing the
>>> library compatibilities. Hence a couple of below doubts:-
>>>
>>>
>>>
>>>- Does *Horovod* have any dependencies?
>>>- Any other library which is suitable for my use case.?
>>>- Any example code would really be of great help to understand.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Vijayant
>>>
>>>
>>>
>>> *From:* Sean Owen 
>>> *Sent:* Wednesday, February 23, 2022 8:40 AM
>>> *To:* Vijayant Kumar 
>>> *Cc:* user @spark 
>>> *Subject:* [E] COMMERCIAL BULK: Re: TensorFlow on Spark
>>>
>>>
>>>
>>> *Email is from a Free Mail Service (Gmail/Yahoo/Hotmail….) *: Beware of
>>> Phishing Scams, Report questionable emails to s...@mavenir.com
>>>
>>> Sure, Horovod is commonly used on Spark for this:
>>>
>>> https://horovod.readthedocs.io/en/stable/spark_include.html
>>>
>>>
>>>
>>> On Tue, Feb 22, 2022 at 8:51 PM Vijayant Kumar <
>>> vijayant.ku...@mavenir.com.invalid> wrote:
>>>
>>> Hi All,
>>>
>>>
>>>
>>> Anyone using Apache spark with TensorFlow for building models. My
>>> requirement is to use TensorFlow distributed model training across the
>>> Spark executors.
>>>
>>> Please help me with some resources or some sample code.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Vijayant
>>> --
>>>
>>> This e-mail message may contain confidential or proprietary information
>>> of Mavenir Systems, Inc. or its affiliates and is intended solely for the
>>> use of the intended recipient(s). If you are not the intended recipient of
>>> this message, you are hereby notified that any review, use or distribution
>>> of this information is absolutely prohibited and we request that you delete
>>> all copies in your control and contact us by e-mailing to
>>> secur...@mavenir.com. This message contains the views of its author and
>>> may not necessarily reflect the views of Mavenir Systems, Inc. or its
>>> affiliates, who employ systems to monitor email messages, but make no
>>> representation that such messages are authorized, secure, uncompromised, or
>>> free from computer viruses, malware, or other defects. Thank You
>>>
>>> --
>>>
>>> This e-mail message may contain confidential or proprietary information
>>> of Mavenir Systems, Inc. or its affiliates and is intended solely for the
>>> use of the intended recipient(s). If you are not the intended recipient of
>>> this message, you are hereby notified that any review, use or distribution
>>> of this information is absolutely prohibited and we request that you delete
>>> all copies in your control and contact us by e-mailing to
>>> secur...@mavenir.com. This message contains the views of its author and
>>> may not necessarily reflect the views of Mavenir Systems, Inc. or its
>>> affiliates, who employ systems to monitor email messages, but make no
>>> 

Re: [E] COMMERCIAL BULK: Re: TensorFlow on Spark

2022-02-24 Thread Gourav Sengupta
Hi,

I do not think that there is any reason for using over engineered platforms
like Petastorm and Ray, except for certain use cases.

What Ray is doing, except for certain use cases, could have been easily
done by SPARK, I think, had the open source community got that steer. But
maybe I am wrong and someone should be able to explain why the SPARK open
source community cannot develop the capabilities which are so natural to
almost all use cases of data processing in SPARK where the data gets
consumed by deep learning frameworks and we are asked to use Ray or
Petastorm?

For those of us who are asking what does native integrations means please
try to compare delta between release 2.x and 3.x and koalas before 3.2 and
after 3.2.

I am sure that the SPARK community can push for extending the dataframes
from SPARK to deep learning and other frameworks by natively integrating
them.


Regards,
Gourav Sengupta


On Wed, Feb 23, 2022 at 4:42 PM Dennis Suhari 
wrote:

> Currently we are trying AnalyticsZoo and Ray
>
>
> Von meinem iPhone gesendet
>
> Am 23.02.2022 um 04:53 schrieb Bitfox :
>
> 
> tensorflow itself can implement the distributed computing via a
> parameter server. Why did you want spark here?
>
> regards.
>
> On Wed, Feb 23, 2022 at 11:27 AM Vijayant Kumar
>  wrote:
>
>> Thanks Sean for your response. !!
>>
>>
>>
>> Want to add some more background here.
>>
>>
>>
>> I am using Spark3.0+ version with Tensorflow 2.0+.
>>
>> My use case is not for the image data but for the Time-series data where
>> I am using LSTM and transformers to forecast.
>>
>>
>>
>> I evaluated *SparkFlow* and *spark_tensorflow_distributor *libraries, and
>> there has been no major development recently on those libraries. I faced
>> the issue of version dependencies on those and had a hard time fixing the
>> library compatibilities. Hence a couple of below doubts:-
>>
>>
>>
>>- Does *Horovod* have any dependencies?
>>- Any other library which is suitable for my use case.?
>>- Any example code would really be of great help to understand.
>>
>>
>>
>> Thanks,
>>
>> Vijayant
>>
>>
>>
>> *From:* Sean Owen 
>> *Sent:* Wednesday, February 23, 2022 8:40 AM
>> *To:* Vijayant Kumar 
>> *Cc:* user @spark 
>> *Subject:* [E] COMMERCIAL BULK: Re: TensorFlow on Spark
>>
>>
>>
>> *Email is from a Free Mail Service (Gmail/Yahoo/Hotmail….) *: Beware of
>> Phishing Scams, Report questionable emails to s...@mavenir.com
>>
>> Sure, Horovod is commonly used on Spark for this:
>>
>> https://horovod.readthedocs.io/en/stable/spark_include.html
>>
>>
>>
>> On Tue, Feb 22, 2022 at 8:51 PM Vijayant Kumar <
>> vijayant.ku...@mavenir.com.invalid> wrote:
>>
>> Hi All,
>>
>>
>>
>> Anyone using Apache spark with TensorFlow for building models. My
>> requirement is to use TensorFlow distributed model training across the
>> Spark executors.
>>
>> Please help me with some resources or some sample code.
>>
>>
>>
>> Thanks,
>>
>> Vijayant
>> --
>>
>> This e-mail message may contain confidential or proprietary information
>> of Mavenir Systems, Inc. or its affiliates and is intended solely for the
>> use of the intended recipient(s). If you are not the intended recipient of
>> this message, you are hereby notified that any review, use or distribution
>> of this information is absolutely prohibited and we request that you delete
>> all copies in your control and contact us by e-mailing to
>> secur...@mavenir.com. This message contains the views of its author and
>> may not necessarily reflect the views of Mavenir Systems, Inc. or its
>> affiliates, who employ systems to monitor email messages, but make no
>> representation that such messages are authorized, secure, uncompromised, or
>> free from computer viruses, malware, or other defects. Thank You
>>
>> --
>>
>> This e-mail message may contain confidential or proprietary information
>> of Mavenir Systems, Inc. or its affiliates and is intended solely for the
>> use of the intended recipient(s). If you are not the intended recipient of
>> this message, you are hereby notified that any review, use or distribution
>> of this information is absolutely prohibited and we request that you delete
>> all copies in your control and contact us by e-mailing to
>> secur...@mavenir.com. This message contains the views of its author and
>> may not necessarily reflect the views of Mavenir Systems, Inc. or its
>> affiliates, who employ systems to monitor email messages, but make no
>> representation that such messages are authorized, secure, uncompromised, or
>> free from computer viruses, malware, or other defects. Thank You
>>
>