Re: Integrating ML/DL frameworks with Spark

2018-05-07 Thread Reynold Xin
I don't think it's sufficient to have them in YARN (or any other services)
without Spark aware of them. If Spark is not aware of them, then there is
no way to really efficiently utilize these accelerators when you run
anything that require non-accelerators (which is almost 100% of the cases
in real world workloads).

For the other two, the point is not to implement all the ML/DL algorithms
in Spark, but make Spark integrate well with ML/DL frameworks. Otherwise
you will have the problems I described (super low performance when
exchanging data between Spark and ML/DL frameworks, and hanging issues with
MPI-based programs).


On Mon, May 7, 2018 at 10:05 PM Jörn Franke  wrote:

> Hadoop / Yarn 3.1 added GPU scheduling. 3.2 is planned to add FPGA
> scheduling, so it might be worth to have the last point generic that not
> only the Spark scheduler, but all supported schedulers can use GPU.
>
> For the other 2 points I just wonder if it makes sense to address this in
> the ml frameworks themselves or in Spark.
>
> On 8. May 2018, at 06:59, Xiangrui Meng  wrote:
>
> Thanks Reynold for summarizing the offline discussion! I added a few
> comments inline. -Xiangrui
>
> On Mon, May 7, 2018 at 5:37 PM Reynold Xin  wrote:
>
>> Hi all,
>>
>> Xiangrui and I were discussing with a heavy Apache Spark user last week
>> on their experiences integrating machine learning (and deep learning)
>> frameworks with Spark and some of their pain points. Couple things were
>> obvious and I wanted to share our learnings with the list.
>>
>> (1) Most organizations already use Spark for data plumbing and want to be
>> able to run their ML part of the stack on Spark as well (not necessarily
>> re-implementing all the algorithms but by integrating various frameworks
>> like tensorflow, mxnet with Spark).
>>
>> (2) The integration is however painful, from the systems perspective:
>>
>>
>>- Performance: data exchange between Spark and other frameworks are
>>slow, because UDFs across process boundaries (with native code) are slow.
>>This works much better now with Pandas UDFs (given a lot of the ML/DL
>>frameworks are in Python). However, there might be some low hanging fruit
>>gaps here.
>>
>> The Arrow support behind Pands UDFs can be reused to exchange data with
> other frameworks. And one possibly performance improvement is to support
> pipelining when supplying data to other frameworks. For example, while
> Spark is pumping data from external sources into TensorFlow, TensorFlow
> starts the computation on GPUs. This would significant improve speed and
> resource utilization.
>
>>
>>- Fault tolerance and execution model: Spark assumes fine-grained
>>task recovery, i.e. if something fails, only that task is rerun. This
>>doesn’t match the execution model of distributed ML/DL frameworks that are
>>typically MPI-based, and rerunning a single task would lead to the entire
>>system hanging. A whole stage needs to be re-run.
>>
>> This is not only useful for integrating with 3rd-party frameworks, but
> also useful for scaling MLlib algorithms. One of my earliest attempts in
> Spark MLlib was to implement All-Reduce primitive (SPARK-1485
> ). But we ended up with
> some compromised solutions. With the new execution model, we can set up a
> hybrid cluster and do all-reduce properly.
>
>
>>
>>- Accelerator-aware scheduling: The DL frameworks leverage GPUs and
>>sometimes FPGAs as accelerators for speedup, and Spark’s scheduler isn’t
>>aware of those resources, leading to either over-utilizing the 
>> accelerators
>>or under-utilizing the CPUs.
>>
>>
>> The good thing is that none of these seem very difficult to address (and
>> we have already made progress on one of them). Xiangrui has graciously
>> accepted the challenge to come up with solutions and SPIP to these.
>>
>>
> I will do more home work, exploring existing JIRAs or creating new JIRAs
> for the proposal. We'd like to hear your feedback and past efforts along
> those directions if they were not fully captured by our JIRA.
>
>
>> Xiangrui - please also chime in if I didn’t capture everything.
>>
>>
>> --
>
> Xiangrui Meng
>
> Software Engineer
>
> Databricks Inc. [image: http://databricks.com] 
>
>


Re: Integrating ML/DL frameworks with Spark

2018-05-07 Thread Jörn Franke
Hadoop / Yarn 3.1 added GPU scheduling. 3.2 is planned to add FPGA scheduling, 
so it might be worth to have the last point generic that not only the Spark 
scheduler, but all supported schedulers can use GPU.

For the other 2 points I just wonder if it makes sense to address this in the 
ml frameworks themselves or in Spark.

> On 8. May 2018, at 06:59, Xiangrui Meng  wrote:
> 
> Thanks Reynold for summarizing the offline discussion! I added a few comments 
> inline. -Xiangrui
> 
>> On Mon, May 7, 2018 at 5:37 PM Reynold Xin  wrote:
>> Hi all,
>> 
>> Xiangrui and I were discussing with a heavy Apache Spark user last week on 
>> their experiences integrating machine learning (and deep learning) 
>> frameworks with Spark and some of their pain points. Couple things were 
>> obvious and I wanted to share our learnings with the list.
>> 
>> (1) Most organizations already use Spark for data plumbing and want to be 
>> able to run their ML part of the stack on Spark as well (not necessarily 
>> re-implementing all the algorithms but by integrating various frameworks 
>> like tensorflow, mxnet with Spark).
>> 
>> (2) The integration is however painful, from the systems perspective:
>> 
>> Performance: data exchange between Spark and other frameworks are slow, 
>> because UDFs across process boundaries (with native code) are slow. This 
>> works much better now with Pandas UDFs (given a lot of the ML/DL frameworks 
>> are in Python). However, there might be some low hanging fruit gaps here.
> The Arrow support behind Pands UDFs can be reused to exchange data with other 
> frameworks. And one possibly performance improvement is to support pipelining 
> when supplying data to other frameworks. For example, while Spark is pumping 
> data from external sources into TensorFlow, TensorFlow starts the computation 
> on GPUs. This would significant improve speed and resource utilization.
>> Fault tolerance and execution model: Spark assumes fine-grained task 
>> recovery, i.e. if something fails, only that task is rerun. This doesn’t 
>> match the execution model of distributed ML/DL frameworks that are typically 
>> MPI-based, and rerunning a single task would lead to the entire system 
>> hanging. A whole stage needs to be re-run.
> This is not only useful for integrating with 3rd-party frameworks, but also 
> useful for scaling MLlib algorithms. One of my earliest attempts in Spark 
> MLlib was to implement All-Reduce primitive (SPARK-1485). But we ended up 
> with some compromised solutions. With the new execution model, we can set up 
> a hybrid cluster and do all-reduce properly.
>  
>> Accelerator-aware scheduling: The DL frameworks leverage GPUs and sometimes 
>> FPGAs as accelerators for speedup, and Spark’s scheduler isn’t aware of 
>> those resources, leading to either over-utilizing the accelerators or 
>> under-utilizing the CPUs.
>> 
>> The good thing is that none of these seem very difficult to address (and we 
>> have already made progress on one of them). Xiangrui has graciously accepted 
>> the challenge to come up with solutions and SPIP to these.
>> 
> 
> I will do more home work, exploring existing JIRAs or creating new JIRAs for 
> the proposal. We'd like to hear your feedback and past efforts along those 
> directions if they were not fully captured by our JIRA.
>  
>> Xiangrui - please also chime in if I didn’t capture everything. 
>> 
>> 
> -- 
> Xiangrui Meng
> Software Engineer
> Databricks Inc. 


Re: Integrating ML/DL frameworks with Spark

2018-05-07 Thread Xiangrui Meng
Thanks Reynold for summarizing the offline discussion! I added a few
comments inline. -Xiangrui

On Mon, May 7, 2018 at 5:37 PM Reynold Xin  wrote:

> Hi all,
>
> Xiangrui and I were discussing with a heavy Apache Spark user last week on
> their experiences integrating machine learning (and deep learning)
> frameworks with Spark and some of their pain points. Couple things were
> obvious and I wanted to share our learnings with the list.
>
> (1) Most organizations already use Spark for data plumbing and want to be
> able to run their ML part of the stack on Spark as well (not necessarily
> re-implementing all the algorithms but by integrating various frameworks
> like tensorflow, mxnet with Spark).
>
> (2) The integration is however painful, from the systems perspective:
>
>
>- Performance: data exchange between Spark and other frameworks are
>slow, because UDFs across process boundaries (with native code) are slow.
>This works much better now with Pandas UDFs (given a lot of the ML/DL
>frameworks are in Python). However, there might be some low hanging fruit
>gaps here.
>
> The Arrow support behind Pands UDFs can be reused to exchange data with
other frameworks. And one possibly performance improvement is to support
pipelining when supplying data to other frameworks. For example, while
Spark is pumping data from external sources into TensorFlow, TensorFlow
starts the computation on GPUs. This would significant improve speed and
resource utilization.

>
>- Fault tolerance and execution model: Spark assumes fine-grained task
>recovery, i.e. if something fails, only that task is rerun. This doesn’t
>match the execution model of distributed ML/DL frameworks that are
>typically MPI-based, and rerunning a single task would lead to the entire
>system hanging. A whole stage needs to be re-run.
>
> This is not only useful for integrating with 3rd-party frameworks, but
also useful for scaling MLlib algorithms. One of my earliest attempts in
Spark MLlib was to implement All-Reduce primitive (SPARK-1485
). But we ended up with
some compromised solutions. With the new execution model, we can set up a
hybrid cluster and do all-reduce properly.


>
>- Accelerator-aware scheduling: The DL frameworks leverage GPUs and
>sometimes FPGAs as accelerators for speedup, and Spark’s scheduler isn’t
>aware of those resources, leading to either over-utilizing the accelerators
>or under-utilizing the CPUs.
>
>
> The good thing is that none of these seem very difficult to address (and
> we have already made progress on one of them). Xiangrui has graciously
> accepted the challenge to come up with solutions and SPIP to these.
>
>
I will do more home work, exploring existing JIRAs or creating new JIRAs
for the proposal. We'd like to hear your feedback and past efforts along
those directions if they were not fully captured by our JIRA.


> Xiangrui - please also chime in if I didn’t capture everything.
>
>
> --

Xiangrui Meng

Software Engineer

Databricks Inc. [image: http://databricks.com] 


Integrating ML/DL frameworks with Spark

2018-05-07 Thread Reynold Xin
Hi all,

Xiangrui and I were discussing with a heavy Apache Spark user last week on
their experiences integrating machine learning (and deep learning)
frameworks with Spark and some of their pain points. Couple things were
obvious and I wanted to share our learnings with the list.

(1) Most organizations already use Spark for data plumbing and want to be
able to run their ML part of the stack on Spark as well (not necessarily
re-implementing all the algorithms but by integrating various frameworks
like tensorflow, mxnet with Spark).

(2) The integration is however painful, from the systems perspective:


   - Performance: data exchange between Spark and other frameworks are
   slow, because UDFs across process boundaries (with native code) are slow.
   This works much better now with Pandas UDFs (given a lot of the ML/DL
   frameworks are in Python). However, there might be some low hanging fruit
   gaps here.


   - Fault tolerance and execution model: Spark assumes fine-grained task
   recovery, i.e. if something fails, only that task is rerun. This doesn’t
   match the execution model of distributed ML/DL frameworks that are
   typically MPI-based, and rerunning a single task would lead to the entire
   system hanging. A whole stage needs to be re-run.


   - Accelerator-aware scheduling: The DL frameworks leverage GPUs and
   sometimes FPGAs as accelerators for speedup, and Spark’s scheduler isn’t
   aware of those resources, leading to either over-utilizing the accelerators
   or under-utilizing the CPUs.


The good thing is that none of these seem very difficult to address (and we
have already made progress on one of them). Xiangrui has graciously
accepted the challenge to come up with solutions and SPIP to these.

Xiangrui - please also chime in if I didn’t capture everything.


Re: Spark UI Source Code

2018-05-07 Thread Marcelo Vanzin
On Mon, May 7, 2018 at 1:44 AM, Anshi Shrivastava
 wrote:
> I've found a KVStore wrapper which stores all the metrics in a LevelDb
> store. This KVStore wrapper is available as a spark-dependency but we cannot
> access the metrics directly from spark since they are all private.

I'm not sure what it is you're trying to do exactly, but there's a
public REST API that exposes all the data Spark keeps about
applications. There's also a programmatic status tracker
(SparkContext.statusTracker) that's easier to use from within the
running Spark app, but has a lot less info.

> Can we use this store to store our own metrics?

No.

> Also can we retrieve these metrics based on timestamp?

Only if the REST API has that feature, don't remember off the top of my head.


-- 
Marcelo

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Design for continuous processing shuffle

2018-05-07 Thread Yuanjian Li
Hi Joseph and devs,

Happy to see the discussion of CP shuffle, as comment in 
https://issues.apache.org/jira/browse/SPARK-20928?focusedCommentId=16245556=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16245556
 
,
 the team also do some design and demo work over CP shuffle, the docs show more 
detailed work:

https://docs.google.com/document/d/14cGJ75v9myznywtB35ytEqL9wHy9xfZRv06B6g2tUgI 




> 在 2018年5月5日,02:27,Joseph Torres  写道:
> 
> Hi all,
> 
> A few of us have been working on a design for how to do shuffling in 
> continuous processing. Feel free to chip in if you have any comments or 
> questions.
> 
> doc:
> https://docs.google.com/document/d/1IL4kJoKrZWeyIhklKUJqsW-yEN7V7aL05MmM65AYOfE
>  
> 
> 
> continuous processing SPIP: https://issues.apache.org/jira/browse/SPARK-20928 
> 
> 
> 
> Jose



Spark UI Source Code

2018-05-07 Thread Anshi Shrivastava
Hi All,

I've been trying to debug the Spark UI source code to replicate the same
Metric monitoring mechanism in my application.

I've found a KVStore wrapper which stores all the metrics in a LevelDb
store. This KVStore wrapper is available as a spark-dependency but we
cannot access the metrics directly from spark since they are all private.
Can we use this store to store our own metrics? Also can we retrieve these
metrics based on timestamp?

Thanks and Regards,
Anshi

-- 
*
*
*DISCLAIMER:*
All the content in email is intended for the recipient 
and not to be published elsewhere without Exadatum consent. And attachments 
shall be send only if required and with ownership of the sender. This 
message contains confidential information and is intended only for the 
individual named. If you are not the named addressee, you should not 
disseminate, distribute or copy this email. Please notify the sender 
immediately by email if you have received this email by mistake and delete 
this email from your system. Email transmission cannot be guaranteed to be 
secure or error-free, as information could be intercepted, corrupted, lost, 
destroyed, arrive late or incomplete, or contain viruses. The sender, 
therefore, does not accept liability for any errors or omissions in the 
contents of this message which arise as a result of email transmission. If 
verification is required, please request a hard-copy version.