Re: Integrating ML/DL frameworks with Spark

2018-05-09 Thread Xiangrui Meng
Shivaram: Yes, we can call it "gang scheduling" or "barrier
synchronization". Spark doesn't support it now. The proposal is to have a
proper support in Spark's job scheduler, so we can integrate well with
MPI-like frameworks.

On Tue, May 8, 2018 at 11:17 AM Nan Zhu  wrote:

> .how I skipped the last part
>
> On Tue, May 8, 2018 at 11:16 AM, Reynold Xin  wrote:
>
>> Yes, Nan, totally agree. To be on the same page, that's exactly what I
>> wrote wasn't it?
>>
>> On Tue, May 8, 2018 at 11:14 AM Nan Zhu  wrote:
>>
>>> besides that, one of the things which is needed by multiple frameworks
>>> is to schedule tasks in a single wave
>>>
>>> i.e.
>>>
>>> if some frameworks like xgboost/mxnet requires 50 parallel workers,
>>> Spark is desired to provide a capability to ensure that either we run 50
>>> tasks at once, or we should quit the complete application/job after some
>>> timeout period
>>>
>>> Best,
>>>
>>> Nan
>>>
>>> On Tue, May 8, 2018 at 11:10 AM, Reynold Xin 
>>> wrote:
>>>
 I think that's what Xiangrui was referring to. Instead of retrying a
 single task, retry the entire stage, and the entire stage of tasks need to
 be scheduled all at once.


 On Tue, May 8, 2018 at 8:53 AM Shivaram Venkataraman <
 shiva...@eecs.berkeley.edu> wrote:

>
>>
>>>- Fault tolerance and execution model: Spark assumes
>>>fine-grained task recovery, i.e. if something fails, only that task 
>>> is
>>>rerun. This doesn’t match the execution model of distributed ML/DL
>>>frameworks that are typically MPI-based, and rerunning a single task 
>>> would
>>>lead to the entire system hanging. A whole stage needs to be re-run.
>>>
>>> This is not only useful for integrating with 3rd-party frameworks,
>> but also useful for scaling MLlib algorithms. One of my earliest attempts
>> in Spark MLlib was to implement All-Reduce primitive (SPARK-1485
>> ). But we ended up
>> with some compromised solutions. With the new execution model, we can set
>> up a hybrid cluster and do all-reduce properly.
>>
>>
> Is there a particular new execution model you are referring to or do
> we plan to investigate a new execution model ?  For the MPI-like model, we
> also need gang scheduling (i.e. schedule all tasks at once or none of 
> them)
> and I dont think we have support for that in the scheduler right now.
>
>>
>>> --
>>
>> Xiangrui Meng
>>
>> Software Engineer
>>
>> Databricks Inc. [image: http://databricks.com]
>> 
>>
>
>
>>>
> --

Xiangrui Meng

Software Engineer

Databricks Inc. [image: http://databricks.com] 


Problem with Spark Master shutting down when zookeeper leader is shutdown

2018-05-09 Thread agateaaa
Dear Spark community,

Just wanted to bring this issue up which was filed for Spark 1.6.1 (
https://issues.apache.org/jira/browse/SPARK-15544) but also exists in Spark
2.3.0 (https://issues.apache.org/jira/browse/SPARK-23530)

We have run into this on production, where Spark Master shuts down if the
Zookeeper leader on another node is shutdown during our upgrade procedure.
Actually this is a serious issue in our opinion and defeats the purpose of
Spark being Highly Available.
Rest of the software components like Kafka are not affected by zookeeper
leader shut down.

The problem manifests in unusual way, since it affects not the node that is
being rebooted or upgraded but some other node in the cluster and it  can
go unnoticed, unless we are actively monitoring for this to happen on other
nodes during upgrade.

(BTW by upgrade we mean upgrade of our application software stack, which
might include changes to base operating system packages, not Spark version
upgrade)

Can we increase the priortiy of these two JIRA's or better still can
someone pick this issue up please?

Thank you
Ashwin


Revisiting Online serving of Spark models?

2018-05-09 Thread Holden Karau
Hi y'all,

With the renewed interest in ML in Apache Spark now seems like a good a
time as any to revisit the online serving situation in Spark ML. DB &
other's have done some excellent working moving a lot of the necessary
tools into a local linear algebra package that doesn't depend on having a
SparkContext.

There are a few different commercial and non-commercial solutions round
this, but currently our individual transform/predict methods are private so
they either need to copy or re-implement (or put them selves in
org.apache.spark) to access them. How would folks feel about adding a new
trait for ML pipeline stages to expose to do transformation of single
element inputs (or local collections) that could be optionally implemented
by stages which support this? That way we can have less copy and paste code
possibly getting out of sync with our model training.

I think continuing to have on-line serving grow in different projects is
probably the right path, forward (folks have different needs), but I'd love
to see us make it simpler for other projects to build reliable serving
tools.

I realize this maybe puts some of the folks in an awkward position with
their own commercial offerings, but hopefully if we make it easier for
everyone the commercial vendors can benefit as well.

Cheers,

Holden :)

-- 
Twitter: https://twitter.com/holdenkarau


Re: eager execution and debuggability

2018-05-09 Thread Tim Hunter
The repr() trick is neat when working on a notebook. When working in a
library, I used to use an evaluate(dataframe) -> DataFrame function that
simply forces the materialization of a dataframe. As Reynold mentions, this
is very convenient when working on a lot of chained UDFs, and it is a
standard trick in lazy environments and languages.

Tim

On Wed, May 9, 2018 at 3:26 AM, Reynold Xin  wrote:

> Yes would be great if possible but it’s non trivial (might be impossible
> to do in general; we already have stacktraces that point to line numbers
> when an error occur in UDFs but clearly that’s not sufficient). Also in
> environments like REPL it’s still more useful to show error as soon as it
> occurs, rather than showing it potentially 30 lines later.
>
> On Tue, May 8, 2018 at 7:22 PM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> This may be technically impractical, but it would be fantastic if we
>> could make it easier to debug Spark programs without needing to rely on
>> eager execution. Sprinkling .count() and .checkpoint() at various points
>> in my code is still a debugging technique I use, but it always makes me
>> wish Spark could point more directly to the offending transformation when
>> something goes wrong.
>>
>> Is it somehow possible to have each individual operator (is that the
>> correct term?) in a DAG include metadata pointing back to the line of code
>> that generated the operator? That way when an action triggers an error, the
>> failing operation can point to the relevant line of code — even if it’s a
>> transformation — and not just the action on the tail end that triggered the
>> error.
>>
>> I don’t know how feasible this is, but addressing it would directly solve
>> the issue of linking failures to the responsible transformation, as opposed
>> to leaving the user to break up a chain of transformations with several
>> debug actions. And this would benefit new and experienced users alike.
>>
>> Nick
>>
>> 2018년 5월 8일 (화) 오후 7:09, Ryan Blue rb...@netflix.com.invalid
>> 님이 작성:
>>
>> I've opened SPARK-24215 to track this.
>>>
>>> On Tue, May 8, 2018 at 3:58 PM, Reynold Xin  wrote:
>>>
 Yup. Sounds great. This is something simple Spark can do and provide
 huge value to the end users.


 On Tue, May 8, 2018 at 3:53 PM Ryan Blue  wrote:

> Would be great if it is something more turn-key.
>
> We can easily add the __repr__ and _repr_html_ methods and behavior
> to PySpark classes. We could also add a configuration property to 
> determine
> whether the dataset evaluation is eager or not. That would make it 
> turn-key
> for anyone running PySpark in Jupyter.
>
> For JVM languages, we could also add a dependency on jvm-repr and do
> the same thing.
>
> rb
> ​
>
> On Tue, May 8, 2018 at 3:47 PM, Reynold Xin 
> wrote:
>
>> s/underestimated/overestimated/
>>
>> On Tue, May 8, 2018 at 3:44 PM Reynold Xin 
>> wrote:
>>
>>> Marco,
>>>
>>> There is understanding how Spark works, and there is finding bugs
>>> early in their own program. One can perfectly understand how Spark works
>>> and still find it valuable to get feedback asap, and that's why we built
>>> eager analysis in the first place.
>>>
>>> Also I'm afraid you've significantly underestimated the level of
>>> technical sophistication of users. In many cases they struggle to get
>>> anything to work, and performance optimization of their programs is
>>> secondary to getting things working. As John Ousterhout says, "the 
>>> greatest
>>> performance improvement of all is when a system goes from not-working to
>>> working".
>>>
>>> I really like Ryan's approach. Would be great if it is something
>>> more turn-key.
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Tue, May 8, 2018 at 2:35 PM Marco Gaido 
>>> wrote:
>>>
 I am not sure how this is useful. For students, it is important to
 understand how Spark works. This can be critical in many decision they 
 have
 to take (whether and what to cache for instance) in order to have
 performant Spark application. Creating a eager execution probably can 
 help
 them having something running more easily, but let them also using 
 Spark
 knowing less about how it works, thus they are likely to write worse
 application and to have more problems in debugging any kind of problem
 which may later (in production) occur (therefore affecting their 
 experience
 with the tool).

 Moreover, as Ryan also mentioned, there are tools/ways to force the
 execution, helping in the debugging phase. So they can