Re: MLLib - Thoughts about refactoring Updater for LBFGS?

2014-04-07 Thread Debasish Das
By the way...what's the idea...the labeled data set is a RDD which is
cached on all nodes..

The bfgs solver is maintained on the master or each worker is supposed to
maintain it's own bfgs...


On Mon, Apr 7, 2014 at 11:23 PM, Debasish Das wrote:

> I got your checkinI need to run logistic regression SGD vs BFGS for my
> current usecases but your next checkin will update the logistic regression
> with LBFGS right ? Are you adding it to regression package as well ?
>
> Thanks.
> Deb
>
>
> On Mon, Apr 7, 2014 at 7:00 PM, DB Tsai  wrote:
>
>> Hi guys,
>>
>> The latest PR uses Breeze's L-BFGS implement which is introduced by
>> Xiangrui's sparse input format work in SPARK-1212.
>>
>> https://github.com/apache/spark/pull/353
>>
>> Now, it works with the new sparse framework!
>>
>> Any feedback would be greatly appreciated.
>>
>> Thanks.
>>
>> Sincerely,
>>
>> DB Tsai
>> ---
>> My Blog: https://www.dbtsai.com
>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>
>>
>> On Thu, Apr 3, 2014 at 5:02 PM, DB Tsai  wrote:
>> > -- Forwarded message --
>> > From: David Hall 
>> > Date: Sat, Mar 15, 2014 at 10:02 AM
>> > Subject: Re: MLLib - Thoughts about refactoring Updater for LBFGS?
>> > To: DB Tsai 
>> >
>> >
>> > On Fri, Mar 7, 2014 at 10:56 PM, DB Tsai  wrote:
>> >>
>> >> Hi David,
>> >>
>> >> Please let me know the version of Breeze that LBFGS can be serialized,
>> >> and CachedDiffFunction is built-in in LBFGS once you finish. I'll
>> >> update the PR to Spark from using RISO implementation to Breeze
>> >> implementation.
>> >
>> >
>> > The current master (0.7-SNAPSHOT) has these problems fixed.
>> >
>> >>
>> >>
>> >> Thanks.
>> >>
>> >> Sincerely,
>> >>
>> >> DB Tsai
>> >> Machine Learning Engineer
>> >> Alpine Data Labs
>> >> --
>> >> Web: http://alpinenow.com/
>> >>
>> >>
>> >> On Thu, Mar 6, 2014 at 4:26 PM, David Hall 
>> wrote:
>> >> > On Thu, Mar 6, 2014 at 4:21 PM, DB Tsai 
>> wrote:
>> >> >
>> >> >> Hi David,
>> >> >>
>> >> >> I can converge to the same result with your breeze LBFGS and Fortran
>> >> >> implementations now. Probably, I made some mistakes when I tried
>> >> >> breeze before. I apologize that I claimed it's not stable.
>> >> >>
>> >> >> See the test case in BreezeLBFGSSuite.scala
>> >> >> https://github.com/AlpineNow/spark/tree/dbtsai-breezeLBFGS
>> >> >>
>> >> >> This is training multinomial logistic regression against iris
>> dataset,
>> >> >> and both optimizers can train the models with 98% training accuracy.
>> >> >>
>> >> >
>> >> > great to hear! There were some bugs in LBFGS about 6 months ago, so
>> >> > depending on the last time you tried it, it might indeed have been
>> >> > bugged.
>> >> >
>> >> >
>> >> >>
>> >> >> There are two issues to use Breeze in Spark,
>> >> >>
>> >> >> 1) When the gradientSum and lossSum are computed distributively in
>> >> >> custom defined DiffFunction which will be passed into your
>> optimizer,
>> >> >> Spark will complain LBFGS class is not serializable. In
>> >> >> BreezeLBFGS.scala, I've to convert RDD to array to make it work
>> >> >> locally. It should be easy to fix by just having LBFGS to implement
>> >> >> Serializable.
>> >> >>
>> >> >
>> >> > I'm not sure why Spark should be serializing LBFGS? Shouldn't it
>> live on
>> >> > the controller node? Or is this a per-node thing?
>> >> >
>> >> > But no problem to make it serializable.
>> >> >
>> >> >
>> >> >>
>> >> >> 2) Breeze computes redundant gradient and loss. See the following
>> log
>> >> >> from both Fortran and Breeze implementations.
>> >> >>
>> >> >
>> >> > Err, yeah. I should probably have LBFGS do this automatically, but
>> >> > there's
>> >> > a CachedDiffFunction that gets rid of the redundant calculations.
>> >> >
>> >> > -- David
>> >> >
>> >> >
>> >> >>
>> >> >> Thanks.
>> >> >>
>> >> >> Fortran:
>> >> >> Iteration -1: loss 1.3862943611198926, diff 1.0
>> >> >> Iteration 0: loss 1.5846343143210866, diff 0.14307193024217352
>> >> >> Iteration 1: loss 1.1242501524477688, diff 0.29053004039012126
>> >> >> Iteration 2: loss 1.0930151243303563, diff 0.027782962952189336
>> >> >> Iteration 3: loss 1.054036932835569, diff 0.03566113127440601
>> >> >> Iteration 4: loss 0.9907956302751622, diff 0.0507649459571
>> >> >> Iteration 5: loss 0.9184205380342829, diff 0.07304737423337761
>> >> >> Iteration 6: loss 0.8259870936519937, diff 0.10064381175132982
>> >> >> Iteration 7: loss 0.6327447552109574, diff 0.23395293458364716
>> >> >> Iteration 8: loss 0.5534101162436359, diff 0.1253815427665277
>> >> >> Iteration 9: loss 0.4045020086612566, diff 0.26907321376758075
>> >> >> Iteration 10: loss 0.3078824990823728, diff 0.23885980452569627
>> >> >>
>> >> >> Breeze:
>> >> >> Iteration -1: loss 1.3862943611198926, diff 1.0
>> >> >> Mar 6, 2014 3:59:11 PM com.github.fommil.netlib.BLAS 
>> >> >> WARNING: Failed to load implementation from:
>> >> >> com.github.fommil.netlib.NativeSystemBLA

Re: MLLib - Thoughts about refactoring Updater for LBFGS?

2014-04-07 Thread Debasish Das
I got your checkinI need to run logistic regression SGD vs BFGS for my
current usecases but your next checkin will update the logistic regression
with LBFGS right ? Are you adding it to regression package as well ?

Thanks.
Deb


On Mon, Apr 7, 2014 at 7:00 PM, DB Tsai  wrote:

> Hi guys,
>
> The latest PR uses Breeze's L-BFGS implement which is introduced by
> Xiangrui's sparse input format work in SPARK-1212.
>
> https://github.com/apache/spark/pull/353
>
> Now, it works with the new sparse framework!
>
> Any feedback would be greatly appreciated.
>
> Thanks.
>
> Sincerely,
>
> DB Tsai
> ---
> My Blog: https://www.dbtsai.com
> LinkedIn: https://www.linkedin.com/in/dbtsai
>
>
> On Thu, Apr 3, 2014 at 5:02 PM, DB Tsai  wrote:
> > -- Forwarded message --
> > From: David Hall 
> > Date: Sat, Mar 15, 2014 at 10:02 AM
> > Subject: Re: MLLib - Thoughts about refactoring Updater for LBFGS?
> > To: DB Tsai 
> >
> >
> > On Fri, Mar 7, 2014 at 10:56 PM, DB Tsai  wrote:
> >>
> >> Hi David,
> >>
> >> Please let me know the version of Breeze that LBFGS can be serialized,
> >> and CachedDiffFunction is built-in in LBFGS once you finish. I'll
> >> update the PR to Spark from using RISO implementation to Breeze
> >> implementation.
> >
> >
> > The current master (0.7-SNAPSHOT) has these problems fixed.
> >
> >>
> >>
> >> Thanks.
> >>
> >> Sincerely,
> >>
> >> DB Tsai
> >> Machine Learning Engineer
> >> Alpine Data Labs
> >> --
> >> Web: http://alpinenow.com/
> >>
> >>
> >> On Thu, Mar 6, 2014 at 4:26 PM, David Hall 
> wrote:
> >> > On Thu, Mar 6, 2014 at 4:21 PM, DB Tsai  wrote:
> >> >
> >> >> Hi David,
> >> >>
> >> >> I can converge to the same result with your breeze LBFGS and Fortran
> >> >> implementations now. Probably, I made some mistakes when I tried
> >> >> breeze before. I apologize that I claimed it's not stable.
> >> >>
> >> >> See the test case in BreezeLBFGSSuite.scala
> >> >> https://github.com/AlpineNow/spark/tree/dbtsai-breezeLBFGS
> >> >>
> >> >> This is training multinomial logistic regression against iris
> dataset,
> >> >> and both optimizers can train the models with 98% training accuracy.
> >> >>
> >> >
> >> > great to hear! There were some bugs in LBFGS about 6 months ago, so
> >> > depending on the last time you tried it, it might indeed have been
> >> > bugged.
> >> >
> >> >
> >> >>
> >> >> There are two issues to use Breeze in Spark,
> >> >>
> >> >> 1) When the gradientSum and lossSum are computed distributively in
> >> >> custom defined DiffFunction which will be passed into your optimizer,
> >> >> Spark will complain LBFGS class is not serializable. In
> >> >> BreezeLBFGS.scala, I've to convert RDD to array to make it work
> >> >> locally. It should be easy to fix by just having LBFGS to implement
> >> >> Serializable.
> >> >>
> >> >
> >> > I'm not sure why Spark should be serializing LBFGS? Shouldn't it live
> on
> >> > the controller node? Or is this a per-node thing?
> >> >
> >> > But no problem to make it serializable.
> >> >
> >> >
> >> >>
> >> >> 2) Breeze computes redundant gradient and loss. See the following log
> >> >> from both Fortran and Breeze implementations.
> >> >>
> >> >
> >> > Err, yeah. I should probably have LBFGS do this automatically, but
> >> > there's
> >> > a CachedDiffFunction that gets rid of the redundant calculations.
> >> >
> >> > -- David
> >> >
> >> >
> >> >>
> >> >> Thanks.
> >> >>
> >> >> Fortran:
> >> >> Iteration -1: loss 1.3862943611198926, diff 1.0
> >> >> Iteration 0: loss 1.5846343143210866, diff 0.14307193024217352
> >> >> Iteration 1: loss 1.1242501524477688, diff 0.29053004039012126
> >> >> Iteration 2: loss 1.0930151243303563, diff 0.027782962952189336
> >> >> Iteration 3: loss 1.054036932835569, diff 0.03566113127440601
> >> >> Iteration 4: loss 0.9907956302751622, diff 0.0507649459571
> >> >> Iteration 5: loss 0.9184205380342829, diff 0.07304737423337761
> >> >> Iteration 6: loss 0.8259870936519937, diff 0.10064381175132982
> >> >> Iteration 7: loss 0.6327447552109574, diff 0.23395293458364716
> >> >> Iteration 8: loss 0.5534101162436359, diff 0.1253815427665277
> >> >> Iteration 9: loss 0.4045020086612566, diff 0.26907321376758075
> >> >> Iteration 10: loss 0.3078824990823728, diff 0.23885980452569627
> >> >>
> >> >> Breeze:
> >> >> Iteration -1: loss 1.3862943611198926, diff 1.0
> >> >> Mar 6, 2014 3:59:11 PM com.github.fommil.netlib.BLAS 
> >> >> WARNING: Failed to load implementation from:
> >> >> com.github.fommil.netlib.NativeSystemBLAS
> >> >> Mar 6, 2014 3:59:11 PM com.github.fommil.netlib.BLAS 
> >> >> WARNING: Failed to load implementation from:
> >> >> com.github.fommil.netlib.NativeRefBLAS
> >> >> Iteration 0: loss 1.3862943611198926, diff 0.0
> >> >> Iteration 1: loss 1.5846343143210866, diff 0.14307193024217352
> >> >> Iteration 2: loss 1.1242501524477688, diff 0.29053004039012126
> >> >> Iteration 3: loss 1.12425015244776

Re: Contributing to Spark

2014-04-07 Thread Matei Zaharia
I’d suggest looking for the issues labeled “Starter” on JIRA. You can find them 
here: 
https://issues.apache.org/jira/browse/SPARK-1438?jql=project%20%3D%20SPARK%20AND%20labels%20%3D%20Starter%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)

Matei

On Apr 7, 2014, at 9:45 PM, Mukesh G  wrote:

> Hi Sujeet,
> 
>Thanks. I went thru the website and looks great. Is there a list of
> items that I can choose from, for contribution?
> 
> Thanks
> 
> Mukesh
> 
> 
> On Mon, Apr 7, 2014 at 10:14 PM, Sujeet Varakhedi
> wrote:
> 
>> This is a good place to start:
>> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
>> 
>> Sujeet
>> 
>> 
>> On Mon, Apr 7, 2014 at 9:20 AM, Mukesh G  wrote:
>> 
>>> Hi,
>>> 
>>>   How I contribute to Spark and it's associated projects?
>>> 
>>> Appreciate the help...
>>> 
>>> Thanks
>>> 
>>> Mukesh
>>> 
>> 



Re: Contributing to Spark

2014-04-07 Thread Mukesh G
Hi Sujeet,

Thanks. I went thru the website and looks great. Is there a list of
items that I can choose from, for contribution?

Thanks

Mukesh


On Mon, Apr 7, 2014 at 10:14 PM, Sujeet Varakhedi
wrote:

> This is a good place to start:
> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
>
> Sujeet
>
>
> On Mon, Apr 7, 2014 at 9:20 AM, Mukesh G  wrote:
>
> > Hi,
> >
> >How I contribute to Spark and it's associated projects?
> >
> > Appreciate the help...
> >
> > Thanks
> >
> > Mukesh
> >
>


Re: MLLib - Thoughts about refactoring Updater for LBFGS?

2014-04-07 Thread DB Tsai
Hi guys,

The latest PR uses Breeze's L-BFGS implement which is introduced by
Xiangrui's sparse input format work in SPARK-1212.

https://github.com/apache/spark/pull/353

Now, it works with the new sparse framework!

Any feedback would be greatly appreciated.

Thanks.

Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Thu, Apr 3, 2014 at 5:02 PM, DB Tsai  wrote:
> -- Forwarded message --
> From: David Hall 
> Date: Sat, Mar 15, 2014 at 10:02 AM
> Subject: Re: MLLib - Thoughts about refactoring Updater for LBFGS?
> To: DB Tsai 
>
>
> On Fri, Mar 7, 2014 at 10:56 PM, DB Tsai  wrote:
>>
>> Hi David,
>>
>> Please let me know the version of Breeze that LBFGS can be serialized,
>> and CachedDiffFunction is built-in in LBFGS once you finish. I'll
>> update the PR to Spark from using RISO implementation to Breeze
>> implementation.
>
>
> The current master (0.7-SNAPSHOT) has these problems fixed.
>
>>
>>
>> Thanks.
>>
>> Sincerely,
>>
>> DB Tsai
>> Machine Learning Engineer
>> Alpine Data Labs
>> --
>> Web: http://alpinenow.com/
>>
>>
>> On Thu, Mar 6, 2014 at 4:26 PM, David Hall  wrote:
>> > On Thu, Mar 6, 2014 at 4:21 PM, DB Tsai  wrote:
>> >
>> >> Hi David,
>> >>
>> >> I can converge to the same result with your breeze LBFGS and Fortran
>> >> implementations now. Probably, I made some mistakes when I tried
>> >> breeze before. I apologize that I claimed it's not stable.
>> >>
>> >> See the test case in BreezeLBFGSSuite.scala
>> >> https://github.com/AlpineNow/spark/tree/dbtsai-breezeLBFGS
>> >>
>> >> This is training multinomial logistic regression against iris dataset,
>> >> and both optimizers can train the models with 98% training accuracy.
>> >>
>> >
>> > great to hear! There were some bugs in LBFGS about 6 months ago, so
>> > depending on the last time you tried it, it might indeed have been
>> > bugged.
>> >
>> >
>> >>
>> >> There are two issues to use Breeze in Spark,
>> >>
>> >> 1) When the gradientSum and lossSum are computed distributively in
>> >> custom defined DiffFunction which will be passed into your optimizer,
>> >> Spark will complain LBFGS class is not serializable. In
>> >> BreezeLBFGS.scala, I've to convert RDD to array to make it work
>> >> locally. It should be easy to fix by just having LBFGS to implement
>> >> Serializable.
>> >>
>> >
>> > I'm not sure why Spark should be serializing LBFGS? Shouldn't it live on
>> > the controller node? Or is this a per-node thing?
>> >
>> > But no problem to make it serializable.
>> >
>> >
>> >>
>> >> 2) Breeze computes redundant gradient and loss. See the following log
>> >> from both Fortran and Breeze implementations.
>> >>
>> >
>> > Err, yeah. I should probably have LBFGS do this automatically, but
>> > there's
>> > a CachedDiffFunction that gets rid of the redundant calculations.
>> >
>> > -- David
>> >
>> >
>> >>
>> >> Thanks.
>> >>
>> >> Fortran:
>> >> Iteration -1: loss 1.3862943611198926, diff 1.0
>> >> Iteration 0: loss 1.5846343143210866, diff 0.14307193024217352
>> >> Iteration 1: loss 1.1242501524477688, diff 0.29053004039012126
>> >> Iteration 2: loss 1.0930151243303563, diff 0.027782962952189336
>> >> Iteration 3: loss 1.054036932835569, diff 0.03566113127440601
>> >> Iteration 4: loss 0.9907956302751622, diff 0.0507649459571
>> >> Iteration 5: loss 0.9184205380342829, diff 0.07304737423337761
>> >> Iteration 6: loss 0.8259870936519937, diff 0.10064381175132982
>> >> Iteration 7: loss 0.6327447552109574, diff 0.23395293458364716
>> >> Iteration 8: loss 0.5534101162436359, diff 0.1253815427665277
>> >> Iteration 9: loss 0.4045020086612566, diff 0.26907321376758075
>> >> Iteration 10: loss 0.3078824990823728, diff 0.23885980452569627
>> >>
>> >> Breeze:
>> >> Iteration -1: loss 1.3862943611198926, diff 1.0
>> >> Mar 6, 2014 3:59:11 PM com.github.fommil.netlib.BLAS 
>> >> WARNING: Failed to load implementation from:
>> >> com.github.fommil.netlib.NativeSystemBLAS
>> >> Mar 6, 2014 3:59:11 PM com.github.fommil.netlib.BLAS 
>> >> WARNING: Failed to load implementation from:
>> >> com.github.fommil.netlib.NativeRefBLAS
>> >> Iteration 0: loss 1.3862943611198926, diff 0.0
>> >> Iteration 1: loss 1.5846343143210866, diff 0.14307193024217352
>> >> Iteration 2: loss 1.1242501524477688, diff 0.29053004039012126
>> >> Iteration 3: loss 1.1242501524477688, diff 0.0
>> >> Iteration 4: loss 1.1242501524477688, diff 0.0
>> >> Iteration 5: loss 1.0930151243303563, diff 0.027782962952189336
>> >> Iteration 6: loss 1.0930151243303563, diff 0.0
>> >> Iteration 7: loss 1.0930151243303563, diff 0.0
>> >> Iteration 8: loss 1.054036932835569, diff 0.03566113127440601
>> >> Iteration 9: loss 1.054036932835569, diff 0.0
>> >> Iteration 10: loss 1.054036932835569, diff 0.0
>> >> Iteration 11: loss 0.9907956302751622, diff 0.0507649459571
>> >> Iteration 12: loss 0.9907956302751622, diff 0.0
>> >> I

Re: Spark Streaming and Flume Avro RPC Servers

2014-04-07 Thread Christophe Clapp
Cool. I'll look at making the code change in FlumeUtils and generating a
pull request.

As far as the use case, the volume of messages we have is currently about
30 MB per second which may grow to over what a 1 Gbit network adapter can
handle.

- Christophe
On Apr 7, 2014 1:51 PM, "Michael Ernest"  wrote:

> I don't see why not. If one were doing something similar with straight
> Flume, you'd start an agent on each node you care to receive Avro/RPC
> events. In the absence of clearer insight to your use case, I'm puzzling
> just a little why it's necessary for each Worker to be its own receiver,
> but there's no real objection or concern to fuel the puzzlement, just
> curiosity.
>
>
> On Mon, Apr 7, 2014 at 4:16 PM, Christophe Clapp
> wrote:
>
> > Could it be as simple as just changing FlumeUtils to accept a list of
> > host/port number pairs to start the RPC servers on?
> >
> >
> >
> > On 4/7/14, 12:58 PM, Christophe Clapp wrote:
> >
> >> Based on the source code here:
> >> https://github.com/apache/spark/blob/master/external/
> >> flume/src/main/scala/org/apache/spark/streaming/flume/FlumeUtils.scala
> >>
> >> It looks like in its current version, FlumeUtils does not support
> >> starting an Avro RPC server on more than one worker.
> >>
> >> - Christophe
> >>
> >> On 4/7/14, 12:23 PM, Michael Ernest wrote:
> >>
> >>> You can configure your sinks to write to one or more Avro sources in a
> >>> load-balanced configuration.
> >>>
> >>> https://flume.apache.org/FlumeUserGuide.html#flume-sink-processors
> >>>
> >>> mfe
> >>>
> >>>
> >>> On Mon, Apr 7, 2014 at 3:19 PM, Christophe Clapp
> >>> wrote:
> >>>
> >>>  Hi,
> 
>   From my testing of Spark Streaming with Flume, it seems that there's
>  only
>  one of the Spark worker nodes that runs a Flume Avro RPC server to
>  receive
>  messages at any given time, as opposed to every Spark worker running
> an
>  Avro RPC server to receive messages. Is this the case? Our use-case
>  would
>  benefit from balancing the load across Workers because of our volume
> of
>  messages. We would be using a load balancer in front of the Spark
>  workers
>  running the Avro RPC servers, essentially round-robinning the messages
>  across all of them.
> 
>  If this is something that is currently not supported, I'd be
> interested
>  in
>  contributing to the code to make it happen.
> 
>  - Christophe
> 
> 
> >>>
> >>>
> >>
> >
>
>
> --
> Michael Ernest
> Sr. Solutions Consultant
> West Coast
>


Re: Spark Streaming and Flume Avro RPC Servers

2014-04-07 Thread Michael Ernest
I don't see why not. If one were doing something similar with straight
Flume, you'd start an agent on each node you care to receive Avro/RPC
events. In the absence of clearer insight to your use case, I'm puzzling
just a little why it's necessary for each Worker to be its own receiver,
but there's no real objection or concern to fuel the puzzlement, just
curiosity.


On Mon, Apr 7, 2014 at 4:16 PM, Christophe Clapp
wrote:

> Could it be as simple as just changing FlumeUtils to accept a list of
> host/port number pairs to start the RPC servers on?
>
>
>
> On 4/7/14, 12:58 PM, Christophe Clapp wrote:
>
>> Based on the source code here:
>> https://github.com/apache/spark/blob/master/external/
>> flume/src/main/scala/org/apache/spark/streaming/flume/FlumeUtils.scala
>>
>> It looks like in its current version, FlumeUtils does not support
>> starting an Avro RPC server on more than one worker.
>>
>> - Christophe
>>
>> On 4/7/14, 12:23 PM, Michael Ernest wrote:
>>
>>> You can configure your sinks to write to one or more Avro sources in a
>>> load-balanced configuration.
>>>
>>> https://flume.apache.org/FlumeUserGuide.html#flume-sink-processors
>>>
>>> mfe
>>>
>>>
>>> On Mon, Apr 7, 2014 at 3:19 PM, Christophe Clapp
>>> wrote:
>>>
>>>  Hi,

  From my testing of Spark Streaming with Flume, it seems that there's
 only
 one of the Spark worker nodes that runs a Flume Avro RPC server to
 receive
 messages at any given time, as opposed to every Spark worker running an
 Avro RPC server to receive messages. Is this the case? Our use-case
 would
 benefit from balancing the load across Workers because of our volume of
 messages. We would be using a load balancer in front of the Spark
 workers
 running the Avro RPC servers, essentially round-robinning the messages
 across all of them.

 If this is something that is currently not supported, I'd be interested
 in
 contributing to the code to make it happen.

 - Christophe


>>>
>>>
>>
>


-- 
Michael Ernest
Sr. Solutions Consultant
West Coast


Re: Spark Streaming and Flume Avro RPC Servers

2014-04-07 Thread Christophe Clapp
Could it be as simple as just changing FlumeUtils to accept a list of 
host/port number pairs to start the RPC servers on?



On 4/7/14, 12:58 PM, Christophe Clapp wrote:

Based on the source code here:
https://github.com/apache/spark/blob/master/external/flume/src/main/scala/org/apache/spark/streaming/flume/FlumeUtils.scala 



It looks like in its current version, FlumeUtils does not support 
starting an Avro RPC server on more than one worker.


- Christophe

On 4/7/14, 12:23 PM, Michael Ernest wrote:

You can configure your sinks to write to one or more Avro sources in a
load-balanced configuration.

https://flume.apache.org/FlumeUserGuide.html#flume-sink-processors

mfe


On Mon, Apr 7, 2014 at 3:19 PM, Christophe Clapp
wrote:


Hi,

 From my testing of Spark Streaming with Flume, it seems that 
there's only
one of the Spark worker nodes that runs a Flume Avro RPC server to 
receive

messages at any given time, as opposed to every Spark worker running an
Avro RPC server to receive messages. Is this the case? Our use-case 
would

benefit from balancing the load across Workers because of our volume of
messages. We would be using a load balancer in front of the Spark 
workers

running the Avro RPC servers, essentially round-robinning the messages
across all of them.

If this is something that is currently not supported, I'd be 
interested in

contributing to the code to make it happen.

- Christophe










Re: Spark Streaming and Flume Avro RPC Servers

2014-04-07 Thread Christophe Clapp

Based on the source code here:
https://github.com/apache/spark/blob/master/external/flume/src/main/scala/org/apache/spark/streaming/flume/FlumeUtils.scala

It looks like in its current version, FlumeUtils does not support 
starting an Avro RPC server on more than one worker.


- Christophe

On 4/7/14, 12:23 PM, Michael Ernest wrote:

You can configure your sinks to write to one or more Avro sources in a
load-balanced configuration.

https://flume.apache.org/FlumeUserGuide.html#flume-sink-processors

mfe


On Mon, Apr 7, 2014 at 3:19 PM, Christophe Clapp
wrote:


Hi,

 From my testing of Spark Streaming with Flume, it seems that there's only
one of the Spark worker nodes that runs a Flume Avro RPC server to receive
messages at any given time, as opposed to every Spark worker running an
Avro RPC server to receive messages. Is this the case? Our use-case would
benefit from balancing the load across Workers because of our volume of
messages. We would be using a load balancer in front of the Spark workers
running the Avro RPC servers, essentially round-robinning the messages
across all of them.

If this is something that is currently not supported, I'd be interested in
contributing to the code to make it happen.

- Christophe








Re: Spark Streaming and Flume Avro RPC Servers

2014-04-07 Thread Christophe Clapp
Right, but at least in my case, no avro RPC server was started on any of
the spark worker nodes except for one. I don't know if that's just some
configuration issue with my setup or if it's expected behavior. I would
need spark to start avro RPC servers on every worker rather than just one.

- Christophe
On Apr 7, 2014 12:24 PM, "Michael Ernest"  wrote:

> You can configure your sinks to write to one or more Avro sources in a
> load-balanced configuration.
>
> https://flume.apache.org/FlumeUserGuide.html#flume-sink-processors
>
> mfe
>
>
> On Mon, Apr 7, 2014 at 3:19 PM, Christophe Clapp
> wrote:
>
> > Hi,
> >
> > From my testing of Spark Streaming with Flume, it seems that there's only
> > one of the Spark worker nodes that runs a Flume Avro RPC server to
> receive
> > messages at any given time, as opposed to every Spark worker running an
> > Avro RPC server to receive messages. Is this the case? Our use-case would
> > benefit from balancing the load across Workers because of our volume of
> > messages. We would be using a load balancer in front of the Spark workers
> > running the Avro RPC servers, essentially round-robinning the messages
> > across all of them.
> >
> > If this is something that is currently not supported, I'd be interested
> in
> > contributing to the code to make it happen.
> >
> > - Christophe
> >
>
>
>
> --
> Michael Ernest
> Sr. Solutions Consultant
> West Coast
>


Re: Spark Streaming and Flume Avro RPC Servers

2014-04-07 Thread Michael Ernest
You can configure your sinks to write to one or more Avro sources in a
load-balanced configuration.

https://flume.apache.org/FlumeUserGuide.html#flume-sink-processors

mfe


On Mon, Apr 7, 2014 at 3:19 PM, Christophe Clapp
wrote:

> Hi,
>
> From my testing of Spark Streaming with Flume, it seems that there's only
> one of the Spark worker nodes that runs a Flume Avro RPC server to receive
> messages at any given time, as opposed to every Spark worker running an
> Avro RPC server to receive messages. Is this the case? Our use-case would
> benefit from balancing the load across Workers because of our volume of
> messages. We would be using a load balancer in front of the Spark workers
> running the Avro RPC servers, essentially round-robinning the messages
> across all of them.
>
> If this is something that is currently not supported, I'd be interested in
> contributing to the code to make it happen.
>
> - Christophe
>



-- 
Michael Ernest
Sr. Solutions Consultant
West Coast


Spark Streaming and Flume Avro RPC Servers

2014-04-07 Thread Christophe Clapp

Hi,

From my testing of Spark Streaming with Flume, it seems that there's 
only one of the Spark worker nodes that runs a Flume Avro RPC server to 
receive messages at any given time, as opposed to every Spark worker 
running an Avro RPC server to receive messages. Is this the case? Our 
use-case would benefit from balancing the load across Workers because of 
our volume of messages. We would be using a load balancer in front of 
the Spark workers running the Avro RPC servers, essentially 
round-robinning the messages across all of them.


If this is something that is currently not supported, I'd be interested 
in contributing to the code to make it happen.


- Christophe


Re: ALS array index out of bound with 50 factors

2014-04-07 Thread Xiangrui Meng
Hi Deb,

It would be helpful if you can attached the logs. It is strange to see
that you can make 4 iterations but not 10.

Xiangrui

On Mon, Apr 7, 2014 at 10:36 AM, Debasish Das  wrote:
> I am using master...
>
> No negative indexes...
>
> If I run with 4 iterations it runs fine and I can generate factors...
>
> With 10 iterations run fails with array index out of bound...
>
> 25m users and 3m products are within int limits
>
> Does it help if I can point the logs for both the runs to you ?
>
> I will debug it further today...
>  On Apr 7, 2014 9:54 AM, "Xiangrui Meng"  wrote:
>
>> Hi Deb,
>>
>> This thread is for the out-of-bound error you described. I don't think
>> the number of iterations has any effect here. My questions were:
>>
>> 1) Are you using the master branch or a particular commit?
>>
>> 2) Do you have negative or out-of-integer-range user or product ids?
>> Try to print out the max/min value of user/product ids.
>>
>> Best,
>> Xiangrui
>>
>> On Sun, Apr 6, 2014 at 11:01 PM, Debasish Das 
>> wrote:
>> > Hi Xiangrui,
>> >
>> > With 4 ALS iterations it runs fine...If I run 10 I am failing...I
>> believe I
>> > have to cut the lineage chain and call checkpointTrying to follow the
>> > other email chain on checkpointing...
>> >
>> > Thanks.
>> > Deb
>> >
>> >
>> > On Sun, Apr 6, 2014 at 9:08 PM, Xiangrui Meng  wrote:
>> >
>> >> Hi Deb,
>> >>
>> >> Are you using the master branch or a particular commit? Do you have
>> >> negative or out-of-integer-range user or product ids? There is an
>> >> issue with ALS' partitioning
>> >> (https://spark-project.atlassian.net/browse/SPARK-1281), but I'm not
>> >> sure whether that is the reason. Could you try to see whether you can
>> >> reproduce the error on a public data set, e.g., movielens? Thanks!
>> >>
>> >> Best,
>> >> Xiangrui
>> >>
>> >> On Sat, Apr 5, 2014 at 10:53 PM, Debasish Das > >
>> >> wrote:
>> >> > Hi,
>> >> >
>> >> > I deployed apache/spark master today and recently there were many ALS
>> >> > related checkins and enhancements..
>> >> >
>> >> > I am running ALS with explicit feedback and I remember most
>> enhancements
>> >> > were related to implicit feedback...
>> >> >
>> >> > With 25 factors my runs were successful but with 50 factors I am
>> getting
>> >> > array index out of bound...
>> >> >
>> >> > Note that I was hitting gc errors before with an older version of
>> spark
>> >> but
>> >> > it seems like the sparse matrix partitioning scheme has changed
>> >> now...data
>> >> > caching looks much balanced now...earlier one node was becoming
>> >> > bottleneck...Although I ran with 64g memory per node...
>> >> >
>> >> > There are around 3M products, 25M users...
>> >> >
>> >> > Anyone noticed this bug or something similar ?
>> >> >
>> >> > 14/04/05 23:03:15 WARN TaskSetManager: Loss was due to
>> >> > java.lang.ArrayIndexOutOfBoundsException
>> >> > java.lang.ArrayIndexOutOfBoundsException: 81029
>> >> > at
>> >> >
>> >>
>> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1$$anonfun$apply$mcVI$sp$1.apply$mcVI$sp(ALS.scala:450)
>> >> > at
>> scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
>> >> > at
>> >> >
>> >>
>> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:446)
>> >> > at
>> scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
>> >> > at org.apache.spark.mllib.recommendation.ALS.org
>> >> > $apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:445)
>> >> > at
>> >> >
>> >>
>> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:416)
>> >> > at
>> >> >
>> >>
>> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:415)
>> >> > at
>> >> >
>> >>
>> org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
>> >> > at
>> >> >
>> >>
>> org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
>> >> > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>> >> > at
>> >> >
>> >>
>> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:149)
>> >> > at
>> >> >
>> >>
>> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:147)
>> >> > at
>> >> >
>> >>
>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>> >> > at
>> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>> >> > at
>> org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:147)
>> >> > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:229)
>> >> > at org.apache.spark.rdd.RDD.iterator(RDD.scala:220)
>> >> > at
>> >> > org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
>> >> > at o

Re: Flaky streaming tests

2014-04-07 Thread Michael Armbrust
I agree these should be disabled right away, and the JIRA can be used to
track fixing / turning them back on.


On Mon, Apr 7, 2014 at 11:33 AM, Michael Armbrust wrote:

> There is a JIRA for one of the flakey tests here:
> https://issues.apache.org/jira/browse/SPARK-1409
>
>
> On Mon, Apr 7, 2014 at 11:32 AM, Patrick Wendell wrote:
>
>> TD - do you know what is going on here?
>>
>> I looked into this ab it and at least a few of these that use
>> Thread.sleep() and assume the sleep will be exact, which is wrong. We
>> should disable all the tests that do and probably they should be
>> re-written
>> to virtualize time.
>>
>> - Patrick
>>
>>
>> On Mon, Apr 7, 2014 at 10:52 AM, Kay Ousterhout > >wrote:
>>
>> > Hi all,
>> >
>> > The InputStreamsSuite seems to have some serious flakiness issues --
>> I've
>> > seen the file input stream fail many times and now I'm seeing some actor
>> > input stream test failures (
>> >
>> >
>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13846/consoleFull
>> > )
>> > on what I think is an unrelated change.  Does anyone know anything about
>> > these?  Should we just remove some of these tests since they seem to be
>> > constantly failing?
>> >
>> > -Kay
>> >
>>
>
>


Re: Flaky streaming tests

2014-04-07 Thread Tathagata Das
Yes, I will take a look at those tests ASAP.

TD



On Mon, Apr 7, 2014 at 11:32 AM, Patrick Wendell  wrote:

> TD - do you know what is going on here?
>
> I looked into this ab it and at least a few of these that use
> Thread.sleep() and assume the sleep will be exact, which is wrong. We
> should disable all the tests that do and probably they should be re-written
> to virtualize time.
>
> - Patrick
>
>
> On Mon, Apr 7, 2014 at 10:52 AM, Kay Ousterhout  >wrote:
>
> > Hi all,
> >
> > The InputStreamsSuite seems to have some serious flakiness issues -- I've
> > seen the file input stream fail many times and now I'm seeing some actor
> > input stream test failures (
> >
> >
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13846/consoleFull
> > )
> > on what I think is an unrelated change.  Does anyone know anything about
> > these?  Should we just remove some of these tests since they seem to be
> > constantly failing?
> >
> > -Kay
> >
>


Re: Flaky streaming tests

2014-04-07 Thread Michael Armbrust
There is a JIRA for one of the flakey tests here:
https://issues.apache.org/jira/browse/SPARK-1409


On Mon, Apr 7, 2014 at 11:32 AM, Patrick Wendell  wrote:

> TD - do you know what is going on here?
>
> I looked into this ab it and at least a few of these that use
> Thread.sleep() and assume the sleep will be exact, which is wrong. We
> should disable all the tests that do and probably they should be re-written
> to virtualize time.
>
> - Patrick
>
>
> On Mon, Apr 7, 2014 at 10:52 AM, Kay Ousterhout  >wrote:
>
> > Hi all,
> >
> > The InputStreamsSuite seems to have some serious flakiness issues -- I've
> > seen the file input stream fail many times and now I'm seeing some actor
> > input stream test failures (
> >
> >
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13846/consoleFull
> > )
> > on what I think is an unrelated change.  Does anyone know anything about
> > these?  Should we just remove some of these tests since they seem to be
> > constantly failing?
> >
> > -Kay
> >
>


Re: Flaky streaming tests

2014-04-07 Thread Patrick Wendell
TD - do you know what is going on here?

I looked into this ab it and at least a few of these that use
Thread.sleep() and assume the sleep will be exact, which is wrong. We
should disable all the tests that do and probably they should be re-written
to virtualize time.

- Patrick


On Mon, Apr 7, 2014 at 10:52 AM, Kay Ousterhout wrote:

> Hi all,
>
> The InputStreamsSuite seems to have some serious flakiness issues -- I've
> seen the file input stream fail many times and now I'm seeing some actor
> input stream test failures (
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13846/consoleFull
> )
> on what I think is an unrelated change.  Does anyone know anything about
> these?  Should we just remove some of these tests since they seem to be
> constantly failing?
>
> -Kay
>


Re: Flaky streaming tests

2014-04-07 Thread Nan Zhu
I met this issue when Jenkins seems to be very busy

On Monday, April 7, 2014, Kay Ousterhout  wrote:

> Hi all,
>
> The InputStreamsSuite seems to have some serious flakiness issues -- I've
> seen the file input stream fail many times and now I'm seeing some actor
> input stream test failures (
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13846/consoleFull
> )
> on what I think is an unrelated change.  Does anyone know anything about
> these?  Should we just remove some of these tests since they seem to be
> constantly failing?
>
> -Kay
>


Flaky streaming tests

2014-04-07 Thread Kay Ousterhout
Hi all,

The InputStreamsSuite seems to have some serious flakiness issues -- I've
seen the file input stream fail many times and now I'm seeing some actor
input stream test failures (
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13846/consoleFull)
on what I think is an unrelated change.  Does anyone know anything about
these?  Should we just remove some of these tests since they seem to be
constantly failing?

-Kay


Re: ALS array index out of bound with 50 factors

2014-04-07 Thread Debasish Das
I am using master...

No negative indexes...

If I run with 4 iterations it runs fine and I can generate factors...

With 10 iterations run fails with array index out of bound...

25m users and 3m products are within int limits

Does it help if I can point the logs for both the runs to you ?

I will debug it further today...
 On Apr 7, 2014 9:54 AM, "Xiangrui Meng"  wrote:

> Hi Deb,
>
> This thread is for the out-of-bound error you described. I don't think
> the number of iterations has any effect here. My questions were:
>
> 1) Are you using the master branch or a particular commit?
>
> 2) Do you have negative or out-of-integer-range user or product ids?
> Try to print out the max/min value of user/product ids.
>
> Best,
> Xiangrui
>
> On Sun, Apr 6, 2014 at 11:01 PM, Debasish Das 
> wrote:
> > Hi Xiangrui,
> >
> > With 4 ALS iterations it runs fine...If I run 10 I am failing...I
> believe I
> > have to cut the lineage chain and call checkpointTrying to follow the
> > other email chain on checkpointing...
> >
> > Thanks.
> > Deb
> >
> >
> > On Sun, Apr 6, 2014 at 9:08 PM, Xiangrui Meng  wrote:
> >
> >> Hi Deb,
> >>
> >> Are you using the master branch or a particular commit? Do you have
> >> negative or out-of-integer-range user or product ids? There is an
> >> issue with ALS' partitioning
> >> (https://spark-project.atlassian.net/browse/SPARK-1281), but I'm not
> >> sure whether that is the reason. Could you try to see whether you can
> >> reproduce the error on a public data set, e.g., movielens? Thanks!
> >>
> >> Best,
> >> Xiangrui
> >>
> >> On Sat, Apr 5, 2014 at 10:53 PM, Debasish Das  >
> >> wrote:
> >> > Hi,
> >> >
> >> > I deployed apache/spark master today and recently there were many ALS
> >> > related checkins and enhancements..
> >> >
> >> > I am running ALS with explicit feedback and I remember most
> enhancements
> >> > were related to implicit feedback...
> >> >
> >> > With 25 factors my runs were successful but with 50 factors I am
> getting
> >> > array index out of bound...
> >> >
> >> > Note that I was hitting gc errors before with an older version of
> spark
> >> but
> >> > it seems like the sparse matrix partitioning scheme has changed
> >> now...data
> >> > caching looks much balanced now...earlier one node was becoming
> >> > bottleneck...Although I ran with 64g memory per node...
> >> >
> >> > There are around 3M products, 25M users...
> >> >
> >> > Anyone noticed this bug or something similar ?
> >> >
> >> > 14/04/05 23:03:15 WARN TaskSetManager: Loss was due to
> >> > java.lang.ArrayIndexOutOfBoundsException
> >> > java.lang.ArrayIndexOutOfBoundsException: 81029
> >> > at
> >> >
> >>
> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1$$anonfun$apply$mcVI$sp$1.apply$mcVI$sp(ALS.scala:450)
> >> > at
> scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
> >> > at
> >> >
> >>
> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:446)
> >> > at
> scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
> >> > at org.apache.spark.mllib.recommendation.ALS.org
> >> > $apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:445)
> >> > at
> >> >
> >>
> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:416)
> >> > at
> >> >
> >>
> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:415)
> >> > at
> >> >
> >>
> org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
> >> > at
> >> >
> >>
> org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
> >> > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> >> > at
> >> >
> >>
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:149)
> >> > at
> >> >
> >>
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:147)
> >> > at
> >> >
> >>
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> >> > at
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> >> > at
> org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:147)
> >> > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:229)
> >> > at org.apache.spark.rdd.RDD.iterator(RDD.scala:220)
> >> > at
> >> > org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
> >> > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:229)
> >> > at org.apache.spark.rdd.RDD.iterator(RDD.scala:220)
> >> > at
> >> >
> >>
> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
> >> > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:229)
> >> > at org.apach

Re: ALS array index out of bound with 50 factors

2014-04-07 Thread Xiangrui Meng
Hi Deb,

This thread is for the out-of-bound error you described. I don't think
the number of iterations has any effect here. My questions were:

1) Are you using the master branch or a particular commit?

2) Do you have negative or out-of-integer-range user or product ids?
Try to print out the max/min value of user/product ids.

Best,
Xiangrui

On Sun, Apr 6, 2014 at 11:01 PM, Debasish Das  wrote:
> Hi Xiangrui,
>
> With 4 ALS iterations it runs fine...If I run 10 I am failing...I believe I
> have to cut the lineage chain and call checkpointTrying to follow the
> other email chain on checkpointing...
>
> Thanks.
> Deb
>
>
> On Sun, Apr 6, 2014 at 9:08 PM, Xiangrui Meng  wrote:
>
>> Hi Deb,
>>
>> Are you using the master branch or a particular commit? Do you have
>> negative or out-of-integer-range user or product ids? There is an
>> issue with ALS' partitioning
>> (https://spark-project.atlassian.net/browse/SPARK-1281), but I'm not
>> sure whether that is the reason. Could you try to see whether you can
>> reproduce the error on a public data set, e.g., movielens? Thanks!
>>
>> Best,
>> Xiangrui
>>
>> On Sat, Apr 5, 2014 at 10:53 PM, Debasish Das 
>> wrote:
>> > Hi,
>> >
>> > I deployed apache/spark master today and recently there were many ALS
>> > related checkins and enhancements..
>> >
>> > I am running ALS with explicit feedback and I remember most enhancements
>> > were related to implicit feedback...
>> >
>> > With 25 factors my runs were successful but with 50 factors I am getting
>> > array index out of bound...
>> >
>> > Note that I was hitting gc errors before with an older version of spark
>> but
>> > it seems like the sparse matrix partitioning scheme has changed
>> now...data
>> > caching looks much balanced now...earlier one node was becoming
>> > bottleneck...Although I ran with 64g memory per node...
>> >
>> > There are around 3M products, 25M users...
>> >
>> > Anyone noticed this bug or something similar ?
>> >
>> > 14/04/05 23:03:15 WARN TaskSetManager: Loss was due to
>> > java.lang.ArrayIndexOutOfBoundsException
>> > java.lang.ArrayIndexOutOfBoundsException: 81029
>> > at
>> >
>> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1$$anonfun$apply$mcVI$sp$1.apply$mcVI$sp(ALS.scala:450)
>> > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
>> > at
>> >
>> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:446)
>> > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
>> > at org.apache.spark.mllib.recommendation.ALS.org
>> > $apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:445)
>> > at
>> >
>> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:416)
>> > at
>> >
>> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:415)
>> > at
>> >
>> org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
>> > at
>> >
>> org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
>> > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>> > at
>> >
>> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:149)
>> > at
>> >
>> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:147)
>> > at
>> >
>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>> > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>> > at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:147)
>> > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:229)
>> > at org.apache.spark.rdd.RDD.iterator(RDD.scala:220)
>> > at
>> > org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
>> > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:229)
>> > at org.apache.spark.rdd.RDD.iterator(RDD.scala:220)
>> > at
>> >
>> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
>> > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:229)
>> > at org.apache.spark.rdd.RDD.iterator(RDD.scala:220)
>> > at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
>> > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:229)
>> > at org.apache.spark.rdd.RDD.iterator(RDD.scala:220)
>> > at
>> >
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
>> > at
>> >
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
>> > at org.apache.spark.scheduler.Task.run(Task.scala:52)
>> > at
>> >
>> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:211)
>> > at

Re: Contributing to Spark

2014-04-07 Thread Sujeet Varakhedi
This is a good place to start:
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

Sujeet


On Mon, Apr 7, 2014 at 9:20 AM, Mukesh G  wrote:

> Hi,
>
>How I contribute to Spark and it's associated projects?
>
> Appreciate the help...
>
> Thanks
>
> Mukesh
>


Contributing to Spark

2014-04-07 Thread Mukesh G
Hi,

   How I contribute to Spark and it's associated projects?

Appreciate the help...

Thanks

Mukesh


Re: tachyon dependency

2014-04-07 Thread Haoyuan Li
Tachyon is Java 6 compatible from version 0.4. Beside putting input/output
data in Tachyon ( http://tachyon-project.org/Running-Spark-on-Tachyon.html ),
Spark applications can also persist data into Tachyon (
https://github.com/apache/spark/blob/master/docs/scala-programming-guide.md
).


On Mon, Apr 7, 2014 at 7:42 AM, Koert Kuipers  wrote:

> i noticed there is a dependency on tachyon in spark core 1.0.0-SNAPSHOT.
> how does that work? i believe tachyon is written in java 7, yet spark
> claims to be java 6 compatible.
>



-- 
Haoyuan Li
Algorithms, Machines, People Lab, EECS, UC Berkeley
http://www.cs.berkeley.edu/~haoyuan/


tachyon dependency

2014-04-07 Thread Koert Kuipers
i noticed there is a dependency on tachyon in spark core 1.0.0-SNAPSHOT.
how does that work? i believe tachyon is written in java 7, yet spark
claims to be java 6 compatible.


Re: ALS array index out of bound with 50 factors

2014-04-07 Thread Debasish Das
Nick,

I already have this code which calls dictionary generation and then maps
string etc to ints...I think the core algorithm should stay in ints...if
you like I can add this code in MFUtils.scalathat's the convention I
followed similar to MLUtils.scala...actually these functions should be even
made part of MLUtils.scala...

Only thing is that the join should be an option which makes it application
dependent...sometimes people would like to do map side joins if their
dictionaries are small...in my case user dictionary has 25M rows and
product dictionary has 3M rows...so join optimization did not help...

Thanks.
Deb



On Mon, Apr 7, 2014 at 6:57 AM, Nick Pentreath wrote:

> On the partitioning / id keys. If we would look at hash partitioning, how
> feasible will it be to just allow the user and item ids to be strings? A
> lot of the time these ids are strings anyway (UUIDs and so on), and it's
> really painful to translate between String <-> Int the whole time.
>
> Are there any obvious blockers to this? I am a bit rusty on the ALS code
> but from a quick scan I think this may work. Performance may be an issue
> with large String keys... Any majore issues/objections to this thinking?
>
> I may be able to find time to take a stab at this if there is demand.
>
>
> On Mon, Apr 7, 2014 at 6:08 AM, Xiangrui Meng  wrote:
>
> > Hi Deb,
> >
> > Are you using the master branch or a particular commit? Do you have
> > negative or out-of-integer-range user or product ids? There is an
> > issue with ALS' partitioning
> > (https://spark-project.atlassian.net/browse/SPARK-1281), but I'm not
> > sure whether that is the reason. Could you try to see whether you can
> > reproduce the error on a public data set, e.g., movielens? Thanks!
> >
> > Best,
> > Xiangrui
> >
> > On Sat, Apr 5, 2014 at 10:53 PM, Debasish Das 
> > wrote:
> > > Hi,
> > >
> > > I deployed apache/spark master today and recently there were many ALS
> > > related checkins and enhancements..
> > >
> > > I am running ALS with explicit feedback and I remember most
> enhancements
> > > were related to implicit feedback...
> > >
> > > With 25 factors my runs were successful but with 50 factors I am
> getting
> > > array index out of bound...
> > >
> > > Note that I was hitting gc errors before with an older version of spark
> > but
> > > it seems like the sparse matrix partitioning scheme has changed
> > now...data
> > > caching looks much balanced now...earlier one node was becoming
> > > bottleneck...Although I ran with 64g memory per node...
> > >
> > > There are around 3M products, 25M users...
> > >
> > > Anyone noticed this bug or something similar ?
> > >
> > > 14/04/05 23:03:15 WARN TaskSetManager: Loss was due to
> > > java.lang.ArrayIndexOutOfBoundsException
> > > java.lang.ArrayIndexOutOfBoundsException: 81029
> > > at
> > >
> >
> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1$$anonfun$apply$mcVI$sp$1.apply$mcVI$sp(ALS.scala:450)
> > > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
> > > at
> > >
> >
> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:446)
> > > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
> > > at org.apache.spark.mllib.recommendation.ALS.org
> > > $apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:445)
> > > at
> > >
> >
> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:416)
> > > at
> > >
> >
> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:415)
> > > at
> > >
> >
> org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
> > > at
> > >
> >
> org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
> > > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> > > at
> > >
> >
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:149)
> > > at
> > >
> >
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:147)
> > > at
> > >
> >
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> > > at
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> > > at
> org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:147)
> > > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:229)
> > > at org.apache.spark.rdd.RDD.iterator(RDD.scala:220)
> > > at
> > > org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
> > > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:229)
> > > at org.apache.spark.rdd.RDD.iterator(RDD.scala:220)
> > > at
> > >
> >
> org.apache.spark.rdd.FlatMappedValuesRDD.compute(

Re: ALS array index out of bound with 50 factors

2014-04-07 Thread Nick Pentreath
On the partitioning / id keys. If we would look at hash partitioning, how
feasible will it be to just allow the user and item ids to be strings? A
lot of the time these ids are strings anyway (UUIDs and so on), and it's
really painful to translate between String <-> Int the whole time.

Are there any obvious blockers to this? I am a bit rusty on the ALS code
but from a quick scan I think this may work. Performance may be an issue
with large String keys... Any majore issues/objections to this thinking?

I may be able to find time to take a stab at this if there is demand.


On Mon, Apr 7, 2014 at 6:08 AM, Xiangrui Meng  wrote:

> Hi Deb,
>
> Are you using the master branch or a particular commit? Do you have
> negative or out-of-integer-range user or product ids? There is an
> issue with ALS' partitioning
> (https://spark-project.atlassian.net/browse/SPARK-1281), but I'm not
> sure whether that is the reason. Could you try to see whether you can
> reproduce the error on a public data set, e.g., movielens? Thanks!
>
> Best,
> Xiangrui
>
> On Sat, Apr 5, 2014 at 10:53 PM, Debasish Das 
> wrote:
> > Hi,
> >
> > I deployed apache/spark master today and recently there were many ALS
> > related checkins and enhancements..
> >
> > I am running ALS with explicit feedback and I remember most enhancements
> > were related to implicit feedback...
> >
> > With 25 factors my runs were successful but with 50 factors I am getting
> > array index out of bound...
> >
> > Note that I was hitting gc errors before with an older version of spark
> but
> > it seems like the sparse matrix partitioning scheme has changed
> now...data
> > caching looks much balanced now...earlier one node was becoming
> > bottleneck...Although I ran with 64g memory per node...
> >
> > There are around 3M products, 25M users...
> >
> > Anyone noticed this bug or something similar ?
> >
> > 14/04/05 23:03:15 WARN TaskSetManager: Loss was due to
> > java.lang.ArrayIndexOutOfBoundsException
> > java.lang.ArrayIndexOutOfBoundsException: 81029
> > at
> >
> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1$$anonfun$apply$mcVI$sp$1.apply$mcVI$sp(ALS.scala:450)
> > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
> > at
> >
> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:446)
> > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
> > at org.apache.spark.mllib.recommendation.ALS.org
> > $apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:445)
> > at
> >
> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:416)
> > at
> >
> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:415)
> > at
> >
> org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
> > at
> >
> org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
> > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> > at
> >
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:149)
> > at
> >
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:147)
> > at
> >
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> > at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:147)
> > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:229)
> > at org.apache.spark.rdd.RDD.iterator(RDD.scala:220)
> > at
> > org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
> > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:229)
> > at org.apache.spark.rdd.RDD.iterator(RDD.scala:220)
> > at
> >
> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
> > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:229)
> > at org.apache.spark.rdd.RDD.iterator(RDD.scala:220)
> > at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
> > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:229)
> > at org.apache.spark.rdd.RDD.iterator(RDD.scala:220)
> > at
> >
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
> > at
> >
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
> > at org.apache.spark.scheduler.Task.run(Task.scala:52)
> > at
> >
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:211)
> > at
> >
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:43)
> > at
> >
> org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(Sp