Re: MLLib - Thoughts about refactoring Updater for LBFGS?
By the way...what's the idea...the labeled data set is a RDD which is cached on all nodes.. The bfgs solver is maintained on the master or each worker is supposed to maintain it's own bfgs... On Mon, Apr 7, 2014 at 11:23 PM, Debasish Das wrote: > I got your checkinI need to run logistic regression SGD vs BFGS for my > current usecases but your next checkin will update the logistic regression > with LBFGS right ? Are you adding it to regression package as well ? > > Thanks. > Deb > > > On Mon, Apr 7, 2014 at 7:00 PM, DB Tsai wrote: > >> Hi guys, >> >> The latest PR uses Breeze's L-BFGS implement which is introduced by >> Xiangrui's sparse input format work in SPARK-1212. >> >> https://github.com/apache/spark/pull/353 >> >> Now, it works with the new sparse framework! >> >> Any feedback would be greatly appreciated. >> >> Thanks. >> >> Sincerely, >> >> DB Tsai >> --- >> My Blog: https://www.dbtsai.com >> LinkedIn: https://www.linkedin.com/in/dbtsai >> >> >> On Thu, Apr 3, 2014 at 5:02 PM, DB Tsai wrote: >> > -- Forwarded message -- >> > From: David Hall >> > Date: Sat, Mar 15, 2014 at 10:02 AM >> > Subject: Re: MLLib - Thoughts about refactoring Updater for LBFGS? >> > To: DB Tsai >> > >> > >> > On Fri, Mar 7, 2014 at 10:56 PM, DB Tsai wrote: >> >> >> >> Hi David, >> >> >> >> Please let me know the version of Breeze that LBFGS can be serialized, >> >> and CachedDiffFunction is built-in in LBFGS once you finish. I'll >> >> update the PR to Spark from using RISO implementation to Breeze >> >> implementation. >> > >> > >> > The current master (0.7-SNAPSHOT) has these problems fixed. >> > >> >> >> >> >> >> Thanks. >> >> >> >> Sincerely, >> >> >> >> DB Tsai >> >> Machine Learning Engineer >> >> Alpine Data Labs >> >> -- >> >> Web: http://alpinenow.com/ >> >> >> >> >> >> On Thu, Mar 6, 2014 at 4:26 PM, David Hall >> wrote: >> >> > On Thu, Mar 6, 2014 at 4:21 PM, DB Tsai >> wrote: >> >> > >> >> >> Hi David, >> >> >> >> >> >> I can converge to the same result with your breeze LBFGS and Fortran >> >> >> implementations now. Probably, I made some mistakes when I tried >> >> >> breeze before. I apologize that I claimed it's not stable. >> >> >> >> >> >> See the test case in BreezeLBFGSSuite.scala >> >> >> https://github.com/AlpineNow/spark/tree/dbtsai-breezeLBFGS >> >> >> >> >> >> This is training multinomial logistic regression against iris >> dataset, >> >> >> and both optimizers can train the models with 98% training accuracy. >> >> >> >> >> > >> >> > great to hear! There were some bugs in LBFGS about 6 months ago, so >> >> > depending on the last time you tried it, it might indeed have been >> >> > bugged. >> >> > >> >> > >> >> >> >> >> >> There are two issues to use Breeze in Spark, >> >> >> >> >> >> 1) When the gradientSum and lossSum are computed distributively in >> >> >> custom defined DiffFunction which will be passed into your >> optimizer, >> >> >> Spark will complain LBFGS class is not serializable. In >> >> >> BreezeLBFGS.scala, I've to convert RDD to array to make it work >> >> >> locally. It should be easy to fix by just having LBFGS to implement >> >> >> Serializable. >> >> >> >> >> > >> >> > I'm not sure why Spark should be serializing LBFGS? Shouldn't it >> live on >> >> > the controller node? Or is this a per-node thing? >> >> > >> >> > But no problem to make it serializable. >> >> > >> >> > >> >> >> >> >> >> 2) Breeze computes redundant gradient and loss. See the following >> log >> >> >> from both Fortran and Breeze implementations. >> >> >> >> >> > >> >> > Err, yeah. I should probably have LBFGS do this automatically, but >> >> > there's >> >> > a CachedDiffFunction that gets rid of the redundant calculations. >> >> > >> >> > -- David >> >> > >> >> > >> >> >> >> >> >> Thanks. >> >> >> >> >> >> Fortran: >> >> >> Iteration -1: loss 1.3862943611198926, diff 1.0 >> >> >> Iteration 0: loss 1.5846343143210866, diff 0.14307193024217352 >> >> >> Iteration 1: loss 1.1242501524477688, diff 0.29053004039012126 >> >> >> Iteration 2: loss 1.0930151243303563, diff 0.027782962952189336 >> >> >> Iteration 3: loss 1.054036932835569, diff 0.03566113127440601 >> >> >> Iteration 4: loss 0.9907956302751622, diff 0.0507649459571 >> >> >> Iteration 5: loss 0.9184205380342829, diff 0.07304737423337761 >> >> >> Iteration 6: loss 0.8259870936519937, diff 0.10064381175132982 >> >> >> Iteration 7: loss 0.6327447552109574, diff 0.23395293458364716 >> >> >> Iteration 8: loss 0.5534101162436359, diff 0.1253815427665277 >> >> >> Iteration 9: loss 0.4045020086612566, diff 0.26907321376758075 >> >> >> Iteration 10: loss 0.3078824990823728, diff 0.23885980452569627 >> >> >> >> >> >> Breeze: >> >> >> Iteration -1: loss 1.3862943611198926, diff 1.0 >> >> >> Mar 6, 2014 3:59:11 PM com.github.fommil.netlib.BLAS >> >> >> WARNING: Failed to load implementation from: >> >> >> com.github.fommil.netlib.NativeSystemBLA
Re: MLLib - Thoughts about refactoring Updater for LBFGS?
I got your checkinI need to run logistic regression SGD vs BFGS for my current usecases but your next checkin will update the logistic regression with LBFGS right ? Are you adding it to regression package as well ? Thanks. Deb On Mon, Apr 7, 2014 at 7:00 PM, DB Tsai wrote: > Hi guys, > > The latest PR uses Breeze's L-BFGS implement which is introduced by > Xiangrui's sparse input format work in SPARK-1212. > > https://github.com/apache/spark/pull/353 > > Now, it works with the new sparse framework! > > Any feedback would be greatly appreciated. > > Thanks. > > Sincerely, > > DB Tsai > --- > My Blog: https://www.dbtsai.com > LinkedIn: https://www.linkedin.com/in/dbtsai > > > On Thu, Apr 3, 2014 at 5:02 PM, DB Tsai wrote: > > -- Forwarded message -- > > From: David Hall > > Date: Sat, Mar 15, 2014 at 10:02 AM > > Subject: Re: MLLib - Thoughts about refactoring Updater for LBFGS? > > To: DB Tsai > > > > > > On Fri, Mar 7, 2014 at 10:56 PM, DB Tsai wrote: > >> > >> Hi David, > >> > >> Please let me know the version of Breeze that LBFGS can be serialized, > >> and CachedDiffFunction is built-in in LBFGS once you finish. I'll > >> update the PR to Spark from using RISO implementation to Breeze > >> implementation. > > > > > > The current master (0.7-SNAPSHOT) has these problems fixed. > > > >> > >> > >> Thanks. > >> > >> Sincerely, > >> > >> DB Tsai > >> Machine Learning Engineer > >> Alpine Data Labs > >> -- > >> Web: http://alpinenow.com/ > >> > >> > >> On Thu, Mar 6, 2014 at 4:26 PM, David Hall > wrote: > >> > On Thu, Mar 6, 2014 at 4:21 PM, DB Tsai wrote: > >> > > >> >> Hi David, > >> >> > >> >> I can converge to the same result with your breeze LBFGS and Fortran > >> >> implementations now. Probably, I made some mistakes when I tried > >> >> breeze before. I apologize that I claimed it's not stable. > >> >> > >> >> See the test case in BreezeLBFGSSuite.scala > >> >> https://github.com/AlpineNow/spark/tree/dbtsai-breezeLBFGS > >> >> > >> >> This is training multinomial logistic regression against iris > dataset, > >> >> and both optimizers can train the models with 98% training accuracy. > >> >> > >> > > >> > great to hear! There were some bugs in LBFGS about 6 months ago, so > >> > depending on the last time you tried it, it might indeed have been > >> > bugged. > >> > > >> > > >> >> > >> >> There are two issues to use Breeze in Spark, > >> >> > >> >> 1) When the gradientSum and lossSum are computed distributively in > >> >> custom defined DiffFunction which will be passed into your optimizer, > >> >> Spark will complain LBFGS class is not serializable. In > >> >> BreezeLBFGS.scala, I've to convert RDD to array to make it work > >> >> locally. It should be easy to fix by just having LBFGS to implement > >> >> Serializable. > >> >> > >> > > >> > I'm not sure why Spark should be serializing LBFGS? Shouldn't it live > on > >> > the controller node? Or is this a per-node thing? > >> > > >> > But no problem to make it serializable. > >> > > >> > > >> >> > >> >> 2) Breeze computes redundant gradient and loss. See the following log > >> >> from both Fortran and Breeze implementations. > >> >> > >> > > >> > Err, yeah. I should probably have LBFGS do this automatically, but > >> > there's > >> > a CachedDiffFunction that gets rid of the redundant calculations. > >> > > >> > -- David > >> > > >> > > >> >> > >> >> Thanks. > >> >> > >> >> Fortran: > >> >> Iteration -1: loss 1.3862943611198926, diff 1.0 > >> >> Iteration 0: loss 1.5846343143210866, diff 0.14307193024217352 > >> >> Iteration 1: loss 1.1242501524477688, diff 0.29053004039012126 > >> >> Iteration 2: loss 1.0930151243303563, diff 0.027782962952189336 > >> >> Iteration 3: loss 1.054036932835569, diff 0.03566113127440601 > >> >> Iteration 4: loss 0.9907956302751622, diff 0.0507649459571 > >> >> Iteration 5: loss 0.9184205380342829, diff 0.07304737423337761 > >> >> Iteration 6: loss 0.8259870936519937, diff 0.10064381175132982 > >> >> Iteration 7: loss 0.6327447552109574, diff 0.23395293458364716 > >> >> Iteration 8: loss 0.5534101162436359, diff 0.1253815427665277 > >> >> Iteration 9: loss 0.4045020086612566, diff 0.26907321376758075 > >> >> Iteration 10: loss 0.3078824990823728, diff 0.23885980452569627 > >> >> > >> >> Breeze: > >> >> Iteration -1: loss 1.3862943611198926, diff 1.0 > >> >> Mar 6, 2014 3:59:11 PM com.github.fommil.netlib.BLAS > >> >> WARNING: Failed to load implementation from: > >> >> com.github.fommil.netlib.NativeSystemBLAS > >> >> Mar 6, 2014 3:59:11 PM com.github.fommil.netlib.BLAS > >> >> WARNING: Failed to load implementation from: > >> >> com.github.fommil.netlib.NativeRefBLAS > >> >> Iteration 0: loss 1.3862943611198926, diff 0.0 > >> >> Iteration 1: loss 1.5846343143210866, diff 0.14307193024217352 > >> >> Iteration 2: loss 1.1242501524477688, diff 0.29053004039012126 > >> >> Iteration 3: loss 1.12425015244776
Re: Contributing to Spark
I’d suggest looking for the issues labeled “Starter” on JIRA. You can find them here: https://issues.apache.org/jira/browse/SPARK-1438?jql=project%20%3D%20SPARK%20AND%20labels%20%3D%20Starter%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened) Matei On Apr 7, 2014, at 9:45 PM, Mukesh G wrote: > Hi Sujeet, > >Thanks. I went thru the website and looks great. Is there a list of > items that I can choose from, for contribution? > > Thanks > > Mukesh > > > On Mon, Apr 7, 2014 at 10:14 PM, Sujeet Varakhedi > wrote: > >> This is a good place to start: >> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark >> >> Sujeet >> >> >> On Mon, Apr 7, 2014 at 9:20 AM, Mukesh G wrote: >> >>> Hi, >>> >>> How I contribute to Spark and it's associated projects? >>> >>> Appreciate the help... >>> >>> Thanks >>> >>> Mukesh >>> >>
Re: Contributing to Spark
Hi Sujeet, Thanks. I went thru the website and looks great. Is there a list of items that I can choose from, for contribution? Thanks Mukesh On Mon, Apr 7, 2014 at 10:14 PM, Sujeet Varakhedi wrote: > This is a good place to start: > https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark > > Sujeet > > > On Mon, Apr 7, 2014 at 9:20 AM, Mukesh G wrote: > > > Hi, > > > >How I contribute to Spark and it's associated projects? > > > > Appreciate the help... > > > > Thanks > > > > Mukesh > > >
Re: MLLib - Thoughts about refactoring Updater for LBFGS?
Hi guys, The latest PR uses Breeze's L-BFGS implement which is introduced by Xiangrui's sparse input format work in SPARK-1212. https://github.com/apache/spark/pull/353 Now, it works with the new sparse framework! Any feedback would be greatly appreciated. Thanks. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Thu, Apr 3, 2014 at 5:02 PM, DB Tsai wrote: > -- Forwarded message -- > From: David Hall > Date: Sat, Mar 15, 2014 at 10:02 AM > Subject: Re: MLLib - Thoughts about refactoring Updater for LBFGS? > To: DB Tsai > > > On Fri, Mar 7, 2014 at 10:56 PM, DB Tsai wrote: >> >> Hi David, >> >> Please let me know the version of Breeze that LBFGS can be serialized, >> and CachedDiffFunction is built-in in LBFGS once you finish. I'll >> update the PR to Spark from using RISO implementation to Breeze >> implementation. > > > The current master (0.7-SNAPSHOT) has these problems fixed. > >> >> >> Thanks. >> >> Sincerely, >> >> DB Tsai >> Machine Learning Engineer >> Alpine Data Labs >> -- >> Web: http://alpinenow.com/ >> >> >> On Thu, Mar 6, 2014 at 4:26 PM, David Hall wrote: >> > On Thu, Mar 6, 2014 at 4:21 PM, DB Tsai wrote: >> > >> >> Hi David, >> >> >> >> I can converge to the same result with your breeze LBFGS and Fortran >> >> implementations now. Probably, I made some mistakes when I tried >> >> breeze before. I apologize that I claimed it's not stable. >> >> >> >> See the test case in BreezeLBFGSSuite.scala >> >> https://github.com/AlpineNow/spark/tree/dbtsai-breezeLBFGS >> >> >> >> This is training multinomial logistic regression against iris dataset, >> >> and both optimizers can train the models with 98% training accuracy. >> >> >> > >> > great to hear! There were some bugs in LBFGS about 6 months ago, so >> > depending on the last time you tried it, it might indeed have been >> > bugged. >> > >> > >> >> >> >> There are two issues to use Breeze in Spark, >> >> >> >> 1) When the gradientSum and lossSum are computed distributively in >> >> custom defined DiffFunction which will be passed into your optimizer, >> >> Spark will complain LBFGS class is not serializable. In >> >> BreezeLBFGS.scala, I've to convert RDD to array to make it work >> >> locally. It should be easy to fix by just having LBFGS to implement >> >> Serializable. >> >> >> > >> > I'm not sure why Spark should be serializing LBFGS? Shouldn't it live on >> > the controller node? Or is this a per-node thing? >> > >> > But no problem to make it serializable. >> > >> > >> >> >> >> 2) Breeze computes redundant gradient and loss. See the following log >> >> from both Fortran and Breeze implementations. >> >> >> > >> > Err, yeah. I should probably have LBFGS do this automatically, but >> > there's >> > a CachedDiffFunction that gets rid of the redundant calculations. >> > >> > -- David >> > >> > >> >> >> >> Thanks. >> >> >> >> Fortran: >> >> Iteration -1: loss 1.3862943611198926, diff 1.0 >> >> Iteration 0: loss 1.5846343143210866, diff 0.14307193024217352 >> >> Iteration 1: loss 1.1242501524477688, diff 0.29053004039012126 >> >> Iteration 2: loss 1.0930151243303563, diff 0.027782962952189336 >> >> Iteration 3: loss 1.054036932835569, diff 0.03566113127440601 >> >> Iteration 4: loss 0.9907956302751622, diff 0.0507649459571 >> >> Iteration 5: loss 0.9184205380342829, diff 0.07304737423337761 >> >> Iteration 6: loss 0.8259870936519937, diff 0.10064381175132982 >> >> Iteration 7: loss 0.6327447552109574, diff 0.23395293458364716 >> >> Iteration 8: loss 0.5534101162436359, diff 0.1253815427665277 >> >> Iteration 9: loss 0.4045020086612566, diff 0.26907321376758075 >> >> Iteration 10: loss 0.3078824990823728, diff 0.23885980452569627 >> >> >> >> Breeze: >> >> Iteration -1: loss 1.3862943611198926, diff 1.0 >> >> Mar 6, 2014 3:59:11 PM com.github.fommil.netlib.BLAS >> >> WARNING: Failed to load implementation from: >> >> com.github.fommil.netlib.NativeSystemBLAS >> >> Mar 6, 2014 3:59:11 PM com.github.fommil.netlib.BLAS >> >> WARNING: Failed to load implementation from: >> >> com.github.fommil.netlib.NativeRefBLAS >> >> Iteration 0: loss 1.3862943611198926, diff 0.0 >> >> Iteration 1: loss 1.5846343143210866, diff 0.14307193024217352 >> >> Iteration 2: loss 1.1242501524477688, diff 0.29053004039012126 >> >> Iteration 3: loss 1.1242501524477688, diff 0.0 >> >> Iteration 4: loss 1.1242501524477688, diff 0.0 >> >> Iteration 5: loss 1.0930151243303563, diff 0.027782962952189336 >> >> Iteration 6: loss 1.0930151243303563, diff 0.0 >> >> Iteration 7: loss 1.0930151243303563, diff 0.0 >> >> Iteration 8: loss 1.054036932835569, diff 0.03566113127440601 >> >> Iteration 9: loss 1.054036932835569, diff 0.0 >> >> Iteration 10: loss 1.054036932835569, diff 0.0 >> >> Iteration 11: loss 0.9907956302751622, diff 0.0507649459571 >> >> Iteration 12: loss 0.9907956302751622, diff 0.0 >> >> I
Re: Spark Streaming and Flume Avro RPC Servers
Cool. I'll look at making the code change in FlumeUtils and generating a pull request. As far as the use case, the volume of messages we have is currently about 30 MB per second which may grow to over what a 1 Gbit network adapter can handle. - Christophe On Apr 7, 2014 1:51 PM, "Michael Ernest" wrote: > I don't see why not. If one were doing something similar with straight > Flume, you'd start an agent on each node you care to receive Avro/RPC > events. In the absence of clearer insight to your use case, I'm puzzling > just a little why it's necessary for each Worker to be its own receiver, > but there's no real objection or concern to fuel the puzzlement, just > curiosity. > > > On Mon, Apr 7, 2014 at 4:16 PM, Christophe Clapp > wrote: > > > Could it be as simple as just changing FlumeUtils to accept a list of > > host/port number pairs to start the RPC servers on? > > > > > > > > On 4/7/14, 12:58 PM, Christophe Clapp wrote: > > > >> Based on the source code here: > >> https://github.com/apache/spark/blob/master/external/ > >> flume/src/main/scala/org/apache/spark/streaming/flume/FlumeUtils.scala > >> > >> It looks like in its current version, FlumeUtils does not support > >> starting an Avro RPC server on more than one worker. > >> > >> - Christophe > >> > >> On 4/7/14, 12:23 PM, Michael Ernest wrote: > >> > >>> You can configure your sinks to write to one or more Avro sources in a > >>> load-balanced configuration. > >>> > >>> https://flume.apache.org/FlumeUserGuide.html#flume-sink-processors > >>> > >>> mfe > >>> > >>> > >>> On Mon, Apr 7, 2014 at 3:19 PM, Christophe Clapp > >>> wrote: > >>> > >>> Hi, > > From my testing of Spark Streaming with Flume, it seems that there's > only > one of the Spark worker nodes that runs a Flume Avro RPC server to > receive > messages at any given time, as opposed to every Spark worker running > an > Avro RPC server to receive messages. Is this the case? Our use-case > would > benefit from balancing the load across Workers because of our volume > of > messages. We would be using a load balancer in front of the Spark > workers > running the Avro RPC servers, essentially round-robinning the messages > across all of them. > > If this is something that is currently not supported, I'd be > interested > in > contributing to the code to make it happen. > > - Christophe > > > >>> > >>> > >> > > > > > -- > Michael Ernest > Sr. Solutions Consultant > West Coast >
Re: Spark Streaming and Flume Avro RPC Servers
I don't see why not. If one were doing something similar with straight Flume, you'd start an agent on each node you care to receive Avro/RPC events. In the absence of clearer insight to your use case, I'm puzzling just a little why it's necessary for each Worker to be its own receiver, but there's no real objection or concern to fuel the puzzlement, just curiosity. On Mon, Apr 7, 2014 at 4:16 PM, Christophe Clapp wrote: > Could it be as simple as just changing FlumeUtils to accept a list of > host/port number pairs to start the RPC servers on? > > > > On 4/7/14, 12:58 PM, Christophe Clapp wrote: > >> Based on the source code here: >> https://github.com/apache/spark/blob/master/external/ >> flume/src/main/scala/org/apache/spark/streaming/flume/FlumeUtils.scala >> >> It looks like in its current version, FlumeUtils does not support >> starting an Avro RPC server on more than one worker. >> >> - Christophe >> >> On 4/7/14, 12:23 PM, Michael Ernest wrote: >> >>> You can configure your sinks to write to one or more Avro sources in a >>> load-balanced configuration. >>> >>> https://flume.apache.org/FlumeUserGuide.html#flume-sink-processors >>> >>> mfe >>> >>> >>> On Mon, Apr 7, 2014 at 3:19 PM, Christophe Clapp >>> wrote: >>> >>> Hi, From my testing of Spark Streaming with Flume, it seems that there's only one of the Spark worker nodes that runs a Flume Avro RPC server to receive messages at any given time, as opposed to every Spark worker running an Avro RPC server to receive messages. Is this the case? Our use-case would benefit from balancing the load across Workers because of our volume of messages. We would be using a load balancer in front of the Spark workers running the Avro RPC servers, essentially round-robinning the messages across all of them. If this is something that is currently not supported, I'd be interested in contributing to the code to make it happen. - Christophe >>> >>> >> > -- Michael Ernest Sr. Solutions Consultant West Coast
Re: Spark Streaming and Flume Avro RPC Servers
Could it be as simple as just changing FlumeUtils to accept a list of host/port number pairs to start the RPC servers on? On 4/7/14, 12:58 PM, Christophe Clapp wrote: Based on the source code here: https://github.com/apache/spark/blob/master/external/flume/src/main/scala/org/apache/spark/streaming/flume/FlumeUtils.scala It looks like in its current version, FlumeUtils does not support starting an Avro RPC server on more than one worker. - Christophe On 4/7/14, 12:23 PM, Michael Ernest wrote: You can configure your sinks to write to one or more Avro sources in a load-balanced configuration. https://flume.apache.org/FlumeUserGuide.html#flume-sink-processors mfe On Mon, Apr 7, 2014 at 3:19 PM, Christophe Clapp wrote: Hi, From my testing of Spark Streaming with Flume, it seems that there's only one of the Spark worker nodes that runs a Flume Avro RPC server to receive messages at any given time, as opposed to every Spark worker running an Avro RPC server to receive messages. Is this the case? Our use-case would benefit from balancing the load across Workers because of our volume of messages. We would be using a load balancer in front of the Spark workers running the Avro RPC servers, essentially round-robinning the messages across all of them. If this is something that is currently not supported, I'd be interested in contributing to the code to make it happen. - Christophe
Re: Spark Streaming and Flume Avro RPC Servers
Based on the source code here: https://github.com/apache/spark/blob/master/external/flume/src/main/scala/org/apache/spark/streaming/flume/FlumeUtils.scala It looks like in its current version, FlumeUtils does not support starting an Avro RPC server on more than one worker. - Christophe On 4/7/14, 12:23 PM, Michael Ernest wrote: You can configure your sinks to write to one or more Avro sources in a load-balanced configuration. https://flume.apache.org/FlumeUserGuide.html#flume-sink-processors mfe On Mon, Apr 7, 2014 at 3:19 PM, Christophe Clapp wrote: Hi, From my testing of Spark Streaming with Flume, it seems that there's only one of the Spark worker nodes that runs a Flume Avro RPC server to receive messages at any given time, as opposed to every Spark worker running an Avro RPC server to receive messages. Is this the case? Our use-case would benefit from balancing the load across Workers because of our volume of messages. We would be using a load balancer in front of the Spark workers running the Avro RPC servers, essentially round-robinning the messages across all of them. If this is something that is currently not supported, I'd be interested in contributing to the code to make it happen. - Christophe
Re: Spark Streaming and Flume Avro RPC Servers
Right, but at least in my case, no avro RPC server was started on any of the spark worker nodes except for one. I don't know if that's just some configuration issue with my setup or if it's expected behavior. I would need spark to start avro RPC servers on every worker rather than just one. - Christophe On Apr 7, 2014 12:24 PM, "Michael Ernest" wrote: > You can configure your sinks to write to one or more Avro sources in a > load-balanced configuration. > > https://flume.apache.org/FlumeUserGuide.html#flume-sink-processors > > mfe > > > On Mon, Apr 7, 2014 at 3:19 PM, Christophe Clapp > wrote: > > > Hi, > > > > From my testing of Spark Streaming with Flume, it seems that there's only > > one of the Spark worker nodes that runs a Flume Avro RPC server to > receive > > messages at any given time, as opposed to every Spark worker running an > > Avro RPC server to receive messages. Is this the case? Our use-case would > > benefit from balancing the load across Workers because of our volume of > > messages. We would be using a load balancer in front of the Spark workers > > running the Avro RPC servers, essentially round-robinning the messages > > across all of them. > > > > If this is something that is currently not supported, I'd be interested > in > > contributing to the code to make it happen. > > > > - Christophe > > > > > > -- > Michael Ernest > Sr. Solutions Consultant > West Coast >
Re: Spark Streaming and Flume Avro RPC Servers
You can configure your sinks to write to one or more Avro sources in a load-balanced configuration. https://flume.apache.org/FlumeUserGuide.html#flume-sink-processors mfe On Mon, Apr 7, 2014 at 3:19 PM, Christophe Clapp wrote: > Hi, > > From my testing of Spark Streaming with Flume, it seems that there's only > one of the Spark worker nodes that runs a Flume Avro RPC server to receive > messages at any given time, as opposed to every Spark worker running an > Avro RPC server to receive messages. Is this the case? Our use-case would > benefit from balancing the load across Workers because of our volume of > messages. We would be using a load balancer in front of the Spark workers > running the Avro RPC servers, essentially round-robinning the messages > across all of them. > > If this is something that is currently not supported, I'd be interested in > contributing to the code to make it happen. > > - Christophe > -- Michael Ernest Sr. Solutions Consultant West Coast
Spark Streaming and Flume Avro RPC Servers
Hi, From my testing of Spark Streaming with Flume, it seems that there's only one of the Spark worker nodes that runs a Flume Avro RPC server to receive messages at any given time, as opposed to every Spark worker running an Avro RPC server to receive messages. Is this the case? Our use-case would benefit from balancing the load across Workers because of our volume of messages. We would be using a load balancer in front of the Spark workers running the Avro RPC servers, essentially round-robinning the messages across all of them. If this is something that is currently not supported, I'd be interested in contributing to the code to make it happen. - Christophe
Re: ALS array index out of bound with 50 factors
Hi Deb, It would be helpful if you can attached the logs. It is strange to see that you can make 4 iterations but not 10. Xiangrui On Mon, Apr 7, 2014 at 10:36 AM, Debasish Das wrote: > I am using master... > > No negative indexes... > > If I run with 4 iterations it runs fine and I can generate factors... > > With 10 iterations run fails with array index out of bound... > > 25m users and 3m products are within int limits > > Does it help if I can point the logs for both the runs to you ? > > I will debug it further today... > On Apr 7, 2014 9:54 AM, "Xiangrui Meng" wrote: > >> Hi Deb, >> >> This thread is for the out-of-bound error you described. I don't think >> the number of iterations has any effect here. My questions were: >> >> 1) Are you using the master branch or a particular commit? >> >> 2) Do you have negative or out-of-integer-range user or product ids? >> Try to print out the max/min value of user/product ids. >> >> Best, >> Xiangrui >> >> On Sun, Apr 6, 2014 at 11:01 PM, Debasish Das >> wrote: >> > Hi Xiangrui, >> > >> > With 4 ALS iterations it runs fine...If I run 10 I am failing...I >> believe I >> > have to cut the lineage chain and call checkpointTrying to follow the >> > other email chain on checkpointing... >> > >> > Thanks. >> > Deb >> > >> > >> > On Sun, Apr 6, 2014 at 9:08 PM, Xiangrui Meng wrote: >> > >> >> Hi Deb, >> >> >> >> Are you using the master branch or a particular commit? Do you have >> >> negative or out-of-integer-range user or product ids? There is an >> >> issue with ALS' partitioning >> >> (https://spark-project.atlassian.net/browse/SPARK-1281), but I'm not >> >> sure whether that is the reason. Could you try to see whether you can >> >> reproduce the error on a public data set, e.g., movielens? Thanks! >> >> >> >> Best, >> >> Xiangrui >> >> >> >> On Sat, Apr 5, 2014 at 10:53 PM, Debasish Das > > >> >> wrote: >> >> > Hi, >> >> > >> >> > I deployed apache/spark master today and recently there were many ALS >> >> > related checkins and enhancements.. >> >> > >> >> > I am running ALS with explicit feedback and I remember most >> enhancements >> >> > were related to implicit feedback... >> >> > >> >> > With 25 factors my runs were successful but with 50 factors I am >> getting >> >> > array index out of bound... >> >> > >> >> > Note that I was hitting gc errors before with an older version of >> spark >> >> but >> >> > it seems like the sparse matrix partitioning scheme has changed >> >> now...data >> >> > caching looks much balanced now...earlier one node was becoming >> >> > bottleneck...Although I ran with 64g memory per node... >> >> > >> >> > There are around 3M products, 25M users... >> >> > >> >> > Anyone noticed this bug or something similar ? >> >> > >> >> > 14/04/05 23:03:15 WARN TaskSetManager: Loss was due to >> >> > java.lang.ArrayIndexOutOfBoundsException >> >> > java.lang.ArrayIndexOutOfBoundsException: 81029 >> >> > at >> >> > >> >> >> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1$$anonfun$apply$mcVI$sp$1.apply$mcVI$sp(ALS.scala:450) >> >> > at >> scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) >> >> > at >> >> > >> >> >> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:446) >> >> > at >> scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) >> >> > at org.apache.spark.mllib.recommendation.ALS.org >> >> > $apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:445) >> >> > at >> >> > >> >> >> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:416) >> >> > at >> >> > >> >> >> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:415) >> >> > at >> >> > >> >> >> org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) >> >> > at >> >> > >> >> >> org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) >> >> > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) >> >> > at >> >> > >> >> >> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:149) >> >> > at >> >> > >> >> >> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:147) >> >> > at >> >> > >> >> >> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) >> >> > at >> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) >> >> > at >> org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:147) >> >> > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:229) >> >> > at org.apache.spark.rdd.RDD.iterator(RDD.scala:220) >> >> > at >> >> > org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) >> >> > at o
Re: Flaky streaming tests
I agree these should be disabled right away, and the JIRA can be used to track fixing / turning them back on. On Mon, Apr 7, 2014 at 11:33 AM, Michael Armbrust wrote: > There is a JIRA for one of the flakey tests here: > https://issues.apache.org/jira/browse/SPARK-1409 > > > On Mon, Apr 7, 2014 at 11:32 AM, Patrick Wendell wrote: > >> TD - do you know what is going on here? >> >> I looked into this ab it and at least a few of these that use >> Thread.sleep() and assume the sleep will be exact, which is wrong. We >> should disable all the tests that do and probably they should be >> re-written >> to virtualize time. >> >> - Patrick >> >> >> On Mon, Apr 7, 2014 at 10:52 AM, Kay Ousterhout > >wrote: >> >> > Hi all, >> > >> > The InputStreamsSuite seems to have some serious flakiness issues -- >> I've >> > seen the file input stream fail many times and now I'm seeing some actor >> > input stream test failures ( >> > >> > >> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13846/consoleFull >> > ) >> > on what I think is an unrelated change. Does anyone know anything about >> > these? Should we just remove some of these tests since they seem to be >> > constantly failing? >> > >> > -Kay >> > >> > >
Re: Flaky streaming tests
Yes, I will take a look at those tests ASAP. TD On Mon, Apr 7, 2014 at 11:32 AM, Patrick Wendell wrote: > TD - do you know what is going on here? > > I looked into this ab it and at least a few of these that use > Thread.sleep() and assume the sleep will be exact, which is wrong. We > should disable all the tests that do and probably they should be re-written > to virtualize time. > > - Patrick > > > On Mon, Apr 7, 2014 at 10:52 AM, Kay Ousterhout >wrote: > > > Hi all, > > > > The InputStreamsSuite seems to have some serious flakiness issues -- I've > > seen the file input stream fail many times and now I'm seeing some actor > > input stream test failures ( > > > > > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13846/consoleFull > > ) > > on what I think is an unrelated change. Does anyone know anything about > > these? Should we just remove some of these tests since they seem to be > > constantly failing? > > > > -Kay > > >
Re: Flaky streaming tests
There is a JIRA for one of the flakey tests here: https://issues.apache.org/jira/browse/SPARK-1409 On Mon, Apr 7, 2014 at 11:32 AM, Patrick Wendell wrote: > TD - do you know what is going on here? > > I looked into this ab it and at least a few of these that use > Thread.sleep() and assume the sleep will be exact, which is wrong. We > should disable all the tests that do and probably they should be re-written > to virtualize time. > > - Patrick > > > On Mon, Apr 7, 2014 at 10:52 AM, Kay Ousterhout >wrote: > > > Hi all, > > > > The InputStreamsSuite seems to have some serious flakiness issues -- I've > > seen the file input stream fail many times and now I'm seeing some actor > > input stream test failures ( > > > > > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13846/consoleFull > > ) > > on what I think is an unrelated change. Does anyone know anything about > > these? Should we just remove some of these tests since they seem to be > > constantly failing? > > > > -Kay > > >
Re: Flaky streaming tests
TD - do you know what is going on here? I looked into this ab it and at least a few of these that use Thread.sleep() and assume the sleep will be exact, which is wrong. We should disable all the tests that do and probably they should be re-written to virtualize time. - Patrick On Mon, Apr 7, 2014 at 10:52 AM, Kay Ousterhout wrote: > Hi all, > > The InputStreamsSuite seems to have some serious flakiness issues -- I've > seen the file input stream fail many times and now I'm seeing some actor > input stream test failures ( > > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13846/consoleFull > ) > on what I think is an unrelated change. Does anyone know anything about > these? Should we just remove some of these tests since they seem to be > constantly failing? > > -Kay >
Re: Flaky streaming tests
I met this issue when Jenkins seems to be very busy On Monday, April 7, 2014, Kay Ousterhout wrote: > Hi all, > > The InputStreamsSuite seems to have some serious flakiness issues -- I've > seen the file input stream fail many times and now I'm seeing some actor > input stream test failures ( > > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13846/consoleFull > ) > on what I think is an unrelated change. Does anyone know anything about > these? Should we just remove some of these tests since they seem to be > constantly failing? > > -Kay >
Flaky streaming tests
Hi all, The InputStreamsSuite seems to have some serious flakiness issues -- I've seen the file input stream fail many times and now I'm seeing some actor input stream test failures ( https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13846/consoleFull) on what I think is an unrelated change. Does anyone know anything about these? Should we just remove some of these tests since they seem to be constantly failing? -Kay
Re: ALS array index out of bound with 50 factors
I am using master... No negative indexes... If I run with 4 iterations it runs fine and I can generate factors... With 10 iterations run fails with array index out of bound... 25m users and 3m products are within int limits Does it help if I can point the logs for both the runs to you ? I will debug it further today... On Apr 7, 2014 9:54 AM, "Xiangrui Meng" wrote: > Hi Deb, > > This thread is for the out-of-bound error you described. I don't think > the number of iterations has any effect here. My questions were: > > 1) Are you using the master branch or a particular commit? > > 2) Do you have negative or out-of-integer-range user or product ids? > Try to print out the max/min value of user/product ids. > > Best, > Xiangrui > > On Sun, Apr 6, 2014 at 11:01 PM, Debasish Das > wrote: > > Hi Xiangrui, > > > > With 4 ALS iterations it runs fine...If I run 10 I am failing...I > believe I > > have to cut the lineage chain and call checkpointTrying to follow the > > other email chain on checkpointing... > > > > Thanks. > > Deb > > > > > > On Sun, Apr 6, 2014 at 9:08 PM, Xiangrui Meng wrote: > > > >> Hi Deb, > >> > >> Are you using the master branch or a particular commit? Do you have > >> negative or out-of-integer-range user or product ids? There is an > >> issue with ALS' partitioning > >> (https://spark-project.atlassian.net/browse/SPARK-1281), but I'm not > >> sure whether that is the reason. Could you try to see whether you can > >> reproduce the error on a public data set, e.g., movielens? Thanks! > >> > >> Best, > >> Xiangrui > >> > >> On Sat, Apr 5, 2014 at 10:53 PM, Debasish Das > > >> wrote: > >> > Hi, > >> > > >> > I deployed apache/spark master today and recently there were many ALS > >> > related checkins and enhancements.. > >> > > >> > I am running ALS with explicit feedback and I remember most > enhancements > >> > were related to implicit feedback... > >> > > >> > With 25 factors my runs were successful but with 50 factors I am > getting > >> > array index out of bound... > >> > > >> > Note that I was hitting gc errors before with an older version of > spark > >> but > >> > it seems like the sparse matrix partitioning scheme has changed > >> now...data > >> > caching looks much balanced now...earlier one node was becoming > >> > bottleneck...Although I ran with 64g memory per node... > >> > > >> > There are around 3M products, 25M users... > >> > > >> > Anyone noticed this bug or something similar ? > >> > > >> > 14/04/05 23:03:15 WARN TaskSetManager: Loss was due to > >> > java.lang.ArrayIndexOutOfBoundsException > >> > java.lang.ArrayIndexOutOfBoundsException: 81029 > >> > at > >> > > >> > org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1$$anonfun$apply$mcVI$sp$1.apply$mcVI$sp(ALS.scala:450) > >> > at > scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) > >> > at > >> > > >> > org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:446) > >> > at > scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) > >> > at org.apache.spark.mllib.recommendation.ALS.org > >> > $apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:445) > >> > at > >> > > >> > org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:416) > >> > at > >> > > >> > org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:415) > >> > at > >> > > >> > org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) > >> > at > >> > > >> > org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) > >> > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > >> > at > >> > > >> > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:149) > >> > at > >> > > >> > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:147) > >> > at > >> > > >> > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > >> > at > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > >> > at > org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:147) > >> > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:229) > >> > at org.apache.spark.rdd.RDD.iterator(RDD.scala:220) > >> > at > >> > org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) > >> > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:229) > >> > at org.apache.spark.rdd.RDD.iterator(RDD.scala:220) > >> > at > >> > > >> > org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) > >> > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:229) > >> > at org.apach
Re: ALS array index out of bound with 50 factors
Hi Deb, This thread is for the out-of-bound error you described. I don't think the number of iterations has any effect here. My questions were: 1) Are you using the master branch or a particular commit? 2) Do you have negative or out-of-integer-range user or product ids? Try to print out the max/min value of user/product ids. Best, Xiangrui On Sun, Apr 6, 2014 at 11:01 PM, Debasish Das wrote: > Hi Xiangrui, > > With 4 ALS iterations it runs fine...If I run 10 I am failing...I believe I > have to cut the lineage chain and call checkpointTrying to follow the > other email chain on checkpointing... > > Thanks. > Deb > > > On Sun, Apr 6, 2014 at 9:08 PM, Xiangrui Meng wrote: > >> Hi Deb, >> >> Are you using the master branch or a particular commit? Do you have >> negative or out-of-integer-range user or product ids? There is an >> issue with ALS' partitioning >> (https://spark-project.atlassian.net/browse/SPARK-1281), but I'm not >> sure whether that is the reason. Could you try to see whether you can >> reproduce the error on a public data set, e.g., movielens? Thanks! >> >> Best, >> Xiangrui >> >> On Sat, Apr 5, 2014 at 10:53 PM, Debasish Das >> wrote: >> > Hi, >> > >> > I deployed apache/spark master today and recently there were many ALS >> > related checkins and enhancements.. >> > >> > I am running ALS with explicit feedback and I remember most enhancements >> > were related to implicit feedback... >> > >> > With 25 factors my runs were successful but with 50 factors I am getting >> > array index out of bound... >> > >> > Note that I was hitting gc errors before with an older version of spark >> but >> > it seems like the sparse matrix partitioning scheme has changed >> now...data >> > caching looks much balanced now...earlier one node was becoming >> > bottleneck...Although I ran with 64g memory per node... >> > >> > There are around 3M products, 25M users... >> > >> > Anyone noticed this bug or something similar ? >> > >> > 14/04/05 23:03:15 WARN TaskSetManager: Loss was due to >> > java.lang.ArrayIndexOutOfBoundsException >> > java.lang.ArrayIndexOutOfBoundsException: 81029 >> > at >> > >> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1$$anonfun$apply$mcVI$sp$1.apply$mcVI$sp(ALS.scala:450) >> > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) >> > at >> > >> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:446) >> > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) >> > at org.apache.spark.mllib.recommendation.ALS.org >> > $apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:445) >> > at >> > >> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:416) >> > at >> > >> org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:415) >> > at >> > >> org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) >> > at >> > >> org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) >> > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) >> > at >> > >> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:149) >> > at >> > >> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:147) >> > at >> > >> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) >> > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) >> > at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:147) >> > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:229) >> > at org.apache.spark.rdd.RDD.iterator(RDD.scala:220) >> > at >> > org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) >> > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:229) >> > at org.apache.spark.rdd.RDD.iterator(RDD.scala:220) >> > at >> > >> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) >> > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:229) >> > at org.apache.spark.rdd.RDD.iterator(RDD.scala:220) >> > at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) >> > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:229) >> > at org.apache.spark.rdd.RDD.iterator(RDD.scala:220) >> > at >> > >> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161) >> > at >> > >> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102) >> > at org.apache.spark.scheduler.Task.run(Task.scala:52) >> > at >> > >> org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:211) >> > at
Re: Contributing to Spark
This is a good place to start: https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark Sujeet On Mon, Apr 7, 2014 at 9:20 AM, Mukesh G wrote: > Hi, > >How I contribute to Spark and it's associated projects? > > Appreciate the help... > > Thanks > > Mukesh >
Contributing to Spark
Hi, How I contribute to Spark and it's associated projects? Appreciate the help... Thanks Mukesh
Re: tachyon dependency
Tachyon is Java 6 compatible from version 0.4. Beside putting input/output data in Tachyon ( http://tachyon-project.org/Running-Spark-on-Tachyon.html ), Spark applications can also persist data into Tachyon ( https://github.com/apache/spark/blob/master/docs/scala-programming-guide.md ). On Mon, Apr 7, 2014 at 7:42 AM, Koert Kuipers wrote: > i noticed there is a dependency on tachyon in spark core 1.0.0-SNAPSHOT. > how does that work? i believe tachyon is written in java 7, yet spark > claims to be java 6 compatible. > -- Haoyuan Li Algorithms, Machines, People Lab, EECS, UC Berkeley http://www.cs.berkeley.edu/~haoyuan/
tachyon dependency
i noticed there is a dependency on tachyon in spark core 1.0.0-SNAPSHOT. how does that work? i believe tachyon is written in java 7, yet spark claims to be java 6 compatible.
Re: ALS array index out of bound with 50 factors
Nick, I already have this code which calls dictionary generation and then maps string etc to ints...I think the core algorithm should stay in ints...if you like I can add this code in MFUtils.scalathat's the convention I followed similar to MLUtils.scala...actually these functions should be even made part of MLUtils.scala... Only thing is that the join should be an option which makes it application dependent...sometimes people would like to do map side joins if their dictionaries are small...in my case user dictionary has 25M rows and product dictionary has 3M rows...so join optimization did not help... Thanks. Deb On Mon, Apr 7, 2014 at 6:57 AM, Nick Pentreath wrote: > On the partitioning / id keys. If we would look at hash partitioning, how > feasible will it be to just allow the user and item ids to be strings? A > lot of the time these ids are strings anyway (UUIDs and so on), and it's > really painful to translate between String <-> Int the whole time. > > Are there any obvious blockers to this? I am a bit rusty on the ALS code > but from a quick scan I think this may work. Performance may be an issue > with large String keys... Any majore issues/objections to this thinking? > > I may be able to find time to take a stab at this if there is demand. > > > On Mon, Apr 7, 2014 at 6:08 AM, Xiangrui Meng wrote: > > > Hi Deb, > > > > Are you using the master branch or a particular commit? Do you have > > negative or out-of-integer-range user or product ids? There is an > > issue with ALS' partitioning > > (https://spark-project.atlassian.net/browse/SPARK-1281), but I'm not > > sure whether that is the reason. Could you try to see whether you can > > reproduce the error on a public data set, e.g., movielens? Thanks! > > > > Best, > > Xiangrui > > > > On Sat, Apr 5, 2014 at 10:53 PM, Debasish Das > > wrote: > > > Hi, > > > > > > I deployed apache/spark master today and recently there were many ALS > > > related checkins and enhancements.. > > > > > > I am running ALS with explicit feedback and I remember most > enhancements > > > were related to implicit feedback... > > > > > > With 25 factors my runs were successful but with 50 factors I am > getting > > > array index out of bound... > > > > > > Note that I was hitting gc errors before with an older version of spark > > but > > > it seems like the sparse matrix partitioning scheme has changed > > now...data > > > caching looks much balanced now...earlier one node was becoming > > > bottleneck...Although I ran with 64g memory per node... > > > > > > There are around 3M products, 25M users... > > > > > > Anyone noticed this bug or something similar ? > > > > > > 14/04/05 23:03:15 WARN TaskSetManager: Loss was due to > > > java.lang.ArrayIndexOutOfBoundsException > > > java.lang.ArrayIndexOutOfBoundsException: 81029 > > > at > > > > > > org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1$$anonfun$apply$mcVI$sp$1.apply$mcVI$sp(ALS.scala:450) > > > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) > > > at > > > > > > org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:446) > > > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) > > > at org.apache.spark.mllib.recommendation.ALS.org > > > $apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:445) > > > at > > > > > > org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:416) > > > at > > > > > > org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:415) > > > at > > > > > > org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) > > > at > > > > > > org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) > > > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > > > at > > > > > > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:149) > > > at > > > > > > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:147) > > > at > > > > > > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > > > at > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > > > at > org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:147) > > > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:229) > > > at org.apache.spark.rdd.RDD.iterator(RDD.scala:220) > > > at > > > org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) > > > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:229) > > > at org.apache.spark.rdd.RDD.iterator(RDD.scala:220) > > > at > > > > > > org.apache.spark.rdd.FlatMappedValuesRDD.compute(
Re: ALS array index out of bound with 50 factors
On the partitioning / id keys. If we would look at hash partitioning, how feasible will it be to just allow the user and item ids to be strings? A lot of the time these ids are strings anyway (UUIDs and so on), and it's really painful to translate between String <-> Int the whole time. Are there any obvious blockers to this? I am a bit rusty on the ALS code but from a quick scan I think this may work. Performance may be an issue with large String keys... Any majore issues/objections to this thinking? I may be able to find time to take a stab at this if there is demand. On Mon, Apr 7, 2014 at 6:08 AM, Xiangrui Meng wrote: > Hi Deb, > > Are you using the master branch or a particular commit? Do you have > negative or out-of-integer-range user or product ids? There is an > issue with ALS' partitioning > (https://spark-project.atlassian.net/browse/SPARK-1281), but I'm not > sure whether that is the reason. Could you try to see whether you can > reproduce the error on a public data set, e.g., movielens? Thanks! > > Best, > Xiangrui > > On Sat, Apr 5, 2014 at 10:53 PM, Debasish Das > wrote: > > Hi, > > > > I deployed apache/spark master today and recently there were many ALS > > related checkins and enhancements.. > > > > I am running ALS with explicit feedback and I remember most enhancements > > were related to implicit feedback... > > > > With 25 factors my runs were successful but with 50 factors I am getting > > array index out of bound... > > > > Note that I was hitting gc errors before with an older version of spark > but > > it seems like the sparse matrix partitioning scheme has changed > now...data > > caching looks much balanced now...earlier one node was becoming > > bottleneck...Although I ran with 64g memory per node... > > > > There are around 3M products, 25M users... > > > > Anyone noticed this bug or something similar ? > > > > 14/04/05 23:03:15 WARN TaskSetManager: Loss was due to > > java.lang.ArrayIndexOutOfBoundsException > > java.lang.ArrayIndexOutOfBoundsException: 81029 > > at > > > org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1$$anonfun$apply$mcVI$sp$1.apply$mcVI$sp(ALS.scala:450) > > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) > > at > > > org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:446) > > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) > > at org.apache.spark.mllib.recommendation.ALS.org > > $apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:445) > > at > > > org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:416) > > at > > > org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:415) > > at > > > org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) > > at > > > org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) > > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > > at > > > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:149) > > at > > > org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:147) > > at > > > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > > at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:147) > > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:229) > > at org.apache.spark.rdd.RDD.iterator(RDD.scala:220) > > at > > org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) > > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:229) > > at org.apache.spark.rdd.RDD.iterator(RDD.scala:220) > > at > > > org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) > > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:229) > > at org.apache.spark.rdd.RDD.iterator(RDD.scala:220) > > at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) > > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:229) > > at org.apache.spark.rdd.RDD.iterator(RDD.scala:220) > > at > > > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161) > > at > > > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102) > > at org.apache.spark.scheduler.Task.run(Task.scala:52) > > at > > > org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:211) > > at > > > org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:43) > > at > > > org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(Sp