Re: Problems with Pyspark + Dill tests

2014-06-25 Thread Josh Rosen
The problem seems to be that unpicklable RDD objects are being pulled into function closures.  In your failing dockets, it looks like the rdd created through sc.parallelize is being pulled into the map lambda’s function closure. I opened a new Dill bug with a small test case that reproduces this

Re: Contributing to MLlib on GLM

2014-06-25 Thread Sung Hwan Chung
Well, as you said, MLLib already supports GLM in a sense. Except they only support two link functions - identity (linear regression) and logit (logistic regression). It should not be too hard to add other link functions, as all you have to do is add a different gradient function for Poisson/Gamma,

Re: Problems with Pyspark + Dill tests

2014-06-25 Thread Mark Baker
Hey, On Mon, Jun 23, 2014 at 5:27 PM, Mark Baker wrote: > Thanks for the context, Josh. > > I've gone ahead and created a new test case and just opened a new issue; > > https://github.com/uqfoundation/dill/issues/49 So that one's dealt with; it was a sys.prefix issue with me using a virtualenv a

Re: balancing RDDs

2014-06-25 Thread Sean McNamara
Yep exactly! I’m not sure how complicated it would be to pull off. If someone wouldn’t mind helping to get me pointed in the right direction I would be happy to look into and contribute this functionality. I imagine this would be implemented in the scheduler codebase and there would be some s