Re: SizeEstimator

2018-02-26 Thread Xin Liu
has less free > memory spilling may become more expensive. > > > If the walk is your bottleneck and not GC then I would recommend JOL and > guessing to better predict memory. > > On Mon, Feb 26, 2018, 4:47 PM Xin Liu <xin.e@gmail.com> wrote: > >&

Re: SizeEstimator

2018-02-26 Thread Xin Liu
Thanks! Our protobuf object is fairly complex. Even O(N) takes a lot of time. On Mon, Feb 26, 2018 at 6:33 PM, 叶先进 <advance...@gmail.com> wrote: > H Xin Liu, > > Could you provide a concrete user case if possible(code to reproduce > protobuf object and comparisons between p

SizeEstimator

2018-02-26 Thread Xin Liu
Hi folks, We have a situation where, shuffled data is protobuf based, and SizeEstimator is taking a lot of time. We have tried to override SizeEstimator to return a constant value, which speeds up things a lot. My questions, what is the side effect of disabling SizeEstimator? Is it just spark

Parquet Multiple Output

2015-06-12 Thread Xin Liu
Hi, I have a scenario where I'd like to store a RDD using parquet format in many files, which corresponds to days, such as 2015/01/01, 2015/02/02, etc. So far I used this method http://stackoverflow.com/questions/23995040/write-to-multiple-outputs-by-key-spark-one-spark-job to store text files

Re: Compare LogisticRegression results using Mllib with those using other libraries (e.g. statsmodel)

2015-05-22 Thread Xin Liu
:42 PM, Xin Liu liuxin...@gmail.com wrote: Hi, I have tried a few models in Mllib to train a LogisticRegression model. However, I consistently get much better results using other libraries such as statsmodel (which gives similar results as R) in terms of AUC. For illustration purpose, I used

Compare LogisticRegression results using Mllib with those using other libraries (e.g. statsmodel)

2015-05-20 Thread Xin Liu
Hi, I have tried a few models in Mllib to train a LogisticRegression model. However, I consistently get much better results using other libraries such as statsmodel (which gives similar results as R) in terms of AUC. For illustration purpose, I used a small data (I have tried much bigger data)