Re: SizeEstimator
Thanks David. Another solution is to convert the protobuf object to byte array, It does speed up SizeEstimator On Mon, Feb 26, 2018 at 5:34 PM, David Capwell <dcapw...@gmail.com> wrote: > This is used to predict the current cost of memory so spark knows to flush > or not. This is very costly for us so we use a flag marked in the code as > private to lower the cost > > spark.shuffle.spill.numElementsForceSpillThreshold (on phone hope no > typo) - how many records before flush > > This lowers the cost because it let's us leave data in young, if we don't > bound we get everyone promoted to old and GC becomes a issue. This doesn't > solve the fact that the walk is slow, but lowers the cost of GC. For us we > make sure to have spare memory on the system for page cache so spilling to > disk for us is a memory write 99% of the time. If your host has less free > memory spilling may become more expensive. > > > If the walk is your bottleneck and not GC then I would recommend JOL and > guessing to better predict memory. > > On Mon, Feb 26, 2018, 4:47 PM Xin Liu <xin.e@gmail.com> wrote: > >> Hi folks, >> >> We have a situation where, shuffled data is protobuf based, and >> SizeEstimator is taking a lot of time. >> >> We have tried to override SizeEstimator to return a constant value, which >> speeds up things a lot. >> >> My questions, what is the side effect of disabling SizeEstimator? Is it >> just spark do memory reallocation, or there is more severe consequences? >> >> Thanks! >> >
Re: SizeEstimator
Thanks! Our protobuf object is fairly complex. Even O(N) takes a lot of time. On Mon, Feb 26, 2018 at 6:33 PM, 叶先进 <advance...@gmail.com> wrote: > H Xin Liu, > > Could you provide a concrete user case if possible(code to reproduce > protobuf object and comparisons between protobuf and normal object)? > > I contributed a bit to SizeEstimator years ago, and to my understanding, > the time complexity should be O(N) where N is the num of referenced fields > recursively. > > We should definitely investigate this case if it indeed takes a lot of > time on protobuf objects. > > > On 27 Feb 2018, at 8:47 AM, Xin Liu <xin.e@gmail.com> wrote: > > Hi folks, > > We have a situation where, shuffled data is protobuf based, and > SizeEstimator is taking a lot of time. > > We have tried to override SizeEstimator to return a constant value, which > speeds up things a lot. > > My questions, what is the side effect of disabling SizeEstimator? Is it > just spark do memory reallocation, or there is more severe consequences? > > Thanks! > > >
SizeEstimator
Hi folks, We have a situation where, shuffled data is protobuf based, and SizeEstimator is taking a lot of time. We have tried to override SizeEstimator to return a constant value, which speeds up things a lot. My questions, what is the side effect of disabling SizeEstimator? Is it just spark do memory reallocation, or there is more severe consequences? Thanks!
Parquet Multiple Output
Hi, I have a scenario where I'd like to store a RDD using parquet format in many files, which corresponds to days, such as 2015/01/01, 2015/02/02, etc. So far I used this method http://stackoverflow.com/questions/23995040/write-to-multiple-outputs-by-key-spark-one-spark-job to store text files (then I have to read text files and convert to parquet and store again). Anyone has tried to store many parquet files from one RDD? Thanks, Xin
Re: Compare LogisticRegression results using Mllib with those using other libraries (e.g. statsmodel)
Thank you guys for the prompt help. I ended up building spark master and verified what DB has suggested. val lr = (new MlLogisticRegression) .setFitIntercept(true) .setMaxIter(35) val model = lr.fit(sqlContext.createDataFrame(training)) val scoreAndLabels = model.transform(sqlContext.createDataFrame(test)) .select(probability, label) .map { case Row(probability: Vector, label: Double) = (probability(1), label) } Without doing much tuning, above generates Weights: [0.0013971323020715888,0.8559779783186241,-0.5052275562089914] Intercept: -3.3076806966913006 Area under ROC: 0.7033511043412033 I also tried it on a much bigger dataset I have and its results are close to what I get from statsmodel. Now early waiting for the 1.4 release. Thanks, Xin On Wed, May 20, 2015 at 9:37 PM, Chris Gore cdg...@cdgore.com wrote: I tried running this data set as described with my own implementation of L2 regularized logistic regression using LBFGS to compare: https://github.com/cdgore/fitbox Intercept: -0.886745823033 Weights (['gre', 'gpa', 'rank']):[ 0.28862268 0.19402388 -0.36637964] Area under ROC: 0.724056603774 The difference could be from the feature preprocessing as mentioned. I normalized the features around 0: binary_train_normalized = (binary_train - binary_train.mean()) / binary_train.std() binary_test_normalized = (binary_test - binary_train.mean()) / binary_train.std() On a data set this small, the difference in models could also be the result of how the training/test sets were split. Have you tried running k-folds cross validation on a larger data set? Chris On May 20, 2015, at 6:15 PM, DB Tsai d...@netflix.com.INVALID wrote: Hi Xin, If you take a look at the model you trained, the intercept from Spark is significantly smaller than StatsModel, and the intercept represents a prior on categories in LOR which causes the low accuracy in Spark implementation. In LogisticRegressionWithLBFGS, the intercept is regularized due to the implementation of Updater, and the intercept should not be regularized. In the new pipleline APIs, a LOR with elasticNet is implemented, and the intercept is properly handled. https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala As you can see the tests, https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala the result is exactly the same as R now. BTW, in both version, the feature scalings are done before training, and we train the model in scaled space but transform the model weights back to original space. The only difference is in the mllib version, LogisticRegressionWithLBFGS regularizes the intercept while in the ml version, the intercept is excluded from regularization. As a result, if lambda is zero, the model should be the same. On Wed, May 20, 2015 at 3:42 PM, Xin Liu liuxin...@gmail.com wrote: Hi, I have tried a few models in Mllib to train a LogisticRegression model. However, I consistently get much better results using other libraries such as statsmodel (which gives similar results as R) in terms of AUC. For illustration purpose, I used a small data (I have tried much bigger data) http://www.ats.ucla.edu/stat/data/binary.csv in http://www.ats.ucla.edu/stat/r/dae/logit.htm Here is the snippet of my usage of LogisticRegressionWithLBFGS. val algorithm = new LogisticRegressionWithLBFGS algorithm.setIntercept(true) algorithm.optimizer .setNumIterations(100) .setRegParam(0.01) .setConvergenceTol(1e-5) val model = algorithm.run(training) model.clearThreshold() val scoreAndLabels = test.map { point = val score = model.predict(point.features) (score, point.label) } val metrics = new BinaryClassificationMetrics(scoreAndLabels) val auROC = metrics.areaUnderROC() I did a (0.6, 0.4) split for training/test. The response is admit and features are GRE score, GPA, and college Rank. Spark: Weights (GRE, GPA, Rank): [0.0011576276331509304,0.048544858567336854,-0.394202150286076] Intercept: -0.6488972641282202 Area under ROC: 0.6294070512820512 StatsModel: Weights [0.0018, 0.7220, -0.3148] Intercept: -3.5913 Area under ROC: 0.69 The weights from statsmodel seems more reasonable if you consider for a one unit increase in gpa, the log odds of being admitted to graduate school increases by 0.72 in statsmodel than 0.04 in Spark. I have seen much bigger difference with other data. So my question is has anyone compared the results with other libraries and is anything wrong with my code to invoke LogisticRegressionWithLBFGS? As the real data I am processing is pretty big and really want to use Spark to get this to work. Please let me know if you have similar experience and how you resolve it. Thanks, Xin
Compare LogisticRegression results using Mllib with those using other libraries (e.g. statsmodel)
Hi, I have tried a few models in Mllib to train a LogisticRegression model. However, I consistently get much better results using other libraries such as statsmodel (which gives similar results as R) in terms of AUC. For illustration purpose, I used a small data (I have tried much bigger data) http://www.ats.ucla.edu/stat/data/binary.csv in http://www.ats.ucla.edu/stat/r/dae/logit.htm Here is the snippet of my usage of LogisticRegressionWithLBFGS. val algorithm = new LogisticRegressionWithLBFGS algorithm.setIntercept(true) algorithm.optimizer .setNumIterations(100) .setRegParam(0.01) .setConvergenceTol(1e-5) val model = algorithm.run(training) model.clearThreshold() val scoreAndLabels = test.map { point = val score = model.predict(point.features) (score, point.label) } val metrics = new BinaryClassificationMetrics(scoreAndLabels) val auROC = metrics.areaUnderROC() I did a (0.6, 0.4) split for training/test. The response is admit and features are GRE score, GPA, and college Rank. Spark: Weights (GRE, GPA, Rank): [0.0011576276331509304,0.048544858567336854,-0.394202150286076] Intercept: -0.6488972641282202 Area under ROC: 0.6294070512820512 StatsModel: Weights [0.0018, 0.7220, -0.3148] Intercept: -3.5913 Area under ROC: 0.69 The weights from statsmodel seems more reasonable if you consider for a one unit increase in gpa, the log odds of being admitted to graduate school increases by 0.72 in statsmodel than 0.04 in Spark. I have seen much bigger difference with other data. So my question is has anyone compared the results with other libraries and is anything wrong with my code to invoke LogisticRegressionWithLBFGS? As the real data I am processing is pretty big and really want to use Spark to get this to work. Please let me know if you have similar experience and how you resolve it. Thanks, Xin