subject:"\[GitHub\] spark pull request\: MLI\-1 Decision Trees"

[GitHub] spark pull request: MLI-1 Decision Trees

2014-04-02 Thread manishamde

Github user manishamde commented on the pull request:

https://github.com/apache/spark/pull/79#issuecomment-39401757
  
@hirakendu Thanks a lot for the detailed comments and feedback. Yes, we 
have a responsibility to keep improving the trees going forward so getting 
additional feedback is awesome.

The feedback around the current implementation in the 'Miscellaneous' 
section is around renaming and minor refactoring of code. I agree with most of 
the feedback and some choices are personal preferences which should discuss and 
resolve. We should address it soon when we implement the better ```Impurity``` 
interface that we promised to implement ASAP. 

I think the "Error" interface you described is very similar to what @mengxr 
proposed as well. We should discuss the naming convention. Even though I don't 
feel strongly about "Impurity" but "Error" might not be the best name for the 
classification scenario. I am open to better names and ready to be convinced 
otherwise. :-)

For 'General Design Notes', I have similar thoughts but I will wait for 
@etrain's comments since he has thought about it carefully. In general, I like 
@etrain 's MLI design for Model and Algorithm. I did not tie the current 
implementation to the existing traits yet since I wanted to have a broader 
conversation about it after the tree PR. It's straightforward to implement once 
we agree on the interfaces for mllib algorithms.

Finally, and most importantly, thanks a ton for performing such extensive 
tests on a massive dataset! The results are not too shabby. ;-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: MLI-1 Decision Trees

2014-04-02 Thread hirakendu

Github user hirakendu commented on the pull request:

https://github.com/apache/spark/pull/79#issuecomment-39396722
  
Thanks for noticing the typo, it's 500 million examples. Corrected it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: MLI-1 Decision Trees

2014-04-02 Thread etrain

Github user etrain commented on the pull request:

https://github.com/apache/spark/pull/79#issuecomment-39392123
  
Hi Hirakendu - thanks for all the detailed suggestions and information. I 
will reply to that separately.

One question - you say there are 500,000 examples and this equates to 90GB 
of raw data. If that's the case, this works out to ~200KB per example - is that 
right or are you off by an order of magnitude in either the number of features 
or the number of data points? Or are we throwing a bunch of data out before 
fitting?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: MLI-1 Decision Trees

2014-04-02 Thread hirakendu

Github user hirakendu commented on the pull request:

https://github.com/apache/spark/pull/79#issuecomment-39390821
  
Since it's a long review, I have put things together in one place 
[spark_mllib_tree_pr79_review.md](https://github.com/hirakendu/temp/blob/master/spark_mllib_tree_pr79_review/spark_mllib_tree_pr79_review.md)
 so that it's easy to track later.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: MLI-1 Decision Trees

2014-04-02 Thread hirakendu

Github user hirakendu commented on the pull request:

https://github.com/apache/spark/pull/79#issuecomment-39390268

In terms of running time performance, here are some scalability results for
a large scale dataset. Results look satisfactory :).

The code was tested on a dataset for a binary classification problem. A
regression tree with `Variance` (square loss) was trained because it's
computationally more intensive. The dataset consists of about `500,000`
training instances. There are 20 features, all numeric. Although there are
categorical features in the dataset and the algorithm implementation supports
it, they were not used since the main program doesn't have options.

The dataset is of size about 90 GB in plain text format. It consists of 100
part files, each about 900 MB. To _optimize_ number of tasks to align with
number of workers, the task split size was chosen to be 160 MB to have 300
tasks.

The model training experiements were done on a Yahoo internal Hadoop
cluster. The CPUs are of type Intel Xeon 2.4 GHz. The Spark YARN adaptation was
used to run on Hadoop cluster. Both master-memory and worker-memory were set to
`7500m` and the cluster was under light to moderate load at the time of
experiments. The fraction of memory used for caching was `0.3`,
leading to about `2 GB` of memory per worker for caching. The number of
workers used for various experiments were `20, 30, 60, 100, 150, 300, 600` to
evenly align with 600 tasks. Note that with 20 and 30 workers, only `48%` and
`78%` of the 90 GB data could be cached in memory. The decision tree of depth
10 was trained and minor code changes were made to record the individual level
training times. The training command (with additional JVM settings) used is

time \
SPARK_JAVA_OPTS="${SPARK_JAVA_OPTS} \
-Dspark.hadoop.mapred.min.split.size=167772160 \
-Dspark.hadoop.mapred.max.split.size=167772160" \
-Dspark.storage.memoryFraction=0.3 \
spark-class org.apache.spark.deploy.yarn.Client \
--queue ${QUEUE} \
--num-workers ${NUM_WORKERS} \
--worker-memory 7500m \
--worker-cores 1 \
--master-memory 7500m \
--jar ${JARS}/spark_mllib_tree.jar \
--class org.apache.spark.mllib.tree.DecisionTree \
--args yarn-standalone \
--args --algo --args Regression \
--args --trainDataDir --args ${DIST_WORK}/train_data_lp.txt \
--args --testDataDir --args ${DIST_WORK}/test_data_0p1pc_lp.txt \
--args --maxDepth --args 10 \
--args --impurity --args Variance \
--args --maxBins --args 100

The training times for training each depth and for each choice of number of
workers is in the attachments
[workers_times.txt](https://raw.githubusercontent.com/hirakendu/temp/master/spark_mllib_tree_pr79_review/workers_times.txt).
The attached graphs demonstrate the scalability in terms of
cumulative training times for various depths and various number of workers.

![](https://raw.githubusercontent.com/hirakendu/temp/master/spark_mllib_tree_pr79_review/workers_times.png)

![](https://raw.githubusercontent.com/hirakendu/temp/master/spark_mllib_tree_pr79_review/workers_speedups.png)

For obtaining the speed-ups, the training times are compared to those for
60 workers, since the data could not be cached completely for 20 and 30
workers. For all experiments, the resources requested
were fully allocated by the cluster and all experiments ran to completion
in their first and only run.
As we can see, the scaling is nearly linear for higher depths 9 and 10,
across the range of 60 workers to 600 workers, although the slope is less than
1 as expected. For such loads, 60 to 100 workers are a reasonable computing
resource and the total training times are about 160 minutes and 100 minutes
respectively. Overall, the performance is satisfactory, in particular when trees
of depth 4 or 5 are trained for boosting models. But clearly, there is room
for improvement in the following sense. The dataset when fully cached takes
about 10s for a count operation, whereas the training time for first level that
involves simple histogram calculation of three error statistics takes roughly
30 seconds.

The error performance in terms of RMSE was verified to be close to that of
alternate implementations.

---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

1 2 >

1 - 100 of 123 matches

Mail list logo