Github user mengxr commented on the pull request:
https://github.com/apache/spark/pull/79#issuecomment-37224328
@manishamde Thanks for updating the code style and adding more docs! I made
a first pass over the code.
For the code style, we do not have a good style checker for Scala. @rxin
can tell more about style checking. However, it is easy to learn Spark's code
style through the code review and make your code style consistent in the next
update. Please see my comments for some examples and update similar code in
other places.
For the implementation, I have the following suggestions:
1. Regression or Classification is checked in many places. It would be nice
to create a DecisionTree base class and make RegressionTree and
ClassificationTree two subclasses of it.
2. For loops are used in some performance critical code. This should be
replaced by "while", which is much faster than "for" in Scala.
3. Several nested methods are used in findBestSplits. It feels safe to see
some unit tests for them.
4. The threshold for classification is set at 0.5. This should be
configurable.
I will try to make a second pass focusing on the algorithm later today. In
the meanwhile, would you please update the remaining code style problems and
the for loops? Thanks!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---