Perhaps the best way is to read the code.
The Decision tree is implemented by 1-tree Random forest, whose entry point
is `run` method:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala#L88
I'm not familiar with the so-called algorithms of decision tree, such as
ID4, CART. However, I believe that the implementation of decision tree of
sklearn is quite similar with those of spark, and some difference are
listed below:
1. Continuous feature.
sklearn use all candidate values to find best split, while spark groups
all candidate values into fixed bins.
2. Build tree.
sklearn provides two methods: depth-first and best-first, while spark
has only one: depth-first.
3. Split number.
sklearn creates one split per iteration, while spark could split in
parallel.
If I'm wrong, please let me know.
On Sat, Oct 1, 2016 at 10:34 AM, janardhan shetty <[email protected]>
wrote:
> It would be good to know which paper has inspired to implement the version
> which we use in spark 2.0 decision trees ?
>
> On Fri, Sep 30, 2016 at 4:44 PM, Peter Figliozzi <[email protected]
> > wrote:
>
>> It's a good question. People have been publishing papers on decision
>> trees and various methods of constructing and pruning them for over 30
>> years. I think it's rather a question for a historian at this point.
>>
>> On Fri, Sep 30, 2016 at 5:08 PM, janardhan shetty <[email protected]
>> > wrote:
>>
>>> Read this explanation but wondering if this algorithm has the base from
>>> a research paper for detail understanding.
>>>
>>> On Fri, Sep 30, 2016 at 1:36 PM, Kevin Mellott <
>>> [email protected]> wrote:
>>>
>>>> The documentation details the algorithm being used at
>>>> http://spark.apache.org/docs/latest/mllib-decision-tree.html
>>>>
>>>> Thanks,
>>>> Kevin
>>>>
>>>> On Fri, Sep 30, 2016 at 1:14 AM, janardhan shetty <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Any help here is appreciated ..
>>>>>
>>>>> On Wed, Sep 28, 2016 at 11:34 AM, janardhan shetty <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Is there a reference to the research paper which is implemented in
>>>>>> spark 2.0 ?
>>>>>>
>>>>>> On Wed, Sep 28, 2016 at 9:52 AM, janardhan shetty <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Which algorithm is used under the covers while doing decision trees
>>>>>>> FOR SPARK ?
>>>>>>> for example: scikit-learn (python) uses an optimised version of the
>>>>>>> CART algorithm.
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>