Re: Feedback: Feature request

2015-08-27 Thread Manish Amde
Hi James,

It's a good idea. A JSON format is more convenient for visualization though a 
little inconvenient to read. How about toJson() method? It might make the mllib 
api inconsistent across models though. 

You should probably create a JIRA for this.

CC: dev list

-Manish

> On Aug 26, 2015, at 11:29 AM, Murphy, James  wrote:
> 
> Hey all,
>  
> In working with the DecisionTree classifier, I found it difficult to extract 
> rules that could easily facilitate visualization with libraries like D3.
>  
> So for example, using : print(model.toDebugString()), I get the following 
> result =
>  
>If (feature 0 <= -35.0)
>   If (feature 24 <= 176.0)
> Predict: 2.1
>   If (feature 24 = 176.0)
> Predict: 4.2
>   Else (feature 24 > 176.0)
> Predict: 6.3
> Else (feature 0 > -35.0)
>   If (feature 24 <= 11.0)
> Predict: 4.5
>   Else (feature 24 > 11.0)
> Predict: 10.2
>  
> But ideally, I could see results in a more parseable format like JSON:
>  
> {
> "node": [
> {
> "name":"node1",
> "rule":"feature 0 <= -35.0",
> "children":[
> {
>   "name":"node2",
>   "rule":"feature 24 <= 176.0",
>   "children":[
>   {
>   "name":"node4",
>   "rule":"feature 20 < 116.0",
>   "predict":  2.1
>   },
>   {
>   "name":"node5",
>   "rule":"feature 20 = 116.0",
>   "predict": 4.2
>   },
>   {
>   "name":"node5",
>   "rule":"feature 20 > 116.0",
>   "predict": 6.3
>   }
>   ]
> },
> {
> "name":"node3",
> "rule":"feature 0 > -35.0",
>   "children":[
>   {
>   "name":"node7",
>   "rule":"feature 3 <= 11.0",
>   "predict": 4.5
>   },
>   {
>   "name":"node8",
>   "rule":"feature 3 > 11.0",
>   "predict": 10.2
>   }
>   ]
> }
>  
> ]
> }
> ]
> }
>  
> Food for thought!
>  
> Thanks,
>  
> Jim
>  


Re: Feedback: Feature request

2015-08-28 Thread Manish Amde
Sounds good. It's a request I have seen a few times in the past and have
needed it personally. May be Joseph Bradley has something to add.

I think a JIRA to capture this will be great. We can move this discussion
to the JIRA then.

On Friday, August 28, 2015, Cody Koeninger  wrote:

> I wrote some code for this a while back, pretty sure it didn't need access
> to anything private in the decision tree / random forest model.  If people
> want it added to the api I can put together a PR.
>
> I think it's important to have separately parseable operators / operands
> though.  E.g
>
> "lhs":0,"op":"<=","rhs":-35.0
> On Aug 28, 2015 12:03 AM, "Manish Amde"  > wrote:
>
>> Hi James,
>>
>> It's a good idea. A JSON format is more convenient for visualization
>> though a little inconvenient to read. How about toJson() method? It might
>> make the mllib api inconsistent across models though.
>>
>> You should probably create a JIRA for this.
>>
>> CC: dev list
>>
>> -Manish
>>
>> On Aug 26, 2015, at 11:29 AM, Murphy, James > > wrote:
>>
>> Hey all,
>>
>>
>>
>> In working with the DecisionTree classifier, I found it difficult to
>> extract rules that could easily facilitate visualization with libraries
>> like D3.
>>
>>
>>
>> So for example, using : print(model.toDebugString()), I get the following
>> result =
>>
>>
>>
>>If (feature 0 <= -35.0)
>>
>>   If (feature 24 <= 176.0)
>>
>> Predict: 2.1
>>
>>   If (feature 24 = 176.0)
>>
>> Predict: 4.2
>>
>>   Else (feature 24 > 176.0)
>>
>> Predict: 6.3
>>
>> Else (feature 0 > -35.0)
>>
>>   If (feature 24 <= 11.0)
>>
>> Predict: 4.5
>>
>>   Else (feature 24 > 11.0)
>>
>> Predict: 10.2
>>
>>
>>
>> But ideally, I could see results in a more parseable format like JSON:
>>
>>
>>
>> {
>>
>> "node": [
>>
>> {
>>
>> "name":"node1",
>>
>> "rule":"feature 0 <= -35.0",
>>
>> "children":[
>>
>> {
>>
>>   "name":"node2",
>>
>>   "rule":"feature 24 <= 176.0",
>>
>>   "children":[
>>
>>   {
>>
>>   "name":"node4",
>>
>>   "rule":"feature 20 < 116.0",
>>
>>   "predict":  2.1
>>
>>   },
>>
>>   {
>>
>>   "name":"node5",
>>
>>   "rule":"feature 20 = 116.0",
>>
>>   "predict": 4.2
>>
>>   },
>>
>>   {
>>
>>   "name":"node5",
>>
>>   "rule":"feature 20 > 116.0",
>>
>>   "predict": 6.3
>>
>>   }
>>
>>   ]
>>
>> },
>>
>> {
>>
>> "name":"node3",
>>
>> "rule":"feature 0 > -35.0",
>>
>>   "children":[
>>
>>   {
>>
>>   "name":"node7",
>>
>>   "rule":"feature 3 <= 11.0",
>>
>>   "predict": 4.5
>>
>>   },
>>
>>   {
>>
>>   "name":"node8",
>>
>>   "rule":"feature 3 > 11.0",
>>
>>   "predict": 10.2
>>
>>   }
>>
>>   ]
>>
>> }
>>
>>
>>
>> ]
>>
>> }
>>
>> ]
>>
>> }
>>
>>
>>
>> Food for thought!
>>
>>
>>
>> Thanks,
>>
>>
>>
>> Jim
>>
>>
>>
>>


Re: DecisionTree Algorithm used in Spark MLLib

2015-01-01 Thread Manish Amde
Hi Anoop,

The Spark decision tree implementation supports: regression and multi class
classification, continuous and categorical features, pruning and does not
support missing features at present. You can probably think of it as
distributed CART though personally I always find the acronyms confusing.

How much difference are you seeing? There is a very small difference in how
the candidate split thresholds are calculated in various libraries (there
is no right way) but it should not lead to significant difference in
performance.

-Manish


On Monday, December 29, 2014, Anoop Shiralige 
wrote:

> Hi All,
>
> I am trying to do a comparison, by building the model locally using R and
> on cluster using spark.
> There is some difference in the results.
>
> Any idea what is the internal implementation of Decision Tree in Spark
> MLLib.. (ID3 or C4.5 or C5.0 or CART algorithm).
>
> Thanks,
> AnoopShiralige
>


Re: Random Forest on Spark

2014-04-18 Thread Manish Amde
Sorry for arriving late to the party! Evan has clearly explained the
current implementation, our future plans and key differences with the
PLANET paper. I don't think I can add more to his comments. :-)

I apologize for not creating the corresponding JIRA tickets for the tree
improvements (multiclass classification, deep trees, post-shuffle
single-machine computation for small datasets, code refactoring for
pluggable loss calculation) and ensembles tree (RF, GBT, AdaBoost,
ExtraTrees, partial implementation of RF). I will create them soon.

We are currently working on creating very fast ensemble trees which will be
different from current ensemble tree implementations in other libraries.
PR's for tree improvements will be great -- just make sure you go carefully
through the tree code which I think is fairly well-documented but
non-trivial to understand and discuss your changes on JIRA before
implementation to avoid duplication.

-Manish


On Fri, Apr 18, 2014 at 8:43 AM, Evan R. Sparks wrote:

> Interesting, and thanks for the thoughts.
>
> I think we're on the same page with 100s of millions of records. We've
> tested the tree implementation in mllib on 1b rows and up to 100 features -
> though this isn't hitting the 1000s of features you mention.
>
> Obviously multi class support isn't there yet, but I can see your point
> about deeper trees for many class problems. Will try them out on some image
> processing stuff with 1k classes we're doing in the lab once they are more
> developed to get a sense for where the issues are.
>
> If you're only allocating 2GB/worker you're going to have a hard time
> getting the real advantages of Spark.
>
> For your 1k features causing heap exceptions at depth 5  - are these
> categorical or continuous? The categorical vars create much smaller
> histograms.
>
> If you're fitting all continuous features, the memory requirements are
> O(b*d*2^l) where b=number of histogram bins, d=number of features, and l =
> level of the tree. Even accounting for object overhead, with the default
> number of bins, the histograms at this depth should be order of 10s of MB,
> not 2GB - so I'm guessing your cached data is occupying a significant chunk
> of that 2GB? In the tree PR - Hirakendu Das tested down to depth 10 on 500m
> data points with 20 continuous features and was able to run without running
> into memory issues (and scaling properties got better as the depth grew).
> His worker mem was 7.5GB and 30% of that was reserved for caching. If you
> wanted to go 1000 features at depth 10 I'd estimate a couple of gigs
> necessary for heap space for the worker to compute/store the histograms,
> and I guess 2x that on the master to do the reduce.
>
> Again 2GB per worker is pretty tight, because there are overheads of just
> starting the jvm, launching a worker, loading libraries, etc.
>
> - Evan
>
> On Apr 17, 2014, at 6:10 PM, Sung Hwan Chung 
> wrote:
>
> Yes, it should be data specific and perhaps we're biased toward the data
> sets that we are playing with. To put things in perspective, we're highly
> interested in (and I believe, our customers are):
>
> 1. large (hundreds of millions of rows)
> 2. multi-class classification - nowadays, dozens of target categories are
> common and even thousands in some cases - you could imagine that this is a
> big reason for us requiring more 'complex' models
> 3. high dimensional with thousands of descriptive and sort-of-independent
> features
>
> From the theoretical perspective, I would argue that it's usually in the
> best interest to prune as little as possible. I believe that pruning
> inherently increases bias of an individual tree, which RF can't do anything
> about while decreasing variance - which is what RF is for.
>
> The default pruning criteria for R's reference implementation is min-node
> of 1 (meaning fully-grown tree) for classification, and 5 for regression.
> I'd imagine they did at least some empirical testing to justify these
> values at the time - although at a time of small datasets :).
>
> FYI, we are also considering the MLLib decision tree for our Gradient
> Boosting implementation, however, the memory requirement is still a bit too
> steep (we were getting heap exceptions at depth limit of 5 with 2GB per
> worker with approximately 1000 features). Now 2GB per worker is about what
> we expect our typical customers would tolerate and I don't think that it's
> unreasonable for shallow trees.
>
>
>
> On Thu, Apr 17, 2014 at 3:54 PM, Evan R. Sparks wrote:
>
>> What kind of data are you training on? These effects are *highly* data
>> dependent, and while saying "the depth of 10 is simply not adequate to
>> build high-accuracy models" may be accurate for the particular problem
>> you're modeling, it is not true in general. From a statistical perspective,
>> I consider each node in each tree an additional degree of freedom for the
>> model, and all else equal I'd expect a model with fewer degrees of freedom
>> to generalize bett

Re: MLLib : Decision Tree not getting built for 5 or more levels(maxDepth=5) and the one built for 3 levels is performing poorly

2014-06-13 Thread Manish Amde
Hi Suraj,

I can't answer 1) without knowing the data. However, the results for 2) are
surprising indeed. We have tested with a billion samples for regression
tasks so I am perplexed with the behavior.

Could you try the latest Spark master to see whether this problem goes
away. It has code that limits memory consumption at the master and worker
nodes to 128 MB by default which ideally should not be needed given the
amount of RAM on your cluster.

Also, feel free to send the DEBUG logs. It might give me a better idea of
where the algorithm is getting stuck.

-Manish



On Wed, Jun 11, 2014 at 1:20 PM, SURAJ SHETH  wrote:

> Hi Filipus,
> The train data is already oversampled.
> The number of positives I mentioned above is for the test dataset : 12028
> (apologies for not making this clear earlier)
> The train dataset has 61,264 positives out of 689,763 total rows. The
> number of negatives is 628,499.
> Oversampling was done for the train dataset to ensure that we have atleast
> 9-10% of positives in the train part
> No oversampling is done for the test dataset.
>
> So, the only difference that remains is the amount of data used for
> building a tree.
>
> But, I have a few more questions :
> Have we tried how much data can be used at most to build a single Decision
> Tree.
> Since, I have enough RAM to fit all the data into memory(only 1.3 GB of
> train data and 30x3 GB of RAM), I would expect it to build a single
> Decision Tree with all the data without any issues. But, for maxDepth >= 5,
> it is not able to. I confirmed that when it keeps running for hours, the
> amount of free memory available is more than 70%. So, it doesn't seem to be
> a Memory issue either.
>
>
> Thanks and Regards,
> Suraj Sheth
>
>
> On Wed, Jun 11, 2014 at 10:19 PM, filipus  wrote:
>
>> well I guess your problem is quite unbalanced and due to the information
>> value as a splitting criterion I guess the algo stops after very view
>> splits
>>
>> work arround is oversampling
>>
>> build many training datasets like
>>
>> take randomly 50% of the positives and from the negativ the same amount or
>> let say the double
>>
>> => 6000 positives and 12000 negatives
>>
>> build a tree
>>
>> this you do many times => many models (agents)
>>
>> and than you make an ensemble model. means vote all the model
>>
>> in a way similar two random forest but at the completely different
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-Decision-Tree-not-getting-built-for-5-or-more-levels-maxDepth-5-and-the-one-built-for-3-levelsy-tp7401p7405.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>
>


Re: MLLib : Decision Tree not getting built for 5 or more levels(maxDepth=5) and the one built for 3 levels is performing poorly

2014-06-15 Thread Manish Amde
Hi Suraj,

I don't see any logs from mllib. You might need to explicit set the logging
to DEBUG for mllib. Adding this line for log4j.properties might fix the
problem.
log4j.logger.org.apache.spark.mllib.tree=DEBUG

Also, please let me know if you can encounter similar problems with the
Spark master.

-Manish


On Sat, Jun 14, 2014 at 3:19 AM, SURAJ SHETH  wrote:

> Hi Manish,
> Thanks for your reply.
>
> I am attaching the logs here(regression, 5 levels). It contains the last
> 100s of lines. Also, I am attaching the screenshot of Spark UI. The first 4
> levels complete in less than 6 seconds, while the 5th level doesn't
> complete even after several hours.
> Due to the reason that this is somebody else's data, I can't share it.
>
> Can you check the code snippet attached in my first email and see if it
> needs something to enable it to work for large data and >= 5 levels. It is
> working for 3 levels on the same dataset, but, not for 5 levels.
>
> In the mean time, I will try to run it on the latest master and let you
> know the results. If it runs fine there, then, it can be related to 128 MB
> limit issue that you mentioned.
>
> Thanks and Regards,
> Suraj Sheth
>
>
>
> On Sat, Jun 14, 2014 at 12:05 AM, Manish Amde  wrote:
>
>> Hi Suraj,
>>
>> I can't answer 1) without knowing the data. However, the results for 2)
>> are surprising indeed. We have tested with a billion samples for regression
>> tasks so I am perplexed with the behavior.
>>
>> Could you try the latest Spark master to see whether this problem goes
>> away. It has code that limits memory consumption at the master and worker
>> nodes to 128 MB by default which ideally should not be needed given the
>> amount of RAM on your cluster.
>>
>> Also, feel free to send the DEBUG logs. It might give me a better idea of
>> where the algorithm is getting stuck.
>>
>> -Manish
>>
>>
>>
>> On Wed, Jun 11, 2014 at 1:20 PM, SURAJ SHETH  wrote:
>>
>>> Hi Filipus,
>>> The train data is already oversampled.
>>> The number of positives I mentioned above is for the test dataset :
>>> 12028 (apologies for not making this clear earlier)
>>> The train dataset has 61,264 positives out of 689,763 total rows. The
>>> number of negatives is 628,499.
>>> Oversampling was done for the train dataset to ensure that we have
>>> atleast 9-10% of positives in the train part
>>> No oversampling is done for the test dataset.
>>>
>>> So, the only difference that remains is the amount of data used for
>>> building a tree.
>>>
>>> But, I have a few more questions :
>>> Have we tried how much data can be used at most to build a single
>>> Decision Tree.
>>> Since, I have enough RAM to fit all the data into memory(only 1.3 GB of
>>> train data and 30x3 GB of RAM), I would expect it to build a single
>>> Decision Tree with all the data without any issues. But, for maxDepth >= 5,
>>> it is not able to. I confirmed that when it keeps running for hours, the
>>> amount of free memory available is more than 70%. So, it doesn't seem to be
>>> a Memory issue either.
>>>
>>>
>>> Thanks and Regards,
>>> Suraj Sheth
>>>
>>>
>>> On Wed, Jun 11, 2014 at 10:19 PM, filipus  wrote:
>>>
>>>> well I guess your problem is quite unbalanced and due to the information
>>>> value as a splitting criterion I guess the algo stops after very view
>>>> splits
>>>>
>>>> work arround is oversampling
>>>>
>>>> build many training datasets like
>>>>
>>>> take randomly 50% of the positives and from the negativ the same amount
>>>> or
>>>> let say the double
>>>>
>>>> => 6000 positives and 12000 negatives
>>>>
>>>> build a tree
>>>>
>>>> this you do many times => many models (agents)
>>>>
>>>> and than you make an ensemble model. means vote all the model
>>>>
>>>> in a way similar two random forest but at the completely different
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-Decision-Tree-not-getting-built-for-5-or-more-levels-maxDepth-5-and-the-one-built-for-3-levelsy-tp7401p7405.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>
>>>
>>
>


Re: MLLib : Decision Tree with minimum points per node

2014-06-19 Thread Manish Amde
Hi Justin,

I am glad to know that trees are working well for you.

The trees will support minimum samples per node in a future release. Thanks
for the feedback.

-Manish


On Fri, Jun 13, 2014 at 8:55 PM, Justin Yip  wrote:

> Hello,
>
> I have been playing around with mllib's decision tree library. It is
> working great, thanks.
>
> I have a question regarding overfitting. It appears to me that the current
> implementation doesn't allows user to specify the minimum number of samples
> per node. This results in some nodes only contain very few samples, which
> potentially leads to overfitting.
>
> I would like to know if there is workaround or any way to prevent
> overfitting? Or will decision tree supports min-samples-per-node in future
> releases?
>
> Thanks.
>
> Justin
>
>
>


Re: MLLib : Decision Tree with minimum points per node

2014-06-19 Thread Manish Amde
Hi Justin,

I have created a JIRA ticket to keep track of your request. Thanks.
https://issues.apache.org/jira/browse/SPARK-2207

-Manish


On Thu, Jun 19, 2014 at 2:35 PM, Manish Amde  wrote:

> Hi Justin,
>
> I am glad to know that trees are working well for you.
>
> The trees will support minimum samples per node in a future release.
> Thanks for the feedback.
>
> -Manish
>
>
> On Fri, Jun 13, 2014 at 8:55 PM, Justin Yip  wrote:
>
>> Hello,
>>
>> I have been playing around with mllib's decision tree library. It is
>> working great, thanks.
>>
>> I have a question regarding overfitting. It appears to me that the
>> current implementation doesn't allows user to specify the minimum number of
>> samples per node. This results in some nodes only contain very few samples,
>> which potentially leads to overfitting.
>>
>> I would like to know if there is workaround or any way to prevent
>> overfitting? Or will decision tree supports min-samples-per-node in future
>> releases?
>>
>> Thanks.
>>
>> Justin
>>
>>
>>
>


Re: Gradient Boosted Machines

2014-08-05 Thread Manish Amde
Hi Daniel,

Thanks a lot for your interest. Gradient boosting and AdaBoost algorithms
are under active development and should be a part of release 1.2.

-Manish


On Mon, Jul 14, 2014 at 11:24 AM, Daniel Bendavid <
daniel.benda...@creditkarma.com> wrote:

>  Hi,
>
>  My company is strongly considering implementing a recommendation engine
> that is built off of statistical models using Spark.  We attended the Spark
> Summit and were incredibly impressed with the technology and the entire
> community.  Since then, we have been exploring the technology and
> determining how we could use it for our specific needs.
>
>  One algorithm that we ideally want to use as part of our project is
> Gradient Boosted Machines.  We are aware that they have not yet been
> implemented in MLib and would like to submit our request that they be
> considered for future implementation.  Additionally, we would love to see
> the AdaBoost algorithm implemented in Mlib and Feature Preprocessing
> implemented in Python (as it already exists for Scala).
>
>  Otherwise, thank you for taking our feedback and for providing us with
> this incredible technology.
>
>  Daniel
>


Re: Anybody built the branch for Adaptive Boosting, extension to MLlib by Manish Amde?

2014-09-18 Thread Manish Amde
Hi Aris,


Thanks for the interest. First and foremost, tree ensembles are a top priority 
for the 1.2 release and we are working hard towards it. A random forests PR is 
already under review and AdaBoost and gradient boosting will be added soon 
after. 




Unfortunately, the GBDT branch I shared is way off master. There has been a lot 
of under-the-hood optimizations for decision trees and I am not surprised that 
the branch doesn't compile. It will be best if you could wait for a few days 
till I make the branch compatible with the latest master.





Again, thanks for your interest in boosting algos. We are eager to add them to 
MLlib ASAP.

On Thu, Sep 18, 2014 at 7:27 PM, Aris  wrote:

> Thank you Spark community you make life much more lovely - suffering in
> silence is not fun!
> I am trying to build the Spark Git branch from Manish Amde, available here:
> https://github.com/manishamde/spark/tree/ada_boost
> I am trying to build the non-master branch 'ada_boost' (in the link above),
> but './sbt/sbt assembly' does not work, as it sees all kinds of new code
> that doesn't build. I saw another script at the top-level called
> 'make-distribution.sh' which requires maven and specifically Java 6 (does
> not allow javac version 7), but that also fails.
> Does anybody have any pointers for building this developmental build of
> Spark with support for adaptive boosting (adaboost ensemble decision tree
> method) in MLlib?
> Thanks!

Re: Status of MLLib exporting models to PMML

2014-11-13 Thread Manish Amde
@Aris, we are closely following the PMML work that is going on and as
Xiangrui mentioned, it might be easier to migrate models such as logistic
regression and then migrate trees. Some of the models get fairly large (as
pointed out by Sung Chung) with deep trees as building blocks and we might
have to consider a distributed storage and prediction strategy.


On Tuesday, November 11, 2014, Xiangrui Meng  wrote:

> Vincenzo sent a PR and included k-means as an example. Sean is helping
> review it. PMML standard is quite large. So we may start with simple
> model export, like linear methods, then move forward to tree-based.
> -Xiangrui
>
> On Mon, Nov 10, 2014 at 11:27 AM, Aris  > wrote:
> > Hello Spark and MLLib folks,
> >
> > So a common problem in the real world of using machine learning is that
> some
> > data analysis use tools like R, but the more "data engineers" out there
> will
> > use more advanced systems like Spark MLLib or even Python Scikit Learn.
> >
> > In the real world, I want to have "a system" where multiple different
> > modeling environments can learn from data / build models, represent the
> > models in a common language, and then have a layer which just takes the
> > model and run model.predict() all day long -- scores the models in other
> > words.
> >
> > It looks like the project openscoring.io and jpmml-evaluator are some
> > amazing systems for this, but they fundamentally use PMML as the model
> > representation here.
> >
> > I have read some JIRA tickets that Xiangrui Meng is interested in getting
> > PMML implemented to export MLLib models, is that happening? Further,
> would
> > something like Manish Amde's boosted ensemble tree methods be
> representable
> > in PMML?
> >
> > Thank you!!
> > Aris
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> For additional commands, e-mail: user-h...@spark.apache.org 
>
>


Re: Status of MLLib exporting models to PMML

2014-11-17 Thread Manish Amde
Hi Charles,

I am not aware of other storage formats. Perhaps Sean or Sandy can
elaborate more given their experience with Oryx.

There is work by Smola et al at Google that talks about large scale model
update and deployment.
https://www.usenix.org/conference/osdi14/technical-sessions/presentation/li_mu

-Manish

On Sunday, November 16, 2014, Charles Earl  wrote:

> Manish and others,
> A follow up question on my mind is whether there are protobuf (or other
> binary format) frameworks in the vein of PMML. Perhaps scientific data
> storage frameworks like netcdf, root are possible also.
> I like the comprehensiveness of PMML but as you mention the complexity of
> management for large models is a concern.
> Cheers
>
> On Fri, Nov 14, 2014 at 1:35 AM, Manish Amde  > wrote:
>
>> @Aris, we are closely following the PMML work that is going on and as
>> Xiangrui mentioned, it might be easier to migrate models such as logistic
>> regression and then migrate trees. Some of the models get fairly large (as
>> pointed out by Sung Chung) with deep trees as building blocks and we might
>> have to consider a distributed storage and prediction strategy.
>>
>>
>> On Tuesday, November 11, 2014, Xiangrui Meng > > wrote:
>>
>>> Vincenzo sent a PR and included k-means as an example. Sean is helping
>>> review it. PMML standard is quite large. So we may start with simple
>>> model export, like linear methods, then move forward to tree-based.
>>> -Xiangrui
>>>
>>> On Mon, Nov 10, 2014 at 11:27 AM, Aris  wrote:
>>> > Hello Spark and MLLib folks,
>>> >
>>> > So a common problem in the real world of using machine learning is
>>> that some
>>> > data analysis use tools like R, but the more "data engineers" out
>>> there will
>>> > use more advanced systems like Spark MLLib or even Python Scikit Learn.
>>> >
>>> > In the real world, I want to have "a system" where multiple different
>>> > modeling environments can learn from data / build models, represent the
>>> > models in a common language, and then have a layer which just takes the
>>> > model and run model.predict() all day long -- scores the models in
>>> other
>>> > words.
>>> >
>>> > It looks like the project openscoring.io and jpmml-evaluator are some
>>> > amazing systems for this, but they fundamentally use PMML as the model
>>> > representation here.
>>> >
>>> > I have read some JIRA tickets that Xiangrui Meng is interested in
>>> getting
>>> > PMML implemented to export MLLib models, is that happening? Further,
>>> would
>>> > something like Manish Amde's boosted ensemble tree methods be
>>> representable
>>> > in PMML?
>>> >
>>> > Thank you!!
>>> > Aris
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>
>
> --
> - Charles
>


Re: Print Node info. of Decision Tree

2014-12-08 Thread Manish Amde
Hi Jake,

The "toString" method should print the full model in versions 1.1.x.

The current master branch has a method "toDebugString" for
DecisionTreeModel which should print out all the node classes and the
"toString" method has been updated to print the summary only so there is a
slight change in the upcoming release 1.2.x.

-Manish

On Sun, Dec 7, 2014 at 9:17 PM, jake Lim  wrote:

> How can i print Node info. of Decision Tree model?
> I want to navigate and print all information of Decision tree Model.
> Is there some kind of function/method to support it?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Print-Node-info-of-Decision-Tree-tp20572.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>