GradientBoostedTrees.trainRegressor with categoricalFeaturesInfo

2015-05-20 Thread Don Drake
I'm running Spark v1.3.1 and when I run the following against my dataset:

model = GradientBoostedTrees.trainRegressor(trainingData,
categoricalFeaturesInfo=catFeatu
res, maxDepth=6, numIterations=3)

The job will fail with the following message:
Traceback (most recent call last):
  File "/Users/drake/fd/spark/mltest.py", line 73, in 
model = GradientBoostedTrees.trainRegressor(trainingData,
categoricalFeaturesInfo=catFeatures, maxDepth=6, numIterations=3)
  File
"/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/tree.py",
line 553, in trainRegressor
loss, numIterations, learningRate, maxDepth)
  File
"/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/tree.py",
line 438, in _train
loss, numIterations, learningRate, maxDepth)
  File
"/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/common.py",
line 120, in callMLlibFunc
return callJavaFunc(sc, api, *args)
  File
"/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/common.py",
line 113, in callJavaFunc
return _java2py(sc, func(*args))
  File
"/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
line 538, in __call__
  File
"/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
line 300, in get_return_value
15/05/20 16:40:12 INFO BlockManager: Removing block rdd_32_95
py4j.protocol.Py4JJavaError: An error occurred while calling
o69.trainGradientBoostedTreesModel.
: java.lang.IllegalArgumentException: requirement failed: DecisionTree
requires maxBins (= 32) >= max categories in categorical features (= 1895)
at scala.Predef$.require(Predef.scala:233)
at
org.apache.spark.mllib.tree.impl.DecisionTreeMetadata$.buildMetadata(DecisionTreeMetadata.scala:128)
at org.apache.spark.mllib.tree.RandomForest.run(RandomForest.scala:138)
at org.apache.spark.mllib.tree.DecisionTree.run(DecisionTree.scala:60)
at
org.apache.spark.mllib.tree.GradientBoostedTrees$.org$apache$spark$mllib$tree$GradientBoostedTrees$$boost(GradientBoostedTrees.scala:150)
at
org.apache.spark.mllib.tree.GradientBoostedTrees.run(GradientBoostedTrees.scala:63)
at
org.apache.spark.mllib.tree.GradientBoostedTrees$.train(GradientBoostedTrees.scala:96)
at
org.apache.spark.mllib.api.python.PythonMLLibAPI.trainGradientBoostedTreesModel(PythonMLLibAPI.scala:595)

So, it's complaining about the maxBins, if I provide maxBins=1900 and
re-run it:

model = GradientBoostedTrees.trainRegressor(trainingData,
categoricalFeaturesInfo=catFeatu
res, maxDepth=6, numIterations=3, maxBins=1900)

Traceback (most recent call last):
  File "/Users/drake/fd/spark/mltest.py", line 73, in 
model = GradientBoostedTrees.trainRegressor(trainingData,
categoricalFeaturesInfo=catF
eatures, maxDepth=6, numIterations=3, maxBins=1900)
TypeError: trainRegressor() got an unexpected keyword argument 'maxBins'

It now says it knows nothing of maxBins.

If I run the same command against DecisionTree or RandomForest (with
maxBins=1900) it works just fine.

Seems like a bug in GradientBoostedTrees.

Suggestions?

-Don

-- 
Donald Drake
Drake Consulting
http://www.drakeconsulting.com/
800-733-2143


Re: GradientBoostedTrees.trainRegressor with categoricalFeaturesInfo

2015-05-20 Thread Burak Yavuz
Could you please open a JIRA for it? The maxBins input is missing for the
Python Api.

Is it possible if you can use the current master? In the current master,
you should be able to use trees with the Pipeline Api and DataFrames.

Best,
Burak

On Wed, May 20, 2015 at 2:44 PM, Don Drake  wrote:

> I'm running Spark v1.3.1 and when I run the following against my dataset:
>
> model = GradientBoostedTrees.trainRegressor(trainingData,
> categoricalFeaturesInfo=catFeatu
> res, maxDepth=6, numIterations=3)
>
> The job will fail with the following message:
> Traceback (most recent call last):
>   File "/Users/drake/fd/spark/mltest.py", line 73, in 
> model = GradientBoostedTrees.trainRegressor(trainingData,
> categoricalFeaturesInfo=catFeatures, maxDepth=6, numIterations=3)
>   File
> "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/tree.py",
> line 553, in trainRegressor
> loss, numIterations, learningRate, maxDepth)
>   File
> "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/tree.py",
> line 438, in _train
> loss, numIterations, learningRate, maxDepth)
>   File
> "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/common.py",
> line 120, in callMLlibFunc
> return callJavaFunc(sc, api, *args)
>   File
> "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/common.py",
> line 113, in callJavaFunc
> return _java2py(sc, func(*args))
>   File
> "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
> line 538, in __call__
>   File
> "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
> line 300, in get_return_value
> 15/05/20 16:40:12 INFO BlockManager: Removing block rdd_32_95
> py4j.protocol.Py4JJavaError: An error occurred while calling
> o69.trainGradientBoostedTreesModel.
> : java.lang.IllegalArgumentException: requirement failed: DecisionTree
> requires maxBins (= 32) >= max categories in categorical features (= 1895)
> at scala.Predef$.require(Predef.scala:233)
> at
> org.apache.spark.mllib.tree.impl.DecisionTreeMetadata$.buildMetadata(DecisionTreeMetadata.scala:128)
> at org.apache.spark.mllib.tree.RandomForest.run(RandomForest.scala:138)
> at org.apache.spark.mllib.tree.DecisionTree.run(DecisionTree.scala:60)
> at
> org.apache.spark.mllib.tree.GradientBoostedTrees$.org$apache$spark$mllib$tree$GradientBoostedTrees$$boost(GradientBoostedTrees.scala:150)
> at
> org.apache.spark.mllib.tree.GradientBoostedTrees.run(GradientBoostedTrees.scala:63)
> at
> org.apache.spark.mllib.tree.GradientBoostedTrees$.train(GradientBoostedTrees.scala:96)
> at
> org.apache.spark.mllib.api.python.PythonMLLibAPI.trainGradientBoostedTreesModel(PythonMLLibAPI.scala:595)
>
> So, it's complaining about the maxBins, if I provide maxBins=1900 and
> re-run it:
>
> model = GradientBoostedTrees.trainRegressor(trainingData,
> categoricalFeaturesInfo=catFeatu
> res, maxDepth=6, numIterations=3, maxBins=1900)
>
> Traceback (most recent call last):
>   File "/Users/drake/fd/spark/mltest.py", line 73, in 
> model = GradientBoostedTrees.trainRegressor(trainingData,
> categoricalFeaturesInfo=catF
> eatures, maxDepth=6, numIterations=3, maxBins=1900)
> TypeError: trainRegressor() got an unexpected keyword argument 'maxBins'
>
> It now says it knows nothing of maxBins.
>
> If I run the same command against DecisionTree or RandomForest (with
> maxBins=1900) it works just fine.
>
> Seems like a bug in GradientBoostedTrees.
>
> Suggestions?
>
> -Don
>
> --
> Donald Drake
> Drake Consulting
> http://www.drakeconsulting.com/
> 800-733-2143
>


Re: GradientBoostedTrees.trainRegressor with categoricalFeaturesInfo

2015-05-20 Thread Joseph Bradley
One more comment: That's a lot of categories for a feature.  If it makes
sense for your data, it will run faster if you can group the categories or
split the 1895 categories into a few features which have fewer categories.

On Wed, May 20, 2015 at 3:17 PM, Burak Yavuz  wrote:

> Could you please open a JIRA for it? The maxBins input is missing for the
> Python Api.
>
> Is it possible if you can use the current master? In the current master,
> you should be able to use trees with the Pipeline Api and DataFrames.
>
> Best,
> Burak
>
> On Wed, May 20, 2015 at 2:44 PM, Don Drake  wrote:
>
>> I'm running Spark v1.3.1 and when I run the following against my dataset:
>>
>> model = GradientBoostedTrees.trainRegressor(trainingData,
>> categoricalFeaturesInfo=catFeatu
>> res, maxDepth=6, numIterations=3)
>>
>> The job will fail with the following message:
>> Traceback (most recent call last):
>>   File "/Users/drake/fd/spark/mltest.py", line 73, in 
>> model = GradientBoostedTrees.trainRegressor(trainingData,
>> categoricalFeaturesInfo=catFeatures, maxDepth=6, numIterations=3)
>>   File
>> "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/tree.py",
>> line 553, in trainRegressor
>> loss, numIterations, learningRate, maxDepth)
>>   File
>> "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/tree.py",
>> line 438, in _train
>> loss, numIterations, learningRate, maxDepth)
>>   File
>> "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/common.py",
>> line 120, in callMLlibFunc
>> return callJavaFunc(sc, api, *args)
>>   File
>> "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/common.py",
>> line 113, in callJavaFunc
>> return _java2py(sc, func(*args))
>>   File
>> "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>> line 538, in __call__
>>   File
>> "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
>> line 300, in get_return_value
>> 15/05/20 16:40:12 INFO BlockManager: Removing block rdd_32_95
>> py4j.protocol.Py4JJavaError: An error occurred while calling
>> o69.trainGradientBoostedTreesModel.
>> : java.lang.IllegalArgumentException: requirement failed: DecisionTree
>> requires maxBins (= 32) >= max categories in categorical features (= 1895)
>> at scala.Predef$.require(Predef.scala:233)
>> at
>> org.apache.spark.mllib.tree.impl.DecisionTreeMetadata$.buildMetadata(DecisionTreeMetadata.scala:128)
>> at org.apache.spark.mllib.tree.RandomForest.run(RandomForest.scala:138)
>> at org.apache.spark.mllib.tree.DecisionTree.run(DecisionTree.scala:60)
>> at
>> org.apache.spark.mllib.tree.GradientBoostedTrees$.org$apache$spark$mllib$tree$GradientBoostedTrees$$boost(GradientBoostedTrees.scala:150)
>> at
>> org.apache.spark.mllib.tree.GradientBoostedTrees.run(GradientBoostedTrees.scala:63)
>> at
>> org.apache.spark.mllib.tree.GradientBoostedTrees$.train(GradientBoostedTrees.scala:96)
>> at
>> org.apache.spark.mllib.api.python.PythonMLLibAPI.trainGradientBoostedTreesModel(PythonMLLibAPI.scala:595)
>>
>> So, it's complaining about the maxBins, if I provide maxBins=1900 and
>> re-run it:
>>
>> model = GradientBoostedTrees.trainRegressor(trainingData,
>> categoricalFeaturesInfo=catFeatu
>> res, maxDepth=6, numIterations=3, maxBins=1900)
>>
>> Traceback (most recent call last):
>>   File "/Users/drake/fd/spark/mltest.py", line 73, in 
>> model = GradientBoostedTrees.trainRegressor(trainingData,
>> categoricalFeaturesInfo=catF
>> eatures, maxDepth=6, numIterations=3, maxBins=1900)
>> TypeError: trainRegressor() got an unexpected keyword argument 'maxBins'
>>
>> It now says it knows nothing of maxBins.
>>
>> If I run the same command against DecisionTree or RandomForest (with
>> maxBins=1900) it works just fine.
>>
>> Seems like a bug in GradientBoostedTrees.
>>
>> Suggestions?
>>
>> -Don
>>
>> --
>> Donald Drake
>> Drake Consulting
>> http://www.drakeconsulting.com/
>> 800-733-2143
>>
>
>


Re: GradientBoostedTrees.trainRegressor with categoricalFeaturesInfo

2015-05-20 Thread Don Drake
JIRA created: https://issues.apache.org/jira/browse/SPARK-7781

Joseph, I agree, I'm debating removing this feature altogether, but I'm
putting the model through its paces.

Thanks.

-Don

On Wed, May 20, 2015 at 7:52 PM, Joseph Bradley 
wrote:

> One more comment: That's a lot of categories for a feature.  If it makes
> sense for your data, it will run faster if you can group the categories or
> split the 1895 categories into a few features which have fewer categories.
>
> On Wed, May 20, 2015 at 3:17 PM, Burak Yavuz  wrote:
>
>> Could you please open a JIRA for it? The maxBins input is missing for the
>> Python Api.
>>
>> Is it possible if you can use the current master? In the current master,
>> you should be able to use trees with the Pipeline Api and DataFrames.
>>
>> Best,
>> Burak
>>
>> On Wed, May 20, 2015 at 2:44 PM, Don Drake  wrote:
>>
>>> I'm running Spark v1.3.1 and when I run the following against my dataset:
>>>
>>> model = GradientBoostedTrees.trainRegressor(trainingData,
>>> categoricalFeaturesInfo=catFeatu
>>> res, maxDepth=6, numIterations=3)
>>>
>>> The job will fail with the following message:
>>> Traceback (most recent call last):
>>>   File "/Users/drake/fd/spark/mltest.py", line 73, in 
>>> model = GradientBoostedTrees.trainRegressor(trainingData,
>>> categoricalFeaturesInfo=catFeatures, maxDepth=6, numIterations=3)
>>>   File
>>> "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/tree.py",
>>> line 553, in trainRegressor
>>> loss, numIterations, learningRate, maxDepth)
>>>   File
>>> "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/tree.py",
>>> line 438, in _train
>>> loss, numIterations, learningRate, maxDepth)
>>>   File
>>> "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/common.py",
>>> line 120, in callMLlibFunc
>>> return callJavaFunc(sc, api, *args)
>>>   File
>>> "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/pyspark/mllib/common.py",
>>> line 113, in callJavaFunc
>>> return _java2py(sc, func(*args))
>>>   File
>>> "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py",
>>> line 538, in __call__
>>>   File
>>> "/Users/drake/spark/spark-1.3.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py",
>>> line 300, in get_return_value
>>> 15/05/20 16:40:12 INFO BlockManager: Removing block rdd_32_95
>>> py4j.protocol.Py4JJavaError: An error occurred while calling
>>> o69.trainGradientBoostedTreesModel.
>>> : java.lang.IllegalArgumentException: requirement failed: DecisionTree
>>> requires maxBins (= 32) >= max categories in categorical features (= 1895)
>>> at scala.Predef$.require(Predef.scala:233)
>>> at
>>> org.apache.spark.mllib.tree.impl.DecisionTreeMetadata$.buildMetadata(DecisionTreeMetadata.scala:128)
>>> at org.apache.spark.mllib.tree.RandomForest.run(RandomForest.scala:138)
>>> at org.apache.spark.mllib.tree.DecisionTree.run(DecisionTree.scala:60)
>>> at
>>> org.apache.spark.mllib.tree.GradientBoostedTrees$.org$apache$spark$mllib$tree$GradientBoostedTrees$$boost(GradientBoostedTrees.scala:150)
>>> at
>>> org.apache.spark.mllib.tree.GradientBoostedTrees.run(GradientBoostedTrees.scala:63)
>>> at
>>> org.apache.spark.mllib.tree.GradientBoostedTrees$.train(GradientBoostedTrees.scala:96)
>>> at
>>> org.apache.spark.mllib.api.python.PythonMLLibAPI.trainGradientBoostedTreesModel(PythonMLLibAPI.scala:595)
>>>
>>> So, it's complaining about the maxBins, if I provide maxBins=1900 and
>>> re-run it:
>>>
>>> model = GradientBoostedTrees.trainRegressor(trainingData,
>>> categoricalFeaturesInfo=catFeatu
>>> res, maxDepth=6, numIterations=3, maxBins=1900)
>>>
>>> Traceback (most recent call last):
>>>   File "/Users/drake/fd/spark/mltest.py", line 73, in 
>>> model = GradientBoostedTrees.trainRegressor(trainingData,
>>> categoricalFeaturesInfo=catF
>>> eatures, maxDepth=6, numIterations=3, maxBins=1900)
>>> TypeError: trainRegressor() got an unexpected keyword argument 'maxBins'
>>>
>>> It now says it knows nothing of maxBins.
>>>
>>> If I run the same command against DecisionTree or RandomForest (with
>>> maxBins=1900) it works just fine.
>>>
>>> Seems like a bug in GradientBoostedTrees.
>>>
>>> Suggestions?
>>>
>>> -Don
>>>
>>> --
>>> Donald Drake
>>> Drake Consulting
>>> http://www.drakeconsulting.com/
>>> 800-733-2143
>>>
>>
>>
>


-- 
Donald Drake
Drake Consulting
http://www.drakeconsulting.com/
http://www.MailLaunder.com/
http://www.DrudgeSiren.com/
http://plu.gd/
800-733-2143