Re: Problem with Generalized Regression Model

2017-01-09 Thread sethah
This likely indicates that the IRLS solver for GLR has encountered a singular matrix. Can you check if you have linearly dependent columns in your data? Also, this error message should be fixed in the latest version of Spark, after: https://issues.apache.org/jira/browse/SPARK-11918

Re: Getting info from DecisionTreeClassificationModel

2015-10-21 Thread sethah
I believe this question will give you the answer your looking for: Decision Tree Accuracy Basically, you can traverse the tree from the root node. -- View this message in

Re: Does feature parity exist between Spark and PySpark

2015-10-07 Thread sethah
Regarding features, the general workflow for the Spark community when adding new features is to first add them in Scala (since Spark is written in Scala). Once this is done, a Jira ticket will be created requesting that the feature be added to the Python API (example - SPARK-9773

Re: Adding the values in a column of a dataframe

2015-10-02 Thread sethah
df.agg(sum("age")).show() -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Adding-the-values-in-a-column-of-a-dataframe-tp24909p24910.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Distance metrics in KMeans

2015-09-25 Thread sethah
It looks like the distance metric is hard coded to the L2 norm (euclidean distance) in MLlib. As you may expect, you are not the first person to desire other metrics and there has been some prior effort. Please reference this PR: https://github.com/apache/spark/pull/2634 And corresponding JIRA:

Re: Sprk RDD : want to combine elements that have approx same keys

2015-09-10 Thread sethah
If you want each key to be combined only once, you can just create a mapping of keys to a reduced key space. Something like this val data = sc.parallelize(Array((0,0.030513227), (1,0.11088216), (2,0.69165534), (3,0.78524816), (4,0.8516909), (5,0.37751913), (6,0.05674714), (7,0.27523404),

Re: Spark MLlib Decision Tree Node Accuracy

2015-09-09 Thread sethah
If you are able to traverse the tree, then you can extract the id of the leaf node for each feature vector. This is like a modified predict method where it returns the leaf node assigned to the data point instead of the prediction for that leaf node. The following example code should work:

Re: Does Spark.ml LogisticRegression assumes only Double valued features?

2015-09-09 Thread sethah
When you pass a data frame into the train method of LogisticRegression and other ML learning algorithms, the data is extracted by using parameters `labelCol` and `featuresCol` which should have been set before calling the train method (they default to "label" and "features", respectively).