Re: Problem with Generalized Regression Model

2017-01-09 Thread sethah
This likely indicates that the IRLS solver for GLR has encountered a singular
matrix. Can you check if you have linearly dependent columns in your data?
Also, this error message should be fixed in the latest version of Spark,
after:  https://issues.apache.org/jira/browse/SPARK-11918
  



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Problem-with-Generalized-Regression-Model-tp28273p28294.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Getting info from DecisionTreeClassificationModel

2015-10-21 Thread sethah
I believe this question will give you the answer your looking for:  Decision
Tree Accuracy

  

Basically, you can traverse the tree from the root node.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Getting-info-from-DecisionTreeClassificationModel-tp25152p25159.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Does feature parity exist between Spark and PySpark

2015-10-07 Thread sethah
Regarding features, the general workflow for the Spark community when adding
new features is to first add them in Scala (since Spark is written in
Scala). Once this is done, a Jira ticket will be created requesting that the
feature be added to the Python API (example -  SPARK-9773
  ). Some of these Python
API tickets get done very quickly, some don't. As such, the Scala API will
always be more feature rich from a Spark perspective, while the Python API
can lag behind in some cases. In general, the intent is to make the PySpark
API contain all features of the Scala API, since Python is considered a
first class citizen in the Spark community; the difference is that if you
need the latest and greatest and need it right away, Scala is the best
choice.

Regarding performance, others have said it very eloquently:


https://www.linkedin.com/pulse/why-i-choose-scala-apache-spark-project-lan-jiang

  
http://stackoverflow.com/questions/17236936/api-compatibility-between-scala-and-python

  
http://apache-spark-developers-list.1001551.n3.nabble.com/A-Comparison-of-Platforms-for-Implementing-and-Running-Very-Large-Scale-Machine-Learning-Algorithms-td7823.html#a7824

  



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Does-feature-parity-exist-between-Spark-and-PySpark-tp24963p24971.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Adding the values in a column of a dataframe

2015-10-02 Thread sethah
df.agg(sum("age")).show()




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Adding-the-values-in-a-column-of-a-dataframe-tp24909p24910.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Distance metrics in KMeans

2015-09-25 Thread sethah
It looks like the distance metric is hard coded to the L2 norm (euclidean
distance) in MLlib. As you may expect, you are not the first person to
desire other metrics and there has been some prior effort. 

Please reference this PR: https://github.com/apache/spark/pull/2634

And corresponding JIRA: https://issues.apache.org/jira/browse/SPARK-3219

Seems as if the addition of arbitrary distance metrics is non-trivial given
current implementation in MLlib. Not sure of any current work towards this
issue.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Distance-metrics-in-KMeans-tp24823p24826.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Sprk RDD : want to combine elements that have approx same keys

2015-09-10 Thread sethah
If you want each key to be combined only once, you can just create a mapping
of keys to a reduced key space. Something like this

val data = sc.parallelize(Array((0,0.030513227), (1,0.11088216),
(2,0.69165534), (3,0.78524816), (4,0.8516909), (5,0.37751913),
(6,0.05674714), (7,0.27523404), (8,0.40828508), (9,0.9491552)))

data.map { case(k,v) => (k / 3, v)}.reduceByKey(_+_)

That code will group keys that are within two of each other and then sum
each group. Could you clarify, if you have the following keys: [141, 142,
143, 144, 145], do you want groups like [(141, 142, 143), (144, 145)] or do
you need groups [(141, 142, 143), (142, 143, 144), (143, 144, 145), (144,
145)]




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Sprk-RDD-want-to-combine-elements-that-have-approx-same-keys-tp24644p24647.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark MLlib Decision Tree Node Accuracy

2015-09-09 Thread sethah
If you are able to traverse the tree, then you can extract the id of the leaf
node for each feature vector. This is like a modified predict method where
it returns the leaf node assigned to the data point instead of the
prediction for that leaf node. The following example code should work: 

import org.apache.spark.mllib.tree.model.Node
import org.apache.spark.mllib.tree.DecisionTree
import org.apache.spark.mllib.tree.model.DecisionTreeModel
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.mllib.tree.configuration.FeatureType._
import org.apache.spark.mllib.linalg.Vector

// Load and parse the data file.
val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
// Split the data into training and test sets (30% held out for testing)
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1))

// Train a DecisionTree model.
//  Empty categoricalFeaturesInfo indicates all features are continuous.
val numClasses = 2
val categoricalFeaturesInfo = Map[Int, Int]()
val impurity = "gini"
val maxDepth = 5
val maxBins = 32

val model = DecisionTree.trainClassifier(trainingData, numClasses,
categoricalFeaturesInfo,
  impurity, maxDepth, maxBins)

def predictImpl(node: Node, features: Vector): Node = {
  if (node.isLeaf) {
node
  } else {
if (node.split.get.featureType == Continuous) {
  if (features(node.split.get.feature) <= node.split.get.threshold) {
predictImpl(node.leftNode.get, features)
  } else {
predictImpl(node.rightNode.get, features)
  }
} else {
  if
(node.split.get.categories.contains(features(node.split.get.feature))) {
predictImpl(node.leftNode.get, features)
  } else {
predictImpl(node.rightNode.get, features)
  }
}
  }
}

val nodeIDAndPredsAndLabels = data.map { lp => 
  val node = predictImpl(model.topNode, lp.features)
  (node.id, (node.predict.predict, lp.label))
}

>From here, you should be able to perform analysis of the accuracy of each
leaf node.

Note that in the new Spark ML library a predictNodeIndex is implemented
(which is being converted to a predictImpl method) similar to the
implementation above. Hopefully that code helps.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-MLlib-Decision-Tree-Node-Accuracy-tp24561p24629.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Does Spark.ml LogisticRegression assumes only Double valued features?

2015-09-09 Thread sethah
When you pass a data frame into the train method of LogisticRegression and
other ML learning algorithms, the data is extracted by using parameters
`labelCol` and `featuresCol` which should have been set before calling the
train method (they default to "label" and "features", respectively).
`featuresCol` should be a Vector type consisting of Doubles. When the train
method is called, it tries to verify that the data type of `featuresCol` is
type Vector and that the data type of `labelCol` is of type Double. It will
throw an exception if other data types are found.

Spark ML has special ways of handling features that are not inherently
continuous or numerical. I urge you to review this question on StackOverflow
which covers it quite well:

http://stackoverflow.com/questions/32277576/spark-ml-categorical-features



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Does-Spark-ml-LogisticRegression-assumes-only-Double-valued-features-tp24575p24630.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org