Garbage stats in Random Forest leaf node?

2015-03-16 Thread cjwang
I dumped the trees in the random forest model, and occasionally saw a leaf node with strange stats: - pred=1.00 prob=0.80 imp=-1.00

LogisticRegressionWithLBFGS shows ERRORs

2015-03-13 Thread cjwang
I am running LogisticRegressionWithLBFGS. I got these lines on my console: 2015-03-12 17:38:03,897 ERROR breeze.optimize.StrongWolfeLineSearch | Encountered bad values in function evaluation. Decreasing step size to 0.5 2015-03-12 17:38:03,967 ERROR breeze.optimize.StrongWolfeLineSearch |

Logistic Regression displays ERRORs

2015-03-12 Thread cjwang
I am running LogisticRegressionWithLBFGS. I got these lines on my console: 2015-03-12 17:38:03,897 ERROR breeze.optimize.StrongWolfeLineSearch | Encountered bad values in function evaluation. Decreasing step size to 0.5 2015-03-12 17:38:03,967 ERROR breeze.optimize.StrongWolfeLineSearch |

Extra output from Spark run

2015-03-04 Thread cjwang
When I run Spark 1.2.1, I found these display that wasn't in the previous releases: [Stage 12:= (6 + 1) / 16] [Stage 12:(8 + 1) / 16] [Stage 12:==

Complexity/Efficiency of SortByKey

2014-09-15 Thread cjwang
I wonder what algorithm is used to implement sortByKey? I assume it is some O(n*log(n)) parallelized on x number of nodes, right? Then, what size of data would make it worthwhile to use sortByKey on multiple processors rather than use standard Scala sort functions on a single processor

Creating an RDD in another RDD causes deadlock

2014-09-02 Thread cjwang
My code seemed deadlock when I tried to do this: object MoreRdd extends Serializable { def apply(i: Int) = { val rdd2 = sc.parallelize(0 to 10) rdd2.map(j = i*10 + j).collect } } val rdd1 = sc.parallelize(0 to 10) val y = rdd1.map(i =

Re: Creating an RDD in another RDD causes deadlock

2014-09-02 Thread cjwang
I didn't know this restriction. Thank you. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Creating-an-RDD-in-another-RDD-causes-deadlock-tp13302p13304.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

What is the better data structure in an RDD

2014-08-29 Thread cjwang
I need some advices regarding how data are stored in an RDD. I have millions of records, called Measures. They are bucketed with keys of String type. I wonder if I need to store them as RDD[(String, Measure)] or RDD[(String, Iterable[Measure])], and why? Data in each bucket are not related

Re: Finding previous and next element in a sorted RDD

2014-08-22 Thread cjwang
It would be nice if an RDD that was massaged by OrderedRDDFunction could know its neighbors. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Finding-previous-and-next-element-in-a-sorted-RDD-tp12621p12664.html Sent from the Apache Spark User List mailing

Finding previous and next element in a sorted RDD

2014-08-21 Thread cjwang
I have an RDD containing elements sorted in certain order. I would like to map over the elements knowing the values of their respective previous and next elements. With regular List, I used to do this: (input is a List below) // The first of the previous measures and the last of the next

Re: Finding previous and next element in a sorted RDD

2014-08-21 Thread cjwang
One way is to do zipWithIndex on the RDD. Then use the index as a key. Add or subtract 1 for previous or next element. Then use cogroup or join to bind them together. val idx = input.zipWithIndex val previous = idx.map(x = (x._2+1, x._1)) val current = idx.map(x = (x._2, x._1)) val next =

Re: SparkR failed to connect to the master

2014-07-14 Thread cjwang
I restarted Spark Master with spark-0.9.1 and SparkR was able to communicate with the Master. I am using the latest SparkR pkg-e1f95b6. Maybe it has problem communicating to Spark 1.0.0? -- View this message in context:

Re: SparkR failed to connect to the master

2014-07-14 Thread cjwang
I tried installing the latest Spark 1.0.1 and SparkR couldn't find the master either. I restarted with Spark 0.9.1 and SparkR was able to find the master. So, there seemed to be something that changed after Spark 1.0.0. -- View this message in context:

Re: ---cores option in spark-shell

2014-07-14 Thread cjwang
Neither do they work in new 1.0.1 either -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/cores-option-in-spark-shell-tp6809p9690.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: executor failed, cannot find compute-classpath.sh

2014-07-10 Thread cjwang
Not sure that was what I want. I tried to run Spark Shell on a machine other than the master and got the same error. The 192 was suppose to be a simple shell script change that alters SPARK_HOME before submitting jobs. Too bad it wasn't there anymore. The build described in the pull request

SparkR failed to connect to the master

2014-07-10 Thread cjwang
I have a cluster running. I was able to run Spark Shell and submit programs. But when I tried to use SparkR, I got these errors: wifi-orcus:sparkR cwang$ MASTER=spark://wifi-orcus.dhcp.carrieriq.com:7077 sparkR R version 3.1.0 (2014-04-10) -- Spring Dance Copyright (C) 2014 The R Foundation

Re: executor failed, cannot find compute-classpath.sh

2014-07-10 Thread cjwang
Andrew, Thanks for replying. I did the following and the result was still the same. 1. Added spark.home /root/spark-1.0.0 to local conf/spark-defaults.conf, where /root was the place in the cluster where I put Spark. 2. Ran bin/spark-shell --master