Re: [Help]: DataframeNAfunction fill method throwing exception

2016-03-01 Thread ai he
Hi Divya, I guess the error is thrown from spark-csv. Spark-csv tries to parse string "null" to double. The workaround is to add nullValue option, like .option("nullValue", "null"). But this nullValue feature is not included in current spark-csv 1.3. Just checkout the master of spark-csv and use

Re: Re: Job aborted due to stage failure: java.lang.StringIndexOutOfBoundsException: String index out of range: 18

2015-08-28 Thread ai he
Hi Ricky, In your first try, you are using flatMap. It will give you a flat list of strings. Then you are trying to map each string to a Row, which definitely throws an exception. Following Terry's idea, you are mapping the input to a list of arrays, each of which contains some strings. Then you

Re: Sporadic Input validation failed error when executing LogisticRegressionWithLBFGS.train

2015-08-11 Thread ai he
Hi Francis, From my observation when using spark sql, dataframe.limit(n) does not necessarily return the same result each time when running Apps. To be more precise, in one App, the result should be same for the same n, however, changing n might not result in the same prefix(the result for n =

Re: question about the TFIDF.

2015-05-07 Thread ai he
Hi Dan, In https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/feature/HashingTF.scala, you can see spark uses Utils.nonNegativeMod(term.##, numFeatures) to locate a term. It's also mentioned in the doc that Maps a sequence of terms to their term frequencies

Re: MLLib SVMWithSGD is failing for large dataset

2015-04-28 Thread ai he
Hi Sarath, It might be questionable to set num-executors as 64 if you only has 8 nodes. Do you use any action like collect which will overwhelm the driver since you have a large dataset? Thanks On Tue, Apr 28, 2015 at 10:50 AM, sarath sarathkrishn...@gmail.com wrote: I am trying to train a