Re: scala.MatchError while doing BinaryClassificationMetrics

2016-11-15 Thread Bhaarat Sharma
Thank, Nick.

This worked for me.

val evaluator = new BinaryClassificationEvaluator().
setLabelCol("label").
setRawPredictionCol("ModelProbability").
setMetricName("areaUnderROC")

val auROC = evaluator.evaluate(testResults)

On Mon, Nov 14, 2016 at 4:00 PM, Nick Pentreath <nick.pentre...@gmail.com>
wrote:

> Typically you pass in the result of a model transform to the evaluator.
>
> So:
> val model = estimator.fit(data)
> val auc = evaluator.evaluate(model.transform(testData)
>
> Check Scala API docs for some details: http://spark.apache.
> org/docs/latest/api/scala/index.html#org.apache.spark.ml.evaluation.
> BinaryClassificationEvaluator
>
> On Mon, 14 Nov 2016 at 20:02 Bhaarat Sharma <bhaara...@gmail.com> wrote:
>
> Can you please suggest how I can use BinaryClassificationEvaluator? I
> tried:
>
> scala> import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
> import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
>
> scala>  val evaluator = new BinaryClassificationEvaluator()
> evaluator: org.apache.spark.ml.evaluation.BinaryClassificationEvaluator =
> binEval_0d57372b7579
>
> Try 1:
>
> scala> evaluator.evaluate(testScoreAndLabel.rdd)
> :105: error: type mismatch;
>  found   : org.apache.spark.rdd.RDD[(Double, Double)]
>  required: org.apache.spark.sql.Dataset[_]
>evaluator.evaluate(testScoreAndLabel.rdd)
>
> Try 2:
>
> scala> evaluator.evaluate(testScoreAndLabel)
> java.lang.IllegalArgumentException: Field "rawPrediction" does not exist.
>   at org.apache.spark.sql.types.StructType$$anonfun$apply$1.
> apply(StructType.scala:228)
>
> Try 3:
>
> scala> evaluator.evaluate(testScoreAndLabel.select("
> Label","ModelProbability"))
> org.apache.spark.sql.AnalysisException: cannot resolve '`Label`' given
> input columns: [_1, _2];
>   at org.apache.spark.sql.catalyst.analysis.package$
> AnalysisErrorAt.failAnalysis(package.scala:42)
>
>
> On Mon, Nov 14, 2016 at 1:44 PM, Nick Pentreath <nick.pentre...@gmail.com>
> wrote:
>
> DataFrame.rdd returns an RDD[Row]. You'll need to use map to extract the
> doubles from the test score and label DF.
>
> But you may prefer to just use spark.ml evaluators, which work with
> DataFrames. Try BinaryClassificationEvaluator.
>
> On Mon, 14 Nov 2016 at 19:30, Bhaarat Sharma <bhaara...@gmail.com> wrote:
>
> I am getting scala.MatchError in the code below. I'm not able to see why
> this would be happening. I am using Spark 2.0.1
>
> scala> testResults.columns
> res538: Array[String] = Array(TopicVector, subject_id, hadm_id, isElective, 
> isNewborn, isUrgent, isEmergency, isMale, isFemale, oasis_score, 
> sapsii_score, sofa_score, age, hosp_death, test, ModelFeatures, Label, 
> rawPrediction, ModelProbability, ModelPrediction)
>
> scala> testResults.select("Label","ModelProbability").take(1)
> res542: Array[org.apache.spark.sql.Row] = 
> Array([0.0,[0.737304818744076,0.262695181255924]])
>
> scala> val testScoreAndLabel = testResults.
>  | select("Label","ModelProbability").
>  | map { case Row(l:Double, p:Vector) => (p(1), l) }
> testScoreAndLabel: org.apache.spark.sql.Dataset[(Double, Double)] = [_1: 
> double, _2: double]
>
> scala> testScoreAndLabel
> res539: org.apache.spark.sql.Dataset[(Double, Double)] = [_1: double, _2: 
> double]
>
> scala> testScoreAndLabel.columns
> res540: Array[String] = Array(_1, _2)
>
> scala> val testMetrics = new 
> BinaryClassificationMetrics(testScoreAndLabel.rdd)
> testMetrics: org.apache.spark.mllib.evaluation.BinaryClassificationMetrics = 
> org.apache.spark.mllib.evaluation.BinaryClassificationMetrics@36e780d1
>
> The code below gives the error
>
> val auROC = testMetrics.areaUnderROC() //this line gives the error
>
> Caused by: scala.MatchError: [0.0,[0.7316583497453766,0.2683416502546234]] 
> (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
>
>
>


Re: scala.MatchError while doing BinaryClassificationMetrics

2016-11-14 Thread Nick Pentreath
Typically you pass in the result of a model transform to the evaluator.

So:
val model = estimator.fit(data)
val auc = evaluator.evaluate(model.transform(testData)

Check Scala API docs for some details:
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.evaluation.BinaryClassificationEvaluator

On Mon, 14 Nov 2016 at 20:02 Bhaarat Sharma <bhaara...@gmail.com> wrote:

Can you please suggest how I can use BinaryClassificationEvaluator? I tried:

scala> import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator

scala>  val evaluator = new BinaryClassificationEvaluator()
evaluator: org.apache.spark.ml.evaluation.BinaryClassificationEvaluator =
binEval_0d57372b7579

Try 1:

scala> evaluator.evaluate(testScoreAndLabel.rdd)
:105: error: type mismatch;
 found   : org.apache.spark.rdd.RDD[(Double, Double)]
 required: org.apache.spark.sql.Dataset[_]
   evaluator.evaluate(testScoreAndLabel.rdd)

Try 2:

scala> evaluator.evaluate(testScoreAndLabel)
java.lang.IllegalArgumentException: Field "rawPrediction" does not exist.
  at
org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:228)

Try 3:

scala>
evaluator.evaluate(testScoreAndLabel.select("Label","ModelProbability"))
org.apache.spark.sql.AnalysisException: cannot resolve '`Label`' given
input columns: [_1, _2];
  at
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)


On Mon, Nov 14, 2016 at 1:44 PM, Nick Pentreath <nick.pentre...@gmail.com>
wrote:

DataFrame.rdd returns an RDD[Row]. You'll need to use map to extract the
doubles from the test score and label DF.

But you may prefer to just use spark.ml evaluators, which work with
DataFrames. Try BinaryClassificationEvaluator.

On Mon, 14 Nov 2016 at 19:30, Bhaarat Sharma <bhaara...@gmail.com> wrote:

I am getting scala.MatchError in the code below. I'm not able to see why
this would be happening. I am using Spark 2.0.1

scala> testResults.columns
res538: Array[String] = Array(TopicVector, subject_id, hadm_id,
isElective, isNewborn, isUrgent, isEmergency, isMale, isFemale,
oasis_score, sapsii_score, sofa_score, age, hosp_death, test,
ModelFeatures, Label, rawPrediction, ModelProbability,
ModelPrediction)

scala> testResults.select("Label","ModelProbability").take(1)
res542: Array[org.apache.spark.sql.Row] =
Array([0.0,[0.737304818744076,0.262695181255924]])

scala> val testScoreAndLabel = testResults.
 | select("Label","ModelProbability").
 | map { case Row(l:Double, p:Vector) => (p(1), l) }
testScoreAndLabel: org.apache.spark.sql.Dataset[(Double, Double)] =
[_1: double, _2: double]

scala> testScoreAndLabel
res539: org.apache.spark.sql.Dataset[(Double, Double)] = [_1: double,
_2: double]

scala> testScoreAndLabel.columns
res540: Array[String] = Array(_1, _2)

scala> val testMetrics = new BinaryClassificationMetrics(testScoreAndLabel.rdd)
testMetrics: org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
= org.apache.spark.mllib.evaluation.BinaryClassificationMetrics@36e780d1

The code below gives the error

val auROC = testMetrics.areaUnderROC() //this line gives the error

Caused by: scala.MatchError:
[0.0,[0.7316583497453766,0.2683416502546234]] (of class
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)


Re: scala.MatchError while doing BinaryClassificationMetrics

2016-11-14 Thread Bhaarat Sharma
Can you please suggest how I can use BinaryClassificationEvaluator? I tried:

scala> import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator

scala>  val evaluator = new BinaryClassificationEvaluator()
evaluator: org.apache.spark.ml.evaluation.BinaryClassificationEvaluator =
binEval_0d57372b7579

Try 1:

scala> evaluator.evaluate(testScoreAndLabel.rdd)
:105: error: type mismatch;
 found   : org.apache.spark.rdd.RDD[(Double, Double)]
 required: org.apache.spark.sql.Dataset[_]
   evaluator.evaluate(testScoreAndLabel.rdd)

Try 2:

scala> evaluator.evaluate(testScoreAndLabel)
java.lang.IllegalArgumentException: Field "rawPrediction" does not exist.
  at
org.apache.spark.sql.types.StructType$$anonfun$apply$1.apply(StructType.scala:228)

Try 3:

scala>
evaluator.evaluate(testScoreAndLabel.select("Label","ModelProbability"))
org.apache.spark.sql.AnalysisException: cannot resolve '`Label`' given
input columns: [_1, _2];
  at
org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)


On Mon, Nov 14, 2016 at 1:44 PM, Nick Pentreath <nick.pentre...@gmail.com>
wrote:

> DataFrame.rdd returns an RDD[Row]. You'll need to use map to extract the
> doubles from the test score and label DF.
>
> But you may prefer to just use spark.ml evaluators, which work with
> DataFrames. Try BinaryClassificationEvaluator.
>
> On Mon, 14 Nov 2016 at 19:30, Bhaarat Sharma <bhaara...@gmail.com> wrote:
>
>> I am getting scala.MatchError in the code below. I'm not able to see why
>> this would be happening. I am using Spark 2.0.1
>>
>> scala> testResults.columns
>> res538: Array[String] = Array(TopicVector, subject_id, hadm_id, isElective, 
>> isNewborn, isUrgent, isEmergency, isMale, isFemale, oasis_score, 
>> sapsii_score, sofa_score, age, hosp_death, test, ModelFeatures, Label, 
>> rawPrediction, ModelProbability, ModelPrediction)
>>
>> scala> testResults.select("Label","ModelProbability").take(1)
>> res542: Array[org.apache.spark.sql.Row] = 
>> Array([0.0,[0.737304818744076,0.262695181255924]])
>>
>> scala> val testScoreAndLabel = testResults.
>>  | select("Label","ModelProbability").
>>  | map { case Row(l:Double, p:Vector) => (p(1), l) }
>> testScoreAndLabel: org.apache.spark.sql.Dataset[(Double, Double)] = [_1: 
>> double, _2: double]
>>
>> scala> testScoreAndLabel
>> res539: org.apache.spark.sql.Dataset[(Double, Double)] = [_1: double, _2: 
>> double]
>>
>> scala> testScoreAndLabel.columns
>> res540: Array[String] = Array(_1, _2)
>>
>> scala> val testMetrics = new 
>> BinaryClassificationMetrics(testScoreAndLabel.rdd)
>> testMetrics: org.apache.spark.mllib.evaluation.BinaryClassificationMetrics = 
>> org.apache.spark.mllib.evaluation.BinaryClassificationMetrics@36e780d1
>>
>> The code below gives the error
>>
>> val auROC = testMetrics.areaUnderROC() //this line gives the error
>>
>> Caused by: scala.MatchError: [0.0,[0.7316583497453766,0.2683416502546234]] 
>> (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
>>
>>


Re: scala.MatchError while doing BinaryClassificationMetrics

2016-11-14 Thread Nick Pentreath
DataFrame.rdd returns an RDD[Row]. You'll need to use map to extract the
doubles from the test score and label DF.

But you may prefer to just use spark.ml evaluators, which work with
DataFrames. Try BinaryClassificationEvaluator.

On Mon, 14 Nov 2016 at 19:30, Bhaarat Sharma <bhaara...@gmail.com> wrote:

> I am getting scala.MatchError in the code below. I'm not able to see why
> this would be happening. I am using Spark 2.0.1
>
> scala> testResults.columns
> res538: Array[String] = Array(TopicVector, subject_id, hadm_id, isElective, 
> isNewborn, isUrgent, isEmergency, isMale, isFemale, oasis_score, 
> sapsii_score, sofa_score, age, hosp_death, test, ModelFeatures, Label, 
> rawPrediction, ModelProbability, ModelPrediction)
>
> scala> testResults.select("Label","ModelProbability").take(1)
> res542: Array[org.apache.spark.sql.Row] = 
> Array([0.0,[0.737304818744076,0.262695181255924]])
>
> scala> val testScoreAndLabel = testResults.
>  | select("Label","ModelProbability").
>  | map { case Row(l:Double, p:Vector) => (p(1), l) }
> testScoreAndLabel: org.apache.spark.sql.Dataset[(Double, Double)] = [_1: 
> double, _2: double]
>
> scala> testScoreAndLabel
> res539: org.apache.spark.sql.Dataset[(Double, Double)] = [_1: double, _2: 
> double]
>
> scala> testScoreAndLabel.columns
> res540: Array[String] = Array(_1, _2)
>
> scala> val testMetrics = new 
> BinaryClassificationMetrics(testScoreAndLabel.rdd)
> testMetrics: org.apache.spark.mllib.evaluation.BinaryClassificationMetrics = 
> org.apache.spark.mllib.evaluation.BinaryClassificationMetrics@36e780d1
>
> The code below gives the error
>
> val auROC = testMetrics.areaUnderROC() //this line gives the error
>
> Caused by: scala.MatchError: [0.0,[0.7316583497453766,0.2683416502546234]] 
> (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
>
>


scala.MatchError while doing BinaryClassificationMetrics

2016-11-14 Thread Bhaarat Sharma
I am getting scala.MatchError in the code below. I'm not able to see why
this would be happening. I am using Spark 2.0.1

scala> testResults.columns
res538: Array[String] = Array(TopicVector, subject_id, hadm_id,
isElective, isNewborn, isUrgent, isEmergency, isMale, isFemale,
oasis_score, sapsii_score, sofa_score, age, hosp_death, test,
ModelFeatures, Label, rawPrediction, ModelProbability,
ModelPrediction)

scala> testResults.select("Label","ModelProbability").take(1)
res542: Array[org.apache.spark.sql.Row] =
Array([0.0,[0.737304818744076,0.262695181255924]])

scala> val testScoreAndLabel = testResults.
 | select("Label","ModelProbability").
 | map { case Row(l:Double, p:Vector) => (p(1), l) }
testScoreAndLabel: org.apache.spark.sql.Dataset[(Double, Double)] =
[_1: double, _2: double]

scala> testScoreAndLabel
res539: org.apache.spark.sql.Dataset[(Double, Double)] = [_1: double,
_2: double]

scala> testScoreAndLabel.columns
res540: Array[String] = Array(_1, _2)

scala> val testMetrics = new BinaryClassificationMetrics(testScoreAndLabel.rdd)
testMetrics: org.apache.spark.mllib.evaluation.BinaryClassificationMetrics
= org.apache.spark.mllib.evaluation.BinaryClassificationMetrics@36e780d1

The code below gives the error

val auROC = testMetrics.areaUnderROC() //this line gives the error

Caused by: scala.MatchError:
[0.0,[0.7316583497453766,0.2683416502546234]] (of class
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)


Re: scala.MatchError on stand-alone cluster mode

2016-07-17 Thread Mekal Zheng
Hi, Rishabh Bhardwaj, Saisai Shao,

Thx for your help. I hava found that the key reason is I forgot to upload
the jar package to all of the node in cluster, so after the master
distributed the job and selected one node as the driver,  the driver can
not find the jar package and throw an exception.

-- 
Mekal Zheng
Sent with Airmail

发件人: Rishabh Bhardwaj <rbnex...@gmail.com> <rbnex...@gmail.com>
回复: Rishabh Bhardwaj <rbnex...@gmail.com> <rbnex...@gmail.com>
日期: July 15, 2016 at 17:28:43
至: Saisai Shao <sai.sai.s...@gmail.com> <sai.sai.s...@gmail.com>
抄送: Mekal Zheng <mekal.zh...@gmail.com> <mekal.zh...@gmail.com>, spark users
<user@spark.apache.org> <user@spark.apache.org>
主题:  Re: scala.MatchError on stand-alone cluster mode

Hi Mekal,
It may be a scala version mismatch error,kindly check whether you are
running both (your streaming app and spark cluster ) on 2.10 scala or 2.11.

Thanks,
Rishabh.

On Fri, Jul 15, 2016 at 1:38 PM, Saisai Shao <sai.sai.s...@gmail.com> wrote:

> The error stack is throwing from your code:
>
> Caused by: scala.MatchError: [Ljava.lang.String;@68d279ec (of class
> [Ljava.lang.String;)
> at com.jd.deeplog.LogAggregator$.main(LogAggregator.scala:29)
> at com.jd.deeplog.LogAggregator.main(LogAggregator.scala)
>
> I think you should debug the code yourself, it may not be the problem of
> Spark.
>
> On Fri, Jul 15, 2016 at 3:17 PM, Mekal Zheng <mekal.zh...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I have a Spark Streaming job written in Scala and is running well on
>> local and client mode, but when I submit it on cluster mode, the driver
>> reported an error shown as below.
>> Is there anyone know what is wrong here?
>> pls help me!
>>
>> the Job CODE is after
>>
>> 16/07/14 17:28:21 DEBUG ByteBufUtil:
>> -Dio.netty.threadLocalDirectBufferSize: 65536
>> 16/07/14 17:28:21 DEBUG NetUtil: Loopback interface: lo (lo,
>> 0:0:0:0:0:0:0:1%lo)
>> 16/07/14 17:28:21 DEBUG NetUtil: /proc/sys/net/core/somaxconn: 32768
>> 16/07/14 17:28:21 DEBUG TransportServer: Shuffle server started on port
>> :43492
>> 16/07/14 17:28:21 INFO Utils: Successfully started service 'Driver' on
>> port 43492.
>> 16/07/14 17:28:21 INFO WorkerWatcher: Connecting to worker spark://
>> Worker@172.20.130.98:23933
>> 16/07/14 17:28:21 DEBUG TransportClientFactory: Creating new connection
>> to /172.20.130.98:23933
>> Exception in thread "main" java.lang.reflect.InvocationTargetException
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:497)
>> at
>> org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
>> at
>> org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
>> Caused by: scala.MatchError: [Ljava.lang.String;@68d279ec (of class
>> [Ljava.lang.String;)
>> at com.jd.deeplog.LogAggregator$.main(LogAggregator.scala:29)
>> at com.jd.deeplog.LogAggregator.main(LogAggregator.scala)
>> ... 6 more
>>
>> ==
>> Job CODE:
>>
>> object LogAggregator {
>>
>>   val batchDuration = Seconds(5)
>>
>>   def main(args:Array[String]) {
>>
>> val usage =
>>   """Usage: LogAggregator 
>> 
>> |  logFormat: fieldName:fieldRole[,fieldName:fieldRole] each field 
>> must have both name and role
>> |  logFormat.role: can be key|avg|enum|sum|ignore
>>   """.stripMargin
>>
>> if (args.length < 9) {
>>   System.err.println(usage)
>>   System.exit(1)
>> }
>>
>> val Array(zkQuorum, group, topics, numThreads, logFormat, logSeparator, 
>> batchDuration, destType, destPath) = args
>>
>> println("Start streaming calculation...")
>>
>> val conf = new SparkConf().setAppName("LBHaproxy-LogAggregator")
>> val ssc = new StreamingContext(conf, Seconds(batchDuration.toInt))
>>
>> val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
>>
>> val lines = KafkaUtils.createStream(ssc, zkQuorum, group, 
>> topicMap).map(_._2)
>>
>> val logFields = logFormat.split(",").map(field => {
>>   val fld = field.split(":")
>>   if (fld.s

Re: scala.MatchError on stand-alone cluster mode

2016-07-15 Thread Saisai Shao
The error stack is throwing from your code:

Caused by: scala.MatchError: [Ljava.lang.String;@68d279ec (of class
[Ljava.lang.String;)
at com.jd.deeplog.LogAggregator$.main(LogAggregator.scala:29)
at com.jd.deeplog.LogAggregator.main(LogAggregator.scala)

I think you should debug the code yourself, it may not be the problem of
Spark.

On Fri, Jul 15, 2016 at 3:17 PM, Mekal Zheng <mekal.zh...@gmail.com> wrote:

> Hi,
>
> I have a Spark Streaming job written in Scala and is running well on local
> and client mode, but when I submit it on cluster mode, the driver reported
> an error shown as below.
> Is there anyone know what is wrong here?
> pls help me!
>
> the Job CODE is after
>
> 16/07/14 17:28:21 DEBUG ByteBufUtil:
> -Dio.netty.threadLocalDirectBufferSize: 65536
> 16/07/14 17:28:21 DEBUG NetUtil: Loopback interface: lo (lo,
> 0:0:0:0:0:0:0:1%lo)
> 16/07/14 17:28:21 DEBUG NetUtil: /proc/sys/net/core/somaxconn: 32768
> 16/07/14 17:28:21 DEBUG TransportServer: Shuffle server started on port
> :43492
> 16/07/14 17:28:21 INFO Utils: Successfully started service 'Driver' on
> port 43492.
> 16/07/14 17:28:21 INFO WorkerWatcher: Connecting to worker spark://
> Worker@172.20.130.98:23933
> 16/07/14 17:28:21 DEBUG TransportClientFactory: Creating new connection to
> /172.20.130.98:23933
> Exception in thread "main" java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at
> org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
> at
> org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
> Caused by: scala.MatchError: [Ljava.lang.String;@68d279ec (of class
> [Ljava.lang.String;)
> at com.jd.deeplog.LogAggregator$.main(LogAggregator.scala:29)
> at com.jd.deeplog.LogAggregator.main(LogAggregator.scala)
> ... 6 more
>
> ==
> Job CODE:
>
> object LogAggregator {
>
>   val batchDuration = Seconds(5)
>
>   def main(args:Array[String]) {
>
> val usage =
>   """Usage: LogAggregator 
> 
> |  logFormat: fieldName:fieldRole[,fieldName:fieldRole] each field 
> must have both name and role
> |  logFormat.role: can be key|avg|enum|sum|ignore
>   """.stripMargin
>
> if (args.length < 9) {
>   System.err.println(usage)
>   System.exit(1)
> }
>
> val Array(zkQuorum, group, topics, numThreads, logFormat, logSeparator, 
> batchDuration, destType, destPath) = args
>
> println("Start streaming calculation...")
>
> val conf = new SparkConf().setAppName("LBHaproxy-LogAggregator")
> val ssc = new StreamingContext(conf, Seconds(batchDuration.toInt))
>
> val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
>
> val lines = KafkaUtils.createStream(ssc, zkQuorum, group, 
> topicMap).map(_._2)
>
> val logFields = logFormat.split(",").map(field => {
>   val fld = field.split(":")
>   if (fld.size != 2) {
> System.err.println("Wrong parameters for logFormat!\n")
> System.err.println(usage)
> System.exit(1)
>   }
>   // TODO: ensure the field has both 'name' and 'role'
>   new LogField(fld(0), fld(1))
> })
>
> val keyFields = logFields.filter(logFieldName => {
>   logFieldName.role == "key"
> })
> val keys = keyFields.map(key => {
>   key.name
> })
>
> val logsByKey = lines.map(line => {
>   val log = new Log(logFields, line, logSeparator)
>   log.toMap
> }).filter(log => log.nonEmpty).map(log => {
>   val keys = keyFields.map(logField => {
> log(logField.name).value
>   })
>
>   val key = keys.reduce((key1, key2) => {
> key1.asInstanceOf[String] + key2.asInstanceOf[String]
>   })
>
>   val fullLog = log + ("count" -> new LogSegment("sum", 1))
>
>   (key, fullLog)
> })
>
>
> val aggResults = logsByKey.reduceByKey((log_a, log_b) => {
>
>   log_a.map(logField => {
> val logFieldName = logField._1
> val logSegment_a = logField._2
> val logSegment_b = log_b(logFieldName)
>
> val segValue = logSegment_a.role match {
>

scala.MatchError on stand-alone cluster mode

2016-07-15 Thread Mekal Zheng
Hi,

I have a Spark Streaming job written in Scala and is running well on local
and client mode, but when I submit it on cluster mode, the driver reported
an error shown as below.
Is there anyone know what is wrong here?
pls help me!

the Job CODE is after

16/07/14 17:28:21 DEBUG ByteBufUtil:
-Dio.netty.threadLocalDirectBufferSize: 65536
16/07/14 17:28:21 DEBUG NetUtil: Loopback interface: lo (lo,
0:0:0:0:0:0:0:1%lo)
16/07/14 17:28:21 DEBUG NetUtil: /proc/sys/net/core/somaxconn: 32768
16/07/14 17:28:21 DEBUG TransportServer: Shuffle server started on port
:43492
16/07/14 17:28:21 INFO Utils: Successfully started service 'Driver' on port
43492.
16/07/14 17:28:21 INFO WorkerWatcher: Connecting to worker spark://
Worker@172.20.130.98:23933
16/07/14 17:28:21 DEBUG TransportClientFactory: Creating new connection to /
172.20.130.98:23933
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at
org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
at
org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
Caused by: scala.MatchError: [Ljava.lang.String;@68d279ec (of class
[Ljava.lang.String;)
at com.jd.deeplog.LogAggregator$.main(LogAggregator.scala:29)
at com.jd.deeplog.LogAggregator.main(LogAggregator.scala)
... 6 more

==
Job CODE:

object LogAggregator {

  val batchDuration = Seconds(5)

  def main(args:Array[String]) {

val usage =
  """Usage: LogAggregator

|  logFormat: fieldName:fieldRole[,fieldName:fieldRole] each
field must have both name and role
|  logFormat.role: can be key|avg|enum|sum|ignore
  """.stripMargin

if (args.length < 9) {
  System.err.println(usage)
  System.exit(1)
}

val Array(zkQuorum, group, topics, numThreads, logFormat,
logSeparator, batchDuration, destType, destPath) = args

println("Start streaming calculation...")

val conf = new SparkConf().setAppName("LBHaproxy-LogAggregator")
val ssc = new StreamingContext(conf, Seconds(batchDuration.toInt))

val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap

val lines = KafkaUtils.createStream(ssc, zkQuorum, group,
topicMap).map(_._2)

val logFields = logFormat.split(",").map(field => {
  val fld = field.split(":")
  if (fld.size != 2) {
System.err.println("Wrong parameters for logFormat!\n")
System.err.println(usage)
System.exit(1)
  }
  // TODO: ensure the field has both 'name' and 'role'
  new LogField(fld(0), fld(1))
})

val keyFields = logFields.filter(logFieldName => {
  logFieldName.role == "key"
})
val keys = keyFields.map(key => {
  key.name
})

val logsByKey = lines.map(line => {
  val log = new Log(logFields, line, logSeparator)
  log.toMap
}).filter(log => log.nonEmpty).map(log => {
  val keys = keyFields.map(logField => {
log(logField.name).value
  })

  val key = keys.reduce((key1, key2) => {
key1.asInstanceOf[String] + key2.asInstanceOf[String]
  })

  val fullLog = log + ("count" -> new LogSegment("sum", 1))

  (key, fullLog)
})


val aggResults = logsByKey.reduceByKey((log_a, log_b) => {

  log_a.map(logField => {
val logFieldName = logField._1
val logSegment_a = logField._2
val logSegment_b = log_b(logFieldName)

val segValue = logSegment_a.role match {
  case "avg" => {
logSegment_a.value.toString.toInt +
logSegment_b.value.toString.toInt
  }
  case "sum" => {
logSegment_a.value.toString.toInt +
logSegment_b.value.toString.toInt
  }
  case "enum" => {
val list_a = logSegment_a.value.asInstanceOf[List[(String, Int)]]
val list_b = logSegment_b.value.asInstanceOf[List[(String, Int)]]
list_a ++ list_b
  }
  case _ => logSegment_a.value
}
(logFieldName, new LogSegment(logSegment_a.role, segValue))
  })
}).map(logRecord => {
  val log = logRecord._2
  val count = log("count").value.toString.toInt


  val logContent = log.map(logField => {
val logFieldName = logField._1
val logSegment = logField._2
val fieldValue = logSegment.role match {
  case "avg" => {
logSegment.value.toString.toInt / count
  }

Re: [MARKETING] Spark Streaming stateful transformation mapWithState function getting error scala.MatchError: [Ljava.lang.Object]

2016-03-14 Thread Vinti Maheshwari
Hi Iain,


Thanks for your reply. Actually i changed my trackStateFunc, it's working
now.

For reference my working code with mapWithState:


def trackStateFunc(batchTime: Time, key: String, value:
Option[Array[Long]], state: State[Array[Long]])
  : Option[(String, Array[Long])] = {
  // Check if state exists
  if (state.exists) {
val newState:Array[Long] = Array(state.get, value.get).transpose.map(_.sum)
state.update(newState)// Set the new state
Some((key, newState))
  } else {
val initialState = value.get
state.update(initialState) // Set the initial state
Some((key, initialState))
  }
}

// StateSpec[KeyType, ValueType, StateType, MappedType]
val stateSpec: StateSpec[String, Array[Long], Array[Long], (String,
Array[Long])] = StateSpec.function(trackStateFunc _)

val state: MapWithStateDStream[String, Array[Long], Array[Long],
(String, Array[Long])] = parsedStream.mapWithState(stateSpec)


Thanks & Regards,

Vinti


On Mon, Mar 14, 2016 at 7:06 AM, Iain Cundy <iain.cu...@amdocs.com> wrote:

> Hi Vinti
>
>
>
> I don’t program in scala, but I think you’ve changed the meaning of the
> current variable – look again at what it state and what is new data.
>
>
>
> Assuming it works like the Java API, to use this function to maintain
> State you must call State.update, while you can return anything, not just
> the state.
>
>
>
> Cheers
>
> Iain
>
>
>
> *From:* Vinti Maheshwari [mailto:vinti.u...@gmail.com]
> *Sent:* 12 March 2016 22:10
> *To:* user
> *Subject:* [MARKETING] Spark Streaming stateful transformation
> mapWithState function getting error scala.MatchError: [Ljava.lang.Object]
>
>
>
> Hi All,
>
> I wanted to replace my updateStateByKey function with mapWithState
> function (Spark 1.6) to improve performance of my program.
>
> I was following these two documents:
> https://databricks.com/blog/2016/02/01/faster-stateful-stream-processing-in-spark-streaming.html
>
>
> https://docs.cloud.databricks.com/docs/spark/1.6/index.html#examples/Streaming%20mapWithState.html
>
> but i am getting error *scala.MatchError: [Ljava.lang.Object]*
>
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 71.0 failed 4 times, most recent failure: Lost task 0.3 in stage 71.0 
> (TID 88, ttsv-lab-vmdb-01.englab.juniper.net): scala.MatchError: 
> [Ljava.lang.Object;@eaf8bc8 (of class [Ljava.lang.Object;)
>
> at 
> HbaseCovrageStream$$anonfun$HbaseCovrageStream$$tracketStateFunc$1$3.apply(HbaseCoverageStream_mapwithstate.scala:84)
>
> at 
> HbaseCovrageStream$$anonfun$HbaseCovrageStream$$tracketStateFunc$1$3.apply(HbaseCoverageStream_mapwithstate.scala:84)
>
> at scala.Option.flatMap(Option.scala:170)
>
> at 
> HbaseCovrageStream$.HbaseCovrageStream$$tracketStateFunc$1(HbaseCoverageStream_mapwithstate.scala:84)
>
> Reference code:
>
> def trackStateFunc(key:String, value:Option[Array[Long]], 
> current:State[Array[Long]]) = {
>
>
>
> //either we can use this
>
> // current.update(value)
>
>
>
> value.map(_ :+ current).orElse(Some(current)).flatMap{
>
>   case x:Array[Long] => Try(x.map(BDV(_)).reduce(_ + 
> _).toArray).toOption
>
>   case None => ???
>
> }
>
>   }
>
>
>
>   val statespec:StateSpec[String, Array[Long], Array[Long], 
> Option[Array[Long]]] = StateSpec.function(trackStateFunc _)
>
>
>
>   val state: MapWithStateDStream[String, Array[Long], Array[Long], 
> Option[Array[Long]]] = parsedStream.mapWithState(statespec)
>
> My previous working code which was using updateStateByKey function:
>
> val state: DStream[(String, Array[Long])] = parsedStream.updateStateByKey(
>
> (current: Seq[Array[Long]], prev: Option[Array[Long]]) =>  {
>
>  prev.map(_ +: current).orElse(Some(current))
>
>   .flatMap(as => Try(as.map(BDV(_)).reduce(_ + _).toArray).toOption)
>
>   })
>
> Anyone has idea what can be the issue?
>
> Thanks & Regards,
>
> Vinti
> This message and the information contained herein is proprietary and
> confidential and subject to the Amdocs policy statement, you may review at
> http://www.amdocs.com/email_disclaimer.asp
>


Spark Streaming stateful transformation mapWithState function getting error scala.MatchError: [Ljava.lang.Object]

2016-03-12 Thread Vinti Maheshwari
Hi All,

I wanted to replace my updateStateByKey function with mapWithState function
(Spark 1.6) to improve performance of my program.

I was following these two documents:
https://databricks.com/blog/2016/02/01/faster-stateful-stream-processing-in-spark-streaming.html

https://docs.cloud.databricks.com/docs/spark/1.6/index.html#examples/Streaming%20mapWithState.html

but i am getting error *scala.MatchError: [Ljava.lang.Object]*

org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 71.0 failed 4 times, most recent failure: Lost task
0.3 in stage 71.0 (TID 88, ttsv-lab-vmdb-01.englab.juniper.net):
scala.MatchError: [Ljava.lang.Object;@eaf8bc8 (of class
[Ljava.lang.Object;)
at 
HbaseCovrageStream$$anonfun$HbaseCovrageStream$$tracketStateFunc$1$3.apply(HbaseCoverageStream_mapwithstate.scala:84)
at 
HbaseCovrageStream$$anonfun$HbaseCovrageStream$$tracketStateFunc$1$3.apply(HbaseCoverageStream_mapwithstate.scala:84)
at scala.Option.flatMap(Option.scala:170)
at 
HbaseCovrageStream$.HbaseCovrageStream$$tracketStateFunc$1(HbaseCoverageStream_mapwithstate.scala:84)

Reference code:

def trackStateFunc(key:String, value:Option[Array[Long]],
current:State[Array[Long]]) = {

//either we can use this
// current.update(value)

value.map(_ :+ current).orElse(Some(current)).flatMap{
  case x:Array[Long] => Try(x.map(BDV(_)).reduce(_ +
_).toArray).toOption
  case None => ???
}
  }

  val statespec:StateSpec[String, Array[Long], Array[Long],
Option[Array[Long]]] = StateSpec.function(trackStateFunc _)

  val state: MapWithStateDStream[String, Array[Long], Array[Long],
Option[Array[Long]]] = parsedStream.mapWithState(statespec)

My previous working code which was using updateStateByKey function:

val state: DStream[(String, Array[Long])] = parsedStream.updateStateByKey(
(current: Seq[Array[Long]], prev: Option[Array[Long]]) =>  {
 prev.map(_ +: current).orElse(Some(current))
  .flatMap(as => Try(as.map(BDV(_)).reduce(_ + _).toArray).toOption)
  })

Anyone has idea what can be the issue?

Thanks & Regards,
Vinti


sql:Exception in thread "main" scala.MatchError: StringType

2016-01-03 Thread Bonsen
(sbt) scala:
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.sql
object SimpleApp {
  def main(args: Array[String]) {
val conf = new SparkConf()
conf.setAppName("mytest").setMaster("spark://Master:7077")
val sc = new SparkContext(conf)
val sqlContext = new sql.SQLContext(sc)
val
d=sqlContext.read.json("/home/hadoop/2015data_test/Data/Data/100808cb11e9898816ef15fcdde4e1d74cbc0b/Db6Jh2XeQ.json")
sc.stop()
  }
}
__
after sbt package :
./spark-submit --class "SimpleApp" 
/home/hadoop/Downloads/sbt/bin/target/scala-2.10/simple-project_2.10-1.0.jar
___
json fIle:
{
"programmers": [
{
"firstName": "Brett",
"lastName": "McLaughlin",
"email": ""
},
{
"firstName": "Jason",
"lastName": "Hunter",
"email": ""
},
{
"firstName": "Elliotte",
"lastName": "Harold",
"email": ""
}
],
"authors": [
{
"firstName": "Isaac",
"lastName": "Asimov",
"genre": "sciencefiction"
},
{
"firstName": "Tad",
"lastName": "Williams",
"genre": "fantasy"
},
{
"firstName": "Frank",
"lastName": "Peretti",
"genre": "christianfiction"
}
],
"musicians": [
{
"firstName": "Eric",
"lastName": "Clapton",
"instrument": "guitar"
},
{
"firstName": "Sergei",
"lastName": "Rachmaninoff",
"instrument": "piano"
}
]
}
___
Exception in thread "main" scala.MatchError: StringType (of class
org.apache.spark.sql.types.StringType$)
at org.apache.spark.sql.json.InferSchema$.apply(InferSchema.scala:58)
at
org.apache.spark.sql.json.JSONRelation$$anonfun$schema$1.apply(JSONRelation.scala:139)
___
why



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/sql-Exception-in-thread-main-scala-MatchError-StringType-tp25868.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: sql:Exception in thread "main" scala.MatchError: StringType

2016-01-03 Thread Jeff Zhang
Spark only support one json object per line. You need to reformat your
file.

On Mon, Jan 4, 2016 at 11:26 AM, Bonsen <hengbohe...@126.com> wrote:

> (sbt) scala:
> import org.apache.spark.SparkContext
> import org.apache.spark.SparkConf
> import org.apache.spark.sql
> object SimpleApp {
>   def main(args: Array[String]) {
> val conf = new SparkConf()
> conf.setAppName("mytest").setMaster("spark://Master:7077")
> val sc = new SparkContext(conf)
> val sqlContext = new sql.SQLContext(sc)
> val
>
> d=sqlContext.read.json("/home/hadoop/2015data_test/Data/Data/100808cb11e9898816ef15fcdde4e1d74cbc0b/Db6Jh2XeQ.json")
> sc.stop()
>   }
> }
>
> __
> after sbt package :
> ./spark-submit --class "SimpleApp"
>
> /home/hadoop/Downloads/sbt/bin/target/scala-2.10/simple-project_2.10-1.0.jar
>
> ___
> json fIle:
> {
> "programmers": [
> {
> "firstName": "Brett",
> "lastName": "McLaughlin",
> "email": ""
> },
> {
> "firstName": "Jason",
> "lastName": "Hunter",
> "email": ""
> },
> {
> "firstName": "Elliotte",
> "lastName": "Harold",
> "email": ""
> }
> ],
> "authors": [
> {
> "firstName": "Isaac",
> "lastName": "Asimov",
> "genre": "sciencefiction"
> },
> {
> "firstName": "Tad",
> "lastName": "Williams",
> "genre": "fantasy"
> },
> {
> "firstName": "Frank",
> "lastName": "Peretti",
> "genre": "christianfiction"
> }
> ],
> "musicians": [
> {
> "firstName": "Eric",
> "lastName": "Clapton",
> "instrument": "guitar"
> },
> {
> "firstName": "Sergei",
> "lastName": "Rachmaninoff",
> "instrument": "piano"
> }
> ]
> }
>
> ___
> Exception in thread "main" scala.MatchError: StringType (of class
> org.apache.spark.sql.types.StringType$)
> at
> org.apache.spark.sql.json.InferSchema$.apply(InferSchema.scala:58)
> at
>
> org.apache.spark.sql.json.JSONRelation$$anonfun$schema$1.apply(JSONRelation.scala:139)
>
> ___
> why
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/sql-Exception-in-thread-main-scala-MatchError-StringType-tp25868.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
Best Regards

Jeff Zhang


Re: Reading JSON in Pyspark throws scala.MatchError

2015-10-20 Thread Jeff Zhang
BTW, I think Json Parser should verify the json format at least when
inferring the schema of json.

On Wed, Oct 21, 2015 at 12:59 PM, Jeff Zhang <zjf...@gmail.com> wrote:

> I think this is due to the json file format.  DataFrame can only accept
> json file with one valid record per line.  Multiple line per record is
> invalid for DataFrame.
>
>
> On Tue, Oct 6, 2015 at 2:48 AM, Davies Liu <dav...@databricks.com> wrote:
>
>> Could you create a JIRA to track this bug?
>>
>> On Fri, Oct 2, 2015 at 1:42 PM, balajikvijayan
>> <balaji.k.vija...@gmail.com> wrote:
>> > Running Windows 8.1, Python 2.7.x, Scala 2.10.5, Spark 1.4.1.
>> >
>> > I'm trying to read in a large quantity of json data in a couple of
>> files and
>> > I receive a scala.MatchError when I do so. Json, Python and stack trace
>> all
>> > shown below.
>> >
>> > Json:
>> >
>> > {
>> > "dataunit": {
>> > "page_view": {
>> > "nonce": 438058072,
>> > "person": {
>> > "user_id": 5846
>> > },
>> > "page": {
>> > "url": "http://mysite.com/blog;
>> > }
>> > }
>> > },
>> > "pedigree": {
>> > "true_as_of_secs": 1438627992
>> > }
>> > }
>> >
>> > Python:
>> >
>> > import pyspark
>> > sc = pyspark.SparkContext()
>> > sqlContext = pyspark.SQLContext(sc)
>> > pageviews = sqlContext.read.json("[Path to folder containing file with
>> above
>> > json]")
>> > pageviews.collect()
>> >
>> > Stack Trace:
>> > Py4JJavaError: An error occurred while calling
>> > z:org.apache.spark.api.python.PythonRDD.collectAndServe.
>> > : org.apache.spark.SparkException: Job aborted due to stage failure:
>> Task 1
>> > in stage 32.0 failed 1 times, most recent failure: Lost task 1.0 in
>> stage
>> > 32.0 (TID 133, localhost): scala.MatchError:
>> > (VALUE_STRING,ArrayType(StructType(),true)) (of class scala.Tuple2)
>> > at
>> >
>> org.apache.spark.sql.json.JacksonParser$.convertField(JacksonParser.scala:49)
>> > at
>> >
>> org.apache.spark.sql.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$1.apply(JacksonParser.scala:201)
>> > at
>> >
>> org.apache.spark.sql.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$1.apply(JacksonParser.scala:193)
>> > at
>> scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>> > at
>> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>> > at
>> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>> > at
>> >
>> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:116)
>> > at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>> > at
>> >
>> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:111)
>> > at
>> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>> > at
>> >
>> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>> > at
>> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>> > at scala.collection.TraversableOnce$class.to
>> (TraversableOnce.scala:273)
>> > at
>> >
>> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.to(SerDeUtil.scala:111)
>> > at
>> >
>> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>> > at
>> >
>> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toBuffer(SerDeUtil.scala:111)
>> > at
>> >
>> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
>> > at
>> >
>> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toArray(SerDeUtil.scala:111)
>> > at
>> >
>> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:885)
>> > at
>> >
>> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:885)
>> > at
>> >
>> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkCo

Re: Reading JSON in Pyspark throws scala.MatchError

2015-10-20 Thread Jeff Zhang
I think this is due to the json file format.  DataFrame can only accept
json file with one valid record per line.  Multiple line per record is
invalid for DataFrame.


On Tue, Oct 6, 2015 at 2:48 AM, Davies Liu <dav...@databricks.com> wrote:

> Could you create a JIRA to track this bug?
>
> On Fri, Oct 2, 2015 at 1:42 PM, balajikvijayan
> <balaji.k.vija...@gmail.com> wrote:
> > Running Windows 8.1, Python 2.7.x, Scala 2.10.5, Spark 1.4.1.
> >
> > I'm trying to read in a large quantity of json data in a couple of files
> and
> > I receive a scala.MatchError when I do so. Json, Python and stack trace
> all
> > shown below.
> >
> > Json:
> >
> > {
> > "dataunit": {
> > "page_view": {
> > "nonce": 438058072,
> > "person": {
> > "user_id": 5846
> > },
> > "page": {
> > "url": "http://mysite.com/blog;
> > }
> > }
> > },
> > "pedigree": {
> > "true_as_of_secs": 1438627992
> > }
> > }
> >
> > Python:
> >
> > import pyspark
> > sc = pyspark.SparkContext()
> > sqlContext = pyspark.SQLContext(sc)
> > pageviews = sqlContext.read.json("[Path to folder containing file with
> above
> > json]")
> > pageviews.collect()
> >
> > Stack Trace:
> > Py4JJavaError: An error occurred while calling
> > z:org.apache.spark.api.python.PythonRDD.collectAndServe.
> > : org.apache.spark.SparkException: Job aborted due to stage failure:
> Task 1
> > in stage 32.0 failed 1 times, most recent failure: Lost task 1.0 in stage
> > 32.0 (TID 133, localhost): scala.MatchError:
> > (VALUE_STRING,ArrayType(StructType(),true)) (of class scala.Tuple2)
> > at
> >
> org.apache.spark.sql.json.JacksonParser$.convertField(JacksonParser.scala:49)
> > at
> >
> org.apache.spark.sql.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$1.apply(JacksonParser.scala:201)
> > at
> >
> org.apache.spark.sql.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$1.apply(JacksonParser.scala:193)
> > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> > at
> >
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:116)
> > at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> > at
> >
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:111)
> > at
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
> > at
> > scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
> > at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
> > at scala.collection.TraversableOnce$class.to
> (TraversableOnce.scala:273)
> > at
> >
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.to(SerDeUtil.scala:111)
> > at
> >
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
> > at
> >
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toBuffer(SerDeUtil.scala:111)
> > at
> > scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
> > at
> >
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toArray(SerDeUtil.scala:111)
> > at
> >
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:885)
> > at
> >
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:885)
> > at
> >
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767)
> > at
> >
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767)
> > at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
> > at org.apache.spark.scheduler.Task.run(Task.scala:70)
> > at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> > at
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> > at
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> > at java.lang.T

Re: Reading JSON in Pyspark throws scala.MatchError

2015-10-20 Thread Balaji Vijayan
You are correct, that was the issue.

On Tue, Oct 20, 2015 at 10:18 PM, Jeff Zhang <zjf...@gmail.com> wrote:

> BTW, I think Json Parser should verify the json format at least when
> inferring the schema of json.
>
> On Wed, Oct 21, 2015 at 12:59 PM, Jeff Zhang <zjf...@gmail.com> wrote:
>
>> I think this is due to the json file format.  DataFrame can only accept
>> json file with one valid record per line.  Multiple line per record is
>> invalid for DataFrame.
>>
>>
>> On Tue, Oct 6, 2015 at 2:48 AM, Davies Liu <dav...@databricks.com> wrote:
>>
>>> Could you create a JIRA to track this bug?
>>>
>>> On Fri, Oct 2, 2015 at 1:42 PM, balajikvijayan
>>> <balaji.k.vija...@gmail.com> wrote:
>>> > Running Windows 8.1, Python 2.7.x, Scala 2.10.5, Spark 1.4.1.
>>> >
>>> > I'm trying to read in a large quantity of json data in a couple of
>>> files and
>>> > I receive a scala.MatchError when I do so. Json, Python and stack
>>> trace all
>>> > shown below.
>>> >
>>> > Json:
>>> >
>>> > {
>>> > "dataunit": {
>>> > "page_view": {
>>> > "nonce": 438058072,
>>> > "person": {
>>> > "user_id": 5846
>>> > },
>>> > "page": {
>>> > "url": "http://mysite.com/blog;
>>> > }
>>> > }
>>> > },
>>> > "pedigree": {
>>> > "true_as_of_secs": 1438627992
>>> > }
>>> > }
>>> >
>>> > Python:
>>> >
>>> > import pyspark
>>> > sc = pyspark.SparkContext()
>>> > sqlContext = pyspark.SQLContext(sc)
>>> > pageviews = sqlContext.read.json("[Path to folder containing file with
>>> above
>>> > json]")
>>> > pageviews.collect()
>>> >
>>> > Stack Trace:
>>> > Py4JJavaError: An error occurred while calling
>>> > z:org.apache.spark.api.python.PythonRDD.collectAndServe.
>>> > : org.apache.spark.SparkException: Job aborted due to stage failure:
>>> Task 1
>>> > in stage 32.0 failed 1 times, most recent failure: Lost task 1.0 in
>>> stage
>>> > 32.0 (TID 133, localhost): scala.MatchError:
>>> > (VALUE_STRING,ArrayType(StructType(),true)) (of class scala.Tuple2)
>>> > at
>>> >
>>> org.apache.spark.sql.json.JacksonParser$.convertField(JacksonParser.scala:49)
>>> > at
>>> >
>>> org.apache.spark.sql.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$1.apply(JacksonParser.scala:201)
>>> > at
>>> >
>>> org.apache.spark.sql.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$1.apply(JacksonParser.scala:193)
>>> > at
>>> scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>>> > at
>>> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>>> > at
>>> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>>> > at
>>> >
>>> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:116)
>>> > at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>>> > at
>>> >
>>> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:111)
>>> > at
>>> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
>>> > at
>>> >
>>> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
>>> > at
>>> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
>>> > at scala.collection.TraversableOnce$class.to
>>> (TraversableOnce.scala:273)
>>> > at
>>> >
>>> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.to(SerDeUtil.scala:111)
>>> > at
>>> >
>>> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
>>> > at
>>> >
>>> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toBuffer(SerDeUtil.scala:111)
>>> > at
>>> >
>>> scala.collection.Trav

Re: Reading JSON in Pyspark throws scala.MatchError

2015-10-05 Thread Davies Liu
Could you create a JIRA to track this bug?

On Fri, Oct 2, 2015 at 1:42 PM, balajikvijayan
<balaji.k.vija...@gmail.com> wrote:
> Running Windows 8.1, Python 2.7.x, Scala 2.10.5, Spark 1.4.1.
>
> I'm trying to read in a large quantity of json data in a couple of files and
> I receive a scala.MatchError when I do so. Json, Python and stack trace all
> shown below.
>
> Json:
>
> {
> "dataunit": {
> "page_view": {
> "nonce": 438058072,
> "person": {
> "user_id": 5846
> },
> "page": {
> "url": "http://mysite.com/blog;
> }
> }
> },
> "pedigree": {
> "true_as_of_secs": 1438627992
> }
> }
>
> Python:
>
> import pyspark
> sc = pyspark.SparkContext()
> sqlContext = pyspark.SQLContext(sc)
> pageviews = sqlContext.read.json("[Path to folder containing file with above
> json]")
> pageviews.collect()
>
> Stack Trace:
> Py4JJavaError: An error occurred while calling
> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1
> in stage 32.0 failed 1 times, most recent failure: Lost task 1.0 in stage
> 32.0 (TID 133, localhost): scala.MatchError:
> (VALUE_STRING,ArrayType(StructType(),true)) (of class scala.Tuple2)
> at
> org.apache.spark.sql.json.JacksonParser$.convertField(JacksonParser.scala:49)
> at
> org.apache.spark.sql.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$1.apply(JacksonParser.scala:201)
> at
> org.apache.spark.sql.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$1.apply(JacksonParser.scala:193)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:116)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:111)
> at 
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
> at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
> at 
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
> at
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.to(SerDeUtil.scala:111)
> at
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
> at
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toBuffer(SerDeUtil.scala:111)
> at
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
> at
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toArray(SerDeUtil.scala:111)
> at
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:885)
> at
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:885)
> at
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767)
> at
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
> at org.apache.spark.scheduler.Task.run(Task.scala:70)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>
> Driver stacktrace:
> at
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1263)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonf

Reading JSON in Pyspark throws scala.MatchError

2015-10-02 Thread balajikvijayan
Running Windows 8.1, Python 2.7.x, Scala 2.10.5, Spark 1.4.1.

I'm trying to read in a large quantity of json data in a couple of files and
I receive a scala.MatchError when I do so. Json, Python and stack trace all
shown below.

Json:

{
"dataunit": {
"page_view": {
"nonce": 438058072,
"person": {
"user_id": 5846
},
"page": {
"url": "http://mysite.com/blog;
}
}
},
"pedigree": {
"true_as_of_secs": 1438627992
}
}

Python:

import pyspark
sc = pyspark.SparkContext()
sqlContext = pyspark.SQLContext(sc)
pageviews = sqlContext.read.json("[Path to folder containing file with above
json]")
pageviews.collect()

Stack Trace:
Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1
in stage 32.0 failed 1 times, most recent failure: Lost task 1.0 in stage
32.0 (TID 133, localhost): scala.MatchError:
(VALUE_STRING,ArrayType(StructType(),true)) (of class scala.Tuple2)
at
org.apache.spark.sql.json.JacksonParser$.convertField(JacksonParser.scala:49)
at
org.apache.spark.sql.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$1.apply(JacksonParser.scala:201)
at
org.apache.spark.sql.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$1.apply(JacksonParser.scala:193)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at
org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:116)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at
org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:111)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at
org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.to(SerDeUtil.scala:111)
at
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at
org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toBuffer(SerDeUtil.scala:111)
at
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at
org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toArray(SerDeUtil.scala:111)
at
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:885)
at
org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:885)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1263)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1457)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1418)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

It seems like this issue has been resolved in scala per  SPARK-339

Re: Reading JSON in Pyspark throws scala.MatchError

2015-10-02 Thread Ted Yu
I got the following when parsing your input with master branch (Python
version 2.6.6):

http://pastebin.com/1w8WM3tz

FYI

On Fri, Oct 2, 2015 at 1:42 PM, balajikvijayan <balaji.k.vija...@gmail.com>
wrote:

> Running Windows 8.1, Python 2.7.x, Scala 2.10.5, Spark 1.4.1.
>
> I'm trying to read in a large quantity of json data in a couple of files
> and
> I receive a scala.MatchError when I do so. Json, Python and stack trace all
> shown below.
>
> Json:
>
> {
> "dataunit": {
> "page_view": {
> "nonce": 438058072,
> "person": {
> "user_id": 5846
> },
> "page": {
> "url": "http://mysite.com/blog;
> }
> }
> },
> "pedigree": {
> "true_as_of_secs": 1438627992
> }
> }
>
> Python:
>
> import pyspark
> sc = pyspark.SparkContext()
> sqlContext = pyspark.SQLContext(sc)
> pageviews = sqlContext.read.json("[Path to folder containing file with
> above
> json]")
> pageviews.collect()
>
> Stack Trace:
> Py4JJavaError: An error occurred while calling
> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1
> in stage 32.0 failed 1 times, most recent failure: Lost task 1.0 in stage
> 32.0 (TID 133, localhost): scala.MatchError:
> (VALUE_STRING,ArrayType(StructType(),true)) (of class scala.Tuple2)
> at
>
> org.apache.spark.sql.json.JacksonParser$.convertField(JacksonParser.scala:49)
> at
>
> org.apache.spark.sql.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$1.apply(JacksonParser.scala:201)
> at
>
> org.apache.spark.sql.json.JacksonParser$$anonfun$parseJson$1$$anonfun$apply$1.apply(JacksonParser.scala:193)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at
>
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:116)
> at scala.collection.Iterator$class.foreach(Iterator.scala:727)
> at
>
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:111)
> at
> scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
> at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
> at
> scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
> at scala.collection.TraversableOnce$class.to
> (TraversableOnce.scala:273)
> at
>
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.to(SerDeUtil.scala:111)
> at
> scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
> at
>
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toBuffer(SerDeUtil.scala:111)
> at
> scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
> at
>
> org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.toArray(SerDeUtil.scala:111)
> at
>
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:885)
> at
>
> org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:885)
> at
>
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767)
> at
>
> org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1767)
> at
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
> at org.apache.spark.scheduler.Task.run(Task.scala:70)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>
> Driver stacktrace:
> at
> org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273)
> at
>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264)
> at
>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263)
> at
>
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at
> scala.collection.mutable.ArrayBuffer.forea

Re: SparkSQL : using Hive UDF returning Map throws rror: scala.MatchError: interface java.util.Map (of class java.lang.Class) (state=,code=0)

2015-06-05 Thread Okehee Goh
I will..that will be great if simple UDF can return complex type.
Thanks!

On Fri, Jun 5, 2015 at 12:17 AM, Cheng, Hao hao.ch...@intel.com wrote:
 Confirmed, with latest master, we don't support complex data type for Simple 
 Hive UDF, do you mind file an issue in jira?

 -Original Message-
 From: Cheng, Hao [mailto:hao.ch...@intel.com]
 Sent: Friday, June 5, 2015 12:35 PM
 To: ogoh; user@spark.apache.org
 Subject: RE: SparkSQL : using Hive UDF returning Map throws rror: 
 scala.MatchError: interface java.util.Map (of class java.lang.Class) 
 (state=,code=0)

 Which version of Hive jar are you using? Hive 0.13.1 or Hive 0.12.0?

 -Original Message-
 From: ogoh [mailto:oke...@gmail.com]
 Sent: Friday, June 5, 2015 10:10 AM
 To: user@spark.apache.org
 Subject: SparkSQL : using Hive UDF returning Map throws rror: 
 scala.MatchError: interface java.util.Map (of class java.lang.Class) 
 (state=,code=0)


 Hello,
 I tested some custom udf on SparkSql's ThriftServer  Beeline (Spark 1.3.1).
 Some udfs work fine (access array parameter and returning int or string type).
 But my udf returning map type throws an error:
 Error: scala.MatchError: interface java.util.Map (of class java.lang.Class) 
 (state=,code=0)

 I converted the code into Hive's GenericUDF since I worried that using 
 complex type parameter (array of map) and returning complex type (map) can be 
 supported in Hive's GenericUDF instead of simple UDF.
 But SparkSQL doesn't seem supporting GenericUDF.(error message : Error:
 java.lang.IllegalAccessException: Class
 org.apache.spark.sql.hive.HiveFunctionWrapper can not access ..).

 Below is my example udf code returning MAP type.
 I appreciate any advice.
 Thanks

 --

 public final class ArrayToMap extends UDF {

 public MapString,String evaluate(ArrayListString arrayOfString) {
 // add code to handle all index problem

 MapString, String map = new HashMapString, String();

 int count = 0;
 for (String element : arrayOfString) {
 map.put(count + , element);
 count++;

 }
 return map;
 }
 }






 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-using-Hive-UDF-returning-Map-throws-rror-scala-MatchError-interface-java-util-Map-of-class--tp23164.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
 commands, e-mail: user-h...@spark.apache.org


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: SparkSQL : using Hive UDF returning Map throws rror: scala.MatchError: interface java.util.Map (of class java.lang.Class) (state=,code=0)

2015-06-05 Thread Okehee Goh
It is Spark 1.3.1.e (it is AWS release .. I think it is close to Spark
1.3.1 with some bug fixes).

My report about GenericUDF not working in SparkSQL is wrong. I tested
with open-source GenericUDF and it worked fine. Just my GenericUDF
which returns Map type didn't work. Sorry about false reporting.



On Thu, Jun 4, 2015 at 9:35 PM, Cheng, Hao hao.ch...@intel.com wrote:
 Which version of Hive jar are you using? Hive 0.13.1 or Hive 0.12.0?

 -Original Message-
 From: ogoh [mailto:oke...@gmail.com]
 Sent: Friday, June 5, 2015 10:10 AM
 To: user@spark.apache.org
 Subject: SparkSQL : using Hive UDF returning Map throws rror: 
 scala.MatchError: interface java.util.Map (of class java.lang.Class) 
 (state=,code=0)


 Hello,
 I tested some custom udf on SparkSql's ThriftServer  Beeline (Spark 1.3.1).
 Some udfs work fine (access array parameter and returning int or string type).
 But my udf returning map type throws an error:
 Error: scala.MatchError: interface java.util.Map (of class java.lang.Class) 
 (state=,code=0)

 I converted the code into Hive's GenericUDF since I worried that using 
 complex type parameter (array of map) and returning complex type (map) can be 
 supported in Hive's GenericUDF instead of simple UDF.
 But SparkSQL doesn't seem supporting GenericUDF.(error message : Error:
 java.lang.IllegalAccessException: Class
 org.apache.spark.sql.hive.HiveFunctionWrapper can not access ..).

 Below is my example udf code returning MAP type.
 I appreciate any advice.
 Thanks

 --

 public final class ArrayToMap extends UDF {

 public MapString,String evaluate(ArrayListString arrayOfString) {
 // add code to handle all index problem

 MapString, String map = new HashMapString, String();

 int count = 0;
 for (String element : arrayOfString) {
 map.put(count + , element);
 count++;

 }
 return map;
 }
 }






 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-using-Hive-UDF-returning-Map-throws-rror-scala-MatchError-interface-java-util-Map-of-class--tp23164.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
 commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: SparkSQL : using Hive UDF returning Map throws rror: scala.MatchError: interface java.util.Map (of class java.lang.Class) (state=,code=0)

2015-06-05 Thread Cheng, Hao
Confirmed, with latest master, we don't support complex data type for Simple 
Hive UDF, do you mind file an issue in jira?

-Original Message-
From: Cheng, Hao [mailto:hao.ch...@intel.com] 
Sent: Friday, June 5, 2015 12:35 PM
To: ogoh; user@spark.apache.org
Subject: RE: SparkSQL : using Hive UDF returning Map throws rror: 
scala.MatchError: interface java.util.Map (of class java.lang.Class) 
(state=,code=0)

Which version of Hive jar are you using? Hive 0.13.1 or Hive 0.12.0?

-Original Message-
From: ogoh [mailto:oke...@gmail.com] 
Sent: Friday, June 5, 2015 10:10 AM
To: user@spark.apache.org
Subject: SparkSQL : using Hive UDF returning Map throws rror: 
scala.MatchError: interface java.util.Map (of class java.lang.Class) 
(state=,code=0)


Hello,
I tested some custom udf on SparkSql's ThriftServer  Beeline (Spark 1.3.1).
Some udfs work fine (access array parameter and returning int or string type). 
But my udf returning map type throws an error:
Error: scala.MatchError: interface java.util.Map (of class java.lang.Class) 
(state=,code=0)

I converted the code into Hive's GenericUDF since I worried that using complex 
type parameter (array of map) and returning complex type (map) can be supported 
in Hive's GenericUDF instead of simple UDF.
But SparkSQL doesn't seem supporting GenericUDF.(error message : Error:
java.lang.IllegalAccessException: Class
org.apache.spark.sql.hive.HiveFunctionWrapper can not access ..).

Below is my example udf code returning MAP type.
I appreciate any advice.
Thanks

--

public final class ArrayToMap extends UDF {

public MapString,String evaluate(ArrayListString arrayOfString) {
// add code to handle all index problem

MapString, String map = new HashMapString, String();
   
int count = 0;
for (String element : arrayOfString) {
map.put(count + , element);
count++;

}
return map;
}
}






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-using-Hive-UDF-returning-Map-throws-rror-scala-MatchError-interface-java-util-Map-of-class--tp23164.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



SparkSQL : using Hive UDF returning Map throws rror: scala.MatchError: interface java.util.Map (of class java.lang.Class) (state=,code=0)

2015-06-04 Thread ogoh

Hello,
I tested some custom udf on SparkSql's ThriftServer  Beeline (Spark 1.3.1).
Some udfs work fine (access array parameter and returning int or string
type). 
But my udf returning map type throws an error:
Error: scala.MatchError: interface java.util.Map (of class java.lang.Class)
(state=,code=0)

I converted the code into Hive's GenericUDF since I worried that using
complex type parameter (array of map) and returning complex type (map) can
be supported in Hive's GenericUDF instead of simple UDF.
But SparkSQL doesn't seem supporting GenericUDF.(error message : Error:
java.lang.IllegalAccessException: Class
org.apache.spark.sql.hive.HiveFunctionWrapper can not access ..).

Below is my example udf code returning MAP type.
I appreciate any advice.
Thanks

--

public final class ArrayToMap extends UDF {

public MapString,String evaluate(ArrayListString arrayOfString) {
// add code to handle all index problem

MapString, String map = new HashMapString, String();
   
int count = 0;
for (String element : arrayOfString) {
map.put(count + , element);
count++;

}
return map;
}
}






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-using-Hive-UDF-returning-Map-throws-rror-scala-MatchError-interface-java-util-Map-of-class--tp23164.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: SparkSQL : using Hive UDF returning Map throws rror: scala.MatchError: interface java.util.Map (of class java.lang.Class) (state=,code=0)

2015-06-04 Thread Cheng, Hao
Which version of Hive jar are you using? Hive 0.13.1 or Hive 0.12.0?

-Original Message-
From: ogoh [mailto:oke...@gmail.com] 
Sent: Friday, June 5, 2015 10:10 AM
To: user@spark.apache.org
Subject: SparkSQL : using Hive UDF returning Map throws rror: 
scala.MatchError: interface java.util.Map (of class java.lang.Class) 
(state=,code=0)


Hello,
I tested some custom udf on SparkSql's ThriftServer  Beeline (Spark 1.3.1).
Some udfs work fine (access array parameter and returning int or string type). 
But my udf returning map type throws an error:
Error: scala.MatchError: interface java.util.Map (of class java.lang.Class) 
(state=,code=0)

I converted the code into Hive's GenericUDF since I worried that using complex 
type parameter (array of map) and returning complex type (map) can be supported 
in Hive's GenericUDF instead of simple UDF.
But SparkSQL doesn't seem supporting GenericUDF.(error message : Error:
java.lang.IllegalAccessException: Class
org.apache.spark.sql.hive.HiveFunctionWrapper can not access ..).

Below is my example udf code returning MAP type.
I appreciate any advice.
Thanks

--

public final class ArrayToMap extends UDF {

public MapString,String evaluate(ArrayListString arrayOfString) {
// add code to handle all index problem

MapString, String map = new HashMapString, String();
   
int count = 0;
for (String element : arrayOfString) {
map.put(count + , element);
count++;

}
return map;
}
}






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkSQL-using-Hive-UDF-returning-Map-throws-rror-scala-MatchError-interface-java-util-Map-of-class--tp23164.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



scala.MatchError: class org.apache.avro.Schema (of class java.lang.Class)

2015-04-07 Thread Yamini
Using spark(1.2) streaming to read avro schema based topics flowing in kafka
and then using spark sql context to register data as temp table. Avro maven
plugin(1.7.7 version) generates the java bean class for the avro file but
includes a field named SCHEMA$ of type org.apache.avro.Schema which is not
supported in the JavaSQLContext class[Method : applySchema].
How to auto generate java bean class for the avro file and over come the
above mentioned problem.

Thanks.




-
Thanks,
Yamini
--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/scala-MatchError-class-org-apache-avro-Schema-of-class-java-lang-Class-tp22402.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: scala.MatchError: class org.apache.avro.Schema (of class java.lang.Class)

2015-04-07 Thread Yamini Maddirala
For more details on my question
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-generate-Java-bean-class-for-avro-files-using-spark-avro-project-tp22413.html

Thanks,
Yamini

On Tue, Apr 7, 2015 at 2:23 PM, Yamini Maddirala yamini.m...@gmail.com
wrote:

 Hi Michael,

 Yes, I did try spark-avro 0.2.0 databricks project. I am using CHD5.3
 which is based on spark 1.2. Hence I'm bound to use spark-avro 0.2.0
 instead of the latest.

 I'm not sure how spark-avro project can help me in this scenario.

 1. I have JavaDStream of type avro generic record
 :JavaDStreamGenericRecord [This is the data being read from kafka topics]
 2. I'm able to get JavaSchemaRDD using the avro file like this
 final JavaSchemaRDD schemaRDD2 = AvroUtils.avroFile(sqlContext,
 /xyz-Project/trunk/src/main/resources/xyz.avro);
 3. I don't know how I can apply schema in step 2 to data in step 1.
 I chose to do something like this
JavaSchemaRDD schemaRDD = sqlContext.applySchema(genericRecordJavaRDD,
 xyz.class);

Used avro maven plugin to generate xyz class in Java. But this is not
 good because avro maven plugin creates a field SCHEMA which is not
 supported in applySchema method.

 Please let me know how to deal with this.

 Appreciate your help

 Thanks,
 Yamini












 On Tue, Apr 7, 2015 at 1:57 PM, Michael Armbrust mich...@databricks.com
 wrote:

 Have you looked at spark-avro?

 https://github.com/databricks/spark-avro

 On Tue, Apr 7, 2015 at 3:57 AM, Yamini yamini.m...@gmail.com wrote:

 Using spark(1.2) streaming to read avro schema based topics flowing in
 kafka
 and then using spark sql context to register data as temp table. Avro
 maven
 plugin(1.7.7 version) generates the java bean class for the avro file but
 includes a field named SCHEMA$ of type org.apache.avro.Schema which is
 not
 supported in the JavaSQLContext class[Method : applySchema].
 How to auto generate java bean class for the avro file and over come the
 above mentioned problem.

 Thanks.




 -
 Thanks,
 Yamini
 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/scala-MatchError-class-org-apache-avro-Schema-of-class-java-lang-Class-tp22402.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org






Re: scala.MatchError: class org.apache.avro.Schema (of class java.lang.Class)

2015-04-07 Thread Michael Armbrust
Have you looked at spark-avro?

https://github.com/databricks/spark-avro

On Tue, Apr 7, 2015 at 3:57 AM, Yamini yamini.m...@gmail.com wrote:

 Using spark(1.2) streaming to read avro schema based topics flowing in
 kafka
 and then using spark sql context to register data as temp table. Avro maven
 plugin(1.7.7 version) generates the java bean class for the avro file but
 includes a field named SCHEMA$ of type org.apache.avro.Schema which is not
 supported in the JavaSQLContext class[Method : applySchema].
 How to auto generate java bean class for the avro file and over come the
 above mentioned problem.

 Thanks.




 -
 Thanks,
 Yamini
 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/scala-MatchError-class-org-apache-avro-Schema-of-class-java-lang-Class-tp22402.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: scala.MatchError: class org.apache.avro.Schema (of class java.lang.Class)

2015-04-07 Thread Yamini Maddirala
Hi Michael,

Yes, I did try spark-avro 0.2.0 databricks project. I am using CHD5.3 which
is based on spark 1.2. Hence I'm bound to use spark-avro 0.2.0 instead of
the latest.

I'm not sure how spark-avro project can help me in this scenario.

1. I have JavaDStream of type avro generic record
:JavaDStreamGenericRecord [This is the data being read from kafka topics]
2. I'm able to get JavaSchemaRDD using the avro file like this
final JavaSchemaRDD schemaRDD2 = AvroUtils.avroFile(sqlContext,
/xyz-Project/trunk/src/main/resources/xyz.avro);
3. I don't know how I can apply schema in step 2 to data in step 1.
I chose to do something like this
   JavaSchemaRDD schemaRDD = sqlContext.applySchema(genericRecordJavaRDD,
xyz.class);

   Used avro maven plugin to generate xyz class in Java. But this is not
good because avro maven plugin creates a field SCHEMA which is not
supported in applySchema method.

Please let me know how to deal with this.

Appreciate your help

Thanks,
Yamini












On Tue, Apr 7, 2015 at 1:57 PM, Michael Armbrust mich...@databricks.com
wrote:

 Have you looked at spark-avro?

 https://github.com/databricks/spark-avro

 On Tue, Apr 7, 2015 at 3:57 AM, Yamini yamini.m...@gmail.com wrote:

 Using spark(1.2) streaming to read avro schema based topics flowing in
 kafka
 and then using spark sql context to register data as temp table. Avro
 maven
 plugin(1.7.7 version) generates the java bean class for the avro file but
 includes a field named SCHEMA$ of type org.apache.avro.Schema which is not
 supported in the JavaSQLContext class[Method : applySchema].
 How to auto generate java bean class for the avro file and over come the
 above mentioned problem.

 Thanks.




 -
 Thanks,
 Yamini
 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/scala-MatchError-class-org-apache-avro-Schema-of-class-java-lang-Class-tp22402.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Q about Spark MLlib- Decision tree - scala.MatchError: 2.0 (of class java.lang.Double)

2014-12-14 Thread jake Lim
I am working some kind of Spark MLlib Test(Decision Tree) and I used IRIS
data from Cran-R package.
Original IRIS Data is not a good format for Spark MLlib. so I changed data
format(change data format and features's location)

When I ran sample Spark MLlib code for DT, I met the error like below
How can i solve this error?
==
14/12/15 14:27:30 ERROR TaskSetManager: Task 21.0:0 failed 4 times; aborting
job
14/12/15 14:27:30 INFO TaskSchedulerImpl: Cancelling stage 21
14/12/15 14:27:30 INFO DAGScheduler: Failed to run aggregate at
DecisionTree.scala:657
14/12/15 14:27:30 INFO TaskSchedulerImpl: Stage 21 was cancelled
14/12/15 14:27:30 WARN TaskSetManager: Loss was due to
org.apache.spark.TaskKilledException
org.apache.spark.TaskKilledException
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
14/12/15 14:27:30 INFO TaskSchedulerImpl: Removed TaskSet 21.0, whose tasks
have all completed, from pool
org.apache.spark.SparkException: Job aborted due to stage failure: Task
21.0:0 failed 4 times, most recent failure: Exception failure in TID 34 on
host krbda1anode01.kr.test.com: scala.MatchError: 2.0 (of class
java.lang.Double)
   
org.apache.spark.mllib.tree.DecisionTree$.classificationBinSeqOp$1(DecisionTree.scala:568)
   
org.apache.spark.mllib.tree.DecisionTree$.org$apache$spark$mllib$tree$DecisionTree$$binSeqOp$1(DecisionTree.scala:623)
   
org.apache.spark.mllib.tree.DecisionTree$$anonfun$4.apply(DecisionTree.scala:657)
   
org.apache.spark.mllib.tree.DecisionTree$$anonfun$4.apply(DecisionTree.scala:657)
   
scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
   
scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
scala.collection.Iterator$class.foreach(Iterator.scala:727)
scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   
scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157)
   
scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201)
scala.collection.AbstractIterator.aggregate(Iterator.scala:1157)
org.apache.spark.rdd.RDD$$anonfun$21.apply(RDD.scala:838)
org.apache.spark.rdd.RDD$$anonfun$21.apply(RDD.scala:838)
   
org.apache.spark.SparkContext$$anonfun$23.apply(SparkContext.scala:1116)
   
org.apache.spark.SparkContext$$anonfun$23.apply(SparkContext.scala:1116)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
org.apache.spark.scheduler.Task.run(Task.scala:51)
   
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
   
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979

scala.MatchError on SparkSQL when creating ArrayType of StructType

2014-12-05 Thread Hao Ren
Hi, 

I am using SparkSQL on 1.1.0 branch. 

The following code leads to a scala.MatchError 
at
org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:247) 

val scm = StructType(inputRDD.schema.fields.init :+ 
  StructField(list, 
ArrayType( 
  StructType( 
Seq(StructField(date, StringType, nullable = false), 
  StructField(nbPurchase, IntegerType, nullable = false, 
nullable = false)) 

// purchaseRDD is RDD[sql.ROW] whose schema is corresponding to scm. It is
transformed from inputRDD
val schemaRDD = hiveContext.applySchema(purchaseRDD, scm) 
schemaRDD.registerTempTable(t_purchase) 

Here's the stackTrace: 
scala.MatchError: ArrayType(StructType(List(StructField(date,StringType,
true ), StructField(n_reachat,IntegerType, true ))),true) (of class
org.apache.spark.sql.catalyst.types.ArrayType) 
at
org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:247) 
at
org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:247) 
at
org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:263) 
at
org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:84) 
at
org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:66)
 
at
org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:50)
 
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) 
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) 
at
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:149)
 
at
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$1.apply(InsertIntoHiveTable.scala:158)
 
at
org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$1.apply(InsertIntoHiveTable.scala:158)
 
at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) 
at org.apache.spark.scheduler.Task.run(Task.scala:54) 
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) 
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
at java.lang.Thread.run(Thread.java:744) 

The strange thing is that nullable of date and nbPurchase field are set to
true while it were false in the code. If I set both to true, it works. But,
in fact, they should not be nullable. 

Here's what I find at Cast.scala:247 on 1.1.0 branch 

  private[this] lazy val cast: Any = Any = dataType match { 
case StringType = castToString 
case BinaryType = castToBinary 
case DecimalType = castToDecimal 
case TimestampType = castToTimestamp 
case BooleanType = castToBoolean 
case ByteType = castToByte 
case ShortType = castToShort 
case IntegerType = castToInt 
case FloatType = castToFloat 
case LongType = castToLong 
case DoubleType = castToDouble 
  } 

Any idea? Thank you. 

Hao



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/scala-MatchError-on-SparkSQL-when-creating-ArrayType-of-StructType-tp20459.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: scala.MatchError on SparkSQL when creating ArrayType of StructType

2014-12-05 Thread Michael Armbrust
All values in Hive are always nullable, though you should still not be
seeing this error.

It should be addressed by this patch:
https://github.com/apache/spark/pull/3150

On Fri, Dec 5, 2014 at 2:36 AM, Hao Ren inv...@gmail.com wrote:

 Hi,

 I am using SparkSQL on 1.1.0 branch.

 The following code leads to a scala.MatchError
 at

 org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:247)

 val scm = StructType(inputRDD.schema.fields.init :+
   StructField(list,
 ArrayType(
   StructType(
 Seq(StructField(date, StringType, nullable = false),
   StructField(nbPurchase, IntegerType, nullable = false,
 nullable = false))

 // purchaseRDD is RDD[sql.ROW] whose schema is corresponding to scm. It is
 transformed from inputRDD
 val schemaRDD = hiveContext.applySchema(purchaseRDD, scm)
 schemaRDD.registerTempTable(t_purchase)

 Here's the stackTrace:
 scala.MatchError: ArrayType(StructType(List(StructField(date,StringType,
 true ), StructField(n_reachat,IntegerType, true ))),true) (of class
 org.apache.spark.sql.catalyst.types.ArrayType)
 at

 org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:247)
 at
 org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:247)
 at
 org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:263)
 at

 org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:84)
 at

 org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:66)
 at

 org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:50)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
 at
 org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org
 $apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:149)
 at

 org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$1.apply(InsertIntoHiveTable.scala:158)
 at

 org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$1.apply(InsertIntoHiveTable.scala:158)
 at
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
 at org.apache.spark.scheduler.Task.run(Task.scala:54)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
 at

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)

 The strange thing is that nullable of date and nbPurchase field are set to
 true while it were false in the code. If I set both to true, it works. But,
 in fact, they should not be nullable.

 Here's what I find at Cast.scala:247 on 1.1.0 branch

   private[this] lazy val cast: Any = Any = dataType match {
 case StringType = castToString
 case BinaryType = castToBinary
 case DecimalType = castToDecimal
 case TimestampType = castToTimestamp
 case BooleanType = castToBoolean
 case ByteType = castToByte
 case ShortType = castToShort
 case IntegerType = castToInt
 case FloatType = castToFloat
 case LongType = castToLong
 case DoubleType = castToDouble
   }

 Any idea? Thank you.

 Hao



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/scala-MatchError-on-SparkSQL-when-creating-ArrayType-of-StructType-tp20459.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




RE: scala.MatchError

2014-11-12 Thread Naveen Kumar Pokala
Hi,

Do you mean with java, I shouldn’t have Issue class as a property (attribute) 
in Instrument Class?

Ex :

Class Issue {
Int a;
}

Class Instrument {

Issue issue;

}


How about scala? Does it support such user defined datatypes in classes

Case class Issue .


case class Issue( a:Int = 0)

case class Instrument(issue: Issue = null)




-Naveen

From: Michael Armbrust [mailto:mich...@databricks.com]
Sent: Wednesday, November 12, 2014 12:09 AM
To: Xiangrui Meng
Cc: Naveen Kumar Pokala; user@spark.apache.org
Subject: Re: scala.MatchError

Xiangrui is correct that is must be a java bean, also nested classes are not 
yet supported in java.

On Tue, Nov 11, 2014 at 10:11 AM, Xiangrui Meng 
men...@gmail.commailto:men...@gmail.com wrote:
I think you need a Java bean class instead of a normal class. See
example here: http://spark.apache.org/docs/1.1.0/sql-programming-guide.html
(switch to the java tab). -Xiangrui

On Tue, Nov 11, 2014 at 7:18 AM, Naveen Kumar Pokala
npok...@spcapitaliq.commailto:npok...@spcapitaliq.com wrote:
 Hi,



 This is my Instrument java constructor.



 public Instrument(Issue issue, Issuer issuer, Issuing issuing) {

 super();

 this.issue = issue;

 this.issuer = issuer;

 this.issuing = issuing;

 }





 I am trying to create javaschemaRDD



 JavaSchemaRDD schemaInstruments = sqlCtx.applySchema(distData,
 Instrument.class);



 Remarks:

 



 Instrument, Issue, Issuer, Issuing all are java classes



 distData is holding List Instrument 





 I am getting the following error.







 Exception in thread Driver java.lang.reflect.InvocationTargetException

 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

 at java.lang.reflect.Method.invoke(Method.java:483)

 at
 org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162)

 Caused by: scala.MatchError: class sample.spark.test.Issue (of class
 java.lang.Class)

 at
 org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$getSchema$1.apply(JavaSQLContext.scala:189)

 at
 org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$getSchema$1.apply(JavaSQLContext.scala:188)

 at
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)

 at
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)

 at
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)

 at
 scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)

 at
 scala.collection.TraversableLike$class.map(TraversableLike.scala:244)

 at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)

 at
 org.apache.spark.sql.api.java.JavaSQLContext.getSchema(JavaSQLContext.scala:188)

 at
 org.apache.spark.sql.api.java.JavaSQLContext.applySchema(JavaSQLContext.scala:90)

 at sample.spark.test.SparkJob.main(SparkJob.java:33)

 ... 5 more



 Please help me.



 Regards,

 Naveen.
-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.orgmailto:user-h...@spark.apache.org



scala.MatchError

2014-11-11 Thread Naveen Kumar Pokala
Hi,

This is my Instrument java constructor.

public Instrument(Issue issue, Issuer issuer, Issuing issuing) {
super();
this.issue = issue;
this.issuer = issuer;
this.issuing = issuing;
}


I am trying to create javaschemaRDD

JavaSchemaRDD schemaInstruments = sqlCtx.applySchema(distData, 
Instrument.class);

Remarks:


Instrument, Issue, Issuer, Issuing all are java classes

distData is holding List Instrument 


I am getting the following error.



Exception in thread Driver java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162)
Caused by: scala.MatchError: class sample.spark.test.Issue (of class 
java.lang.Class)
at 
org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$getSchema$1.apply(JavaSQLContext.scala:189)
at 
org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$getSchema$1.apply(JavaSQLContext.scala:188)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at 
org.apache.spark.sql.api.java.JavaSQLContext.getSchema(JavaSQLContext.scala:188)
at 
org.apache.spark.sql.api.java.JavaSQLContext.applySchema(JavaSQLContext.scala:90)
at sample.spark.test.SparkJob.main(SparkJob.java:33)
... 5 more

Please help me.

Regards,
Naveen.


Re: scala.MatchError

2014-11-11 Thread Xiangrui Meng
I think you need a Java bean class instead of a normal class. See
example here: http://spark.apache.org/docs/1.1.0/sql-programming-guide.html
(switch to the java tab). -Xiangrui

On Tue, Nov 11, 2014 at 7:18 AM, Naveen Kumar Pokala
npok...@spcapitaliq.com wrote:
 Hi,



 This is my Instrument java constructor.



 public Instrument(Issue issue, Issuer issuer, Issuing issuing) {

 super();

 this.issue = issue;

 this.issuer = issuer;

 this.issuing = issuing;

 }





 I am trying to create javaschemaRDD



 JavaSchemaRDD schemaInstruments = sqlCtx.applySchema(distData,
 Instrument.class);



 Remarks:

 



 Instrument, Issue, Issuer, Issuing all are java classes



 distData is holding List Instrument 





 I am getting the following error.







 Exception in thread Driver java.lang.reflect.InvocationTargetException

 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)

 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

 at java.lang.reflect.Method.invoke(Method.java:483)

 at
 org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162)

 Caused by: scala.MatchError: class sample.spark.test.Issue (of class
 java.lang.Class)

 at
 org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$getSchema$1.apply(JavaSQLContext.scala:189)

 at
 org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$getSchema$1.apply(JavaSQLContext.scala:188)

 at
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)

 at
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)

 at
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)

 at
 scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)

 at
 scala.collection.TraversableLike$class.map(TraversableLike.scala:244)

 at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)

 at
 org.apache.spark.sql.api.java.JavaSQLContext.getSchema(JavaSQLContext.scala:188)

 at
 org.apache.spark.sql.api.java.JavaSQLContext.applySchema(JavaSQLContext.scala:90)

 at sample.spark.test.SparkJob.main(SparkJob.java:33)

 ... 5 more



 Please help me.



 Regards,

 Naveen.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: scala.MatchError

2014-11-11 Thread Michael Armbrust
Xiangrui is correct that is must be a java bean, also nested classes are
not yet supported in java.

On Tue, Nov 11, 2014 at 10:11 AM, Xiangrui Meng men...@gmail.com wrote:

 I think you need a Java bean class instead of a normal class. See
 example here:
 http://spark.apache.org/docs/1.1.0/sql-programming-guide.html
 (switch to the java tab). -Xiangrui

 On Tue, Nov 11, 2014 at 7:18 AM, Naveen Kumar Pokala
 npok...@spcapitaliq.com wrote:
  Hi,
 
 
 
  This is my Instrument java constructor.
 
 
 
  public Instrument(Issue issue, Issuer issuer, Issuing issuing) {
 
  super();
 
  this.issue = issue;
 
  this.issuer = issuer;
 
  this.issuing = issuing;
 
  }
 
 
 
 
 
  I am trying to create javaschemaRDD
 
 
 
  JavaSchemaRDD schemaInstruments = sqlCtx.applySchema(distData,
  Instrument.class);
 
 
 
  Remarks:
 
  
 
 
 
  Instrument, Issue, Issuer, Issuing all are java classes
 
 
 
  distData is holding List Instrument 
 
 
 
 
 
  I am getting the following error.
 
 
 
 
 
 
 
  Exception in thread Driver java.lang.reflect.InvocationTargetException
 
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 
  at
 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 
  at
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 
  at java.lang.reflect.Method.invoke(Method.java:483)
 
  at
 
 org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:162)
 
  Caused by: scala.MatchError: class sample.spark.test.Issue (of class
  java.lang.Class)
 
  at
 
 org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$getSchema$1.apply(JavaSQLContext.scala:189)
 
  at
 
 org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$getSchema$1.apply(JavaSQLContext.scala:188)
 
  at
 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 
  at
 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 
  at
 
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
 
  at
  scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
 
  at
  scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
 
  at
 scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
 
  at
 
 org.apache.spark.sql.api.java.JavaSQLContext.getSchema(JavaSQLContext.scala:188)
 
  at
 
 org.apache.spark.sql.api.java.JavaSQLContext.applySchema(JavaSQLContext.scala:90)
 
  at sample.spark.test.SparkJob.main(SparkJob.java:33)
 
  ... 5 more
 
 
 
  Please help me.
 
 
 
  Regards,
 
  Naveen.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




scala.MatchError: class java.sql.Timestamp

2014-10-19 Thread Ge, Yao (Y.)
I am working with Spark 1.1.0 and I believe Timestamp is a supported data type 
for Spark SQL. However I keep getting this MatchError for java.sql.Timestamp 
when I try to use reflection to register a Java Bean with Timestamp field. 
Anything wrong with my code below?

public static class Event implements Serializable {
private String name;
private Timestamp time;
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
public Timestamp getTime() {
return time;
}
public void setTime(Timestamp time) {
this.time = time;
}
}

@Test
public void testTimeStamp() {
JavaSparkContext sc = new 
JavaSparkContext(local, timestamp);
String[] data = {1,2014-01-01, 
2,2014-02-01};
JavaRDDString input = 
sc.parallelize(Arrays.asList(data));
JavaRDDEvent events = input.map(new 
FunctionString,Event() {
public Event call(String arg0) 
throws Exception {
String[] c = 
arg0.split(,);
Event e = new 
Event();
e.setName(c[0]);
DateFormat fmt 
= new SimpleDateFormat(-MM-dd);
e.setTime(new 
Timestamp(fmt.parse(c[1]).getTime()));
return e;
}
});

JavaSQLContext sqlCtx = new JavaSQLContext(sc);
JavaSchemaRDD schemaEvent = 
sqlCtx.applySchema(events, Event.class);
schemaEvent.registerTempTable(event);

sc.stop();
}


RE: scala.MatchError: class java.sql.Timestamp

2014-10-19 Thread Wang, Daoyuan
Can you provide the exception stack?

Thanks,
Daoyuan

From: Ge, Yao (Y.) [mailto:y...@ford.com]
Sent: Sunday, October 19, 2014 10:17 PM
To: user@spark.apache.org
Subject: scala.MatchError: class java.sql.Timestamp

I am working with Spark 1.1.0 and I believe Timestamp is a supported data type 
for Spark SQL. However I keep getting this MatchError for java.sql.Timestamp 
when I try to use reflection to register a Java Bean with Timestamp field. 
Anything wrong with my code below?

public static class Event implements Serializable {
private String name;
private Timestamp time;
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
public Timestamp getTime() {
return time;
}
public void setTime(Timestamp time) {
this.time = time;
}
}

@Test
public void testTimeStamp() {
JavaSparkContext sc = new 
JavaSparkContext(local, timestamp);
String[] data = {1,2014-01-01, 
2,2014-02-01};
JavaRDDString input = 
sc.parallelize(Arrays.asList(data));
JavaRDDEvent events = input.map(new 
FunctionString,Event() {
public Event call(String arg0) 
throws Exception {
String[] c = 
arg0.split(,);
Event e = new 
Event();
e.setName(c[0]);
DateFormat fmt 
= new SimpleDateFormat(-MM-dd);
e.setTime(new 
Timestamp(fmt.parse(c[1]).getTime()));
return e;
}
});

JavaSQLContext sqlCtx = new JavaSQLContext(sc);
JavaSchemaRDD schemaEvent = 
sqlCtx.applySchema(events, Event.class);
schemaEvent.registerTempTable(event);

sc.stop();
}


RE: scala.MatchError: class java.sql.Timestamp

2014-10-19 Thread Ge, Yao (Y.)
scala.MatchError: class java.sql.Timestamp (of class java.lang.Class)
at 
org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$getSchema$1.apply(JavaSQLContext.scala:189)
at 
org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$getSchema$1.apply(JavaSQLContext.scala:188)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at 
scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at 
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at 
scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at 
org.apache.spark.sql.api.java.JavaSQLContext.getSchema(JavaSQLContext.scala:188)
at 
org.apache.spark.sql.api.java.JavaSQLContext.applySchema(JavaSQLContext.scala:90)
at 
com.ford.dtc.ff.SessionStats.testTimeStamp(SessionStats.java:111)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown 
Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231)
at 
org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60)
at 
org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229)
at 
org.junit.runners.ParentRunner.access$000(ParentRunner.java:50)
at 
org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222)
at org.junit.runners.ParentRunner.run(ParentRunner.java:300)
at 
org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
at 
org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)


From: Wang, Daoyuan [mailto:daoyuan.w...@intel.com]
Sent: Sunday, October 19, 2014 10:31 AM
To: Ge, Yao (Y.); user@spark.apache.org
Subject: RE: scala.MatchError: class java.sql.Timestamp

Can you provide the exception stack?

Thanks,
Daoyuan

From: Ge, Yao (Y.) [mailto:y...@ford.com]
Sent: Sunday, October 19, 2014 10:17 PM
To: user@spark.apache.orgmailto:user@spark.apache.org
Subject: scala.MatchError: class java.sql.Timestamp

I am working with Spark 1.1.0 and I believe Timestamp is a supported data type 
for Spark SQL. However I keep getting this MatchError for java.sql.Timestamp 
when I try to use reflection to register a Java Bean with Timestamp field. 
Anything wrong with my code below?

public static class Event implements Serializable {
private String name;
private Timestamp time;
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
public Timestamp getTime() {
return time;
}
public void setTime(Timestamp time) {
this.time = time;
}
}

@Test
public void testTimeStamp() {
JavaSparkContext sc = new

RE: scala.MatchError: class java.sql.Timestamp

2014-10-19 Thread Cheng, Hao
Seems bugs in the JavaSQLContext.getSchema(), which doesn't enumerate all of 
the data types supported by Catalyst.

From: Ge, Yao (Y.) [mailto:y...@ford.com]
Sent: Sunday, October 19, 2014 11:44 PM
To: Wang, Daoyuan; user@spark.apache.org
Subject: RE: scala.MatchError: class java.sql.Timestamp

scala.MatchError: class java.sql.Timestamp (of class java.lang.Class)
at 
org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$getSchema$1.apply(JavaSQLContext.scala:189)
at 
org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$getSchema$1.apply(JavaSQLContext.scala:188)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at 
scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at 
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at 
scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at 
org.apache.spark.sql.api.java.JavaSQLContext.getSchema(JavaSQLContext.scala:188)
at 
org.apache.spark.sql.api.java.JavaSQLContext.applySchema(JavaSQLContext.scala:90)
at 
com.ford.dtc.ff.SessionStats.testTimeStamp(SessionStats.java:111)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown 
Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231)
at 
org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60)
at 
org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229)
at 
org.junit.runners.ParentRunner.access$000(ParentRunner.java:50)
at 
org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222)
at org.junit.runners.ParentRunner.run(ParentRunner.java:300)
at 
org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
at 
org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)


From: Wang, Daoyuan [mailto:daoyuan.w...@intel.com]
Sent: Sunday, October 19, 2014 10:31 AM
To: Ge, Yao (Y.); user@spark.apache.orgmailto:user@spark.apache.org
Subject: RE: scala.MatchError: class java.sql.Timestamp

Can you provide the exception stack?

Thanks,
Daoyuan

From: Ge, Yao (Y.) [mailto:y...@ford.com]
Sent: Sunday, October 19, 2014 10:17 PM
To: user@spark.apache.orgmailto:user@spark.apache.org
Subject: scala.MatchError: class java.sql.Timestamp

I am working with Spark 1.1.0 and I believe Timestamp is a supported data type 
for Spark SQL. However I keep getting this MatchError for java.sql.Timestamp 
when I try to use reflection to register a Java Bean with Timestamp field. 
Anything wrong with my code below?

public static class Event implements Serializable {
private String name;
private Timestamp time;
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
public Timestamp getTime() {
return time

RE: scala.MatchError: class java.sql.Timestamp

2014-10-19 Thread Wang, Daoyuan
I have created an issue for this 
https://issues.apache.org/jira/browse/SPARK-4003


From: Cheng, Hao
Sent: Monday, October 20, 2014 9:20 AM
To: Ge, Yao (Y.); Wang, Daoyuan; user@spark.apache.org
Subject: RE: scala.MatchError: class java.sql.Timestamp

Seems bugs in the JavaSQLContext.getSchema(), which doesn't enumerate all of 
the data types supported by Catalyst.

From: Ge, Yao (Y.) [mailto:y...@ford.com]
Sent: Sunday, October 19, 2014 11:44 PM
To: Wang, Daoyuan; user@spark.apache.orgmailto:user@spark.apache.org
Subject: RE: scala.MatchError: class java.sql.Timestamp

scala.MatchError: class java.sql.Timestamp (of class java.lang.Class)
at 
org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$getSchema$1.apply(JavaSQLContext.scala:189)
at 
org.apache.spark.sql.api.java.JavaSQLContext$$anonfun$getSchema$1.apply(JavaSQLContext.scala:188)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at 
scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at 
scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at 
scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at 
org.apache.spark.sql.api.java.JavaSQLContext.getSchema(JavaSQLContext.scala:188)
at 
org.apache.spark.sql.api.java.JavaSQLContext.applySchema(JavaSQLContext.scala:90)
at 
com.ford.dtc.ff.SessionStats.testTimeStamp(SessionStats.java:111)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown 
Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231)
at 
org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60)
at 
org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229)
at 
org.junit.runners.ParentRunner.access$000(ParentRunner.java:50)
at 
org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222)
at org.junit.runners.ParentRunner.run(ParentRunner.java:300)
at 
org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
at 
org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)


From: Wang, Daoyuan [mailto:daoyuan.w...@intel.com]
Sent: Sunday, October 19, 2014 10:31 AM
To: Ge, Yao (Y.); user@spark.apache.orgmailto:user@spark.apache.org
Subject: RE: scala.MatchError: class java.sql.Timestamp

Can you provide the exception stack?

Thanks,
Daoyuan

From: Ge, Yao (Y.) [mailto:y...@ford.com]
Sent: Sunday, October 19, 2014 10:17 PM
To: user@spark.apache.orgmailto:user@spark.apache.org
Subject: scala.MatchError: class java.sql.Timestamp

I am working with Spark 1.1.0 and I believe Timestamp is a supported data type 
for Spark SQL. However I keep getting this MatchError for java.sql.Timestamp 
when I try to use reflection to register a Java Bean with Timestamp field. 
Anything wrong with my code below?

public static class Event implements Serializable {
private String name;
private Timestamp time;
public String getName() {
return name;
}
public void

Are scala.MatchError messages a problem?

2014-06-08 Thread Jeremy Lee
I shut down my first (working) cluster and brought up a fresh one... and
It's been a bit of a horror and I need to sleep now. Should I be worried
about these errors? Or did I just have the old log4j.config tuned so I
didn't see them?

I

14/06/08 16:32:52 ERROR scheduler.JobScheduler: Error running job streaming
job 1402245172000 ms.2
scala.MatchError: 0101-01-10 (of class java.lang.String)
at SimpleApp$$anonfun$6$$anonfun$apply$6.apply(SimpleApp.scala:218)
at SimpleApp$$anonfun$6$$anonfun$apply$6.apply(SimpleApp.scala:217)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at
scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at SimpleApp$$anonfun$6.apply(SimpleApp.scala:217)
at SimpleApp$$anonfun$6.apply(SimpleApp.scala:214)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:527)
at
org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:527)
at
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41)
at
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
at
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)


The error comes from this code, which seemed like a sensible way to match
things:
(The case cmd_plus(w) statement is generating the error,)

val cmd_plus = [+]([\w]+).r
val cmd_minus = [-]([\w]+).r
// find command user tweets
val commands = stream.map(
status = ( status.getUser().getId(), status.getText() )
).foreachRDD(rdd = {
rdd.join(superusers).map(
x = x._2._1
).collect().foreach{ cmd = {
218:  cmd match {
case cmd_plus(w) = {
...
} case cmd_minus(w) = { ... } } }} })

It seems a bit excessive for scala to throw exceptions because a regex
didn't match. Something feels wrong.


Re: Are scala.MatchError messages a problem?

2014-06-08 Thread Sean Owen
A match clause needs to cover all the possibilities, and not matching
any regex is a distinct possibility. It's not really like 'switch'
because it requires this and I think that has benefits, like being
able to interpret a match as something with a type. I think it's all
in order, but it's more of a Scala thing than Spark thing.

You just need a case _ = ... to cover anything else.

(You can avoid two extra levels of scope with .foreach(_ match { ... }) BTW)

On Sun, Jun 8, 2014 at 12:44 PM, Jeremy Lee
unorthodox.engine...@gmail.com wrote:

 I shut down my first (working) cluster and brought up a fresh one... and
 It's been a bit of a horror and I need to sleep now. Should I be worried
 about these errors? Or did I just have the old log4j.config tuned so I
 didn't see them?

 I

 14/06/08 16:32:52 ERROR scheduler.JobScheduler: Error running job streaming
 job 1402245172000 ms.2
 scala.MatchError: 0101-01-10 (of class java.lang.String)
 at SimpleApp$$anonfun$6$$anonfun$apply$6.apply(SimpleApp.scala:218)
 at SimpleApp$$anonfun$6$$anonfun$apply$6.apply(SimpleApp.scala:217)
 at
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
 at
 scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
 at SimpleApp$$anonfun$6.apply(SimpleApp.scala:217)
 at SimpleApp$$anonfun$6.apply(SimpleApp.scala:214)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:527)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:527)
 at
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41)
 at
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
 at
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
 at scala.util.Try$.apply(Try.scala:161)
 at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
 at
 org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)


 The error comes from this code, which seemed like a sensible way to match
 things:
 (The case cmd_plus(w) statement is generating the error,)

 val cmd_plus = [+]([\w]+).r
 val cmd_minus = [-]([\w]+).r
 // find command user tweets
 val commands = stream.map(
 status = ( status.getUser().getId(), status.getText() )
 ).foreachRDD(rdd = {
 rdd.join(superusers).map(
 x = x._2._1
 ).collect().foreach{ cmd = {
 218: cmd match {
 case cmd_plus(w) = {
 ...
 } case cmd_minus(w) = { ... } } }} })

 It seems a bit excessive for scala to throw exceptions because a regex
 didn't match. Something feels wrong.


Re: Are scala.MatchError messages a problem?

2014-06-08 Thread Nick Pentreath
When you use match, the match must be exhaustive. That is, a match error is 
thrown if the match fails. 


That's why you usually handle the default case using case _ = ...




Here it looks like your taking the text of all statuses - which means not all 
of them will be commands... Which means your match will not be exhaustive.




The solution is either to add a default case which does nothing, or probably 
better to add a .filter such that you filter out anything that's not a command 
before matching.




Just looking at it again it could also be that you take x = x._2._1 ... What 
type is that? Should it not be a Seq if you're joining, in which case the match 
will also fail...




Hope this helps.
—
Sent from Mailbox

On Sun, Jun 8, 2014 at 6:45 PM, Jeremy Lee unorthodox.engine...@gmail.com
wrote:

 I shut down my first (working) cluster and brought up a fresh one... and
 It's been a bit of a horror and I need to sleep now. Should I be worried
 about these errors? Or did I just have the old log4j.config tuned so I
 didn't see them?
 I
 14/06/08 16:32:52 ERROR scheduler.JobScheduler: Error running job streaming
 job 1402245172000 ms.2
 scala.MatchError: 0101-01-10 (of class java.lang.String)
 at SimpleApp$$anonfun$6$$anonfun$apply$6.apply(SimpleApp.scala:218)
 at SimpleApp$$anonfun$6$$anonfun$apply$6.apply(SimpleApp.scala:217)
 at
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
 at
 scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
 at SimpleApp$$anonfun$6.apply(SimpleApp.scala:217)
 at SimpleApp$$anonfun$6.apply(SimpleApp.scala:214)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:527)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:527)
 at
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41)
 at
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
 at
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
 at scala.util.Try$.apply(Try.scala:161)
 at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
 at
 org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)
 The error comes from this code, which seemed like a sensible way to match
 things:
 (The case cmd_plus(w) statement is generating the error,)
 val cmd_plus = [+]([\w]+).r
 val cmd_minus = [-]([\w]+).r
 // find command user tweets
 val commands = stream.map(
 status = ( status.getUser().getId(), status.getText() )
 ).foreachRDD(rdd = {
 rdd.join(superusers).map(
 x = x._2._1
 ).collect().foreach{ cmd = {
 218:  cmd match {
 case cmd_plus(w) = {
 ...
 } case cmd_minus(w) = { ... } } }} })
 It seems a bit excessive for scala to throw exceptions because a regex
 didn't match. Something feels wrong.

Re: Are scala.MatchError messages a problem?

2014-06-08 Thread Mark Hamstra

 The solution is either to add a default case which does nothing, or
 probably better to add a .filter such that you filter out anything that's
 not a command before matching.


And you probably want to push down that filter into the cluster --
collecting all of the elements of an RDD only to not use or filter out some
of them isn't an efficient usage of expensive (at least in terms of
time/performance) network resources.  There may also be a good opportunity
to use the partial function form of collect to push even more processing
into the cluster.



On Sun, Jun 8, 2014 at 10:00 AM, Nick Pentreath nick.pentre...@gmail.com
wrote:

 When you use match, the match must be exhaustive. That is, a match error
 is thrown if the match fails.

 That's why you usually handle the default case using case _ = ...

 Here it looks like your taking the text of all statuses - which means not
 all of them will be commands... Which means your match will not be
 exhaustive.

 The solution is either to add a default case which does nothing, or
 probably better to add a .filter such that you filter out anything that's
 not a command before matching.

 Just looking at it again it could also be that you take x = x._2._1 ...
 What type is that? Should it not be a Seq if you're joining, in which case
 the match will also fail...

 Hope this helps.
 —
 Sent from Mailbox https://www.dropbox.com/mailbox


 On Sun, Jun 8, 2014 at 6:45 PM, Jeremy Lee unorthodox.engine...@gmail.com
  wrote:


 I shut down my first (working) cluster and brought up a fresh one... and
 It's been a bit of a horror and I need to sleep now. Should I be worried
 about these errors? Or did I just have the old log4j.config tuned so I
 didn't see them?

 I

  14/06/08 16:32:52 ERROR scheduler.JobScheduler: Error running job
 streaming job 1402245172000 ms.2
 scala.MatchError: 0101-01-10 (of class java.lang.String)
  at
 SimpleApp$$anonfun$6$$anonfun$apply$6.apply(SimpleApp.scala:218)
 at
 SimpleApp$$anonfun$6$$anonfun$apply$6.apply(SimpleApp.scala:217)
 at
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
 at
 scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
 at SimpleApp$$anonfun$6.apply(SimpleApp.scala:217)
 at SimpleApp$$anonfun$6.apply(SimpleApp.scala:214)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:527)
 at
 org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:527)
 at
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41)
 at
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
 at
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
 at scala.util.Try$.apply(Try.scala:161)
 at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
 at
 org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:744)


 The error comes from this code, which seemed like a sensible way to match
 things:
 (The case cmd_plus(w) statement is generating the error,)

   val cmd_plus = [+]([\w]+).r
  val cmd_minus = [-]([\w]+).r
  // find command user tweets
  val commands = stream.map(
  status = ( status.getUser().getId(), status.getText() )
  ).foreachRDD(rdd = {
  rdd.join(superusers).map(
  x = x._2._1
  ).collect().foreach{ cmd = {
  218:  cmd match {
  case cmd_plus(w) = {
  ...
 } case cmd_minus(w) = { ... } } }} })

  It seems a bit excessive for scala to throw exceptions because a regex
 didn't match. Something feels wrong.





Re: Are scala.MatchError messages a problem?

2014-06-08 Thread Jeremy Lee
On Sun, Jun 8, 2014 at 10:00 AM, Nick Pentreath nick.pentre...@gmail.com
 wrote:

 When you use match, the match must be exhaustive. That is, a match error
 is thrown if the match fails.


Ahh, right. That makes sense. Scala is applying its strong typing rules
here instead of no ceremony... but isn't the idea that type errors should
get picked up at compile time? I suppose the compiler can't tell there's
not complete coverage, but it seems strange to throw that at runtime when
it is literally the 'default case'.

I think I need a good Scala Programming Guide... any suggestions? I've
read and watch the usual resources and videos, but it feels like a shotgun
approach and I've clearly missed a lot.

On Mon, Jun 9, 2014 at 3:26 AM, Mark Hamstra m...@clearstorydata.com
wrote:

 And you probably want to push down that filter into the cluster --
 collecting all of the elements of an RDD only to not use or filter out some
 of them isn't an efficient usage of expensive (at least in terms of
 time/performance) network resources.  There may also be a good opportunity
 to use the partial function form of collect to push even more processing
 into the cluster.


I almost certainly do :-) And I am really looking forward to spending time
optimizing the code, but I keep getting caught up on deployment issues,
uberjars, missing /mnt/spark directories, only being able to submit from
the master, and being thoroughly confused about sample code from three
versions ago.

I'm even thinking of learning maven, if it means I never have to use sbt
again. Does it mean that?

-- 
Jeremy Lee  BCompSci(Hons)
  The Unorthodox Engineers


Re: Are scala.MatchError messages a problem?

2014-06-08 Thread Tobias Pfeiffer
Jeremy,

On Mon, Jun 9, 2014 at 10:22 AM, Jeremy Lee
unorthodox.engine...@gmail.com wrote:
 When you use match, the match must be exhaustive. That is, a match error
 is thrown if the match fails.

 Ahh, right. That makes sense. Scala is applying its strong typing rules
 here instead of no ceremony... but isn't the idea that type errors should
 get picked up at compile time? I suppose the compiler can't tell there's not
 complete coverage, but it seems strange to throw that at runtime when it is
 literally the 'default case'.

You can use subclasses of sealed traits to get a compiler warning
for non-exhaustive matches:
http://stackoverflow.com/questions/11203268/what-is-a-sealed-trait
I don't know if it can be applied for regular expression matching, though...

Tobias