Untraceable NullPointerException

2022-12-09 Thread Alberto Huélamo
Hello,

I have a job that runs on Databricks Runtime 11.3 LTS, so Spark 3.3.0,
which is causing a NullPointerException and the stack trace does not
include any reference to the job's code. Only Spark internals. Which makes
me wonder whether we're dealing with a Spark bug or there's indeed an
underlying cause for this.

I found a thread
https://lists.apache.org/thread/x55y75sdgo334nmm0gm2r8glhy8w4j05 that
reports a similar stack trace to the one I'm linking.

Here's the complete stack trace:
https://gist.github.com/alhuelamo/a77baa211ea4bb6febc4303cc3d9fd8a

Thanks in advance for any insights you may provide!

~Alberto


Re: NullPointerException in SparkSession while reading Parquet files on S3

2021-05-25 Thread YEONWOO BAEK
unsubscribe

2021년 5월 26일 (수) 오전 12:31, Eric Beabes 님이 작성:

> I keep getting the following exception when I am trying to read a Parquet
> file from a Path on S3 in Spark/Scala. Note: I am running this on EMR.
>
> java.lang.NullPointerException
> at 
> org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:144)
> at 
> org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:142)
> at 
> org.apache.spark.sql.DataFrameReader.(DataFrameReader.scala:789)
> at org.apache.spark.sql.SparkSession.read(SparkSession.scala:656)
>
> Interestingly I can read the path from Spark shell:
>
> scala> val df = spark.read.parquet("s3://my-path/").count
> df: Long = 47
>
> I've created the SparkSession as follows:
>
> val sparkConf = new SparkConf().setAppName("My spark app")val spark = 
> SparkSession.builder.config(sparkConf).enableHiveSupport().getOrCreate()
> spark.sparkContext.setLogLevel("WARN")
> spark.sparkContext.hadoopConfiguration.set("java.library.path", 
> "/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native")
> spark.conf.set("spark.sql.parquet.mergeSchema", "true")
> spark.conf.set("spark.speculation", "false")
> spark.conf.set("spark.sql.crossJoin.enabled", "true")
> spark.conf.set("spark.sql.sources.partitionColumnTypeInference.enabled", 
> "true")
> spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
> spark.sparkContext.hadoopConfiguration.set("mapreduce.fileoutputcommitter.algorithm.version",
>  "2")
> spark.sparkContext.hadoopConfiguration.setBoolean("mapreduce.fileoutputcommitter.cleanup.skipped",
>  true)
> spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", 
> System.getenv("AWS_ACCESS_KEY_ID"))
> spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", 
> System.getenv("AWS_SECRET_ACCESS_KEY"))
> spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint", 
> "s3.amazonaws.com")
>
> Here's the line where I am getting this exception:
>
> val df1 = spark.read.parquet(pathToRead)
>
> What am I doing wrong? I have tried without setting 'access key' & 'secret 
> key' as well with no luck.
>
>
>
>
>


NullPointerException in SparkSession while reading Parquet files on S3

2021-05-25 Thread Eric Beabes
I keep getting the following exception when I am trying to read a Parquet
file from a Path on S3 in Spark/Scala. Note: I am running this on EMR.

java.lang.NullPointerException
at 
org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:144)
at 
org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:142)
at 
org.apache.spark.sql.DataFrameReader.(DataFrameReader.scala:789)
at org.apache.spark.sql.SparkSession.read(SparkSession.scala:656)

Interestingly I can read the path from Spark shell:

scala> val df = spark.read.parquet("s3://my-path/").count
df: Long = 47

I've created the SparkSession as follows:

val sparkConf = new SparkConf().setAppName("My spark app")val spark =
SparkSession.builder.config(sparkConf).enableHiveSupport().getOrCreate()
spark.sparkContext.setLogLevel("WARN")
spark.sparkContext.hadoopConfiguration.set("java.library.path",
"/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native")
spark.conf.set("spark.sql.parquet.mergeSchema", "true")
spark.conf.set("spark.speculation", "false")
spark.conf.set("spark.sql.crossJoin.enabled", "true")
spark.conf.set("spark.sql.sources.partitionColumnTypeInference.enabled", "true")
spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
spark.sparkContext.hadoopConfiguration.set("mapreduce.fileoutputcommitter.algorithm.version",
"2")
spark.sparkContext.hadoopConfiguration.setBoolean("mapreduce.fileoutputcommitter.cleanup.skipped",
true)
spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key",
System.getenv("AWS_ACCESS_KEY_ID"))
spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key",
System.getenv("AWS_SECRET_ACCESS_KEY"))
spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint",
"s3.amazonaws.com")

Here's the line where I am getting this exception:

val df1 = spark.read.parquet(pathToRead)

What am I doing wrong? I have tried without setting 'access key' &
'secret key' as well with no luck.


Re: [Structured Streaming] NullPointerException in long running query

2020-04-29 Thread ZHANG Wei
Is there any chance we also print the least recent failure in stage as the
following most recent failure before Driver statcktrace? 

> >>   Caused by: org.apache.spark.SparkException: Job aborted due to stage
> >> failure: Task 10 in stage 1.0 failed 4 times, most recent failure: Lost
> >> task 10.3 in stage 1.0 (TID 81, spark6, executor 1):
> >> java.lang.NullPointerException
> >> Driver stacktrace:

-- 
Cheers,
-z

On Tue, 28 Apr 2020 23:48:17 -0700
"Shixiong(Ryan) Zhu"  wrote:

> The stack trace is omitted by JVM when an exception is thrown too
> many times. This usually happens when you have multiple Spark tasks on the
> same executor JVM throwing the same exception. See
> https://stackoverflow.com/a/3010106
> 
> Best Regards,
> Ryan
> 
> 
> On Tue, Apr 28, 2020 at 10:45 PM lec ssmi  wrote:
> 
> > It should be a problem of my data quality. It's curious why the
> > driver-side exception stack has no specific exception information.
> >
> > Edgardo Szrajber  于2020年4月28日周二 下午3:32写道:
> >
> >> The exception occured while aborting the stage. It might be interesting
> >> to try to understand the reason for the abortion.
> >> Maybe timeout? How long the query run?
> >> Bentzi
> >>
> >> Sent from Yahoo Mail on Android
> >> 
> >>
> >> On Tue, Apr 28, 2020 at 9:25, Jungtaek Lim
> >>  wrote:
> >> The root cause of exception is occurred in executor side "Lost task 10.3
> >> in stage 1.0 (TID 81, spark6, executor 1)" so you may need to check there.
> >>
> >> On Tue, Apr 28, 2020 at 2:52 PM lec ssmi  wrote:
> >>
> >> Hi:
> >>   One of my long-running queries occasionally encountered the following
> >> exception:
> >>
> >>
> >>   Caused by: org.apache.spark.SparkException: Job aborted due to stage
> >> failure: Task 10 in stage 1.0 failed 4 times, most recent failure: Lost
> >> task 10.3 in stage 1.0 (TID 81, spark6, executor 1):
> >> java.lang.NullPointerException
> >> Driver stacktrace:
> >> at org.apache.spark.scheduler.DAGScheduler.org
> >> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
> >> at
> >> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
> >> at
> >> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
> >> at
> >> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> >> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> >> at
> >> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
> >> at
> >> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
> >> at
> >> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
> >> at scala.Option.foreach(Option.scala:257)
> >> at
> >> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
> >> at
> >> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
> >> at
> >> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
> >> at
> >> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
> >> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> >> at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
> >> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
> >> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
> >> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
> >> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
> >> at
> >> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:929)
> >> at
> >> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:927)
> >> at
> >> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> >> at
> >> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
> >> at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
> >> at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:927)
> >> at
> >> org.apache.spark.sql.execution.streaming.ForeachSink.addBatch(ForeachSink.scala:49)
> >> at
> >> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3$$anonfun$apply$16.apply(MicroBatchExecution.scala:475)
> >> at
> >> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
> >> at
> >> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3.apply(MicroBatchExecution.scala:473)
> >> at
> >> 

Re: [Structured Streaming] NullPointerException in long running query

2020-04-29 Thread Shixiong(Ryan) Zhu
The stack trace is omitted by JVM when an exception is thrown too
many times. This usually happens when you have multiple Spark tasks on the
same executor JVM throwing the same exception. See
https://stackoverflow.com/a/3010106

Best Regards,
Ryan


On Tue, Apr 28, 2020 at 10:45 PM lec ssmi  wrote:

> It should be a problem of my data quality. It's curious why the
> driver-side exception stack has no specific exception information.
>
> Edgardo Szrajber  于2020年4月28日周二 下午3:32写道:
>
>> The exception occured while aborting the stage. It might be interesting
>> to try to understand the reason for the abortion.
>> Maybe timeout? How long the query run?
>> Bentzi
>>
>> Sent from Yahoo Mail on Android
>> 
>>
>> On Tue, Apr 28, 2020 at 9:25, Jungtaek Lim
>>  wrote:
>> The root cause of exception is occurred in executor side "Lost task 10.3
>> in stage 1.0 (TID 81, spark6, executor 1)" so you may need to check there.
>>
>> On Tue, Apr 28, 2020 at 2:52 PM lec ssmi  wrote:
>>
>> Hi:
>>   One of my long-running queries occasionally encountered the following
>> exception:
>>
>>
>>   Caused by: org.apache.spark.SparkException: Job aborted due to stage
>> failure: Task 10 in stage 1.0 failed 4 times, most recent failure: Lost
>> task 10.3 in stage 1.0 (TID 81, spark6, executor 1):
>> java.lang.NullPointerException
>> Driver stacktrace:
>> at org.apache.spark.scheduler.DAGScheduler.org
>> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
>> at
>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>> at
>> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
>> at scala.Option.foreach(Option.scala:257)
>> at
>> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
>> at
>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
>> at
>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
>> at
>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
>> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>> at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
>> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
>> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
>> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
>> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
>> at
>> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:929)
>> at
>> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:927)
>> at
>> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>> at
>> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>> at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
>> at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:927)
>> at
>> org.apache.spark.sql.execution.streaming.ForeachSink.addBatch(ForeachSink.scala:49)
>> at
>> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3$$anonfun$apply$16.apply(MicroBatchExecution.scala:475)
>> at
>> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
>> at
>> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3.apply(MicroBatchExecution.scala:473)
>> at
>> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
>> at
>> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
>> at org.apache.spark.sql.execution.streaming.MicroBatchExecution.org
>> $apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:472)
>> at
>> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:133)
>> at
>> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
>> at
>> 

Re: [Structured Streaming] NullPointerException in long running query

2020-04-28 Thread lec ssmi
It should be a problem of my data quality. It's curious why the driver-side
exception stack has no specific exception information.

Edgardo Szrajber  于2020年4月28日周二 下午3:32写道:

> The exception occured while aborting the stage. It might be interesting to
> try to understand the reason for the abortion.
> Maybe timeout? How long the query run?
> Bentzi
>
> Sent from Yahoo Mail on Android
> 
>
> On Tue, Apr 28, 2020 at 9:25, Jungtaek Lim
>  wrote:
> The root cause of exception is occurred in executor side "Lost task 10.3
> in stage 1.0 (TID 81, spark6, executor 1)" so you may need to check there.
>
> On Tue, Apr 28, 2020 at 2:52 PM lec ssmi  wrote:
>
> Hi:
>   One of my long-running queries occasionally encountered the following
> exception:
>
>
>   Caused by: org.apache.spark.SparkException: Job aborted due to stage
> failure: Task 10 in stage 1.0 failed 4 times, most recent failure: Lost
> task 10.3 in stage 1.0 (TID 81, spark6, executor 1):
> java.lang.NullPointerException
> Driver stacktrace:
> at org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
> at scala.Option.foreach(Option.scala:257)
> at
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
> at
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:929)
> at
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:927)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
> at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:927)
> at
> org.apache.spark.sql.execution.streaming.ForeachSink.addBatch(ForeachSink.scala:49)
> at
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3$$anonfun$apply$16.apply(MicroBatchExecution.scala:475)
> at
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
> at
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3.apply(MicroBatchExecution.scala:473)
> at
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
> at
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
> at org.apache.spark.sql.execution.streaming.MicroBatchExecution.org
> $apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:472)
> at
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:133)
> at
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
> at
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
> at
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
> at
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
> at
> 

Re: [Structured Streaming] NullPointerException in long running query

2020-04-28 Thread Edgardo Szrajber
The exception occured while aborting the stage. It might be interesting to try 
to understand the reason for the abortion.Maybe timeout? How long the query 
run?Bentzi

Sent from Yahoo Mail on Android 
 
  On Tue, Apr 28, 2020 at 9:25, Jungtaek Lim 
wrote:   The root cause of exception is occurred in executor side "Lost task 
10.3 in stage 1.0 (TID 81, spark6, executor 1)" so you may need to check there.
On Tue, Apr 28, 2020 at 2:52 PM lec ssmi  wrote:

Hi:  One of my long-running queries occasionally encountered the following 
exception:


  Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 10 in stage 1.0 failed 4 times, most recent failure: Lost task 10.3 in 
stage 1.0 (TID 81, spark6, executor 1): java.lang.NullPointerException
Driver stacktrace:
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
at scala.Option.foreach(Option.scala:257)
at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:929)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:927)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:927)
at 
org.apache.spark.sql.execution.streaming.ForeachSink.addBatch(ForeachSink.scala:49)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3$$anonfun$apply$16.apply(MicroBatchExecution.scala:475)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3.apply(MicroBatchExecution.scala:473)
at 
org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
at 
org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:472)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:133)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
at 
org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
at 
org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:121)
at 
org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:117)
at 
org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279)
... 1 more


According to the 

Re: [Structured Streaming] NullPointerException in long running query

2020-04-28 Thread Jungtaek Lim
The root cause of exception is occurred in executor side "Lost task 10.3 in
stage 1.0 (TID 81, spark6, executor 1)" so you may need to check there.

On Tue, Apr 28, 2020 at 2:52 PM lec ssmi  wrote:

> Hi:
>   One of my long-running queries occasionally encountered the following
> exception:
>
>
>   Caused by: org.apache.spark.SparkException: Job aborted due to stage
>> failure: Task 10 in stage 1.0 failed 4 times, most recent failure: Lost
>> task 10.3 in stage 1.0 (TID 81, spark6, executor 1):
>> java.lang.NullPointerException
>> Driver stacktrace:
>> at org.apache.spark.scheduler.DAGScheduler.org
>> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
>> at
>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>> at
>> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
>> at scala.Option.foreach(Option.scala:257)
>> at
>> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
>> at
>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
>> at
>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
>> at
>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
>> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>> at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
>> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
>> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
>> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
>> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
>> at
>> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:929)
>> at
>> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:927)
>> at
>> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>> at
>> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>> at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
>> at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:927)
>> at
>> org.apache.spark.sql.execution.streaming.ForeachSink.addBatch(ForeachSink.scala:49)
>> at
>> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3$$anonfun$apply$16.apply(MicroBatchExecution.scala:475)
>> at
>> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
>> at
>> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3.apply(MicroBatchExecution.scala:473)
>> at
>> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
>> at
>> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
>> at org.apache.spark.sql.execution.streaming.MicroBatchExecution.org
>> $apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:472)
>> at
>> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:133)
>> at
>> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
>> at
>> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
>> at
>> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
>> at
>> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
>> at
>> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:121)
>> at
>> org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
>> at
>> org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:117)
>> at org.apache.spark.sql.execution.streaming.StreamExecution.org
>> $apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279)
>> ... 1 more
>
>
>
> According to the exception stack, it seems to have 

[Structured Streaming] NullPointerException in long running query

2020-04-27 Thread lec ssmi
Hi:
  One of my long-running queries occasionally encountered the following
exception:


  Caused by: org.apache.spark.SparkException: Job aborted due to stage
> failure: Task 10 in stage 1.0 failed 4 times, most recent failure: Lost
> task 10.3 in stage 1.0 (TID 81, spark6, executor 1):
> java.lang.NullPointerException
> Driver stacktrace:
> at org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1602)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1590)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1589)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1589)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831)
> at scala.Option.foreach(Option.scala:257)
> at
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1823)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1772)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1761)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
> at
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:929)
> at
> org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:927)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
> at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:927)
> at
> org.apache.spark.sql.execution.streaming.ForeachSink.addBatch(ForeachSink.scala:49)
> at
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3$$anonfun$apply$16.apply(MicroBatchExecution.scala:475)
> at
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
> at
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$3.apply(MicroBatchExecution.scala:473)
> at
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
> at
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
> at org.apache.spark.sql.execution.streaming.MicroBatchExecution.org
> $apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:472)
> at
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:133)
> at
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
> at
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
> at
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
> at
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
> at
> org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:121)
> at
> org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
> at
> org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:117)
> at org.apache.spark.sql.execution.streaming.StreamExecution.org
> $apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279)
> ... 1 more



According to the exception stack, it seems to have nothing to do with the
logic of my code.Is this a spark bug or something? The version of spark is
2.3.1.

Best
Lec Ssmi


NullPointerException at FileBasedWriteAheadLogRandomReader

2019-12-27 Thread Kang Minwoo
Hello, Users.

While I use a write-ahead logs in spark streaming, I got an error that is a 
NullPointerException at FileBasedWriteAheadLogRandomReader.scala:48[1]

[1]: 
https://github.com/apache/spark/blob/v2.4.4/streaming/src/main/scala/org/apache/spark/streaming/util/FileBasedWriteAheadLogRandomReader.scala#L48

 Full stack trace
Caused by: org.apache.spark.SparkException: Could not read data from write 
ahead log record 
FileBasedWriteAheadLogSegment(hdfs://.../receivedData/0/log-...,...)
at 
org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.org$apache$spark$streaming$rdd$WriteAheadLogBackedBlockRDD$$getBlockFromWriteAheadLog$1(WriteAheadLogBackedBlockRDD.scala:145)
at 
org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:173)
at 
org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD$$anonfun$compute$1.apply(WriteAheadLogBackedBlockRDD.scala:173)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.compute(WriteAheadLogBackedBlockRDD.scala:173)
//...
at 
org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
at 
org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:935)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
... 3 more
Caused by: java.lang.NullPointerException
at 
org.apache.spark.streaming.util.FileBasedWriteAheadLogRandomReader.close(FileBasedWriteAheadLogRandomReader.scala:48)
at 
org.apache.spark.streaming.util.FileBasedWriteAheadLog.read(FileBasedWriteAheadLog.scala:122)
at 
org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDD.org$apache$spark$streaming$rdd$WriteAheadLogBackedBlockRDD$$getBlockFromWriteAheadLog$1(WriteAheadLogBackedBlockRDD.scala:142)
... 50 more


- Spark version: 2.4.4
- Hadoop version: 2.7.1
- spark conf
- "spark.streaming.receiver.writeAheadLog.enable" -> "true"

Did I do something wrong?

Best regards,
Minwoo Kang


NullPointerException when scanning HBase table

2018-04-30 Thread Huiliang Zhang
Hi,

In my spark job, I need to scan HBase table. I set up a scan with custom
filters. Then I use

newAPIHadoopRDD function to get a JavaPairRDD variable X.

The problem is when no records inside HBase matches my filters,
the call X.isEmpty() or X.count() will cause a java.lang.NullPointerException.

Part of trace is here:
Caused by: java.lang.RuntimeException: java.lang.NullPointerException
at 
org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:208)
at 
org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:320)
at 
org.apache.hadoop.hbase.client.ClientScanner.nextScanner(ClientScanner.java:295)
at 
org.apache.hadoop.hbase.client.ClientScanner.initializeScannerInConstruction(ClientScanner.java:160)
at 
org.apache.hadoop.hbase.client.ClientScanner.(ClientScanner.java:155)
at org.apache.hadoop.hbase.client.HTable.getScanner(HTable.java:821)
at 
org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:193)
at 
org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:89)
at 
org.apache.hadoop.hbase.client.MetaScanner.allTableRegions(MetaScanner.java:324)
at 
org.apache.hadoop.hbase.client.HRegionLocator.getAllRegionLocations(HRegionLocator.java:88)
at 
org.apache.hadoop.hbase.util.RegionSizeCalculator.init(RegionSizeCalculator.java:94)
at 
org.apache.hadoop.hbase.util.RegionSizeCalculator.(RegionSizeCalculator.java:81)
at 
org.apache.hadoop.hbase.mapreduce.TableInputFormatBase.getSplits(TableInputFormatBase.java:256)
at 
org.apache.spark.rdd.NewHadoopRDD.getPartitions(NewHadoopRDD.scala:125)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
at 
org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply$mcZ$sp(RDD.scala:1461)
at org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply(RDD.scala:1461)
at org.apache.spark.rdd.RDD$$anonfun$isEmpty$1.apply(RDD.scala:1461)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.RDD.isEmpty(RDD.scala:1460)
at 
org.apache.spark.api.java.JavaRDDLike$class.isEmpty(JavaRDDLike.scala:544)
at 
org.apache.spark.api.java.AbstractJavaRDDLike.isEmpty(JavaRDDLike.scala:45)
...


I just override setConf() function to setSacn().

Please see if this is a bug of spark or issue of my code.

Thanks,

Huiliang


Re: Nullpointerexception error when in repartition

2018-04-12 Thread Junfeng Chen
Hi,
I know it, but my purpose it to transforming json string in DataSet
to Dataset, while spark.readStream can only support read json file in
specified path.
https://stackoverflow.com/questions/48617474/how-to-convert-json-dataset-to-dataframe-in-spark-structured-streaming
gives an essential method, but the formats of every json data are not same.
Either Spark java api seems not supporting grammer like

.select(from_json($"value", colourSchema))



Regard,
Junfeng Chen

On Fri, Apr 13, 2018 at 7:09 AM, Tathagata Das <tathagata.das1...@gmail.com>
wrote:

> Have you read through the documentation of Structured Streaming?
> https://spark.apache.org/docs/latest/structured-streaming-
> programming-guide.html
>
> One of the basic mistakes you are making is defining the dataset as with
> `spark.read()`. You define a streaming Dataset as `spark.readStream()`
>
> On Thu, Apr 12, 2018 at 3:02 AM, Junfeng Chen <darou...@gmail.com> wrote:
>
>> Hi, Tathagata
>>
>> I have tried structured streaming, but in line
>>
>>> Dataset rowDataset = spark.read().json(jsondataset);
>>
>>
>> Always throw
>>
>>> Queries with streaming sources must be executed with writeStream.start()
>>
>>
>> But what i need to do in this step is only transforming json string data
>> to Dataset . How to fix it?
>>
>> Thanks!
>>
>>
>> Regard,
>> Junfeng Chen
>>
>> On Thu, Apr 12, 2018 at 3:08 PM, Tathagata Das <
>> tathagata.das1...@gmail.com> wrote:
>>
>>> It's not very surprising that doing this sort of RDD to DF conversion
>>> inside DStream.foreachRDD has weird corner cases like this. In fact, you
>>> are going to have additional problems with partial parquet files (when
>>> there are failures) in this approach. I strongly suggest that you use
>>> Structured Streaming, which is designed to do this sort of processing. It
>>> will take care of tracking the written parquet files correctly.
>>>
>>> TD
>>>
>>> On Wed, Apr 11, 2018 at 6:58 PM, Junfeng Chen <darou...@gmail.com>
>>> wrote:
>>>
>>>> I write a program to read some json data from kafka and purpose to save
>>>> them to parquet file on hdfs.
>>>> Here is my code:
>>>>
>>>>> JavaInputDstream stream = ...
>>>>> JavaDstream rdd = stream.map...
>>>>> rdd.repartition(taksNum).foreachRDD(VoldFunction<JavaRDD
>>>>> stringjavardd->{
>>>>> Dataset df = spark.read().json( stringjavardd ); // convert
>>>>> json to df
>>>>> JavaRDD rowJavaRDD = df.javaRDD().map...  //add some new
>>>>> fields
>>>>> StructType type = df.schema()...; // constuct new type for new
>>>>> added fields
>>>>> Dataset<Row) newdf = spark.createDataFrame(rowJavaRDD.type);
>>>>> //create new dataframe
>>>>> newdf.repatition(taskNum).write().mode(SaveMode.Append).pati
>>>>> tionedBy("appname").parquet(savepath); // save to parquet
>>>>> })
>>>>
>>>>
>>>>
>>>> However, if I remove the repartition method of newdf in writing parquet
>>>> stage, the program always throw nullpointerexception error in json convert
>>>> line:
>>>>
>>>> Java.lang.NullPointerException
>>>>>  at org.apache.spark.SparkContext.getPreferredLocs(SparkContext.
>>>>> scala:1783)
>>>>> ...
>>>>
>>>>
>>>> While it looks make no sense, writing parquet operation should be in
>>>> different stage with json transforming operation.
>>>> So how to solve it? Thanks!
>>>>
>>>> Regard,
>>>> Junfeng Chen
>>>>
>>>
>>>
>>
>


Re: Nullpointerexception error when in repartition

2018-04-12 Thread Tathagata Das
Have you read through the documentation of Structured Streaming?
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html

One of the basic mistakes you are making is defining the dataset as with
`spark.read()`. You define a streaming Dataset as `spark.readStream()`

On Thu, Apr 12, 2018 at 3:02 AM, Junfeng Chen <darou...@gmail.com> wrote:

> Hi, Tathagata
>
> I have tried structured streaming, but in line
>
>> Dataset rowDataset = spark.read().json(jsondataset);
>
>
> Always throw
>
>> Queries with streaming sources must be executed with writeStream.start()
>
>
> But what i need to do in this step is only transforming json string data
> to Dataset . How to fix it?
>
> Thanks!
>
>
> Regard,
> Junfeng Chen
>
> On Thu, Apr 12, 2018 at 3:08 PM, Tathagata Das <
> tathagata.das1...@gmail.com> wrote:
>
>> It's not very surprising that doing this sort of RDD to DF conversion
>> inside DStream.foreachRDD has weird corner cases like this. In fact, you
>> are going to have additional problems with partial parquet files (when
>> there are failures) in this approach. I strongly suggest that you use
>> Structured Streaming, which is designed to do this sort of processing. It
>> will take care of tracking the written parquet files correctly.
>>
>> TD
>>
>> On Wed, Apr 11, 2018 at 6:58 PM, Junfeng Chen <darou...@gmail.com> wrote:
>>
>>> I write a program to read some json data from kafka and purpose to save
>>> them to parquet file on hdfs.
>>> Here is my code:
>>>
>>>> JavaInputDstream stream = ...
>>>> JavaDstream rdd = stream.map...
>>>> rdd.repartition(taksNum).foreachRDD(VoldFunction<JavaRDD
>>>> stringjavardd->{
>>>> Dataset df = spark.read().json( stringjavardd ); // convert
>>>> json to df
>>>> JavaRDD rowJavaRDD = df.javaRDD().map...  //add some new fields
>>>> StructType type = df.schema()...; // constuct new type for new
>>>> added fields
>>>> Dataset<Row) newdf = spark.createDataFrame(rowJavaRDD.type);
>>>> //create new dataframe
>>>> newdf.repatition(taskNum).write().mode(SaveMode.Append).pati
>>>> tionedBy("appname").parquet(savepath); // save to parquet
>>>> })
>>>
>>>
>>>
>>> However, if I remove the repartition method of newdf in writing parquet
>>> stage, the program always throw nullpointerexception error in json convert
>>> line:
>>>
>>> Java.lang.NullPointerException
>>>>  at org.apache.spark.SparkContext.getPreferredLocs(SparkContext.
>>>> scala:1783)
>>>> ...
>>>
>>>
>>> While it looks make no sense, writing parquet operation should be in
>>> different stage with json transforming operation.
>>> So how to solve it? Thanks!
>>>
>>> Regard,
>>> Junfeng Chen
>>>
>>
>>
>


Re: Nullpointerexception error when in repartition

2018-04-12 Thread Junfeng Chen
Hi, Tathagata

I have tried structured streaming, but in line

> Dataset rowDataset = spark.read().json(jsondataset);


Always throw

> Queries with streaming sources must be executed with writeStream.start()


But what i need to do in this step is only transforming json string data to
Dataset . How to fix it?

Thanks!


Regard,
Junfeng Chen

On Thu, Apr 12, 2018 at 3:08 PM, Tathagata Das <tathagata.das1...@gmail.com>
wrote:

> It's not very surprising that doing this sort of RDD to DF conversion
> inside DStream.foreachRDD has weird corner cases like this. In fact, you
> are going to have additional problems with partial parquet files (when
> there are failures) in this approach. I strongly suggest that you use
> Structured Streaming, which is designed to do this sort of processing. It
> will take care of tracking the written parquet files correctly.
>
> TD
>
> On Wed, Apr 11, 2018 at 6:58 PM, Junfeng Chen <darou...@gmail.com> wrote:
>
>> I write a program to read some json data from kafka and purpose to save
>> them to parquet file on hdfs.
>> Here is my code:
>>
>>> JavaInputDstream stream = ...
>>> JavaDstream rdd = stream.map...
>>> rdd.repartition(taksNum).foreachRDD(VoldFunction<JavaRDD
>>> stringjavardd->{
>>> Dataset df = spark.read().json( stringjavardd ); // convert
>>> json to df
>>> JavaRDD rowJavaRDD = df.javaRDD().map...  //add some new fields
>>> StructType type = df.schema()...; // constuct new type for new added
>>> fields
>>> Dataset<Row) newdf = spark.createDataFrame(rowJavaRDD.type);
>>> //create new dataframe
>>> newdf.repatition(taskNum).write().mode(SaveMode.Append).pati
>>> tionedBy("appname").parquet(savepath); // save to parquet
>>> })
>>
>>
>>
>> However, if I remove the repartition method of newdf in writing parquet
>> stage, the program always throw nullpointerexception error in json convert
>> line:
>>
>> Java.lang.NullPointerException
>>>  at org.apache.spark.SparkContext.getPreferredLocs(SparkContext.
>>> scala:1783)
>>> ...
>>
>>
>> While it looks make no sense, writing parquet operation should be in
>> different stage with json transforming operation.
>> So how to solve it? Thanks!
>>
>> Regard,
>> Junfeng Chen
>>
>
>


Re: Nullpointerexception error when in repartition

2018-04-12 Thread Tathagata Das
It's not very surprising that doing this sort of RDD to DF conversion
inside DStream.foreachRDD has weird corner cases like this. In fact, you
are going to have additional problems with partial parquet files (when
there are failures) in this approach. I strongly suggest that you use
Structured Streaming, which is designed to do this sort of processing. It
will take care of tracking the written parquet files correctly.

TD

On Wed, Apr 11, 2018 at 6:58 PM, Junfeng Chen <darou...@gmail.com> wrote:

> I write a program to read some json data from kafka and purpose to save
> them to parquet file on hdfs.
> Here is my code:
>
>> JavaInputDstream stream = ...
>> JavaDstream rdd = stream.map...
>> rdd.repartition(taksNum).foreachRDD(VoldFunction<JavaRDD
>> stringjavardd->{
>> Dataset df = spark.read().json( stringjavardd ); // convert
>> json to df
>> JavaRDD rowJavaRDD = df.javaRDD().map...  //add some new fields
>> StructType type = df.schema()...; // constuct new type for new added
>> fields
>> Dataset<Row) newdf = spark.createDataFrame(rowJavaRDD.type);
>> //create new dataframe
>> newdf.repatition(taskNum).write().mode(SaveMode.Append).
>> patitionedBy("appname").parquet(savepath); // save to parquet
>> })
>
>
>
> However, if I remove the repartition method of newdf in writing parquet
> stage, the program always throw nullpointerexception error in json convert
> line:
>
> Java.lang.NullPointerException
>>  at org.apache.spark.SparkContext.getPreferredLocs(SparkContext.
>> scala:1783)
>> ...
>
>
> While it looks make no sense, writing parquet operation should be in
> different stage with json transforming operation.
> So how to solve it? Thanks!
>
> Regard,
> Junfeng Chen
>


Nullpointerexception error when in repartition

2018-04-11 Thread Junfeng Chen
I write a program to read some json data from kafka and purpose to save
them to parquet file on hdfs.
Here is my code:

> JavaInputDstream stream = ...
> JavaDstream rdd = stream.map...
> rdd.repartition(taksNum).foreachRDD(VoldFunction<JavaRDD
> stringjavardd->{
> Dataset df = spark.read().json( stringjavardd ); // convert json
> to df
> JavaRDD rowJavaRDD = df.javaRDD().map...  //add some new fields
> StructType type = df.schema()...; // constuct new type for new added
> fields
> Dataset<Row) newdf = spark.createDataFrame(rowJavaRDD.type); //create
> new dataframe
>
> newdf.repatition(taskNum).write().mode(SaveMode.Append).patitionedBy("appname").parquet(savepath);
> // save to parquet
> })



However, if I remove the repartition method of newdf in writing parquet
stage, the program always throw nullpointerexception error in json convert
line:

Java.lang.NullPointerException
>  at org.apache.spark.SparkContext.getPreferredLocs(SparkContext.scala:1783)
> ...


While it looks make no sense, writing parquet operation should be in
different stage with json transforming operation.
So how to solve it? Thanks!

Regard,
Junfeng Chen


NullPointerException issue in LDA.train()

2018-02-09 Thread Kevin Lam
Hi,

We're encountering an issue with training an LDA model in PySpark. The
issue is as follows:

- Running LDA on some large set of documents (12M, ~2-5kB each)
- Works fine for small subset of full set (100K - 1M)
- Hit a NullPointerException for full data set
- Running workload on google cloud dataproc

The following two issues I was able to find online appear relevant:
https://issues.apache.org/jira/browse/SPARK-299 (which may not have been
addressed as far as I can tell)
http://apache-spark-user-list.1001560.n3.nabble.com/MLLIB-LDA-throws-
NullPointerException-td26686.html

Also I've heavily followed the code outlined here:
http://sean.lane.sh/blog/2016/PySpark_and_LDA

Any ideas or help is appreciated!!

Thanks in advance,
Kevin

Example trace of output:

16:22:55 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 8.0 in
>> stage 42.0 (TID 16163, 
>> royallda-20180209-152710-w-1.c.fathom-containers.internal,
>> executor 4): java.lang.NullPointerException
>
> at org.apache.spark.mllib.clustering.LDA$.computePTopic(LDA.scala:432)
>
> at org.apache.spark.mllib.clustering.EMLDAOptimizer$$
>> anonfun$5.apply(LDAOptimizer.scala:190)
>
> at org.apache.spark.mllib.clustering.EMLDAOptimizer$$
>> anonfun$5.apply(LDAOptimizer.scala:184)
>
> at org.apache.spark.graphx.impl.EdgePartition.aggregateMessagesEdgeScan(
>> EdgePartition.scala:409)
>
> at org.apache.spark.graphx.impl.GraphImpl$$anonfun$13$$
>> anonfun$apply$3.apply(GraphImpl.scala:237)
>
> at org.apache.spark.graphx.impl.GraphImpl$$anonfun$13$$
>> anonfun$apply$3.apply(GraphImpl.scala:207)
>
> at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>
> at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>
> at org.apache.spark.util.collection.ExternalSorter.
>> insertAll(ExternalSorter.scala:199)
>
> at org.apache.spark.shuffle.sort.SortShuffleWriter.write(
>> SortShuffleWriter.scala:63)
>
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(
>> ShuffleMapTask.scala:96)
>
> at org.apache.spark.scheduler.ShuffleMapTask.runTask(
>> ShuffleMapTask.scala:53)
>
> at org.apache.spark.scheduler.Task.run(Task.scala:108)
>
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
>
> at java.util.concurrent.ThreadPoolExecutor.runWorker(
>> ThreadPoolExecutor.java:1142)
>
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
>> ThreadPoolExecutor.java:617)
>
> at java.lang.Thread.run(Thread.java:748)
>
>
>>
>> [Stage 42:> (0 + 24) / 1000][Stage 43:=>   (104 + 0)
>> / 1000]18/02/09 16:22:55 ERROR org.apache.spark.scheduler.TaskSetManager:
>> Task 8 in stage 42.0 failed 4 times; aborting job
>
> Traceback (most recent call last):
>
>   File "/tmp/61801514-d562-433b-ac42-faa758c27b63/pipeline_launcher.py",
>> line 258, in 
>
> fire.Fire({'run_pipeline': run_pipeline})
>
>   File "/usr/local/lib/python3.6/dist-packages/fire/core.py", line 120,
>> in Fire
>
> component_trace = _Fire(component, args, context, name)
>
>   File "/usr/local/lib/python3.6/dist-packages/fire/core.py", line 358,
>> in _Fire
>
> component, remaining_args)
>
>   File "/usr/local/lib/python3.6/dist-packages/fire/core.py", line 561,
>> in _CallCallable
>
> result = fn(*varargs, **kwargs)
>
>   File "/tmp/61801514-d562-433b-ac42-faa758c27b63/pipeline_launcher.py",
>> line 77, in run_pipeline
>
> run_pipeline_local(pipeline_id, **kwargs)
>
>   File "/tmp/61801514-d562-433b-ac42-faa758c27b63/pipeline_launcher.py",
>> line 94, in run_pipeline_local
>
> pipeline.run(**kwargs)
>
>   File 
> "/tmp/61801514-d562-433b-ac42-faa758c27b63/diseaseTools.zip/spark/pipelines/royal_lda.py",
>> line 142, in run
>
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/mllib/clustering.py",
>> line 1039, in train
>
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/mllib/common.py",
>> line 130, in callMLlibFunc
>
>   File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/mllib/common.py",
>> line 123, in callJavaFunc
>
>   File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py",
>> line 1133, in __call__
>
>   File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py",
>> line 319, in get_return_value
>
> py4j.protocol.Py4JJavaError: An error occurred while calling
>> o161.trainLDAModel.
>
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task
>> 8 in stage 42.0 failed 4 times, most recent failure: Lost

Re: [Spark DataFrame]: Passing DataFrame to custom method results in NullPointerException

2018-01-22 Thread Matteo Cossu
Hello,
I did not understand very well your question.
However, I can tell you that if you do .collect() on a RDD you are
collecting all the data in the driver node. For this reason, you should use
it only when the RDD is very small.
Your function "validate_hostname" depends on a DataFrame. It's not possible
to refer a DataFrame from a worker node, that's why that operation doesn't
work. In the other case it works because the "map" is a function executed
in the driver, not an RDD's method.
In these cases you could use broadcast variables, but I have the intuition
that, in general, you are using the wrong approach to solve the problem.

Best Regards,

Matteo Cossu


On 15 January 2018 at 12:56, <abdul.h.huss...@bt.com> wrote:

> Hi,
>
>
>
> My Spark app is mapping lines from a text file to case classes stored
> within an RDD.
>
>
>
> When I run the following code on this rdd:
>
> .collect.map(line => if(validate_hostname(line, data_frame))
> line).foreach(println)
>
>
>
> It correctly calls the method validate_hostname by passing the case class
> and another data_frame defined within the main method. Unfortunately the
> above map only returns a TraversableLike collection so I can’t do
> transformations and joins on this data structure so I’m tried to apply a
> filter on the rdd with the following code:
>
> .filter(line => validate_hostname(line, data_frame)).count()
>
>
>
> Unfortunately the above method with filtering the rdd does not pass the
> data_frame so I get a NullPointerException though it correctly passes the
> case class which I print within the method.
>
>
>
> Where am I going wrong?
>
>
>
> When
>
>
>
> Regards,
>
> Abdul Haseeb Hussain
>


[Spark DataFrame]: Passing DataFrame to custom method results in NullPointerException

2018-01-15 Thread abdul.h.hussain
Hi,

My Spark app is mapping lines from a text file to case classes stored within an 
RDD.

When I run the following code on this rdd:
.collect.map(line => if(validate_hostname(line, data_frame)) 
line).foreach(println)

It correctly calls the method validate_hostname by passing the case class and 
another data_frame defined within the main method. Unfortunately the above map 
only returns a TraversableLike collection so I can't do transformations and 
joins on this data structure so I'm tried to apply a filter on the rdd with the 
following code:
.filter(line => validate_hostname(line, data_frame)).count()

Unfortunately the above method with filtering the rdd does not pass the 
data_frame so I get a NullPointerException though it correctly passes the case 
class which I print within the method.

Where am I going wrong?

When

Regards,
Abdul Haseeb Hussain


Re: NullPointerException while reading a column from the row

2017-12-19 Thread Vadim Semenov
getAs defined as:

def getAs[T](i: Int): T = get(i).asInstanceOf[T]

and when you do toString you call Object.toString which doesn't depend on
the type,
so asInstanceOf[T] get dropped by the compiler, i.e.

row.getAs[Int](0).toString -> row.get(0).toString

we can confirm that by writing a simple scala code:

import org.apache.spark.sql._
object Test {
  val row = Row(null)
  row.getAs[Int](0).toString
}

and then compiling it:

$ scalac -classpath $SPARK_HOME/jars/'*' -print test.scala
[[syntax trees at end of   cleanup]] // test.scala
package  {
  object Test extends Object {
private[this] val row: org.apache.spark.sql.Row = _;
  def row(): org.apache.spark.sql.Row = Test.this.row;
def (): Test.type = {
  Test.super.();
  Test.this.row =
org.apache.spark.sql.Row.apply(scala.this.Predef.genericWrapArray(Array[Object]{null}));
  *Test.this.row().getAs(0).toString();*
  ()
}
  }
}

So the proper way would be:

String.valueOf(row.getAs[Int](0))


On Tue, Dec 19, 2017 at 4:23 AM, Anurag Sharma <anu...@logistimo.com> wrote:

> The following Scala (Spark 1.6) code for reading a value from a Row fails
> with a NullPointerException when the value is null.
>
> val test = row.getAs[Int]("ColumnName").toString
>
> while this works fine
>
> val test1 = row.getAs[Int]("ColumnName") // returns 0 for nullval test2 = 
> test1.toString // converts to String fine
>
> What is causing NullPointerException and what is the recommended way to
> handle such cases?
>
> PS: getting row from DataFrame as follows:
>
>  val myRDD = myDF
> .repartition(partitions)
> .mapPartitions {
>   rows =>
> rows.flatMap {
> row =>
>   functionWithRows(row) //has above logic to read null column 
> which fails
>   }
>   }
>
> functionWithRows has then above mentioned NullPointerException
>
> MyDF schema:
>
> root
>  |-- LDID: string (nullable = true)
>  |-- KTAG: string (nullable = true)
>  |-- ColumnName: integer (nullable = true)
>
>


NullPointerException while reading a column from the row

2017-12-19 Thread Anurag Sharma
The following Scala (Spark 1.6) code for reading a value from a Row fails
with a NullPointerException when the value is null.

val test = row.getAs[Int]("ColumnName").toString

while this works fine

val test1 = row.getAs[Int]("ColumnName") // returns 0 for nullval
test2 = test1.toString // converts to String fine

What is causing NullPointerException and what is the recommended way to
handle such cases?

PS: getting row from DataFrame as follows:

 val myRDD = myDF
.repartition(partitions)
.mapPartitions {
  rows =>
rows.flatMap {
row =>
  functionWithRows(row) //has above logic to read null
column which fails
  }
  }

functionWithRows has then above mentioned NullPointerException

MyDF schema:

root
 |-- LDID: string (nullable = true)
 |-- KTAG: string (nullable = true)
 |-- ColumnName: integer (nullable = true)


[Spark SQL]: DataFrame schema resulting in NullPointerException

2017-11-19 Thread Chitral Verma
Hey,

I'm working on this use case that involves converting DStreams to
Dataframes after some transformations. I've simplified my code into the
following snippet so as to reproduce the error. Also, I've mentioned below
my environment settings.

*Environment:*

Spark Version: 2.2.0
Java: 1.8
Execution mode: local/ IntelliJ


*Code:*

object Tests {

def main(args: Array[String]): Unit = {
val spark: SparkSession =  ...
  import spark.implicits._

val df = List(
("jim", "usa"),
("raj", "india"))
.toDF("name", "country")

df.rdd
  .map(x => x.toSeq)
  .map(x => new GenericRowWithSchema(x.toArray, df.schema))
  .foreach(println)
  }
}


This results in NullPointerException as I'm directly using df.schema in
map().

What I don't understand is that if I use the following code (basically
storing the schema as a value before transforming), it works just fine.

object Tests {

def main(args: Array[String]): Unit = {
val spark: SparkSession =  ...
  import spark.implicits._

val df = List(
("jim", "usa"),
("raj", "india"))
.toDF("name", "country")

val sc = df.schema

df.rdd
  .map(x => x.toSeq)
  .map(x => new GenericRowWithSchema(x.toArray, sc))
  .foreach(println)
  }
}


I wonder why this is happening as *df.rdd* is not an action and there is
visible change in state of dataframe just yet. What are your thoughts on
this?

Regards,
Chitral Verma


Re: NullPointerException error while saving Scala Dataframe to HBase

2017-10-01 Thread Marco Mistroni
Hi
 The question is getting to the list.
I have no experience in hbase ...though , having seen similar stuff when
saving a df somewhere else...it might have to do with the properties you
need to set to let spark know it is dealing with hbase? Don't u need to set
some properties on the spark context you are using?
Hth
 Marco


On Oct 1, 2017 4:33 AM, <mailford...@gmail.com> wrote:

Hi Guys- am not sure whether the email is reaching to the community
members. Please can somebody acknowledge

Sent from my iPhone

> On 30-Sep-2017, at 5:02 PM, Debabrata Ghosh <mailford...@gmail.com> wrote:
>
> Dear All,
>Greetings ! I am repeatedly hitting a NullPointerException
error while saving a Scala Dataframe to HBase. Please can you help
resolving this for me. Here is the code snippet:
>
> scala> def catalog = s"""{
>  ||"table":{"namespace":"default", "name":"table1"},
>  ||"rowkey":"key",
>  ||"columns":{
>  |  |"col0":{"cf":"rowkey", "col":"key", "type":"string"},
>  |  |"col1":{"cf":"cf1", "col":"col1", "type":"string"}
>  ||}
>  |  |}""".stripMargin
> catalog: String
>
> scala> case class HBaseRecord(
>  |col0: String,
>  |col1: String)
> defined class HBaseRecord
>
> scala> val data = (0 to 255).map { i =>  HBaseRecord(i.toString, "extra")}
> data: scala.collection.immutable.IndexedSeq[HBaseRecord] =
Vector(HBaseRecord(0,extra), HBaseRecord(1,extra), HBaseRecord
>
> (2,extra), HBaseRecord(3,extra), HBaseRecord(4,extra),
HBaseRecord(5,extra), HBaseRecord(6,extra), HBaseRecord(7,extra),
>
> HBaseRecord(8,extra), HBaseRecord(9,extra), HBaseRecord(10,extra),
HBaseRecord(11,extra), HBaseRecord(12,extra),
>
> HBaseRecord(13,extra), HBaseRecord(14,extra), HBaseRecord(15,extra),
HBaseRecord(16,extra), HBaseRecord(17,extra),
>
> HBaseRecord(18,extra), HBaseRecord(19,extra), HBaseRecord(20,extra),
HBaseRecord(21,extra), HBaseRecord(22,extra),
>
> HBaseRecord(23,extra), HBaseRecord(24,extra), HBaseRecord(25,extra),
HBaseRecord(26,extra), HBaseRecord(27,extra),
>
> HBaseRecord(28,extra), HBaseRecord(29,extra), HBaseRecord(30,extra),
HBaseRecord(31,extra), HBase...
>
> scala> import org.apache.spark.sql.datasources.hbase
> import org.apache.spark.sql.datasources.hbase
>
>
> scala> import org.apache.spark.sql.datasources.hbase.{HBaseTableCatalog}
> import org.apache.spark.sql.datasources.hbase.HBaseTableCatalog
>
> scala> 
> sc.parallelize(data).toDF.write.options(Map(HBaseTableCatalog.tableCatalog
-> catalog, HBaseTableCatalog.newTable ->
>
> "5")).format("org.apache.hadoop.hbase.spark").save()
>
> java.lang.NullPointerException
>   at org.apache.hadoop.hbase.spark.HBaseRelation.(
DefaultSource.scala:134)
>   at org.apache.hadoop.hbase.spark.DefaultSource.createRelation(
DefaultSource.scala:75)
>   at org.apache.spark.sql.execution.datasources.
DataSource.write(DataSource.scala:426)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215)
>   ... 56 elided
>
>
> Thanks in advance !
>
> Debu
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org


Re: NullPointerException error while saving Scala Dataframe to HBase

2017-09-30 Thread mailfordebu
Hi Guys- am not sure whether the email is reaching to the community members. 
Please can somebody acknowledge 

Sent from my iPhone

> On 30-Sep-2017, at 5:02 PM, Debabrata Ghosh <mailford...@gmail.com> wrote:
> 
> Dear All,
>Greetings ! I am repeatedly hitting a NullPointerException 
> error while saving a Scala Dataframe to HBase. Please can you help resolving 
> this for me. Here is the code snippet:
> 
> scala> def catalog = s"""{
>  ||"table":{"namespace":"default", "name":"table1"},
>  ||"rowkey":"key",
>  ||"columns":{
>  |  |"col0":{"cf":"rowkey", "col":"key", "type":"string"},
>  |  |"col1":{"cf":"cf1", "col":"col1", "type":"string"}
>  ||}
>  |  |}""".stripMargin
> catalog: String
> 
> scala> case class HBaseRecord(
>  |col0: String,
>  |col1: String)
> defined class HBaseRecord
> 
> scala> val data = (0 to 255).map { i =>  HBaseRecord(i.toString, "extra")}
> data: scala.collection.immutable.IndexedSeq[HBaseRecord] = 
> Vector(HBaseRecord(0,extra), HBaseRecord(1,extra), HBaseRecord
> 
> (2,extra), HBaseRecord(3,extra), HBaseRecord(4,extra), HBaseRecord(5,extra), 
> HBaseRecord(6,extra), HBaseRecord(7,extra), 
> 
> HBaseRecord(8,extra), HBaseRecord(9,extra), HBaseRecord(10,extra), 
> HBaseRecord(11,extra), HBaseRecord(12,extra), 
> 
> HBaseRecord(13,extra), HBaseRecord(14,extra), HBaseRecord(15,extra), 
> HBaseRecord(16,extra), HBaseRecord(17,extra), 
> 
> HBaseRecord(18,extra), HBaseRecord(19,extra), HBaseRecord(20,extra), 
> HBaseRecord(21,extra), HBaseRecord(22,extra), 
> 
> HBaseRecord(23,extra), HBaseRecord(24,extra), HBaseRecord(25,extra), 
> HBaseRecord(26,extra), HBaseRecord(27,extra), 
> 
> HBaseRecord(28,extra), HBaseRecord(29,extra), HBaseRecord(30,extra), 
> HBaseRecord(31,extra), HBase...
> 
> scala> import org.apache.spark.sql.datasources.hbase
> import org.apache.spark.sql.datasources.hbase
>  
> 
> scala> import org.apache.spark.sql.datasources.hbase.{HBaseTableCatalog}
> import org.apache.spark.sql.datasources.hbase.HBaseTableCatalog
> 
> scala> 
> sc.parallelize(data).toDF.write.options(Map(HBaseTableCatalog.tableCatalog -> 
> catalog, HBaseTableCatalog.newTable -> 
> 
> "5")).format("org.apache.hadoop.hbase.spark").save()
> 
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hbase.spark.HBaseRelation.(DefaultSource.scala:134)
>   at 
> org.apache.hadoop.hbase.spark.DefaultSource.createRelation(DefaultSource.scala:75)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:426)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215)
>   ... 56 elided
> 
> 
> Thanks in advance !
> 
> Debu
> 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



NullPointerException error while saving Scala Dataframe to HBase

2017-09-30 Thread Debabrata Ghosh
Dear All,
   Greetings ! I am repeatedly hitting a NullPointerException
error while saving a Scala Dataframe to HBase. Please can you help
resolving this for me. Here is the code snippet:

scala> def catalog = s"""{
 ||"table":{"namespace":"default", "name":"table1"},
 ||"rowkey":"key",
 ||"columns":{
 |  |"col0":{"cf":"rowkey", "col":"key", "type":"string"},
 |  |"col1":{"cf":"cf1", "col":"col1", "type":"string"}
 ||}
 |  |}""".stripMargin
catalog: String

scala> case class HBaseRecord(
 |col0: String,
 |col1: String)
defined class HBaseRecord

scala> val data = (0 to 255).map { i =>  HBaseRecord(i.toString, "extra")}
data: scala.collection.immutable.IndexedSeq[HBaseRecord] =
Vector(HBaseRecord(0,extra), HBaseRecord(1,extra), HBaseRecord

(2,extra), HBaseRecord(3,extra), HBaseRecord(4,extra),
HBaseRecord(5,extra), HBaseRecord(6,extra), HBaseRecord(7,extra),

HBaseRecord(8,extra), HBaseRecord(9,extra), HBaseRecord(10,extra),
HBaseRecord(11,extra), HBaseRecord(12,extra),

HBaseRecord(13,extra), HBaseRecord(14,extra), HBaseRecord(15,extra),
HBaseRecord(16,extra), HBaseRecord(17,extra),

HBaseRecord(18,extra), HBaseRecord(19,extra), HBaseRecord(20,extra),
HBaseRecord(21,extra), HBaseRecord(22,extra),

HBaseRecord(23,extra), HBaseRecord(24,extra), HBaseRecord(25,extra),
HBaseRecord(26,extra), HBaseRecord(27,extra),

HBaseRecord(28,extra), HBaseRecord(29,extra), HBaseRecord(30,extra),
HBaseRecord(31,extra), HBase...

scala> import org.apache.spark.sql.datasources.hbase
import org.apache.spark.sql.datasources.hbase


scala> import org.apache.spark.sql.datasources.hbase.{HBaseTableCatalog}
import org.apache.spark.sql.datasources.hbase.HBaseTableCatalog

scala>
sc.parallelize(data).toDF.write.options(Map(HBaseTableCatalog.tableCatalog
-> catalog, HBaseTableCatalog.newTable ->

"5")).format("org.apache.hadoop.hbase.spark").save()

java.lang.NullPointerException
  at
org.apache.hadoop.hbase.spark.HBaseRelation.(DefaultSource.scala:134)
  at
org.apache.hadoop.hbase.spark.DefaultSource.createRelation(DefaultSource.scala:75)
  at
org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:426)
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215)
  ... 56 elided


Thanks in advance !

Debu


Spark Streaming: NullPointerException when restoring Spark Streaming job from hdfs/s3 checkpoint

2017-05-16 Thread Richard Moorhead

Im having some difficulty reliably restoring a streaming job from a checkpoint. 
When restoring a streaming job constructed from the following snippet, I 
receive NullPointerException's when `map` is called on the the restored RDD.


lazy val ssc = StreamingContext.getOrCreate(checkpointDir, 
createStreamingContext _)

private def createStreamingContext: StreamingContext = {
  val ssc = new StreamingContext(spark.sparkContext, batchInterval)
  ssc.checkpoint(checkpointDir)
  consumeStreamingContext(ssc)
  ssc
}

def consumeStreamingContext(ssc: StreamingContext) = {
  //... create dstreams
  val dstream = KinesisUtil.createStream(
  ...

  dstream.checkpoint(batchInterval)

  dstream
.foreachRDD(process)
}

def process(events: RDD[Event]) = {
  if (!events.isEmpty()) {
logger.info("Transforming events for processing")
//rdd seems to support some operations?
logger.info(s"RDD LENGTH: ${events.count}")
//nullpointer exception on call to .map
val df = events.map(e => {
  ...
}

  }
}




. . . . . . . . . . . . . . . . . . . . . . . . . . .

Richard Moorhead
Software Engineer
richard.moorh...@c2fo.com

C2FO: The World's Market for Working Capital®

[http://c2fo.com/wp-content/uploads/sites/1/2016/03/LinkedIN.png] 

 [http://c2fo.com/wp-content/uploads/sites/1/2016/03/YouTube.png]  
 
[http://c2fo.com/wp-content/uploads/sites/1/2016/03/Twitter.png]  
 
[http://c2fo.com/wp-content/uploads/sites/1/2016/03/Googleplus.png]  
 
[http://c2fo.com/wp-content/uploads/sites/1/2016/03/Facebook.png]  
 
[http://c2fo.com/wp-content/uploads/sites/1/2016/03/Forbes-Fintech-50.png] 


The information contained in this message and any attachment may be privileged, 
confidential, and protected from disclosure. If you are not the intended 
recipient, or an employee, or agent responsible for delivering this message to 
the intended recipient, you are hereby notified that any dissemination, 
distribution, or copying of this communication is strictly prohibited. If you 
have received this communication in error, please notify us immediately by 
replying to the message and deleting from your computer.



NullPointerException while joining two avro Hive tables

2017-02-04 Thread Понькин Алексей

Hi,

I have a table in Hive(data is stored as avro files).
Using python spark shell I am trying to join two datasets

events = spark.sql('select * from mydb.events')

intersect = events.where('attr2 in (5,6,7) and attr1 in (1,2,3)')
intersect.count()

But I am constantly receiving the following

java.lang.NullPointerException
at 
org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.supportedCategories(AvroObjectInspectorGenerator.java:142)
at 
org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspectorWorker(AvroObjectInspectorGenerator.java:91)
at 
org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspectorWorker(AvroObjectInspectorGenerator.java:104)
at 
org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.createObjectInspector(AvroObjectInspectorGenerator.java:83)
at 
org.apache.hadoop.hive.serde2.avro.AvroObjectInspectorGenerator.(AvroObjectInspectorGenerator.java:56)
at 
org.apache.hadoop.hive.serde2.avro.AvroSerDe.initialize(AvroSerDe.java:124)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$5$$anonfun$10.apply(TableReader.scala:251)
at 
org.apache.spark.sql.hive.HadoopTableReader$$anonfun$5$$anonfun$10.apply(TableReader.scala:239)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:766)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:766)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:103)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Using Spark 2.0.0.2.5.0.0-1245

Any help will be appreciated


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



[ML - Intermediate - Debug] - Loading Customized Transformers in Apache Spark raised a NullPointerException

2017-01-24 Thread Saulo Ricci
Hi,

sorry if I'm being short here. I'm facing the issue related in this link
<http://stackoverflow.com/questions/41844035/loading-customized-transformers-in-apache-spark-raised-a-nullpointerexception>,
I would really appreciate any help from the team and happy to talk and
discuss more about this issue.

Looking forward to hear from you.

Best,

Saulo


Re: Spark-Sql 2.0 nullpointerException

2016-10-12 Thread Selvam Raman
What i am trying to achieve is

Trigger query to get number(i.e.,1,2,3,...n)
for every number i have to trigger another 3 queries.

Thanks,
selvam R

On Wed, Oct 12, 2016 at 4:10 PM, Selvam Raman <sel...@gmail.com> wrote:

> Hi ,
>
> I am reading  parquet file and creating temp table. when i am trying to
> execute query outside of foreach function it is working fine.
> throws nullpointerexception within data frame.foreach function.
>
> code snippet:
>
> String CITATION_QUERY = "select c.citation_num, c.title, c.publisher from
> test c";
>
> Dataset citation_query = spark.sql(CITATION_QUERY);
>
> System.out.println("mistery:"+citation_query.count());
>
>
> // Iterator iterofresulDF = resultDF.toLocalIterator();
>
>
> resultDF.foreach(new ForeachFunction()
>
> {
>
> private static final long serialVersionUID = 1L;
>
> public void call(Row line)
>
> {
>
> Dataset row = spark.sql(CITATION_QUERY);
>
> System.out.println("mistery row count:"+row.count());
>
> }
>
> });
>
>
> ​Error trace:
>
> 16/10/12 15:59:53 INFO CodecPool: Got brand-new decompressor [.snappy]
>
> 16/10/12 15:59:53 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID
> 5)
>
> java.lang.NullPointerException
>
> at org.apache.spark.sql.SparkSession.sessionState$
> lzycompute(SparkSession.scala:112)
>
> at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:110)
>
> at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582)
>
> at com.elsevier.datasearch.ExecuteSQL.executeQuery(ExecuteSQL.java:11)
>
> at com.elsevier.datasearch.ProcessPetDB$1.call(ProcessPetDB.java:53)
>
> at com.elsevier.datasearch.ProcessPetDB$1.call(ProcessPetDB.java:1)
>
> at org.apache.spark.sql.Dataset$$anonfun$foreach$2.apply(
> Dataset.scala:2118)
>
> at org.apache.spark.sql.Dataset$$anonfun$foreach$2.apply(
> Dataset.scala:2118)
>
> at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>
> at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$
> apply$27.apply(RDD.scala:894)
>
> at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$
> apply$27.apply(RDD.scala:894)
>
> at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(
> SparkContext.scala:1916)
>
> at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(
> SparkContext.scala:1916)
>
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>
> at org.apache.spark.scheduler.Task.run(Task.scala:86)
>
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>
> at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
>
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
>
> at java.lang.Thread.run(Thread.java:745)
>
>
>
>
> Driver stacktrace:
>
> at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$
> scheduler$DAGScheduler$$failJobAndIndependentStages(
> DAGScheduler.scala:1454)
>
> at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(
> DAGScheduler.scala:1442)
>
> at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(
> DAGScheduler.scala:1441)
>
> at scala.collection.mutable.ResizableArray$class.foreach(
> ResizableArray.scala:59)
>
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>
> at org.apache.spark.scheduler.DAGScheduler.abortStage(
> DAGScheduler.scala:1441)
>
> at org.apache.spark.scheduler.DAGScheduler$$anonfun$
> handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
>
> at org.apache.spark.scheduler.DAGScheduler$$anonfun$
> handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
>
> at scala.Option.foreach(Option.scala:257)
>
> at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(
> DAGScheduler.scala:811)
>
> at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.
> doOnReceive(DAGScheduler.scala:1667)
>
> at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.
> onReceive(DAGScheduler.scala:1622)
>
> at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.
> onReceive(DAGScheduler.scala:1611)
>
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>
> at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
>
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1890)
>
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1903)
>
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1916)
>
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:1930)
>
> at org.apache.spark.rd

Spark-Sql 2.0 nullpointerException

2016-10-12 Thread Selvam Raman
Hi ,

I am reading  parquet file and creating temp table. when i am trying to
execute query outside of foreach function it is working fine.
throws nullpointerexception within data frame.foreach function.

code snippet:

String CITATION_QUERY = "select c.citation_num, c.title, c.publisher from
test c";

Dataset citation_query = spark.sql(CITATION_QUERY);

System.out.println("mistery:"+citation_query.count());


// Iterator iterofresulDF = resultDF.toLocalIterator();


resultDF.foreach(new ForeachFunction()

{

private static final long serialVersionUID = 1L;

public void call(Row line)

{

Dataset row = spark.sql(CITATION_QUERY);

System.out.println("mistery row count:"+row.count());

}

});


​Error trace:

16/10/12 15:59:53 INFO CodecPool: Got brand-new decompressor [.snappy]

16/10/12 15:59:53 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 5)

java.lang.NullPointerException

at
org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:112)

at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:110)

at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:582)

at com.elsevier.datasearch.ExecuteSQL.executeQuery(ExecuteSQL.java:11)

at com.elsevier.datasearch.ProcessPetDB$1.call(ProcessPetDB.java:53)

at com.elsevier.datasearch.ProcessPetDB$1.call(ProcessPetDB.java:1)

at org.apache.spark.sql.Dataset$$anonfun$foreach$2.apply(Dataset.scala:2118)

at org.apache.spark.sql.Dataset$$anonfun$foreach$2.apply(Dataset.scala:2118)

at scala.collection.Iterator$class.foreach(Iterator.scala:893)

at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)

at
org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:894)

at
org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:894)

at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916)

at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1916)

at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)

at org.apache.spark.scheduler.Task.run(Task.scala:86)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)

at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)




Driver stacktrace:

at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454)

at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442)

at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441)

at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)

at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1441)

at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)

at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)

at scala.Option.foreach(Option.scala:257)

at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)

at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1667)

at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1622)

at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1611)

at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)

at org.apache.spark.SparkContext.runJob(SparkContext.scala:1890)

at org.apache.spark.SparkContext.runJob(SparkContext.scala:1903)

at org.apache.spark.SparkContext.runJob(SparkContext.scala:1916)

at org.apache.spark.SparkContext.runJob(SparkContext.scala:1930)

at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:894)

at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:892)

at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)

at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)

at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)

at org.apache.spark.rdd.RDD.foreach(RDD.scala:892)

at
org.apache.spark.sql.Dataset$$anonfun$foreach$1.apply$mcV$sp(Dataset.scala:2108)

at org.apache.spark.sql.Dataset$$anonfun$foreach$1.apply(Dataset.scala:2108)

at org.apache.spark.sql.Dataset$$anonfun$foreach$1.apply(Dataset.scala:2108)

at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)

at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2546)

at org.apache.spark.sql.Dataset.foreach(Dataset.scala:2107)

at org.apache.spark.sql.Dataset.foreach(Dataset.scala:2118)

at com.elsevier.datasearch.Proce

question about Broadcast value NullPointerException

2016-08-23 Thread Chong Zhang
Hello,

I'm using Spark streaming to process kafka message, and wants to use a prop
file as the input and broadcast the properties:

val props = new Properties()
props.load(new FileInputStream(args(0)))
val sc = initSparkContext()
val propsBC = sc.broadcast(props)
println(s"propFileBC 1: " + propsBC.value)

val lines = createKStream(sc)
val parsedLines = lines.map (l => {
println(s"propFileBC 2: " + propsBC.value)
process(l, propsBC.value)
}).filter(...)

var goodLines = lines.window(2,2)
goodLines.print()


If I run it with spark-submit and master local[2], it works fine.
But if I used the --master spark://master:7077 (2 nodes), the 1st
propsBC.value is printed, but the 2nd print inside the map function causes
null pointer exception:

Caused by: java.lang.NullPointerException
at test.spark.Main$$anonfun$1.apply(Main.scala:79)
at test.spark.Main$$anonfun$1.apply(Main.scala:78)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:389)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at
org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:284)
at
org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)

Appreciate any help,  thanks!


Re: Matrix Factorization Model model.save error "NullPointerException"

2016-07-12 Thread Zhou (Joe) Xing
Anyone may have an idea on what this NPE issue below is about? Thank you!

cheers

zhou

On Jul 11, 2016, at 11:27 PM, Zhou (Joe) Xing 
> wrote:


Hi Guys,

I searched for the archive and also googled this problem when saving the ALS 
trained Matrix Factorization Model to local file system using Model.save() 
method, I found some hints such as partition the model before saving, etc. But 
it does not seem to solve my problem. I’m always getting this NPE error when 
running in a cluster of several nodes, while it’s totally fine when running in 
local node.

I’m using spark 1.6.2, pyspark. Any hint would be appreciated! thank you

cheers

zhou




model = ALS.train(ratingsRDD, rank, numIter, lmbda, 5)



16/07/12 02:14:32 INFO ParquetFileReader: Initiating action with parallelism: 5
16/07/12 02:14:32 WARN ParquetOutputCommitter: could not write summary file for 
file:/home/ec2-user/myCollaborativeFilterNoTesting_2016_07_12_02_13_35.dat/data/product
java.lang.NullPointerException
at 
org.apache.parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:456)
at 
org.apache.parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:420)
at 
org.apache.parquet.hadoop.ParquetOutputCommitter.writeMetaDataFile(ParquetOutputCommitter.java:58)
at 
org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:48)
at 
org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:230)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:149)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:106)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:106)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:106)
at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
at 
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:256)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:139)
at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:334)
at 
org.apache.spark.mllib.recommendation.MatrixFactorizationModel$SaveLoadV1_0$.save(MatrixFactorizationModel.scala:362)
at 
org.apache.spark.mllib.recommendation.MatrixFactorizationModel.save(MatrixFactorizationModel.scala:205)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)



Matrix Factorization Model model.save error "NullPointerException"

2016-07-12 Thread Zhou (Joe) Xing

Hi Guys,

I searched for the archive and also googled this problem when saving the ALS 
trained Matrix Factorization Model to local file system using Model.save() 
method, I found some hints such as partition the model before saving, etc. But 
it does not seem to solve my problem. I’m always getting this NPE error when 
running in a cluster of several nodes, while it’s totally fine when running in 
local node.

I’m using spark 1.6.2, pyspark. Any hint would be appreciated! thank you

cheers

zhou




model = ALS.train(ratingsRDD, rank, numIter, lmbda, 5)



16/07/12 02:14:32 INFO ParquetFileReader: Initiating action with parallelism: 5
16/07/12 02:14:32 WARN ParquetOutputCommitter: could not write summary file for 
file:/home/ec2-user/myCollaborativeFilterNoTesting_2016_07_12_02_13_35.dat/data/product
java.lang.NullPointerException
at 
org.apache.parquet.hadoop.ParquetFileWriter.mergeFooters(ParquetFileWriter.java:456)
at 
org.apache.parquet.hadoop.ParquetFileWriter.writeMetadataFile(ParquetFileWriter.java:420)
at 
org.apache.parquet.hadoop.ParquetOutputCommitter.writeMetaDataFile(ParquetOutputCommitter.java:58)
at 
org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:48)
at 
org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:230)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:149)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:106)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply(InsertIntoHadoopFsRelation.scala:106)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:56)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:106)
at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
at 
org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
at 
org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:256)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:139)
at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:334)
at 
org.apache.spark.mllib.recommendation.MatrixFactorizationModel$SaveLoadV1_0$.save(MatrixFactorizationModel.scala:362)
at 
org.apache.spark.mllib.recommendation.MatrixFactorizationModel.save(MatrixFactorizationModel.scala:205)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)



Re: NullPointerException when starting StreamingContext

2016-06-24 Thread Sunita Arvind
I was able to resolve the serialization issue. The root cause was, I was
accessing the config values within foreachRDD{}.
The solution was to extract the values from config outside the foreachRDD
scope and send in values to the loop directly. Probably something obvious
as we cannot have nested distribution data sets. Mentioning it here for
benefit of anyone else stumbling upon the same issue.

regards
Sunita

On Wed, Jun 22, 2016 at 8:20 PM, Sunita Arvind 
wrote:

> Hello Experts,
>
> I am getting this error repeatedly:
>
> 16/06/23 03:06:59 ERROR streaming.StreamingContext: Error starting the 
> context, marking it as stopped
> java.lang.NullPointerException
>   at 
> com.typesafe.config.impl.SerializedConfigValue.writeOrigin(SerializedConfigValue.java:202)
>   at 
> com.typesafe.config.impl.ConfigImplUtil.writeOrigin(ConfigImplUtil.java:228)
>   at 
> com.typesafe.config.ConfigException.writeObject(ConfigException.java:58)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
>   at 
> java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:440)
>   at java.lang.Throwable.writeObject(Throwable.java:985)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>   at 
> java.io.ObjectOutputStream.writeFatalException(ObjectOutputStream.java:1576)
>   at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:350)
>   at 
> org.apache.spark.streaming.Checkpoint$$anonfun$serialize$1.apply$mcV$sp(Checkpoint.scala:141)
>   at 
> org.apache.spark.streaming.Checkpoint$$anonfun$serialize$1.apply(Checkpoint.scala:141)
>   at 
> org.apache.spark.streaming.Checkpoint$$anonfun$serialize$1.apply(Checkpoint.scala:141)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1251)
>   at 
> org.apache.spark.streaming.Checkpoint$.serialize(Checkpoint.scala:142)
>   at 
> org.apache.spark.streaming.StreamingContext.validate(StreamingContext.scala:554)
>   at 
> org.apache.spark.streaming.StreamingContext.liftedTree1$1(StreamingContext.scala:601)
>   at 
> org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:600)
>   at 
> com.edgecast.engine.ProcessingEngine$$anonfun$main$1.apply(ProcessingEngine.scala:73)
>   at 
> com.edgecast.engine.ProcessingEngine$$anonfun$main$1.apply(ProcessingEngine.scala:67)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at com.edgecast.engine.ProcessingEngine$.main(ProcessingEngine.scala:67)
>   at com.edgecast.engine.ProcessingEngine.main(ProcessingEngine.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:542)
>
>
> It seems to be a typical issue. All I am doing here is as below:
>
> Object ProcessingEngine{
>
> def initializeSpark(customer:String):StreamingContext={
>   LogHandler.log.info("InitialeSpark")
>   val custConf = ConfigFactory.load(customer + 
> ".conf").getConfig(customer).withFallback(AppConf)
>   implicit val sparkConf: SparkConf = new SparkConf().setAppName(customer)
>   val ssc: StreamingContext = new StreamingContext(sparkConf, 
> Seconds(custConf.getLong("batchDurSec")))
>   ssc.checkpoint(custConf.getString("checkpointDir"))
>   ssc
> }
>
> def createDataStreamFromKafka(customer:String, ssc: 
> StreamingContext):DStream[Array[Byte]]={
>   val 

Re: NullPointerException when starting StreamingContext

2016-06-24 Thread Cody Koeninger
That looks like a classpath problem.  You should not have to include
the kafka_2.10 artifact in your pom, spark-streaming-kafka_2.10
already has a transitive dependency on it.  That being said, 0.8.2.1
is the correct version, so that's a little strange.

How are you building and submitting your application?

Finally, if this ends up being a CDH related issue, you may have
better luck on their forum.

On Thu, Jun 23, 2016 at 1:16 PM, Sunita Arvind  wrote:
> Also, just to keep it simple, I am trying to use 1.6.0CDH5.7.0 in the
> pom.xml as the cluster I am trying to run on is CDH5.7.0 with spark 1.6.0.
>
> Here is my pom setting:
>
>
> 1.6.0-cdh5.7.0
> 
> org.apache.spark
> spark-core_2.10
> ${cdh.spark.version}
> compile
> 
> 
> org.apache.spark
> spark-streaming_2.10
> ${cdh.spark.version}
> compile
> 
> 
> org.apache.spark
> spark-sql_2.10
> ${cdh.spark.version}
> compile
> 
> 
> org.apache.spark
> spark-streaming-kafka_2.10
> ${cdh.spark.version}
> compile
> 
> 
> org.apache.kafka
> kafka_2.10
> 0.8.2.1
> compile
> 
>
> But trying to execute the application throws errors like below:
> Exception in thread "main" java.lang.NoClassDefFoundError:
> kafka/cluster/BrokerEndPoint
> at
> org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$2$$anonfun$3$$anonfun$apply$6$$anonfun$apply$7.apply(KafkaCluster.scala:90)
> at scala.Option.map(Option.scala:145)
> at
> org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$2$$anonfun$3$$anonfun$apply$6.apply(KafkaCluster.scala:90)
> at
> org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$2$$anonfun$3$$anonfun$apply$6.apply(KafkaCluster.scala:87)
> at
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
> at
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
> at
> org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$2$$anonfun$3.apply(KafkaCluster.scala:87)
> at
> org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$2$$anonfun$3.apply(KafkaCluster.scala:86)
> at
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at scala.collection.immutable.Set$Set1.foreach(Set.scala:74)
> at
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
> at
> org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$2.apply(KafkaCluster.scala:86)
> at
> org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$2.apply(KafkaCluster.scala:85)
> at scala.util.Either$RightProjection.flatMap(Either.scala:523)
> at
> org.apache.spark.streaming.kafka.KafkaCluster.findLeaders(KafkaCluster.scala:85)
> at
> org.apache.spark.streaming.kafka.KafkaCluster.getLeaderOffsets(KafkaCluster.scala:179)
> at
> org.apache.spark.streaming.kafka.KafkaCluster.getLeaderOffsets(KafkaCluster.scala:161)
> at
> org.apache.spark.streaming.kafka.KafkaCluster.getLatestLeaderOffsets(KafkaCluster.scala:150)
> at
> org.apache.spark.streaming.kafka.KafkaUtils$$anonfun$5.apply(KafkaUtils.scala:215)
> at
> org.apache.spark.streaming.kafka.KafkaUtils$$anonfun$5.apply(KafkaUtils.scala:211)
> at scala.util.Either$RightProjection.flatMap(Either.scala:523)
> at
> org.apache.spark.streaming.kafka.KafkaUtils$.getFromOffsets(KafkaUtils.scala:211)
> at
> org.apache.spark.streaming.kafka.KafkaUtils$.createDirectStream(KafkaUtils.scala:484)
> at
> com.edgecast.engine.ConcurrentOps$.createDataStreamFromKafka(ConcurrentOps.scala:68)
> at
> com.edgecast.engine.ConcurrentOps$.startProcessing(ConcurrentOps.scala:32)
> at com.edgecast.engine.ProcessingEngine$.main(ProcessingEngine.scala:33)
> at com.edgecast.engine.ProcessingEngine.main(ProcessingEngine.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)
> Caused by: java.lang.ClassNotFoundException: kafka.cluster.BrokerEndPoint
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> at java.security.AccessController.doPrivileged(Native Method)
>   

Re: NullPointerException when starting StreamingContext

2016-06-23 Thread Sunita Arvind
Also, just to keep it simple, I am trying to use 1.6.0CDH5.7.0 in the
pom.xml as the cluster I am trying to run on is CDH5.7.0 with spark 1.6.0.

Here is my pom setting:


1.6.0-cdh5.7.0

org.apache.spark
spark-core_2.10
${cdh.spark.version}
compile


org.apache.spark
spark-streaming_2.10
${cdh.spark.version}
compile


org.apache.spark
spark-sql_2.10
${cdh.spark.version}
compile


org.apache.spark
spark-streaming-kafka_2.10
${cdh.spark.version}
compile


org.apache.kafka
kafka_2.10
0.8.2.1
compile


But trying to execute the application throws errors like below:
Exception in thread "main" java.lang.NoClassDefFoundError:
kafka/cluster/BrokerEndPoint
at
org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$2$$anonfun$3$$anonfun$apply$6$$anonfun$apply$7.apply(KafkaCluster.scala:90)
at scala.Option.map(Option.scala:145)
at
org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$2$$anonfun$3$$anonfun$apply$6.apply(KafkaCluster.scala:90)
at
org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$2$$anonfun$3$$anonfun$apply$6.apply(KafkaCluster.scala:87)
at
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
at
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at
org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$2$$anonfun$3.apply(KafkaCluster.scala:87)
at
org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$2$$anonfun$3.apply(KafkaCluster.scala:86)
at
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.immutable.Set$Set1.foreach(Set.scala:74)
at
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at
org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$2.apply(KafkaCluster.scala:86)
at
org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$2.apply(KafkaCluster.scala:85)
at scala.util.Either$RightProjection.flatMap(Either.scala:523)
at
org.apache.spark.streaming.kafka.KafkaCluster.findLeaders(KafkaCluster.scala:85)
at
org.apache.spark.streaming.kafka.KafkaCluster.getLeaderOffsets(KafkaCluster.scala:179)
at
org.apache.spark.streaming.kafka.KafkaCluster.getLeaderOffsets(KafkaCluster.scala:161)
at
org.apache.spark.streaming.kafka.KafkaCluster.getLatestLeaderOffsets(KafkaCluster.scala:150)
at
org.apache.spark.streaming.kafka.KafkaUtils$$anonfun$5.apply(KafkaUtils.scala:215)
at
org.apache.spark.streaming.kafka.KafkaUtils$$anonfun$5.apply(KafkaUtils.scala:211)
at scala.util.Either$RightProjection.flatMap(Either.scala:523)
at
org.apache.spark.streaming.kafka.KafkaUtils$.getFromOffsets(KafkaUtils.scala:211)
at
org.apache.spark.streaming.kafka.KafkaUtils$.createDirectStream(KafkaUtils.scala:484)
at
com.edgecast.engine.ConcurrentOps$.createDataStreamFromKafka(ConcurrentOps.scala:68)
at
com.edgecast.engine.ConcurrentOps$.startProcessing(ConcurrentOps.scala:32)
at com.edgecast.engine.ProcessingEngine$.main(ProcessingEngine.scala:33)
at com.edgecast.engine.ProcessingEngine.main(ProcessingEngine.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)
Caused by: java.lang.ClassNotFoundException: kafka.cluster.BrokerEndPoint
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 38 more
16/06/23 11:09:53 INFO SparkContext: Invoking stop() from shutdown hook


I've tried kafka version 0.8.2.0, 0.8.2.2, 0.9.0.0. With 0.9.0.0 the
processing hangs much sooner.
Can someone help with this error?

regards
Sunita

On Wed, Jun 22, 2016 at 8:20 PM, Sunita Arvind 
wrote:

> Hello Experts,
>
> I am getting this error repeatedly:
>
> 16/06/23 03:06:59 ERROR streaming.StreamingContext: Error 

Re: NullPointerException when starting StreamingContext

2016-06-22 Thread Ted Yu
Which Scala version / Spark release are you using ?

Cheers

On Wed, Jun 22, 2016 at 8:20 PM, Sunita Arvind 
wrote:

> Hello Experts,
>
> I am getting this error repeatedly:
>
> 16/06/23 03:06:59 ERROR streaming.StreamingContext: Error starting the 
> context, marking it as stopped
> java.lang.NullPointerException
>   at 
> com.typesafe.config.impl.SerializedConfigValue.writeOrigin(SerializedConfigValue.java:202)
>   at 
> com.typesafe.config.impl.ConfigImplUtil.writeOrigin(ConfigImplUtil.java:228)
>   at 
> com.typesafe.config.ConfigException.writeObject(ConfigException.java:58)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
>   at 
> java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:440)
>   at java.lang.Throwable.writeObject(Throwable.java:985)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>   at 
> java.io.ObjectOutputStream.writeFatalException(ObjectOutputStream.java:1576)
>   at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:350)
>   at 
> org.apache.spark.streaming.Checkpoint$$anonfun$serialize$1.apply$mcV$sp(Checkpoint.scala:141)
>   at 
> org.apache.spark.streaming.Checkpoint$$anonfun$serialize$1.apply(Checkpoint.scala:141)
>   at 
> org.apache.spark.streaming.Checkpoint$$anonfun$serialize$1.apply(Checkpoint.scala:141)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1251)
>   at 
> org.apache.spark.streaming.Checkpoint$.serialize(Checkpoint.scala:142)
>   at 
> org.apache.spark.streaming.StreamingContext.validate(StreamingContext.scala:554)
>   at 
> org.apache.spark.streaming.StreamingContext.liftedTree1$1(StreamingContext.scala:601)
>   at 
> org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:600)
>   at 
> com.edgecast.engine.ProcessingEngine$$anonfun$main$1.apply(ProcessingEngine.scala:73)
>   at 
> com.edgecast.engine.ProcessingEngine$$anonfun$main$1.apply(ProcessingEngine.scala:67)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at com.edgecast.engine.ProcessingEngine$.main(ProcessingEngine.scala:67)
>   at com.edgecast.engine.ProcessingEngine.main(ProcessingEngine.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:542)
>
>
> It seems to be a typical issue. All I am doing here is as below:
>
> Object ProcessingEngine{
>
> def initializeSpark(customer:String):StreamingContext={
>   LogHandler.log.info("InitialeSpark")
>   val custConf = ConfigFactory.load(customer + 
> ".conf").getConfig(customer).withFallback(AppConf)
>   implicit val sparkConf: SparkConf = new SparkConf().setAppName(customer)
>   val ssc: StreamingContext = new StreamingContext(sparkConf, 
> Seconds(custConf.getLong("batchDurSec")))
>   ssc.checkpoint(custConf.getString("checkpointDir"))
>   ssc
> }
>
> def createDataStreamFromKafka(customer:String, ssc: 
> StreamingContext):DStream[Array[Byte]]={
>   val custConf = ConfigFactory.load(customer + 
> ".conf").getConfig(customer).withFallback(ConfigFactory.load())
>   LogHandler.log.info("createDataStreamFromKafka")
>   KafkaUtils.createDirectStream[String,
> Array[Byte],
> StringDecoder,
> DefaultDecoder](
> ssc,
> Map[String, String]("metadata.broker.list" -> 
> 

NullPointerException when starting StreamingContext

2016-06-22 Thread Sunita Arvind
Hello Experts,

I am getting this error repeatedly:

16/06/23 03:06:59 ERROR streaming.StreamingContext: Error starting the
context, marking it as stopped
java.lang.NullPointerException
at 
com.typesafe.config.impl.SerializedConfigValue.writeOrigin(SerializedConfigValue.java:202)
at 
com.typesafe.config.impl.ConfigImplUtil.writeOrigin(ConfigImplUtil.java:228)
at 
com.typesafe.config.ConfigException.writeObject(ConfigException.java:58)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at 
java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at 
java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:440)
at java.lang.Throwable.writeObject(Throwable.java:985)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)
at 
java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495)
at 
java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at 
java.io.ObjectOutputStream.writeFatalException(ObjectOutputStream.java:1576)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:350)
at 
org.apache.spark.streaming.Checkpoint$$anonfun$serialize$1.apply$mcV$sp(Checkpoint.scala:141)
at 
org.apache.spark.streaming.Checkpoint$$anonfun$serialize$1.apply(Checkpoint.scala:141)
at 
org.apache.spark.streaming.Checkpoint$$anonfun$serialize$1.apply(Checkpoint.scala:141)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1251)
at 
org.apache.spark.streaming.Checkpoint$.serialize(Checkpoint.scala:142)
at 
org.apache.spark.streaming.StreamingContext.validate(StreamingContext.scala:554)
at 
org.apache.spark.streaming.StreamingContext.liftedTree1$1(StreamingContext.scala:601)
at 
org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:600)
at 
com.edgecast.engine.ProcessingEngine$$anonfun$main$1.apply(ProcessingEngine.scala:73)
at 
com.edgecast.engine.ProcessingEngine$$anonfun$main$1.apply(ProcessingEngine.scala:67)
at scala.collection.immutable.List.foreach(List.scala:318)
at com.edgecast.engine.ProcessingEngine$.main(ProcessingEngine.scala:67)
at com.edgecast.engine.ProcessingEngine.main(ProcessingEngine.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:542)


It seems to be a typical issue. All I am doing here is as below:

Object ProcessingEngine{

def initializeSpark(customer:String):StreamingContext={
  LogHandler.log.info("InitialeSpark")
  val custConf = ConfigFactory.load(customer +
".conf").getConfig(customer).withFallback(AppConf)
  implicit val sparkConf: SparkConf = new SparkConf().setAppName(customer)
  val ssc: StreamingContext = new StreamingContext(sparkConf,
Seconds(custConf.getLong("batchDurSec")))
  ssc.checkpoint(custConf.getString("checkpointDir"))
  ssc
}

def createDataStreamFromKafka(customer:String, ssc:
StreamingContext):DStream[Array[Byte]]={
  val custConf = ConfigFactory.load(customer +
".conf").getConfig(customer).withFallback(ConfigFactory.load())
  LogHandler.log.info("createDataStreamFromKafka")
  KafkaUtils.createDirectStream[String,
Array[Byte],
StringDecoder,
DefaultDecoder](
ssc,
Map[String, String]("metadata.broker.list" ->
custConf.getString("brokers"), "group.id" ->
custConf.getString("groupId")),
Set(custConf.getString("topics")))

}

def main(args: Array[String]): Unit = {
  val AppConf = ConfigFactory.load()
  LogHandler.log.info("Starting the processing Engine")
  getListOfCustomers().foreach{cust 

Re: getting NullPointerException while doing left outer join

2016-05-06 Thread Adam Westerman
For anyone interested, the problem ended up being that in some rare cases,
the value from the pair RDD on the right side of the left outer join was
Java's null.  The Spark optionToOptional method attempted to apply Some()
to null, which caused the NPE to be thrown.

The lesson is to filter out any null values before doing an outer join.

-Adam

On Fri, May 6, 2016 at 10:45 AM, Adam Westerman  wrote:

> Hi Ted,
>
> I am working on replicating the problem on a smaller scale.
>
> I saw that Spark 2.0 is moving to Java 8 Optional instead of Guava
> Optional, but in the meantime I'm stuck with 1.6.1.
>
> -Adam
>
> On Fri, May 6, 2016 at 9:40 AM, Ted Yu  wrote:
>
>> Is it possible to write a short test which exhibits this problem ?
>>
>> For Spark 2.0, this part of code has changed:
>>
>> [SPARK-4819] Remove Guava's "Optional" from public API
>>
>> FYI
>>
>> On Fri, May 6, 2016 at 6:57 AM, Adam Westerman 
>> wrote:
>>
>>> Hi,
>>>
>>> I’m attempting to do a left outer join in Spark, and I’m getting an NPE
>>> that appears to be due to some Spark Java API bug. (I’m running Spark 1.6.0
>>> in local mode on a Mac).
>>>
>>> For a little background, the left outer join returns all keys from the
>>> left side of the join regardless of whether or not the key is present on
>>> the right side.  To handle this uncertainty, the value from the right side
>>> is wrapped in Guava’s Optional class.  The Optional class has a method to
>>> check whether the value is present or not (which would indicate the key
>>> appeared in both RDDs being joined).  If the key was indeed present in both
>>> RDDs you can then retrieve the value and move forward.
>>>
>>> After doing a little digging, I found that Spark is using Scala’s Option
>>> functionality internally.  This is the same concept as the Guava Optional,
>>> only native to Scala.  It appears that during the conversion from a Scala
>>> Option back to a Guava Optional (this method can be found here:
>>> https://github.com/apache/spark/blob/v1.6.0/core/src/main/scala/org/apache/spark/api/java/JavaUtils.scala#L28)
>>>  the
>>> conversion method is erroneously passed a Scala Option with the String
>>> value “None” instead of Scala’s null value None.  This is matched to the
>>> first *case*, which causes Guava’s Optional.of method to attempt to
>>> pull the value out.  A NPE is thrown since it wasn’t ever actually there.
>>>
>>> The code basically looks like this, where the classes used are just
>>> plain Java objects with some class attributes inside:
>>> // First RDD
>>> JavaPairRDD rdd1
>>> // Second RDD
>>> JavaPairRDD rdd2
>>>
>>> // Resultant RDD
>>> JavaPairRDD>> Optional>> result = rdd1.leftOuterJoin(rdd2)
>>>
>>> Has anyone ever encountered this problem before, or know why the
>>> optionToOptional method might be getting passed this “None” value?  I’ve
>>> added some more relevant information below, let me know if I can provide
>>> any more details.
>>>
>>> Here's a screenshot showing the string value of “None” being passed into
>>> the optionToOptional method using the debugger:
>>>
>>> Here’s the stack trace (the method shown above is highlighted):
>>>
>>> ERROR 13:17:00,743 com.tgt.allocation.needengine.NeedEngineApplication
>>> Exception while running need engine:
>>> org.apache.spark.SparkException: Job aborted due to stage failure: Task
>>> 8 in stage 31.0 failed 1 times, most recent failure: Lost task 8.0 in stage
>>> 31.0 (TID 50, localhost): java.lang.NullPointerException
>>> at
>>> org.spark-project.guava.base.Preconditions.checkNotNull(Preconditions.java:191)
>>> at com.google.common.base.Optional.of(Optional.java:86)
>>> at org.apache.spark.api.java.JavaUtils$.optionToOptional
>>> (JavaUtils.scala:30)
>>> at
>>> org.apache.spark.api.java.JavaPairRDD$$anonfun$leftOuterJoin$2.apply(JavaPairRDD.scala:564)
>>> at
>>> org.apache.spark.api.java.JavaPairRDD$$anonfun$leftOuterJoin$2.apply(JavaPairRDD.scala:564)
>>> at
>>> org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$41$$anonfun$apply$42.apply(PairRDDFunctions.scala:755)
>>> at
>>> org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$41$$anonfun$apply$42.apply(PairRDDFunctions.scala:755)
>>> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>>> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>>> at
>>> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:149)
>>> at
>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>>> at
>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>>> at org.apache.spark.scheduler.Task.run(Task.scala:89)
>>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>>> at
>>> 

Re: getting NullPointerException while doing left outer join

2016-05-06 Thread Adam Westerman
Hi Ted,

I am working on replicating the problem on a smaller scale.

I saw that Spark 2.0 is moving to Java 8 Optional instead of Guava
Optional, but in the meantime I'm stuck with 1.6.1.

-Adam

On Fri, May 6, 2016 at 9:40 AM, Ted Yu  wrote:

> Is it possible to write a short test which exhibits this problem ?
>
> For Spark 2.0, this part of code has changed:
>
> [SPARK-4819] Remove Guava's "Optional" from public API
>
> FYI
>
> On Fri, May 6, 2016 at 6:57 AM, Adam Westerman  wrote:
>
>> Hi,
>>
>> I’m attempting to do a left outer join in Spark, and I’m getting an NPE
>> that appears to be due to some Spark Java API bug. (I’m running Spark 1.6.0
>> in local mode on a Mac).
>>
>> For a little background, the left outer join returns all keys from the
>> left side of the join regardless of whether or not the key is present on
>> the right side.  To handle this uncertainty, the value from the right side
>> is wrapped in Guava’s Optional class.  The Optional class has a method to
>> check whether the value is present or not (which would indicate the key
>> appeared in both RDDs being joined).  If the key was indeed present in both
>> RDDs you can then retrieve the value and move forward.
>>
>> After doing a little digging, I found that Spark is using Scala’s Option
>> functionality internally.  This is the same concept as the Guava Optional,
>> only native to Scala.  It appears that during the conversion from a Scala
>> Option back to a Guava Optional (this method can be found here:
>> https://github.com/apache/spark/blob/v1.6.0/core/src/main/scala/org/apache/spark/api/java/JavaUtils.scala#L28)
>>  the
>> conversion method is erroneously passed a Scala Option with the String
>> value “None” instead of Scala’s null value None.  This is matched to the
>> first *case*, which causes Guava’s Optional.of method to attempt to pull
>> the value out.  A NPE is thrown since it wasn’t ever actually there.
>>
>> The code basically looks like this, where the classes used are just plain
>> Java objects with some class attributes inside:
>> // First RDD
>> JavaPairRDD rdd1
>> // Second RDD
>> JavaPairRDD rdd2
>>
>> // Resultant RDD
>> JavaPairRDD> Optional>> result = rdd1.leftOuterJoin(rdd2)
>>
>> Has anyone ever encountered this problem before, or know why the
>> optionToOptional method might be getting passed this “None” value?  I’ve
>> added some more relevant information below, let me know if I can provide
>> any more details.
>>
>> Here's a screenshot showing the string value of “None” being passed into
>> the optionToOptional method using the debugger:
>>
>> Here’s the stack trace (the method shown above is highlighted):
>>
>> ERROR 13:17:00,743 com.tgt.allocation.needengine.NeedEngineApplication
>> Exception while running need engine:
>> org.apache.spark.SparkException: Job aborted due to stage failure: Task 8
>> in stage 31.0 failed 1 times, most recent failure: Lost task 8.0 in stage
>> 31.0 (TID 50, localhost): java.lang.NullPointerException
>> at
>> org.spark-project.guava.base.Preconditions.checkNotNull(Preconditions.java:191)
>> at com.google.common.base.Optional.of(Optional.java:86)
>> at org.apache.spark.api.java.JavaUtils$.optionToOptional
>> (JavaUtils.scala:30)
>> at
>> org.apache.spark.api.java.JavaPairRDD$$anonfun$leftOuterJoin$2.apply(JavaPairRDD.scala:564)
>> at
>> org.apache.spark.api.java.JavaPairRDD$$anonfun$leftOuterJoin$2.apply(JavaPairRDD.scala:564)
>> at
>> org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$41$$anonfun$apply$42.apply(PairRDDFunctions.scala:755)
>> at
>> org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$41$$anonfun$apply$42.apply(PairRDDFunctions.scala:755)
>> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>> at
>> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:149)
>> at
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>> at
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>> at org.apache.spark.scheduler.Task.run(Task.scala:89)
>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> at java.lang.Thread.run(Thread.java:745)
>>
>> Driver stacktrace:
>> at org.apache.spark.scheduler.DAGScheduler.org
>> 
>> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
>> at
>> 

Re: getting NullPointerException while doing left outer join

2016-05-06 Thread Ted Yu
Is it possible to write a short test which exhibits this problem ?

For Spark 2.0, this part of code has changed:

[SPARK-4819] Remove Guava's "Optional" from public API

FYI

On Fri, May 6, 2016 at 6:57 AM, Adam Westerman  wrote:

> Hi,
>
> I’m attempting to do a left outer join in Spark, and I’m getting an NPE
> that appears to be due to some Spark Java API bug. (I’m running Spark 1.6.0
> in local mode on a Mac).
>
> For a little background, the left outer join returns all keys from the
> left side of the join regardless of whether or not the key is present on
> the right side.  To handle this uncertainty, the value from the right side
> is wrapped in Guava’s Optional class.  The Optional class has a method to
> check whether the value is present or not (which would indicate the key
> appeared in both RDDs being joined).  If the key was indeed present in both
> RDDs you can then retrieve the value and move forward.
>
> After doing a little digging, I found that Spark is using Scala’s Option
> functionality internally.  This is the same concept as the Guava Optional,
> only native to Scala.  It appears that during the conversion from a Scala
> Option back to a Guava Optional (this method can be found here:
> https://github.com/apache/spark/blob/v1.6.0/core/src/main/scala/org/apache/spark/api/java/JavaUtils.scala#L28)
>  the
> conversion method is erroneously passed a Scala Option with the String
> value “None” instead of Scala’s null value None.  This is matched to the
> first *case*, which causes Guava’s Optional.of method to attempt to pull
> the value out.  A NPE is thrown since it wasn’t ever actually there.
>
> The code basically looks like this, where the classes used are just plain
> Java objects with some class attributes inside:
> // First RDD
> JavaPairRDD rdd1
> // Second RDD
> JavaPairRDD rdd2
>
> // Resultant RDD
> JavaPairRDD>
> result = rdd1.leftOuterJoin(rdd2)
>
> Has anyone ever encountered this problem before, or know why the
> optionToOptional method might be getting passed this “None” value?  I’ve
> added some more relevant information below, let me know if I can provide
> any more details.
>
> Here's a screenshot showing the string value of “None” being passed into
> the optionToOptional method using the debugger:
>
> Here’s the stack trace (the method shown above is highlighted):
>
> ERROR 13:17:00,743 com.tgt.allocation.needengine.NeedEngineApplication
> Exception while running need engine:
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 8
> in stage 31.0 failed 1 times, most recent failure: Lost task 8.0 in stage
> 31.0 (TID 50, localhost): java.lang.NullPointerException
> at
> org.spark-project.guava.base.Preconditions.checkNotNull(Preconditions.java:191)
> at com.google.common.base.Optional.of(Optional.java:86)
> at org.apache.spark.api.java.JavaUtils$.optionToOptional
> (JavaUtils.scala:30)
> at
> org.apache.spark.api.java.JavaPairRDD$$anonfun$leftOuterJoin$2.apply(JavaPairRDD.scala:564)
> at
> org.apache.spark.api.java.JavaPairRDD$$anonfun$leftOuterJoin$2.apply(JavaPairRDD.scala:564)
> at
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$41$$anonfun$apply$42.apply(PairRDDFunctions.scala:755)
> at
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$41$$anonfun$apply$42.apply(PairRDDFunctions.scala:755)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> at
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:149)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>
> Driver stacktrace:
> at org.apache.spark.scheduler.DAGScheduler.org
> 
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
> at
> 

getting NullPointerException while doing left outer join

2016-05-06 Thread Adam Westerman
Hi,

I’m attempting to do a left outer join in Spark, and I’m getting an NPE
that appears to be due to some Spark Java API bug. (I’m running Spark 1.6.0
in local mode on a Mac).

For a little background, the left outer join returns all keys from the left
side of the join regardless of whether or not the key is present on the
right side.  To handle this uncertainty, the value from the right side is
wrapped in Guava’s Optional class.  The Optional class has a method to
check whether the value is present or not (which would indicate the key
appeared in both RDDs being joined).  If the key was indeed present in both
RDDs you can then retrieve the value and move forward.

After doing a little digging, I found that Spark is using Scala’s Option
functionality internally.  This is the same concept as the Guava Optional,
only native to Scala.  It appears that during the conversion from a Scala
Option back to a Guava Optional (this method can be found here:
https://github.com/apache/spark/blob/v1.6.0/core/src/main/scala/org/apache/spark/api/java/JavaUtils.scala#L28)
the
conversion method is erroneously passed a Scala Option with the String
value “None” instead of Scala’s null value None.  This is matched to the
first *case*, which causes Guava’s Optional.of method to attempt to pull
the value out.  A NPE is thrown since it wasn’t ever actually there.

The code basically looks like this, where the classes used are just plain
Java objects with some class attributes inside:
// First RDD
JavaPairRDD rdd1
// Second RDD
JavaPairRDD rdd2

// Resultant RDD
JavaPairRDD>
result = rdd1.leftOuterJoin(rdd2)

Has anyone ever encountered this problem before, or know why the
optionToOptional method might be getting passed this “None” value?  I’ve
added some more relevant information below, let me know if I can provide
any more details.

Here's a screenshot showing the string value of “None” being passed into
the optionToOptional method using the debugger:

Here’s the stack trace (the method shown above is highlighted):

ERROR 13:17:00,743 com.tgt.allocation.needengine.NeedEngineApplication
Exception while running need engine:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 8
in stage 31.0 failed 1 times, most recent failure: Lost task 8.0 in stage
31.0 (TID 50, localhost): java.lang.NullPointerException
at
org.spark-project.guava.base.Preconditions.checkNotNull(Preconditions.java:191)
at com.google.common.base.Optional.of(Optional.java:86)
at org.apache.spark.api.java.JavaUtils$.optionToOptional(JavaUtils.scala:30)
at
org.apache.spark.api.java.JavaPairRDD$$anonfun$leftOuterJoin$2.apply(JavaPairRDD.scala:564)
at
org.apache.spark.api.java.JavaPairRDD$$anonfun$leftOuterJoin$2.apply(JavaPairRDD.scala:564)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$41$$anonfun$apply$42.apply(PairRDDFunctions.scala:755)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$mapValues$1$$anonfun$apply$41$$anonfun$apply$42.apply(PairRDDFunctions.scala:755)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:149)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org

$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
at

MLLIB LDA throws NullPointerException

2016-04-06 Thread jamborta
Hi all,

I came across a really weird error on spark 1.6 (calling LDA from pyspark)

//data is [index, DenseVector] 
data1 = corpusZippedDataFiltered.repartition(100).sample(False, 0.1, 100)
data2 = sc.parallelize(data1.collect().repartition(100)

ldaModel1 = LDA.train(data1, k=10, maxIterations=10)
ldaModel2 = LDA.train(data2, k=10, maxIterations=10)

ldaModel2 completes OK (with or without repartitioning), but ldaModel1 fails
with:

An error occurred while calling o1812.trainLDAModel.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 36
in stage 1681.0 failed 4 times, most recent failure: Lost task 36.3 in stage
1681.0 (TID 60425, ip-10-33-65-169.eu-west-1.compute.internal):
java.lang.NullPointerException

Driver stacktrace:
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1952)
at org.apache.spark.rdd.RDD$$anonfun$fold$1.apply(RDD.scala:1081)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.fold(RDD.scala:1075)
at
org.apache.spark.mllib.clustering.EMLDAOptimizer.computeGlobalTopicTotals(LDAOptimizer.scala:205)
at
org.apache.spark.mllib.clustering.EMLDAOptimizer.next(LDAOptimizer.scala:192)
at
org.apache.spark.mllib.clustering.EMLDAOptimizer.next(LDAOptimizer.scala:80)
at org.apache.spark.mllib.clustering.LDA.run(LDA.scala:329)
at
org.apache.spark.mllib.api.python.PythonMLLibAPI.trainLDAModel(PythonMLLibAPI.scala:538)
at sun.reflect.GeneratedMethodAccessor123.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
at py4j.Gateway.invoke(Gateway.java:259)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:209)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/MLLIB-LDA-throws-NullPointerException-tp26686.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: NullPointerException

2016-03-12 Thread saurabh guru
I don't see how that would be possible. I am reading from a live stream of
data through kafka.

On Sat 12 Mar, 2016 20:28 Ted Yu,  wrote:

> Interesting.
> If kv._1 was null, shouldn't the NPE have come from getPartition() (line
> 105) ?
>
> Was it possible that records.next() returned null ?
>
> On Fri, Mar 11, 2016 at 11:20 PM, Prabhu Joseph <
> prabhujose.ga...@gmail.com> wrote:
>
>> Looking at ExternalSorter.scala line 192, i suspect some input record has
>> Null key.
>>
>> 189  while (records.hasNext) {
>> 190addElementsRead()
>> 191kv = records.next()
>> 192map.changeValue((getPartition(kv._1), kv._1), update)
>>
>>
>>
>> On Sat, Mar 12, 2016 at 12:48 PM, Prabhu Joseph <
>> prabhujose.ga...@gmail.com> wrote:
>>
>>> Looking at ExternalSorter.scala line 192
>>>
>>> 189
>>> while (records.hasNext) { addElementsRead() kv = records.next()
>>> map.changeValue((getPartition(kv._1), kv._1), update)
>>> maybeSpillCollection(usingMap = true) }
>>>
>>> On Sat, Mar 12, 2016 at 12:31 PM, Saurabh Guru 
>>> wrote:
>>>
 I am seeing the following exception in my Spark Cluster every few days
 in production.

 2016-03-12 05:30:00,541 - WARN  TaskSetManager - Lost task 0.0 in stage
 12528.0 (TID 18792, ip-1X-1XX-1-1XX.us 
 -west-1.compute.internal
 ): java.lang.NullPointerException
at
 org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:192)
at
 org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)


 I have debugged in local machine but haven’t been able to pin point the
 cause of the error. Anyone knows why this might occur? Any suggestions?


 Thanks,
 Saurabh




>>>
>>
>


Re: NullPointerException

2016-03-12 Thread Ted Yu
Interesting.
If kv._1 was null, shouldn't the NPE have come from getPartition() (line
105) ?

Was it possible that records.next() returned null ?

On Fri, Mar 11, 2016 at 11:20 PM, Prabhu Joseph 
wrote:

> Looking at ExternalSorter.scala line 192, i suspect some input record has
> Null key.
>
> 189  while (records.hasNext) {
> 190addElementsRead()
> 191kv = records.next()
> 192map.changeValue((getPartition(kv._1), kv._1), update)
>
>
>
> On Sat, Mar 12, 2016 at 12:48 PM, Prabhu Joseph <
> prabhujose.ga...@gmail.com> wrote:
>
>> Looking at ExternalSorter.scala line 192
>>
>> 189
>> while (records.hasNext) { addElementsRead() kv = records.next()
>> map.changeValue((getPartition(kv._1), kv._1), update)
>> maybeSpillCollection(usingMap = true) }
>>
>> On Sat, Mar 12, 2016 at 12:31 PM, Saurabh Guru 
>> wrote:
>>
>>> I am seeing the following exception in my Spark Cluster every few days
>>> in production.
>>>
>>> 2016-03-12 05:30:00,541 - WARN  TaskSetManager - Lost task 0.0 in stage
>>> 12528.0 (TID 18792, ip-1X-1XX-1-1XX.us 
>>> -west-1.compute.internal
>>> ): java.lang.NullPointerException
>>>at
>>> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:192)
>>>at
>>> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
>>>at
>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>>>at
>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>>>at org.apache.spark.scheduler.Task.run(Task.scala:89)
>>>at
>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>>>at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>at java.lang.Thread.run(Thread.java:745)
>>>
>>>
>>> I have debugged in local machine but haven’t been able to pin point the
>>> cause of the error. Anyone knows why this might occur? Any suggestions?
>>>
>>>
>>> Thanks,
>>> Saurabh
>>>
>>>
>>>
>>>
>>
>


Re: NullPointerException

2016-03-11 Thread Saurabh Guru
I am using the following versions:


org.apache.spark
spark-streaming_2.10
1.6.0



org.apache.spark
spark-streaming-kafka_2.10
1.6.0



org.elasticsearch
elasticsearch-spark_2.10
2.2.0


Thanks,
Saurabh

:)



> On 12-Mar-2016, at 12:56 PM, Ted Yu  wrote:
> 
> Which Spark release do you use ?
> 
> I wonder if the following may have fixed the problem:
> SPARK-8029 Robust shuffle writer
> 
> JIRA is down, cannot check now.
> 
> On Fri, Mar 11, 2016 at 11:01 PM, Saurabh Guru  > wrote:
> I am seeing the following exception in my Spark Cluster every few days in 
> production.
> 
> 2016-03-12 05:30:00,541 - WARN  TaskSetManager - Lost task 0.0 in stage 
> 12528.0 (TID 18792, ip-1X-1XX-1-1XX.us 
> -west-1.compute.internal
> ): java.lang.NullPointerException
>at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:192)
>at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
>at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>at org.apache.spark.scheduler.Task.run(Task.scala:89)
>at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>at java.lang.Thread.run(Thread.java:745)
> 
> 
> I have debugged in local machine but haven’t been able to pin point the cause 
> of the error. Anyone knows why this might occur? Any suggestions? 
> 
> 
> Thanks,
> Saurabh
> 
> 
> 
> 



Re: NullPointerException

2016-03-11 Thread Ted Yu
Which Spark release do you use ?

I wonder if the following may have fixed the problem:
SPARK-8029 Robust shuffle writer

JIRA is down, cannot check now.

On Fri, Mar 11, 2016 at 11:01 PM, Saurabh Guru 
wrote:

> I am seeing the following exception in my Spark Cluster every few days in
> production.
>
> 2016-03-12 05:30:00,541 - WARN  TaskSetManager - Lost task 0.0 in stage
> 12528.0 (TID 18792, ip-1X-1XX-1-1XX.us 
> -west-1.compute.internal
> ): java.lang.NullPointerException
>at
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:192)
>at
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
>at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>at org.apache.spark.scheduler.Task.run(Task.scala:89)
>at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>at java.lang.Thread.run(Thread.java:745)
>
>
> I have debugged in local machine but haven’t been able to pin point the
> cause of the error. Anyone knows why this might occur? Any suggestions?
>
>
> Thanks,
> Saurabh
>
>
>
>


Re: NullPointerException

2016-03-11 Thread Prabhu Joseph
Looking at ExternalSorter.scala line 192, i suspect some input record has
Null key.

189  while (records.hasNext) {
190addElementsRead()
191kv = records.next()
192map.changeValue((getPartition(kv._1), kv._1), update)



On Sat, Mar 12, 2016 at 12:48 PM, Prabhu Joseph 
wrote:

> Looking at ExternalSorter.scala line 192
>
> 189
> while (records.hasNext) { addElementsRead() kv = records.next()
> map.changeValue((getPartition(kv._1), kv._1), update)
> maybeSpillCollection(usingMap = true) }
>
> On Sat, Mar 12, 2016 at 12:31 PM, Saurabh Guru 
> wrote:
>
>> I am seeing the following exception in my Spark Cluster every few days in
>> production.
>>
>> 2016-03-12 05:30:00,541 - WARN  TaskSetManager - Lost task 0.0 in stage
>> 12528.0 (TID 18792, ip-1X-1XX-1-1XX.us 
>> -west-1.compute.internal
>> ): java.lang.NullPointerException
>>at
>> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:192)
>>at
>> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
>>at
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>>at
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>>at org.apache.spark.scheduler.Task.run(Task.scala:89)
>>at
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>>at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>at java.lang.Thread.run(Thread.java:745)
>>
>>
>> I have debugged in local machine but haven’t been able to pin point the
>> cause of the error. Anyone knows why this might occur? Any suggestions?
>>
>>
>> Thanks,
>> Saurabh
>>
>>
>>
>>
>


Re: NullPointerException

2016-03-11 Thread Prabhu Joseph
Looking at ExternalSorter.scala line 192

189
while (records.hasNext) { addElementsRead() kv = records.next()
map.changeValue((getPartition(kv._1), kv._1), update)
maybeSpillCollection(usingMap = true) }

On Sat, Mar 12, 2016 at 12:31 PM, Saurabh Guru 
wrote:

> I am seeing the following exception in my Spark Cluster every few days in
> production.
>
> 2016-03-12 05:30:00,541 - WARN  TaskSetManager - Lost task 0.0 in stage
> 12528.0 (TID 18792, ip-1X-1XX-1-1XX.us 
> -west-1.compute.internal
> ): java.lang.NullPointerException
>at
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:192)
>at
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
>at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>at org.apache.spark.scheduler.Task.run(Task.scala:89)
>at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>at java.lang.Thread.run(Thread.java:745)
>
>
> I have debugged in local machine but haven’t been able to pin point the
> cause of the error. Anyone knows why this might occur? Any suggestions?
>
>
> Thanks,
> Saurabh
>
>
>
>


NullPointerException

2016-03-11 Thread Saurabh Guru
I am seeing the following exception in my Spark Cluster every few days in 
production.

2016-03-12 05:30:00,541 - WARN  TaskSetManager - Lost task 0.0 in stage 12528.0 
(TID 18792, ip-1X-1XX-1-1XX.us 
-west-1.compute.internal
): java.lang.NullPointerException
   at 
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:192)
   at 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
   at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
   at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
   at org.apache.spark.scheduler.Task.run(Task.scala:89)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
   at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)


I have debugged in local machine but haven’t been able to pin point the cause 
of the error. Anyone knows why this might occur? Any suggestions? 


Thanks,
Saurabh





Re: Streaming mapWithState API has NullPointerException

2016-02-23 Thread Tathagata Das
Yes, you should be okay to test your code. :)

On Mon, Feb 22, 2016 at 5:57 PM, Aris <arisofala...@gmail.com> wrote:

> If I build from git branch origin/branch-1.6 will I be OK to test out my
> code?
>
> Thank you so much TD!
>
> Aris
>
> On Mon, Feb 22, 2016 at 2:48 PM, Tathagata Das <
> tathagata.das1...@gmail.com> wrote:
>
>> There were a few bugs that were solved with mapWithState recently. Would
>> be available in 1.6.1 (RC to be cut soon).
>>
>> On Mon, Feb 22, 2016 at 5:29 PM, Aris <arisofala...@gmail.com> wrote:
>>
>>> Hello Spark community, and especially TD and Spark Streaming folks:
>>>
>>> I am using the new Spark 1.6.0 Streaming mapWithState API, in order to
>>> accomplish a streaming joining task with data.
>>>
>>> Things work fine on smaller sets of data, but on a single-node large
>>> cluster with JSON strings amounting to 2.5 GB problems start to occur, I
>>> get a NullPointerException. It appears to happen in my code when I call
>>> DataFrame.write.parquet()
>>>
>>> I am reliably reproducing this, and it appears to be internal to
>>> mapWithState -- I don't know what else I can do to make progress, any
>>> thoughts?
>>>
>>>
>>>
>>> Here is the stack trace:
>>>
>>> 16/02/22 22:03:54 ERROR Executor: Exception in task 1.0 in stage 4349.0
>>>> (TID 6386)
>>>> java.lang.NullPointerException
>>>> at
>>>> org.apache.spark.streaming.util.OpenHashMapBasedStateMap.getByTime(StateMap.scala:117)
>>>> at
>>>> org.apache.spark.streaming.util.OpenHashMapBasedStateMap.getByTime(StateMap.scala:117)
>>>> at
>>>> org.apache.spark.streaming.rdd.MapWithStateRDDRecord$.updateRecordWithData(MapWithStateRDD.scala:69)
>>>> at
>>>> org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:154)
>>>> at
>>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>>>> at
>>>> org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
>>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
>>>> at
>>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>>>> at
>>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>>>> at
>>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>>>> at
>>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>>>> at
>>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>>>> at
>>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>>>> at org.apache.spark.scheduler.Task.run(Task.scala:89)
>>>> at
>>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>>>> at
>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>> at
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>> at java.lang.Thread.run(Thread.java:745)
>>>
>>>
>>>
>>>> 16/02/22 22:03:55 ERROR JobScheduler: Error running job streaming job
>>>> 145617858 ms.0
>>>> org.apache.spark.SparkException: Job aborted due to stage failure: Task
>>>> 12 in stage 4349.0 failed 1 times, most recent failure: Lost task 12.0 in
>>>> stage 4349.0 (TID 6397, localhost): java.lang.NullPointerException
>>>> at
>>>> org.apache.spark.streaming.util.OpenHashMapBasedStateMap.getByTime(StateMap.scala:117)
>>>> at
>>>> org.apache.spark.streaming.util.OpenHashMapBasedStateMap.getByTime(StateMap.scala:117)
>>>> at
>>>> org.apache.spark.streaming.rdd.MapWithStateRDDRecord$.updateRecordWithData(MapWithStateRDD.scala:69)
>>>> at
>>>> org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:154)
>>>> at
>>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>>>> at
>>>> org.apache.spark.CacheManager.getOrCompute(

Re: Streaming mapWithState API has NullPointerException

2016-02-22 Thread Aris
If I build from git branch origin/branch-1.6 will I be OK to test out my
code?

Thank you so much TD!

Aris

On Mon, Feb 22, 2016 at 2:48 PM, Tathagata Das <tathagata.das1...@gmail.com>
wrote:

> There were a few bugs that were solved with mapWithState recently. Would
> be available in 1.6.1 (RC to be cut soon).
>
> On Mon, Feb 22, 2016 at 5:29 PM, Aris <arisofala...@gmail.com> wrote:
>
>> Hello Spark community, and especially TD and Spark Streaming folks:
>>
>> I am using the new Spark 1.6.0 Streaming mapWithState API, in order to
>> accomplish a streaming joining task with data.
>>
>> Things work fine on smaller sets of data, but on a single-node large
>> cluster with JSON strings amounting to 2.5 GB problems start to occur, I
>> get a NullPointerException. It appears to happen in my code when I call
>> DataFrame.write.parquet()
>>
>> I am reliably reproducing this, and it appears to be internal to
>> mapWithState -- I don't know what else I can do to make progress, any
>> thoughts?
>>
>>
>>
>> Here is the stack trace:
>>
>> 16/02/22 22:03:54 ERROR Executor: Exception in task 1.0 in stage 4349.0
>>> (TID 6386)
>>> java.lang.NullPointerException
>>> at
>>> org.apache.spark.streaming.util.OpenHashMapBasedStateMap.getByTime(StateMap.scala:117)
>>> at
>>> org.apache.spark.streaming.util.OpenHashMapBasedStateMap.getByTime(StateMap.scala:117)
>>> at
>>> org.apache.spark.streaming.rdd.MapWithStateRDDRecord$.updateRecordWithData(MapWithStateRDD.scala:69)
>>> at
>>> org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:154)
>>> at
>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>>> at
>>> org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
>>> at
>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>>> at
>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>>> at
>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>>> at
>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>>> at
>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>>> at
>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>>> at org.apache.spark.scheduler.Task.run(Task.scala:89)
>>> at
>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>> at java.lang.Thread.run(Thread.java:745)
>>
>>
>>
>>> 16/02/22 22:03:55 ERROR JobScheduler: Error running job streaming job
>>> 145617858 ms.0
>>> org.apache.spark.SparkException: Job aborted due to stage failure: Task
>>> 12 in stage 4349.0 failed 1 times, most recent failure: Lost task 12.0 in
>>> stage 4349.0 (TID 6397, localhost): java.lang.NullPointerException
>>> at
>>> org.apache.spark.streaming.util.OpenHashMapBasedStateMap.getByTime(StateMap.scala:117)
>>> at
>>> org.apache.spark.streaming.util.OpenHashMapBasedStateMap.getByTime(StateMap.scala:117)
>>> at
>>> org.apache.spark.streaming.rdd.MapWithStateRDDRecord$.updateRecordWithData(MapWithStateRDD.scala:69)
>>> at
>>> org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:154)
>>> at
>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>>> at
>>> org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
>>> at
>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>>> at
>>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>>> at
>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>>> at
>>> org.apac

Re: Streaming mapWithState API has NullPointerException

2016-02-22 Thread Tathagata Das
There were a few bugs that were solved with mapWithState recently. Would be
available in 1.6.1 (RC to be cut soon).

On Mon, Feb 22, 2016 at 5:29 PM, Aris <arisofala...@gmail.com> wrote:

> Hello Spark community, and especially TD and Spark Streaming folks:
>
> I am using the new Spark 1.6.0 Streaming mapWithState API, in order to
> accomplish a streaming joining task with data.
>
> Things work fine on smaller sets of data, but on a single-node large
> cluster with JSON strings amounting to 2.5 GB problems start to occur, I
> get a NullPointerException. It appears to happen in my code when I call
> DataFrame.write.parquet()
>
> I am reliably reproducing this, and it appears to be internal to
> mapWithState -- I don't know what else I can do to make progress, any
> thoughts?
>
>
>
> Here is the stack trace:
>
> 16/02/22 22:03:54 ERROR Executor: Exception in task 1.0 in stage 4349.0
>> (TID 6386)
>> java.lang.NullPointerException
>> at
>> org.apache.spark.streaming.util.OpenHashMapBasedStateMap.getByTime(StateMap.scala:117)
>> at
>> org.apache.spark.streaming.util.OpenHashMapBasedStateMap.getByTime(StateMap.scala:117)
>> at
>> org.apache.spark.streaming.rdd.MapWithStateRDDRecord$.updateRecordWithData(MapWithStateRDD.scala:69)
>> at
>> org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:154)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>> at
>> org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
>> at
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>> at
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>> at
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>> at
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>> at org.apache.spark.scheduler.Task.run(Task.scala:89)
>> at
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> at java.lang.Thread.run(Thread.java:745)
>
>
>
>> 16/02/22 22:03:55 ERROR JobScheduler: Error running job streaming job
>> 145617858 ms.0
>> org.apache.spark.SparkException: Job aborted due to stage failure: Task
>> 12 in stage 4349.0 failed 1 times, most recent failure: Lost task 12.0 in
>> stage 4349.0 (TID 6397, localhost): java.lang.NullPointerException
>> at
>> org.apache.spark.streaming.util.OpenHashMapBasedStateMap.getByTime(StateMap.scala:117)
>> at
>> org.apache.spark.streaming.util.OpenHashMapBasedStateMap.getByTime(StateMap.scala:117)
>> at
>> org.apache.spark.streaming.rdd.MapWithStateRDDRecord$.updateRecordWithData(MapWithStateRDD.scala:69)
>> at
>> org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:154)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>> at
>> org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
>> at
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>> at
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>> at
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>> at
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>> at org.apache.spark.scheduler.Task.run(Task.scala:89)
>> at
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> 

Streaming mapWithState API has NullPointerException

2016-02-22 Thread Aris
Hello Spark community, and especially TD and Spark Streaming folks:

I am using the new Spark 1.6.0 Streaming mapWithState API, in order to
accomplish a streaming joining task with data.

Things work fine on smaller sets of data, but on a single-node large
cluster with JSON strings amounting to 2.5 GB problems start to occur, I
get a NullPointerException. It appears to happen in my code when I call
DataFrame.write.parquet()

I am reliably reproducing this, and it appears to be internal to
mapWithState -- I don't know what else I can do to make progress, any
thoughts?



Here is the stack trace:

16/02/22 22:03:54 ERROR Executor: Exception in task 1.0 in stage 4349.0
> (TID 6386)
> java.lang.NullPointerException
> at
> org.apache.spark.streaming.util.OpenHashMapBasedStateMap.getByTime(StateMap.scala:117)
> at
> org.apache.spark.streaming.util.OpenHashMapBasedStateMap.getByTime(StateMap.scala:117)
> at
> org.apache.spark.streaming.rdd.MapWithStateRDDRecord$.updateRecordWithData(MapWithStateRDD.scala:69)
> at
> org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:154)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at
> org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)



> 16/02/22 22:03:55 ERROR JobScheduler: Error running job streaming job
> 145617858 ms.0
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 12
> in stage 4349.0 failed 1 times, most recent failure: Lost task 12.0 in
> stage 4349.0 (TID 6397, localhost): java.lang.NullPointerException
> at
> org.apache.spark.streaming.util.OpenHashMapBasedStateMap.getByTime(StateMap.scala:117)
> at
> org.apache.spark.streaming.util.OpenHashMapBasedStateMap.getByTime(StateMap.scala:117)
> at
> org.apache.spark.streaming.rdd.MapWithStateRDDRecord$.updateRecordWithData(MapWithStateRDD.scala:69)
> at
> org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:154)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at
> org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Driver stacktrace:
> at org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)
>   

RE: Random Forest FeatureImportance throwing NullPointerException

2016-01-14 Thread Rachana Srivastava
Tried using 1.6 version of Spark that takes numberOfFeatures fifth argument in  
the API but still getting featureImportance as null.

RandomForestClassifier rfc = getRandomForestClassifier( numTrees,  maxBinSize,  
maxTreeDepth,  seed,  impurity);
RandomForestClassificationModel rfm = 
RandomForestClassificationModel.fromOld(model, rfc, categoricalFeatures, 
numberOfClasses,numberOfFeatures);
System.out.println(rfm.featureImportances());

Stack Trace:
Exception in thread "main" java.lang.NullPointerException
at 
org.apache.spark.ml.tree.impl.RandomForest$.computeFeatureImportance(RandomForest.scala:1152)
at 
org.apache.spark.ml.tree.impl.RandomForest$$anonfun$featureImportances$1.apply(RandomForest.scala:)
at 
org.apache.spark.ml.tree.impl.RandomForest$$anonfun$featureImportances$1.apply(RandomForest.scala:1108)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at 
scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at 
org.apache.spark.ml.tree.impl.RandomForest$.featureImportances(RandomForest.scala:1108)
at 
org.apache.spark.ml.classification.RandomForestClassificationModel.featureImportances$lzycompute(RandomForestClassifier.scala:237)
at 
org.apache.spark.ml.classification.RandomForestClassificationModel.featureImportances(RandomForestClassifier.scala:237)
at 
com.markmonitor.antifraud.ce.ml.CheckFeatureImportance.main(CheckFeatureImportance.java:49)

From: Rachana Srivastava
Sent: Wednesday, January 13, 2016 3:30 PM
To: 'user@spark.apache.org'; 'd...@spark.apache.org'
Subject: Random Forest FeatureImportance throwing NullPointerException

I have a Random forest model for which I am trying to get the featureImportance 
vector.

Map<Object,Object> categoricalFeaturesParam = new HashMap<>();
scala.collection.immutable.Map<Object,Object> categoricalFeatures =  
(scala.collection.immutable.Map<Object,Object>)
scala.collection.immutable.Map$.MODULE$.apply(JavaConversions.mapAsScalaMap(categoricalFeaturesParam).toSeq());
int numberOfClasses =2;
RandomForestClassifier rfc = new RandomForestClassifier();
RandomForestClassificationModel rfm = 
RandomForestClassificationModel.fromOld(model, rfc, categoricalFeatures, 
numberOfClasses);
System.out.println(rfm.featureImportances());

When I run above code I found featureImportance as null.  Do I need to set 
anything in specific to get the feature importance for the random forest model.

Thanks,

Rachana


Re: Random Forest FeatureImportance throwing NullPointerException

2016-01-14 Thread Bryan Cutler
If you are able to just train the RandomForestClassificationModel from ML
directly instead of training the old model and converting, then that would
be the way to go.

On Thu, Jan 14, 2016 at 2:21 PM, <rachana.srivast...@thomsonreuters.com>
wrote:

> Thanks so much Bryan for your response.  Is there any workaround?
>
>
>
> *From:* Bryan Cutler [mailto:cutl...@gmail.com]
> *Sent:* Thursday, January 14, 2016 2:19 PM
> *To:* Rachana Srivastava
> *Cc:* user@spark.apache.org; d...@spark.apache.org
> *Subject:* Re: Random Forest FeatureImportance throwing
> NullPointerException
>
>
>
> Hi Rachana,
>
> I got the same exception.  It is because computing the feature importance
> depends on impurity stats, which is not calculated with the old
> RandomForestModel in MLlib.  Feel free to create a JIRA for this if you
> think it is necessary, otherwise I believe this problem will be eventually
> solved as part of this JIRA
> https://issues.apache.org/jira/browse/SPARK-12183
>
> Bryan
>
>
>
> On Thu, Jan 14, 2016 at 8:12 AM, Rachana Srivastava <
> rachana.srivast...@markmonitor.com> wrote:
>
> Tried using 1.6 version of Spark that takes numberOfFeatures fifth
> argument in  the API but still getting featureImportance as null.
>
>
>
> RandomForestClassifier rfc = *getRandomForestClassifier*( numTrees,
> maxBinSize,  maxTreeDepth,  seed,  impurity);
>
> RandomForestClassificationModel rfm = RandomForestClassificationModel.
> *fromOld*(model, rfc, categoricalFeatures, numberOfClasses,
> numberOfFeatures);
>
> System.*out*.println(rfm.featureImportances());
>
>
>
> Stack Trace:
>
> Exception in thread "main" *java.lang.NullPointerException*
>
> at
> org.apache.spark.ml.tree.impl.RandomForest$.computeFeatureImportance(RandomForest.scala:1152)
>
> at
> org.apache.spark.ml.tree.impl.RandomForest$$anonfun$featureImportances$1.apply(RandomForest.scala:)
>
> at
> org.apache.spark.ml.tree.impl.RandomForest$$anonfun$featureImportances$1.apply(RandomForest.scala:1108)
>
> at
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>
> at
> scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>
> at
> org.apache.spark.ml.tree.impl.RandomForest$.featureImportances(RandomForest.scala:1108)
>
> at
> org.apache.spark.ml.classification.RandomForestClassificationModel.featureImportances$lzycompute(RandomForestClassifier.scala:237)
>
> at
> org.apache.spark.ml.classification.RandomForestClassificationModel.featureImportances(RandomForestClassifier.scala:237)
>
> at
> com.markmonitor.antifraud.ce.ml.CheckFeatureImportance.main(
> *CheckFeatureImportance.java:49*)
>
>
>
> *From:* Rachana Srivastava
> *Sent:* Wednesday, January 13, 2016 3:30 PM
> *To:* 'user@spark.apache.org'; 'd...@spark.apache.org'
> *Subject:* Random Forest FeatureImportance throwing NullPointerException
>
>
>
> I have a Random forest model for which I am trying to get the
> featureImportance vector.
>
>
>
> Map<Object,Object> categoricalFeaturesParam = *new* HashMap<>();
>
> scala.collection.immutable.Map<Object,Object> categoricalFeatures =
>  (scala.collection.immutable.Map<Object,Object>)
>
> scala.collection.immutable.Map$.*MODULE$*.apply(JavaConversions.
> *mapAsScalaMap*(categoricalFeaturesParam).toSeq());
>
> *int* numberOfClasses =2;
>
> RandomForestClassifier rfc = *new* RandomForestClassifier();
>
> RandomForestClassificationModel rfm = RandomForestClassificationModel.
> *fromOld*(model, rfc, categoricalFeatures, numberOfClasses);
>
> System.*out*.println(rfm.featureImportances());
>
>
>
> When I run above code I found featureImportance as null.  Do I need to set
> anything in specific to get the feature importance for the random forest
> model.
>
>
>
> Thanks,
>
>
>
> Rachana
>
>
>


Re: Random Forest FeatureImportance throwing NullPointerException

2016-01-14 Thread Bryan Cutler
Hi Rachana,

I got the same exception.  It is because computing the feature importance
depends on impurity stats, which is not calculated with the old
RandomForestModel in MLlib.  Feel free to create a JIRA for this if you
think it is necessary, otherwise I believe this problem will be eventually
solved as part of this JIRA
https://issues.apache.org/jira/browse/SPARK-12183

Bryan

On Thu, Jan 14, 2016 at 8:12 AM, Rachana Srivastava <
rachana.srivast...@markmonitor.com> wrote:

> Tried using 1.6 version of Spark that takes numberOfFeatures fifth
> argument in  the API but still getting featureImportance as null.
>
>
>
> RandomForestClassifier rfc = *getRandomForestClassifier*( numTrees,
> maxBinSize,  maxTreeDepth,  seed,  impurity);
>
> RandomForestClassificationModel rfm = RandomForestClassificationModel.
> *fromOld*(model, rfc, categoricalFeatures, numberOfClasses,
> numberOfFeatures);
>
> System.*out*.println(rfm.featureImportances());
>
>
>
> Stack Trace:
>
> Exception in thread "main" *java.lang.NullPointerException*
>
> at
> org.apache.spark.ml.tree.impl.RandomForest$.computeFeatureImportance(RandomForest.scala:1152)
>
> at
> org.apache.spark.ml.tree.impl.RandomForest$$anonfun$featureImportances$1.apply(RandomForest.scala:)
>
> at
> org.apache.spark.ml.tree.impl.RandomForest$$anonfun$featureImportances$1.apply(RandomForest.scala:1108)
>
> at
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>
> at
> scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>
> at
> org.apache.spark.ml.tree.impl.RandomForest$.featureImportances(RandomForest.scala:1108)
>
> at
> org.apache.spark.ml.classification.RandomForestClassificationModel.featureImportances$lzycompute(RandomForestClassifier.scala:237)
>
> at
> org.apache.spark.ml.classification.RandomForestClassificationModel.featureImportances(RandomForestClassifier.scala:237)
>
> at
> com.markmonitor.antifraud.ce.ml.CheckFeatureImportance.main(
> *CheckFeatureImportance.java:49*)
>
>
>
> *From:* Rachana Srivastava
> *Sent:* Wednesday, January 13, 2016 3:30 PM
> *To:* 'user@spark.apache.org'; 'd...@spark.apache.org'
> *Subject:* Random Forest FeatureImportance throwing NullPointerException
>
>
>
> I have a Random forest model for which I am trying to get the
> featureImportance vector.
>
>
>
> Map<Object,Object> categoricalFeaturesParam = *new* HashMap<>();
>
> scala.collection.immutable.Map<Object,Object> categoricalFeatures =
>  (scala.collection.immutable.Map<Object,Object>)
>
> scala.collection.immutable.Map$.*MODULE$*.apply(JavaConversions.
> *mapAsScalaMap*(categoricalFeaturesParam).toSeq());
>
> *int* numberOfClasses =2;
>
> RandomForestClassifier rfc = *new* RandomForestClassifier();
>
> RandomForestClassificationModel rfm = RandomForestClassificationModel.
> *fromOld*(model, rfc, categoricalFeatures, numberOfClasses);
>
> System.*out*.println(rfm.featureImportances());
>
>
>
> When I run above code I found featureImportance as null.  Do I need to set
> anything in specific to get the feature importance for the random forest
> model.
>
>
>
> Thanks,
>
>
>
> Rachana
>


Random Forest FeatureImportance throwing NullPointerException

2016-01-13 Thread Rachana Srivastava
I have a Random forest model for which I am trying to get the featureImportance 
vector.

Map categoricalFeaturesParam = new HashMap<>();
scala.collection.immutable.Map categoricalFeatures =  
(scala.collection.immutable.Map)
scala.collection.immutable.Map$.MODULE$.apply(JavaConversions.mapAsScalaMap(categoricalFeaturesParam).toSeq());
int numberOfClasses =2;
RandomForestClassifier rfc = new RandomForestClassifier();
RandomForestClassificationModel rfm = 
RandomForestClassificationModel.fromOld(model, rfc, categoricalFeatures, 
numberOfClasses);
System.out.println(rfm.featureImportances());

When I run above code I found featureImportance as null.  Do I need to set 
anything in specific to get the feature importance for the random forest model.

Thanks,

Rachana


DataFrame withColumnRenamed throwing NullPointerException

2016-01-05 Thread Prasad Ravilla
I am joining two data frames as shown in the code below. This is throwing 
NullPointerException.

I have a number of different join throughout the program and the SparkContext 
throws this NullPointerException on a randomly on one of the joins.
The two data frames are very large data frames ( around 1TB)

I am using Spark version 1.5.2.

Thanks in advance for any insights.

Regards,
Prasad.


Below is the code.

val userAndFmSegment = 
userData.as("userdata").join(fmSegmentData.withColumnRenamed("USER_ID", 
"FM_USER_ID").as("fmsegmentdata"),

$"userdata.PRIMARY_USER_ID" === $"fmsegmentdata.FM_USER_ID"

&& $"fmsegmentdata.END_DATE" >= date_sub($"userdata.REPORT_DATE", 
trailingWeeks * 7)

&& $"fmsegmentdata.START_DATE" <= date_sub($"userdata.REPORT_DATE", 
trailingWeeks * 7)

, "inner").select(

"USER_ID",

"PRIMARY_USER_ID",

"FM_BUYER_TYPE_CD"

)





Log


16/01/05 17:41:19 ERROR ApplicationMaster: User class threw exception: 
java.lang.NullPointerException

java.lang.NullPointerException

at org.apache.spark.sql.DataFrame.withColumnRenamed(DataFrame.scala:1161)

at DnaAgg$.getUserIdAndFMSegmentId$1(DnaAgg.scala:294)

at DnaAgg$.main(DnaAgg.scala:339)

at DnaAgg.main(DnaAgg.scala)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)

at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:606)

at 
org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:525)





Re: NullPointerException with joda time

2015-11-12 Thread Romain Sagean
at
>>> heatmap.scala:116, took 45,476562 s
>>> Exception in thread "main" org.apache.spark.SparkException: Job aborted
>>> due to stage failure: Task 205 in stage 3.0 failed 4 times, most recent
>>> failure: Lost task 205.3 in stage 3.0 (TID 803, R610-2.pro.hupi.loc):
>>> java.lang.NullPointerException
>>> at org.joda.time.DateTime.plusDays(DateTime.java:1070)
>>> at Heatmap$.allDates(heatmap.scala:34)
>>> at Heatmap$$anonfun$12.apply(heatmap.scala:97)
>>> at Heatmap$$anonfun$12.apply(heatmap.scala:97)
>>> at
>>> org.apache.spark.rdd.PairRDDFunctions$$anonfun$flatMapValues$1$$anonfun$apply$16.apply(PairRDDFunctions.scala:686)
>>> at
>>> org.apache.spark.rdd.PairRDDFunctions$$anonfun$flatMapValues$1$$anonfun$apply$16.apply(PairRDDFunctions.scala:685)
>>> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>>> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>>> at
>>> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125)
>>> at
>>> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:160)
>>> at
>>> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
>>> at
>>> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>>> at
>>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>>> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>>> at
>>> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
>>> at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:159)
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>>> at
>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>>> at
>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>>> at
>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>>> at
>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>>> at
>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>>> at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>>> at
>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>>> at
>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>>> at org.apache.spark.scheduler.Task.run(Task.scala:64)
>>> at
>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>> at java.lang.Thread.run(Thread.java:745)
>>>
>>> Driver stacktrace:
>>> at org.apache.spark.scheduler.DAGScheduler.org
>>> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1210)
>>> at
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1199)
>>> at
>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1198)
>>> at
>>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>>> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>>> at
>>> org.apache.spark.sc

Re: NullPointerException with joda time

2015-11-12 Thread Ted Yu
.scala:68)
>>>> at
>>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>>>> at org.apache.spark.scheduler.Task.run(Task.scala:64)
>>>> at
>>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
>>>> at
>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>> at
>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>> at java.lang.Thread.run(Thread.java:745)
>>>>
>>>> Driver stacktrace:
>>>> at org.apache.spark.scheduler.DAGScheduler.org
>>>> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1210)
>>>> at
>>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1199)
>>>> at
>>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1198)
>>>> at
>>>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>>>> at
>>>> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>>>> at
>>>> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1198)
>>>> at
>>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
>>>> at
>>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
>>>> at scala.Option.foreach(Option.scala:236)
>>>> at
>>>> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
>>>> at
>>>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1400)
>>>> at
>>>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1361)
>>>> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>>>> 15/11/10 21:10:36 WARN TaskSetManager: Lost task 209.0 in stage 3.0
>>>> (TID 804, R610-2.pro.hupi.loc): TaskKilled (killed intentionally)
>>>> SLF4J: Class path contains multiple SLF4J bindings.
>>>> SLF4J: Found binding in
>>>> [jar:file:/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>>>> SLF4J: Found binding in
>>>> [jar:file:/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/jars/avro-tools-1.7.6-cdh5.4.7.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>>>> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
>>>> explanation.
>>>> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
>>>>
>>>>
>>>> 2015-11-10 18:39 GMT+01:00 Ted Yu <yuzhih...@gmail.com>:
>>>>
>>>>> Can you show the stack trace for the NPE ?
>>>>>
>>>>> Which release of Spark are you using ?
>>>>>
>>>>> Cheers
>>>>>
>>>>> On Tue, Nov 10, 2015 at 8:20 AM, romain sagean <romain.sag...@hupi.fr>
>>>>> wrote:
>>>>>
>>>>>> Hi community,
>>>>>> I try to apply the function below during a flatMapValues or a map but
>>>>>> I get a nullPointerException with the plusDays(1). What did I miss ?
>>>>>>
>>>>>> def allDates(dateSeq: Seq[DateTime], dateEnd: DateTime):
>>>>>> Seq[DateTime] = {
>>>>>> if (dateSeq.last.isBefore(dateEnd)){
>>>>>>   allDates(dateSeq:+ dateSeq.last.plusDays(1), dateEnd)
>>>>>> } else {
>>>>>>   dateSeq
>>>>>> }
>>>>>>   }
>>>>>>
>>>>>> val videoAllDates = .select("player_id", "mindate").flatMapValues(
>>>>>> minDate => allDates(Seq(minDate), endDate))
>>>>>>
>>>>>>
>>>>>> -
>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> *Romain Sagean*
>>>>
>>>> *romain.sag...@hupi.fr <romain.sag...@hupi.fr>*
>>>>
>>>>
>>>
>>
>
>
> --
> *Romain Sagean*
>
> *romain.sag...@hupi.fr <romain.sag...@hupi.fr>*
>
>


Re: NullPointerException with joda time

2015-11-12 Thread Koert Kuipers
;>> at
>>>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>>>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>>>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>>>>> at
>>>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>>>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>>>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>>>>> at
>>>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>>>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>>>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>>>>> at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
>>>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>>>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>>>>> at
>>>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>>>>> at
>>>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>>>>> at org.apache.spark.scheduler.Task.run(Task.scala:64)
>>>>> at
>>>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
>>>>> at
>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>>> at
>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>>> at java.lang.Thread.run(Thread.java:745)
>>>>>
>>>>> Driver stacktrace:
>>>>> at org.apache.spark.scheduler.DAGScheduler.org
>>>>> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1210)
>>>>> at
>>>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1199)
>>>>> at
>>>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1198)
>>>>> at
>>>>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>>>>> at
>>>>> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>>>>> at
>>>>> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1198)
>>>>> at
>>>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
>>>>> at
>>>>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
>>>>> at scala.Option.foreach(Option.scala:236)
>>>>> at
>>>>> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
>>>>> at
>>>>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1400)
>>>>> at
>>>>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1361)
>>>>> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>>>>> 15/11/10 21:10:36 WARN TaskSetManager: Lost task 209.0 in stage 3.0
>>>>> (TID 804, R610-2.pro.hupi.loc): TaskKilled (killed intentionally)
>>>>> SLF4J: Class path contains multiple SLF4J bindings.
>>>>> SLF4J: Found binding in
>>>>> [jar:file:/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>>>>> SLF4J: Found binding in
>>>>> [jar:file:/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/jars/avro-tools-1.7.6-cdh5.4.7.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>>>>> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
>>>>> explanation.
>>>>> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
>>>>>
>>>>>
>>>>> 2015-11-10 18:39 GMT+01:00 Ted Yu <yuzhih...@gmail.com>:
>>>>>
>>>>>> Can you show the stack trace for the NPE ?
>>>>>>
>>>>>> Which release of Spark are you using ?
>>>>>>
>>>>>> Cheers
>>>>>>
>>>>>> On Tue, Nov 10, 2015 at 8:20 AM, romain sagean <romain.sag...@hupi.fr
>>>>>> > wrote:
>>>>>>
>>>>>>> Hi community,
>>>>>>> I try to apply the function below during a flatMapValues or a map
>>>>>>> but I get a nullPointerException with the plusDays(1). What did I miss ?
>>>>>>>
>>>>>>> def allDates(dateSeq: Seq[DateTime], dateEnd: DateTime):
>>>>>>> Seq[DateTime] = {
>>>>>>> if (dateSeq.last.isBefore(dateEnd)){
>>>>>>>   allDates(dateSeq:+ dateSeq.last.plusDays(1), dateEnd)
>>>>>>> } else {
>>>>>>>   dateSeq
>>>>>>> }
>>>>>>>   }
>>>>>>>
>>>>>>> val videoAllDates = .select("player_id", "mindate").flatMapValues(
>>>>>>> minDate => allDates(Seq(minDate), endDate))
>>>>>>>
>>>>>>>
>>>>>>> -
>>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> *Romain Sagean*
>>>>>
>>>>> *romain.sag...@hupi.fr <romain.sag...@hupi.fr>*
>>>>>
>>>>>
>>>>
>>>
>>
>>
>> --
>> *Romain Sagean*
>>
>> *romain.sag...@hupi.fr <romain.sag...@hupi.fr>*
>>
>>
>


Re: NullPointerException with joda time

2015-11-11 Thread Ted Yu
.insertAll(ExternalAppendOnlyMap.scala:125)
>> at
>> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:160)
>> at
>> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
>> at
>> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>> at
>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>> at
>> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
>> at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:159)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>> at
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>> at
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>> at
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>> at
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>> at
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>> at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>> at
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
>> at
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>> at org.apache.spark.scheduler.Task.run(Task.scala:64)
>> at
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:745)
>>
>> Driver stacktrace:
>> at org.apache.spark.scheduler.DAGScheduler.org
>> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1210)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1199)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1198)
>> at
>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>> at
>> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1198)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
>> at
>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
>> at scala.Option.foreach(Option.scala:236)
>> at
>> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
>> at
>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1400)
>> at
>> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1361)
>> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>> 15/11/10 21:10:36 WARN TaskSetManager: Lost task 209.0 in stage 3.0 (TID
>> 804, R610-2.pro.hupi.loc): TaskKilled (killed intentionally)
>> SLF4J: Class path contains multiple SLF4J bindings.
>> SLF4J: Found binding in
>> [jar:file:/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>> SLF4J: Found binding in
>> [jar:file:/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/jars/avro-tools-1.7.6-cdh5.4.7.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
>> explanation.
>> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
>>
>>
>> 2015-11-10 18:39 GMT+01:00 Ted Yu <yuzhih...@gmail.com>:
>>
>>> Can you show the stack trace for the NPE ?
>>>
>>> Which release of Spark are you using ?
>>>
>>> Cheers
>>>
>>> On Tue, Nov 10, 2015 at 8:20 AM, romain sagean <romain.sag...@hupi.fr>
>>> wrote:
>>>
>>>> Hi community,
>>>> I try to apply the function below during a flatMapValues or a map but I
>>>> get a nullPointerException with the plusDays(1). What did I miss ?
>>>>
>>>> def allDates(dateSeq: Seq[DateTime], dateEnd: DateTime): Seq[DateTime]
>>>> = {
>>>> if (dateSeq.last.isBefore(dateEnd)){
>>>>   allDates(dateSeq:+ dateSeq.last.plusDays(1), dateEnd)
>>>> } else {
>>>>   dateSeq
>>>> }
>>>>   }
>>>>
>>>> val videoAllDates = .select("player_id", "mindate").flatMapValues(
>>>> minDate => allDates(Seq(minDate), endDate))
>>>>
>>>>
>>>> -
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>
>>>>
>>>
>>
>>
>> --
>> *Romain Sagean*
>>
>> *romain.sag...@hupi.fr <romain.sag...@hupi.fr>*
>>
>>
>


NullPointerException with joda time

2015-11-10 Thread romain sagean

Hi community,
I try to apply the function below during a flatMapValues or a map but I 
get a nullPointerException with the plusDays(1). What did I miss ?


def allDates(dateSeq: Seq[DateTime], dateEnd: DateTime): Seq[DateTime] = {
if (dateSeq.last.isBefore(dateEnd)){
  allDates(dateSeq:+ dateSeq.last.plusDays(1), dateEnd)
} else {
  dateSeq
}
  }

val videoAllDates = .select("player_id", "mindate").flatMapValues( 
minDate => allDates(Seq(minDate), endDate))



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: NullPointerException with joda time

2015-11-10 Thread Ted Yu
Can you show the stack trace for the NPE ?

Which release of Spark are you using ?

Cheers

On Tue, Nov 10, 2015 at 8:20 AM, romain sagean <romain.sag...@hupi.fr>
wrote:

> Hi community,
> I try to apply the function below during a flatMapValues or a map but I
> get a nullPointerException with the plusDays(1). What did I miss ?
>
> def allDates(dateSeq: Seq[DateTime], dateEnd: DateTime): Seq[DateTime] = {
> if (dateSeq.last.isBefore(dateEnd)){
>   allDates(dateSeq:+ dateSeq.last.plusDays(1), dateEnd)
> } else {
>   dateSeq
> }
>   }
>
> val videoAllDates = .select("player_id", "mindate").flatMapValues( minDate
> => allDates(Seq(minDate), endDate))
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: NullPointerException with joda time

2015-11-10 Thread Romain Sagean
apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1210)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1199)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1198)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1198)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1400)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1361)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
15/11/10 21:10:36 WARN TaskSetManager: Lost task 209.0 in stage 3.0 (TID
804, R610-2.pro.hupi.loc): TaskKilled (killed intentionally)
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in
[jar:file:/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/jars/avro-tools-1.7.6-cdh5.4.7.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]


2015-11-10 18:39 GMT+01:00 Ted Yu <yuzhih...@gmail.com>:

> Can you show the stack trace for the NPE ?
>
> Which release of Spark are you using ?
>
> Cheers
>
> On Tue, Nov 10, 2015 at 8:20 AM, romain sagean <romain.sag...@hupi.fr>
> wrote:
>
>> Hi community,
>> I try to apply the function below during a flatMapValues or a map but I
>> get a nullPointerException with the plusDays(1). What did I miss ?
>>
>> def allDates(dateSeq: Seq[DateTime], dateEnd: DateTime): Seq[DateTime] = {
>> if (dateSeq.last.isBefore(dateEnd)){
>>   allDates(dateSeq:+ dateSeq.last.plusDays(1), dateEnd)
>> } else {
>>   dateSeq
>> }
>>   }
>>
>> val videoAllDates = .select("player_id", "mindate").flatMapValues(
>> minDate => allDates(Seq(minDate), endDate))
>>
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


-- 
*Romain Sagean*

*romain.sag...@hupi.fr <romain.sag...@hupi.fr>*


Re: NullPointerException with joda time

2015-11-10 Thread Ted Yu
itionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:64)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
>
> Driver stacktrace:
> at org.apache.spark.scheduler.DAGScheduler.org
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1210)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1199)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1198)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1198)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
> at
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
> at scala.Option.foreach(Option.scala:236)
> at
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1400)
> at
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1361)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> 15/11/10 21:10:36 WARN TaskSetManager: Lost task 209.0 in stage 3.0 (TID
> 804, R610-2.pro.hupi.loc): TaskKilled (killed intentionally)
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in
> [jar:file:/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in
> [jar:file:/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/jars/avro-tools-1.7.6-cdh5.4.7.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
>
>
> 2015-11-10 18:39 GMT+01:00 Ted Yu <yuzhih...@gmail.com>:
>
>> Can you show the stack trace for the NPE ?
>>
>> Which release of Spark are you using ?
>>
>> Cheers
>>
>> On Tue, Nov 10, 2015 at 8:20 AM, romain sagean <romain.sag...@hupi.fr>
>> wrote:
>>
>>> Hi community,
>>> I try to apply the function below during a flatMapValues or a map but I
>>> get a nullPointerException with the plusDays(1). What did I miss ?
>>>
>>> def allDates(dateSeq: Seq[DateTime], dateEnd: DateTime): Seq[DateTime] =
>>> {
>>> if (dateSeq.last.isBefore(dateEnd)){
>>>   allDates(dateSeq:+ dateSeq.last.plusDays(1), dateEnd)
>>> } else {
>>>   dateSeq
>>> }
>>>   }
>>>
>>> val videoAllDates = .select("player_id", "mindate").flatMapValues(
>>> minDate => allDates(Seq(minDate), endDate))
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>
>
> --
> *Romain Sagean*
>
> *romain.sag...@hupi.fr <romain.sag...@hupi.fr>*
>
>


Re: NullPointerException when cache DataFrame in Java (Spark1.5.1)

2015-10-29 Thread Romi Kuntsman
Did you try to cache a DataFrame with just a single row?
Do you rows have any columns with null values?
Can you post a code snippet here on how you load/generate the dataframe?
Does dataframe.rdd.cache work?

*Romi Kuntsman*, *Big Data Engineer*
http://www.totango.com

On Thu, Oct 29, 2015 at 4:33 AM, Zhang, Jingyu 
wrote:

> It is not a problem to use JavaRDD.cache() for 200M data (all Objects read
> form Json Format). But when I try to use DataFrame.cache(), It shown
> exception in below.
>
> My machine can cache 1 G data in Avro format without any problem.
>
> 15/10/29 13:26:23 INFO GeneratePredicate: Code generated in 154.531827 ms
>
> 15/10/29 13:26:23 INFO GenerateUnsafeProjection: Code generated in
> 27.832369 ms
>
> 15/10/29 13:26:23 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID
> 1)
>
> java.lang.NullPointerException
>
> at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source)
>
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:43)
>
> at java.lang.reflect.Method.invoke(Method.java:497)
>
> at
> org.apache.spark.sql.SQLContext$$anonfun$9$$anonfun$apply$1$$anonfun$apply$2.apply(
> SQLContext.scala:500)
>
> at
> org.apache.spark.sql.SQLContext$$anonfun$9$$anonfun$apply$1$$anonfun$apply$2.apply(
> SQLContext.scala:500)
>
> at scala.collection.TraversableLike$$anonfun$map$1.apply(
> TraversableLike.scala:244)
>
> at scala.collection.TraversableLike$$anonfun$map$1.apply(
> TraversableLike.scala:244)
>
> at scala.collection.IndexedSeqOptimized$class.foreach(
> IndexedSeqOptimized.scala:33)
>
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>
> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
>
> at org.apache.spark.sql.SQLContext$$anonfun$9$$anonfun$apply$1.apply(
> SQLContext.scala:500)
>
> at org.apache.spark.sql.SQLContext$$anonfun$9$$anonfun$apply$1.apply(
> SQLContext.scala:498)
>
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>
> at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:389)
>
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>
> at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(
> InMemoryColumnarTableScan.scala:127)
>
> at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(
> InMemoryColumnarTableScan.scala:120)
>
> at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:278
> )
>
> at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
>
> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
>
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
>
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38
> )
>
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38
> )
>
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
>
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>
> at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
>
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
>
> at java.lang.Thread.run(Thread.java:745)
>
> 15/10/29 13:26:23 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1,
> localhost): java.lang.NullPointerException
>
> at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source)
>
>
> Thanks,
>
>
> Jingyu
>
> This message and its attachments may contain legally privileged or
> confidential information. It is intended solely for the named addressee. If
> you are not the addressee indicated in this message or responsible for
> delivery of the message to the addressee, you may not copy or deliver this
> message or its attachments to anyone. Rather, you should permanently delete
> this message and its attachments and kindly notify the sender by reply
> e-mail. Any content of this message and its attachments which does not
> relate to the official business of the sending company must be taken not to
> have been sent or endorsed by that company or any of its related entities.
> No warranty is made that the e-mail or attachments are free from computer
> virus or other defect.


Re: NullPointerException when cache DataFrame in Java (Spark1.5.1)

2015-10-29 Thread Zhang, Jingyu
Thanks Romi,

I resize the dataset to 7MB, however, the code show NullPointerException
 exception as well.

Did you try to cache a DataFrame with just a single row?

Yes, I tried. But, Same problem.
.
Do you rows have any columns with null values?

No, I had filter out null values before cache the dataframe.

Can you post a code snippet here on how you load/generate the dataframe?

Sure, Here is the working code 1:

JavaRDD pixels = pixelsStr.map(new PixelGenerator()).cache();

System.out.println(pixels.count()); // 3000-4000 rows

Working code 2:

JavaRDD pixels = pixelsStr.map(new PixelGenerator());

DataFrame schemaPixel = sqlContext.createDataFrame(pixels, PixelObject.class
);

DataFrame totalDF1 =
schemaPixel.select(schemaPixel.col("domain")).filter("'domain'
is not null").limit(500);

System.out.println(totalDF1.count());


BUT, after change limit(500) to limit(1000). The code report
NullPointerException.


JavaRDD pixels = pixelsStr.map(new PixelGenerator());

DataFrame schemaPixel = sqlContext.createDataFrame(pixels, PixelObject.class
);

DataFrame totalDF =
schemaPixel.select(schemaPixel.col("domain")).filter("'domain'
is not null").limit(*1000*);

System.out.println(totalDF.count()); // problem at this line

15/10/29 18:56:28 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks
have all completed, from pool

15/10/29 18:56:28 INFO TaskSchedulerImpl: Cancelling stage 0

15/10/29 18:56:28 INFO DAGScheduler: ShuffleMapStage 0 (count at
X.java:113) failed in 3.764 s

15/10/29 18:56:28 INFO DAGScheduler: Job 0 failed: count at XXX.java:113,
took 3.862207 s

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage
0.0 (TID 0, localhost): java.lang.NullPointerException

at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
Does dataframe.rdd.cache work?

No, I tried but same exception.

Thanks,

Jingyu

On 29 October 2015 at 17:38, Romi Kuntsman <r...@totango.com> wrote:

> Did you try to cache a DataFrame with just a single row?
> Do you rows have any columns with null values?
> Can you post a code snippet here on how you load/generate the dataframe?
> Does dataframe.rdd.cache work?
>
> *Romi Kuntsman*, *Big Data Engineer*
> http://www.totango.com
>
> On Thu, Oct 29, 2015 at 4:33 AM, Zhang, Jingyu <jingyu.zh...@news.com.au>
> wrote:
>
>> It is not a problem to use JavaRDD.cache() for 200M data (all Objects
>> read form Json Format). But when I try to use DataFrame.cache(), It shown
>> exception in below.
>>
>> My machine can cache 1 G data in Avro format without any problem.
>>
>> 15/10/29 13:26:23 INFO GeneratePredicate: Code generated in 154.531827 ms
>>
>> 15/10/29 13:26:23 INFO GenerateUnsafeProjection: Code generated in
>> 27.832369 ms
>>
>> 15/10/29 13:26:23 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID
>> 1)
>>
>> java.lang.NullPointerException
>>
>> at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source)
>>
>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(
>> DelegatingMethodAccessorImpl.java:43)
>>
>> at java.lang.reflect.Method.invoke(Method.java:497)
>>
>> at
>> org.apache.spark.sql.SQLContext$$anonfun$9$$anonfun$apply$1$$anonfun$apply$2.apply(
>> SQLContext.scala:500)
>>
>> at
>> org.apache.spark.sql.SQLContext$$anonfun$9$$anonfun$apply$1$$anonfun$apply$2.apply(
>> SQLContext.scala:500)
>>
>> at scala.collection.TraversableLike$$anonfun$map$1.apply(
>> TraversableLike.scala:244)
>>
>> at scala.collection.TraversableLike$$anonfun$map$1.apply(
>> TraversableLike.scala:244)
>>
>> at scala.collection.IndexedSeqOptimized$class.foreach(
>> IndexedSeqOptimized.scala:33)
>>
>> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>>
>> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>>
>> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
>>
>> at org.apache.spark.sql.SQLContext$$anonfun$9$$anonfun$apply$1.apply(
>> SQLContext.scala:500)
>>
>> at org.apache.spark.sql.SQLContext$$anonfun$9$$anonfun$apply$1.apply(
>> SQLContext.scala:498)
>>
>> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>>
>> at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:389)
>>
>> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>>
>> at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(
>> InMemoryColumnarTableScan.scala:127)
>>
>> at org.apache.spark.sql.columnar.InMemoryRelat

Re: NullPointerException when cache DataFrame in Java (Spark1.5.1)

2015-10-29 Thread Romi Kuntsman
>
> BUT, after change limit(500) to limit(1000). The code report
> NullPointerException.
>
I had a similar situation, and the problem was with a certain record.
Try to find which records are returned when you limit to 1000 but not
returned when you limit to 500.

Could it be a NPE thrown from PixelObject?
Are you running spark with master=local, so it's running inside your IDE
and you can see the errors from the driver and worker?


*Romi Kuntsman*, *Big Data Engineer*
http://www.totango.com

On Thu, Oct 29, 2015 at 10:04 AM, Zhang, Jingyu <jingyu.zh...@news.com.au>
wrote:

> Thanks Romi,
>
> I resize the dataset to 7MB, however, the code show NullPointerException
>  exception as well.
>
> Did you try to cache a DataFrame with just a single row?
>
> Yes, I tried. But, Same problem.
> .
> Do you rows have any columns with null values?
>
> No, I had filter out null values before cache the dataframe.
>
> Can you post a code snippet here on how you load/generate the dataframe?
>
> Sure, Here is the working code 1:
>
> JavaRDD pixels = pixelsStr.map(new PixelGenerator()).cache();
>
> System.out.println(pixels.count()); // 3000-4000 rows
>
> Working code 2:
>
> JavaRDD pixels = pixelsStr.map(new PixelGenerator());
>
> DataFrame schemaPixel = sqlContext.createDataFrame(pixels, PixelObject.
> class);
>
> DataFrame totalDF1 = 
> schemaPixel.select(schemaPixel.col("domain")).filter("'domain'
> is not null").limit(500);
>
> System.out.println(totalDF1.count());
>
>
> BUT, after change limit(500) to limit(1000). The code report
> NullPointerException.
>
>
> JavaRDD pixels = pixelsStr.map(new PixelGenerator());
>
> DataFrame schemaPixel = sqlContext.createDataFrame(pixels, PixelObject.
> class);
>
> DataFrame totalDF = 
> schemaPixel.select(schemaPixel.col("domain")).filter("'domain'
> is not null").limit(*1000*);
>
> System.out.println(totalDF.count()); // problem at this line
>
> 15/10/29 18:56:28 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks
> have all completed, from pool
>
> 15/10/29 18:56:28 INFO TaskSchedulerImpl: Cancelling stage 0
>
> 15/10/29 18:56:28 INFO DAGScheduler: ShuffleMapStage 0 (count at
> X.java:113) failed in 3.764 s
>
> 15/10/29 18:56:28 INFO DAGScheduler: Job 0 failed: count at XXX.java:113,
> took 3.862207 s
>
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
> in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage
> 0.0 (TID 0, localhost): java.lang.NullPointerException
>
> at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
> Does dataframe.rdd.cache work?
>
> No, I tried but same exception.
>
> Thanks,
>
> Jingyu
>
> On 29 October 2015 at 17:38, Romi Kuntsman <r...@totango.com> wrote:
>
>> Did you try to cache a DataFrame with just a single row?
>> Do you rows have any columns with null values?
>> Can you post a code snippet here on how you load/generate the dataframe?
>> Does dataframe.rdd.cache work?
>>
>> *Romi Kuntsman*, *Big Data Engineer*
>> http://www.totango.com
>>
>> On Thu, Oct 29, 2015 at 4:33 AM, Zhang, Jingyu <jingyu.zh...@news.com.au>
>> wrote:
>>
>>> It is not a problem to use JavaRDD.cache() for 200M data (all Objects
>>> read form Json Format). But when I try to use DataFrame.cache(), It shown
>>> exception in below.
>>>
>>> My machine can cache 1 G data in Avro format without any problem.
>>>
>>> 15/10/29 13:26:23 INFO GeneratePredicate: Code generated in 154.531827 ms
>>>
>>> 15/10/29 13:26:23 INFO GenerateUnsafeProjection: Code generated in
>>> 27.832369 ms
>>>
>>> 15/10/29 13:26:23 ERROR Executor: Exception in task 0.0 in stage 1.0
>>> (TID 1)
>>>
>>> java.lang.NullPointerException
>>>
>>> at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source)
>>>
>>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(
>>> DelegatingMethodAccessorImpl.java:43)
>>>
>>> at java.lang.reflect.Method.invoke(Method.java:497)
>>>
>>> at
>>> org.apache.spark.sql.SQLContext$$anonfun$9$$anonfun$apply$1$$anonfun$apply$2.apply(
>>> SQLContext.scala:500)
>>>
>>> at
>>> org.apache.spark.sql.SQLContext$$anonfun$9$$anonfun$apply$1$$anonfun$apply$2.apply(
>>> SQLContext.scala:500)
>>>
>>> at scala.collection.TraversableLike$$anonfun$map$1.apply(
>>> TraversableLike.scala:244)
>>>
>>> at scala.collection.TraversableLike$$anonfun$map$1.apply(
>>> Tra

NullPointerException when cache DataFrame in Java (Spark1.5.1)

2015-10-28 Thread Zhang, Jingyu
It is not a problem to use JavaRDD.cache() for 200M data (all Objects read
form Json Format). But when I try to use DataFrame.cache(), It shown
exception in below.

My machine can cache 1 G data in Avro format without any problem.

15/10/29 13:26:23 INFO GeneratePredicate: Code generated in 154.531827 ms

15/10/29 13:26:23 INFO GenerateUnsafeProjection: Code generated in
27.832369 ms

15/10/29 13:26:23 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)

java.lang.NullPointerException

at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source)

at sun.reflect.DelegatingMethodAccessorImpl.invoke(
DelegatingMethodAccessorImpl.java:43)

at java.lang.reflect.Method.invoke(Method.java:497)

at
org.apache.spark.sql.SQLContext$$anonfun$9$$anonfun$apply$1$$anonfun$apply$2.apply(
SQLContext.scala:500)

at
org.apache.spark.sql.SQLContext$$anonfun$9$$anonfun$apply$1$$anonfun$apply$2.apply(
SQLContext.scala:500)

at scala.collection.TraversableLike$$anonfun$map$1.apply(
TraversableLike.scala:244)

at scala.collection.TraversableLike$$anonfun$map$1.apply(
TraversableLike.scala:244)

at scala.collection.IndexedSeqOptimized$class.foreach(
IndexedSeqOptimized.scala:33)

at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)

at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)

at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)

at org.apache.spark.sql.SQLContext$$anonfun$9$$anonfun$apply$1.apply(
SQLContext.scala:500)

at org.apache.spark.sql.SQLContext$$anonfun$9$$anonfun$apply$1.apply(
SQLContext.scala:498)

at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:389)

at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)

at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(
InMemoryColumnarTableScan.scala:127)

at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(
InMemoryColumnarTableScan.scala:120)

at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:278)

at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)

at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)

at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)

at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)

at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)

at org.apache.spark.scheduler.Task.run(Task.scala:88)

at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)

at java.util.concurrent.ThreadPoolExecutor.runWorker(
ThreadPoolExecutor.java:1142)

at java.util.concurrent.ThreadPoolExecutor$Worker.run(
ThreadPoolExecutor.java:617)

at java.lang.Thread.run(Thread.java:745)

15/10/29 13:26:23 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1,
localhost): java.lang.NullPointerException

at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source)


Thanks,


Jingyu

-- 
This message and its attachments may contain legally privileged or 
confidential information. It is intended solely for the named addressee. If 
you are not the addressee indicated in this message or responsible for 
delivery of the message to the addressee, you may not copy or deliver this 
message or its attachments to anyone. Rather, you should permanently delete 
this message and its attachments and kindly notify the sender by reply 
e-mail. Any content of this message and its attachments which does not 
relate to the official business of the sending company must be taken not to 
have been sent or endorsed by that company or any of its related entities. 
No warranty is made that the e-mail or attachments are free from computer 
virus or other defect.


NullPointerException when adding to accumulator

2015-10-14 Thread Sela, Amit
I'm running a simple streaming application that reads from Kafka, maps the 
events and prints them and I'm trying to use accumulators to count the number 
of mapped records.

While this works in standalone(IDE), when submitting to YARN I get 
NullPointerException on accumulator.add(1) or accumulator += 1

Anyone using accumulators in .map() with Spark 1.5 and YARN ?

Thanks,
Amit





Re: yarn-cluster mode throwing NullPointerException

2015-10-12 Thread Venkatakrishnan Sowrirajan
Hi Rachana,


Are you by any chance saying something like this in your code
​?
​

"sparkConf.setMaster("yarn-cluster");"

​SparkContext is not supported with yarn-cluster mode.​


I think you are hitting this bug -- >
https://issues.apache.org/jira/browse/SPARK-7504. This got fixed in
Spark-1.4.0, so you can try in 1.4.0

Regards
Venkata krishnan

On Sun, Oct 11, 2015 at 8:49 PM, Rachana Srivastava <
rachana.srivast...@markmonitor.com> wrote:

> I am trying to submit a job using yarn-cluster mode using spark-submit
> command.  My code works fine when I use yarn-client mode.
>
>
>
> *Cloudera Version:*
>
> CDH-5.4.7-1.cdh5.4.7.p0.3
>
>
>
> *Command Submitted:*
>
> spark-submit --class "com.markmonitor.antifraud.ce.KafkaURLStreaming"  \
>
> --driver-java-options
> "-Dlog4j.configuration=file:///etc/spark/myconf/log4j.sample.properties" \
>
> --conf
> "spark.driver.extraJavaOptions=-Dlog4j.configuration=file:///etc/spark/myconf/log4j.sample.properties"
> \
>
> --conf
> "spark.executor.extraJavaOptions=-Dlog4j.configuration=file:///etc/spark/myconf/log4j.sample.properties"
> \
>
> --num-executors 2 \
>
> --executor-cores 2 \
>
> ../target/mm-XXX-ce-0.0.1-SNAPSHOT-jar-with-dependencies.jar \
>
> yarn-cluster 10 "XXX:2181" "XXX:9092" groups kafkaurl 5 \
>
> "hdfs://ip-10-0-0-XXX.us-west-2.compute.internal:8020/user/ec2-user/urlFeature.properties"
> \
>
> "hdfs://ip-10-0-0-XXX.us-west-2.compute.internal:8020/user/ec2-user/urlFeatureContent.properties"
> \
>
> "hdfs://ip-10-0-0-XXX.us-west-2.compute.internal:8020/user/ec2-user/hdfsOutputNEWScript/OUTPUTYarn2"
> false
>
>
>
>
>
> *Log Details:*
>
> INFO : org.apache.spark.SparkContext - Running Spark version 1.3.0
>
> INFO : org.apache.spark.SecurityManager - Changing view acls to: ec2-user
>
> INFO : org.apache.spark.SecurityManager - Changing modify acls to: ec2-user
>
> INFO : org.apache.spark.SecurityManager - SecurityManager: authentication
> disabled; ui acls disabled; users with view permissions: Set(ec2-user);
> users with modify permissions: Set(ec2-user)
>
> INFO : akka.event.slf4j.Slf4jLogger - Slf4jLogger started
>
> INFO : Remoting - Starting remoting
>
> INFO : Remoting - Remoting started; listening on addresses
> :[akka.tcp://sparkdri...@ip-10-0-0-xxx.us-west-2.compute.internal:49579]
>
> INFO : Remoting - Remoting now listens on addresses:
> [akka.tcp://sparkdri...@ip-10-0-0-xxx.us-west-2.compute.internal:49579]
>
> INFO : org.apache.spark.util.Utils - Successfully started service
> 'sparkDriver' on port 49579.
>
> INFO : org.apache.spark.SparkEnv - Registering MapOutputTracker
>
> INFO : org.apache.spark.SparkEnv - Registering BlockManagerMaster
>
> INFO : org.apache.spark.storage.DiskBlockManager - Created local directory
> at
> /tmp/spark-1c805495-c7c4-471d-973f-b1ae0e2c8ff9/blockmgr-fff1946f-a716-40fc-a62d-bacba5b17638
>
> INFO : org.apache.spark.storage.MemoryStore - MemoryStore started with
> capacity 265.4 MB
>
> INFO : org.apache.spark.HttpFileServer - HTTP File server directory is
> /tmp/spark-8ed6f513-854f-4ee4-95ea-87185364eeaf/httpd-75cee1e7-af7a-4c82-a9ff-a124ce7ca7ae
>
> INFO : org.apache.spark.HttpServer - Starting HTTP Server
>
> INFO : org.spark-project.jetty.server.Server - jetty-8.y.z-SNAPSHOT
>
> INFO : org.spark-project.jetty.server.AbstractConnector - Started
> SocketConnector@0.0.0.0:46671
>
> INFO : org.apache.spark.util.Utils - Successfully started service 'HTTP
> file server' on port 46671.
>
> INFO : org.apache.spark.SparkEnv - Registering OutputCommitCoordinator
>
> INFO : org.spark-project.jetty.server.Server - jetty-8.y.z-SNAPSHOT
>
> INFO : org.spark-project.jetty.server.AbstractConnector - Started
> SelectChannelConnector@0.0.0.0:4040
>
> INFO : org.apache.spark.util.Utils - Successfully started service
> 'SparkUI' on port 4040.
>
> INFO : org.apache.spark.ui.SparkUI - Started SparkUI at
> http://ip-10-0-0-XXX.us-west-2.compute.internal:4040
>
> INFO : org.apache.spark.SparkContext - Added JAR
> file:/home/ec2-user/CE/correlationengine/scripts/../target/mm-anti-fraud-ce-0.0.1-SNAPSHOT-jar-with-dependencies.jar
> at
> http://10.0.0.XXX:46671/jars/mm-anti-fraud-ce-0.0.1-SNAPSHOT-jar-with-dependencies.jar
> with timestamp 1444620509463
>
> INFO : org.apache.spark.scheduler.cluster.YarnClusterScheduler - Created
> YarnClusterScheduler
>
> ERROR: org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend -
> Application ID is not set.
>
> INFO : org.apache.spark.network.netty.NettyBlockTransferService - Server
> created on 33880
>
> INFO : org.apache.spark.storage.BlockManagerMaster - Trying to register
> BlockManager
>
> INFO : org.apache.spark.storage.BlockManagerMasterActor - Registering
> block manager ip-10-0-0-XXX.us-west-2.compute.internal:33880 with 265.4 MB
> RAM, BlockManagerId(, ip-10-0-0-XXX.us-west-2.compute.internal,
> 33880)
>
> INFO : org.apache.spark.storage.BlockManagerMaster - Registered
> BlockManager
>
> INFO : org.apache.spark.scheduler.EventLoggingListener - Logging events to
> 

yarn-cluster mode throwing NullPointerException

2015-10-11 Thread Rachana Srivastava
I am trying to submit a job using yarn-cluster mode using spark-submit command. 
 My code works fine when I use yarn-client mode.

Cloudera Version:
CDH-5.4.7-1.cdh5.4.7.p0.3

Command Submitted:
spark-submit --class "com.markmonitor.antifraud.ce.KafkaURLStreaming"  \
--driver-java-options 
"-Dlog4j.configuration=file:///etc/spark/myconf/log4j.sample.properties" \
--conf 
"spark.driver.extraJavaOptions=-Dlog4j.configuration=file:///etc/spark/myconf/log4j.sample.properties"
 \
--conf 
"spark.executor.extraJavaOptions=-Dlog4j.configuration=file:///etc/spark/myconf/log4j.sample.properties"
 \
--num-executors 2 \
--executor-cores 2 \
../target/mm-XXX-ce-0.0.1-SNAPSHOT-jar-with-dependencies.jar \
yarn-cluster 10 "XXX:2181" "XXX:9092" groups kafkaurl 5 \
"hdfs://ip-10-0-0-XXX.us-west-2.compute.internal:8020/user/ec2-user/urlFeature.properties"
 \
"hdfs://ip-10-0-0-XXX.us-west-2.compute.internal:8020/user/ec2-user/urlFeatureContent.properties"
 \
"hdfs://ip-10-0-0-XXX.us-west-2.compute.internal:8020/user/ec2-user/hdfsOutputNEWScript/OUTPUTYarn2"
  false


Log Details:
INFO : org.apache.spark.SparkContext - Running Spark version 1.3.0
INFO : org.apache.spark.SecurityManager - Changing view acls to: ec2-user
INFO : org.apache.spark.SecurityManager - Changing modify acls to: ec2-user
INFO : org.apache.spark.SecurityManager - SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(ec2-user); users 
with modify permissions: Set(ec2-user)
INFO : akka.event.slf4j.Slf4jLogger - Slf4jLogger started
INFO : Remoting - Starting remoting
INFO : Remoting - Remoting started; listening on addresses 
:[akka.tcp://sparkdri...@ip-10-0-0-xxx.us-west-2.compute.internal:49579]
INFO : Remoting - Remoting now listens on addresses: 
[akka.tcp://sparkdri...@ip-10-0-0-xxx.us-west-2.compute.internal:49579]
INFO : org.apache.spark.util.Utils - Successfully started service 'sparkDriver' 
on port 49579.
INFO : org.apache.spark.SparkEnv - Registering MapOutputTracker
INFO : org.apache.spark.SparkEnv - Registering BlockManagerMaster
INFO : org.apache.spark.storage.DiskBlockManager - Created local directory at 
/tmp/spark-1c805495-c7c4-471d-973f-b1ae0e2c8ff9/blockmgr-fff1946f-a716-40fc-a62d-bacba5b17638
INFO : org.apache.spark.storage.MemoryStore - MemoryStore started with capacity 
265.4 MB
INFO : org.apache.spark.HttpFileServer - HTTP File server directory is 
/tmp/spark-8ed6f513-854f-4ee4-95ea-87185364eeaf/httpd-75cee1e7-af7a-4c82-a9ff-a124ce7ca7ae
INFO : org.apache.spark.HttpServer - Starting HTTP Server
INFO : org.spark-project.jetty.server.Server - jetty-8.y.z-SNAPSHOT
INFO : org.spark-project.jetty.server.AbstractConnector - Started 
SocketConnector@0.0.0.0:46671
INFO : org.apache.spark.util.Utils - Successfully started service 'HTTP file 
server' on port 46671.
INFO : org.apache.spark.SparkEnv - Registering OutputCommitCoordinator
INFO : org.spark-project.jetty.server.Server - jetty-8.y.z-SNAPSHOT
INFO : org.spark-project.jetty.server.AbstractConnector - Started 
SelectChannelConnector@0.0.0.0:4040
INFO : org.apache.spark.util.Utils - Successfully started service 'SparkUI' on 
port 4040.
INFO : org.apache.spark.ui.SparkUI - Started SparkUI at 
http://ip-10-0-0-XXX.us-west-2.compute.internal:4040
INFO : org.apache.spark.SparkContext - Added JAR 
file:/home/ec2-user/CE/correlationengine/scripts/../target/mm-anti-fraud-ce-0.0.1-SNAPSHOT-jar-with-dependencies.jar
 at 
http://10.0.0.XXX:46671/jars/mm-anti-fraud-ce-0.0.1-SNAPSHOT-jar-with-dependencies.jar
 with timestamp 1444620509463
INFO : org.apache.spark.scheduler.cluster.YarnClusterScheduler - Created 
YarnClusterScheduler
ERROR: org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend - 
Application ID is not set.
INFO : org.apache.spark.network.netty.NettyBlockTransferService - Server 
created on 33880
INFO : org.apache.spark.storage.BlockManagerMaster - Trying to register 
BlockManager
INFO : org.apache.spark.storage.BlockManagerMasterActor - Registering block 
manager ip-10-0-0-XXX.us-west-2.compute.internal:33880 with 265.4 MB RAM, 
BlockManagerId(, ip-10-0-0-XXX.us-west-2.compute.internal, 33880)
INFO : org.apache.spark.storage.BlockManagerMaster - Registered BlockManager
INFO : org.apache.spark.scheduler.EventLoggingListener - Logging events to 
hdfs://ip-10-0-0-XXX.us-west-2.compute.internal:8020/user/spark/applicationHistory/spark-application-1444620509497
Exception in thread "main" java.lang.NullPointerException
at 
org.apache.spark.deploy.yarn.ApplicationMaster$.sparkContextInitialized(ApplicationMaster.scala:580)
at 
org.apache.spark.scheduler.cluster.YarnClusterScheduler.postStartHook(YarnClusterScheduler.scala:32)
at org.apache.spark.SparkContext.(SparkContext.scala:541)
at 
org.apache.spark.api.java.JavaSparkContext.(JavaSparkContext.scala:61)
at 
com.markmonitor.antifraud.ce.KafkaURLStreaming.main(KafkaURLStreaming.java:91)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 

Re: Why Checkpoint is throwing "actor.OneForOneStrategy: NullPointerException"

2015-09-25 Thread Uthayan Suthakar
Thank you Tathagata and Therry for your response. You guys were absolutely
correct that I created a dummy Dstream (to prevent Flume channel filling
up)  and counted the messages but I didn't output(print), hence is why it
reported that error. Since I called print(), the error is no longer is
being thrown.

Cheers,

Uthay

On 25 September 2015 at 03:40, Terry Hoo  wrote:

> I met this before: in my program, some DStreams are not initialized since
> they are not in the path of  of output.
>
> You can  check if you are the same case.
>
>
> Thanks!
> - Terry
>
> On Fri, Sep 25, 2015 at 10:22 AM, Tathagata Das 
> wrote:
>
>> Are you by any chance setting DStream.remember() with null?
>>
>> On Thu, Sep 24, 2015 at 5:02 PM, Uthayan Suthakar <
>> uthayan.sutha...@gmail.com> wrote:
>>
>>> Hello all,
>>>
>>> My Stream job is throwing below exception at every interval. It is first
>>> deleting the the checkpoint file and then it's trying to checkpoint, is
>>> this normal behaviour? I'm using Spark 1.3.0. Do you know what may cause
>>> this issue?
>>>
>>> 15/09/24 16:35:55 INFO scheduler.TaskSetManager: Finished task 1.0 in
>>> stage 84.0 (TID 799) in 12 ms on itrac1511.cern.ch (1/8)
>>> *15/09/24 16:35:55 INFO streaming.CheckpointWriter:
>>> Deleting 
>>> hdfs://p01001532067275/user/wdtmon/wdt-dstream-6/checkpoint-144310422*
>>> *15/09/24 16:35:55 INFO streaming.CheckpointWriter: Checkpoint for time
>>> 144310422 ms saved to file
>>> 'hdfs://p01001532067275/user/wdtmon/wdt-dstream-6/*
>>> checkpoint-144310422', took 10696 bytes and 108 ms
>>> 15/09/24 16:35:55 INFO streaming.DStreamGraph: Clearing checkpoint data
>>> for time 144310422 ms
>>> 15/09/24 16:35:55 INFO streaming.DStreamGraph: Cleared checkpoint data
>>> for time 144310422 ms
>>> 15/09/24 16:35:55 ERROR actor.OneForOneStrategy:
>>> java.lang.NullPointerException
>>> at
>>> org.apache.spark.streaming.DStreamGraph$$anonfun$getMaxInputStreamRememberDuration$2.apply(DStreamGraph.scala:168)
>>> at
>>> org.apache.spark.streaming.DStreamGraph$$anonfun$getMaxInputStreamRememberDuration$2.apply(DStreamGraph.scala:168)
>>> at
>>> scala.collection.TraversableOnce$$anonfun$maxBy$1.apply(TraversableOnce.scala:225)
>>> at
>>> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
>>> at
>>> scala.collection.IndexedSeqOptimized$class.reduceLeft(IndexedSeqOptimized.scala:68)
>>> at
>>> scala.collection.mutable.ArrayBuffer.reduceLeft(ArrayBuffer.scala:47)
>>> at
>>> scala.collection.TraversableOnce$class.maxBy(TraversableOnce.scala:225)
>>> at
>>> scala.collection.AbstractTraversable.maxBy(Traversable.scala:105)
>>> at
>>> org.apache.spark.streaming.DStreamGraph.getMaxInputStreamRememberDuration(DStreamGraph.scala:168)
>>> at
>>> org.apache.spark.streaming.scheduler.JobGenerator.clearCheckpointData(JobGenerator.scala:279)
>>> at org.apache.spark.streaming.scheduler.JobGenerator.org
>>> $apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:181)
>>> at
>>> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$start$1$$anon$1$$anonfun$receive$1.applyOrElse(JobGenerator.scala:86)
>>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>>> at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>>> at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>>> at
>>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>>> at
>>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>> at
>>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>> at
>>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>> at
>>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>>
>>>
>>> Cheers,
>>>
>>> Uthay
>>>
>>>
>>
>


Why Checkpoint is throwing "actor.OneForOneStrategy: NullPointerException"

2015-09-24 Thread Uthayan Suthakar
Hello all,

My Stream job is throwing below exception at every interval. It is first
deleting the the checkpoint file and then it's trying to checkpoint, is
this normal behaviour? I'm using Spark 1.3.0. Do you know what may cause
this issue?

15/09/24 16:35:55 INFO scheduler.TaskSetManager: Finished task 1.0 in stage
84.0 (TID 799) in 12 ms on itrac1511.cern.ch (1/8)
*15/09/24 16:35:55 INFO streaming.CheckpointWriter:
Deleting 
hdfs://p01001532067275/user/wdtmon/wdt-dstream-6/checkpoint-144310422*
*15/09/24 16:35:55 INFO streaming.CheckpointWriter: Checkpoint for time
144310422 ms saved to file
'hdfs://p01001532067275/user/wdtmon/wdt-dstream-6/*
checkpoint-144310422', took 10696 bytes and 108 ms
15/09/24 16:35:55 INFO streaming.DStreamGraph: Clearing checkpoint data for
time 144310422 ms
15/09/24 16:35:55 INFO streaming.DStreamGraph: Cleared checkpoint data for
time 144310422 ms
15/09/24 16:35:55 ERROR actor.OneForOneStrategy:
java.lang.NullPointerException
at
org.apache.spark.streaming.DStreamGraph$$anonfun$getMaxInputStreamRememberDuration$2.apply(DStreamGraph.scala:168)
at
org.apache.spark.streaming.DStreamGraph$$anonfun$getMaxInputStreamRememberDuration$2.apply(DStreamGraph.scala:168)
at
scala.collection.TraversableOnce$$anonfun$maxBy$1.apply(TraversableOnce.scala:225)
at
scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
at
scala.collection.IndexedSeqOptimized$class.reduceLeft(IndexedSeqOptimized.scala:68)
at
scala.collection.mutable.ArrayBuffer.reduceLeft(ArrayBuffer.scala:47)
at
scala.collection.TraversableOnce$class.maxBy(TraversableOnce.scala:225)
at scala.collection.AbstractTraversable.maxBy(Traversable.scala:105)
at
org.apache.spark.streaming.DStreamGraph.getMaxInputStreamRememberDuration(DStreamGraph.scala:168)
at
org.apache.spark.streaming.scheduler.JobGenerator.clearCheckpointData(JobGenerator.scala:279)
at org.apache.spark.streaming.scheduler.JobGenerator.org
$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:181)
at
org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$start$1$$anon$1$$anonfun$receive$1.applyOrElse(JobGenerator.scala:86)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


Cheers,

Uthay


Re: Why Checkpoint is throwing "actor.OneForOneStrategy: NullPointerException"

2015-09-24 Thread Tathagata Das
Are you by any chance setting DStream.remember() with null?

On Thu, Sep 24, 2015 at 5:02 PM, Uthayan Suthakar <
uthayan.sutha...@gmail.com> wrote:

> Hello all,
>
> My Stream job is throwing below exception at every interval. It is first
> deleting the the checkpoint file and then it's trying to checkpoint, is
> this normal behaviour? I'm using Spark 1.3.0. Do you know what may cause
> this issue?
>
> 15/09/24 16:35:55 INFO scheduler.TaskSetManager: Finished task 1.0 in
> stage 84.0 (TID 799) in 12 ms on itrac1511.cern.ch (1/8)
> *15/09/24 16:35:55 INFO streaming.CheckpointWriter:
> Deleting 
> hdfs://p01001532067275/user/wdtmon/wdt-dstream-6/checkpoint-144310422*
> *15/09/24 16:35:55 INFO streaming.CheckpointWriter: Checkpoint for time
> 144310422 ms saved to file
> 'hdfs://p01001532067275/user/wdtmon/wdt-dstream-6/*
> checkpoint-144310422', took 10696 bytes and 108 ms
> 15/09/24 16:35:55 INFO streaming.DStreamGraph: Clearing checkpoint data
> for time 144310422 ms
> 15/09/24 16:35:55 INFO streaming.DStreamGraph: Cleared checkpoint data for
> time 144310422 ms
> 15/09/24 16:35:55 ERROR actor.OneForOneStrategy:
> java.lang.NullPointerException
> at
> org.apache.spark.streaming.DStreamGraph$$anonfun$getMaxInputStreamRememberDuration$2.apply(DStreamGraph.scala:168)
> at
> org.apache.spark.streaming.DStreamGraph$$anonfun$getMaxInputStreamRememberDuration$2.apply(DStreamGraph.scala:168)
> at
> scala.collection.TraversableOnce$$anonfun$maxBy$1.apply(TraversableOnce.scala:225)
> at
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
> at
> scala.collection.IndexedSeqOptimized$class.reduceLeft(IndexedSeqOptimized.scala:68)
> at
> scala.collection.mutable.ArrayBuffer.reduceLeft(ArrayBuffer.scala:47)
> at
> scala.collection.TraversableOnce$class.maxBy(TraversableOnce.scala:225)
> at
> scala.collection.AbstractTraversable.maxBy(Traversable.scala:105)
> at
> org.apache.spark.streaming.DStreamGraph.getMaxInputStreamRememberDuration(DStreamGraph.scala:168)
> at
> org.apache.spark.streaming.scheduler.JobGenerator.clearCheckpointData(JobGenerator.scala:279)
> at org.apache.spark.streaming.scheduler.JobGenerator.org
> $apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:181)
> at
> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$start$1$$anon$1$$anonfun$receive$1.applyOrElse(JobGenerator.scala:86)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
> at akka.actor.ActorCell.invoke(ActorCell.scala:456)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
> at akka.dispatch.Mailbox.run(Mailbox.scala:219)
> at
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
> at
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>
>
> Cheers,
>
> Uthay
>
>


Re: Why Checkpoint is throwing "actor.OneForOneStrategy: NullPointerException"

2015-09-24 Thread Terry Hoo
I met this before: in my program, some DStreams are not initialized since
they are not in the path of  of output.

You can  check if you are the same case.


Thanks!
- Terry

On Fri, Sep 25, 2015 at 10:22 AM, Tathagata Das  wrote:

> Are you by any chance setting DStream.remember() with null?
>
> On Thu, Sep 24, 2015 at 5:02 PM, Uthayan Suthakar <
> uthayan.sutha...@gmail.com> wrote:
>
>> Hello all,
>>
>> My Stream job is throwing below exception at every interval. It is first
>> deleting the the checkpoint file and then it's trying to checkpoint, is
>> this normal behaviour? I'm using Spark 1.3.0. Do you know what may cause
>> this issue?
>>
>> 15/09/24 16:35:55 INFO scheduler.TaskSetManager: Finished task 1.0 in
>> stage 84.0 (TID 799) in 12 ms on itrac1511.cern.ch (1/8)
>> *15/09/24 16:35:55 INFO streaming.CheckpointWriter:
>> Deleting 
>> hdfs://p01001532067275/user/wdtmon/wdt-dstream-6/checkpoint-144310422*
>> *15/09/24 16:35:55 INFO streaming.CheckpointWriter: Checkpoint for time
>> 144310422 ms saved to file
>> 'hdfs://p01001532067275/user/wdtmon/wdt-dstream-6/*
>> checkpoint-144310422', took 10696 bytes and 108 ms
>> 15/09/24 16:35:55 INFO streaming.DStreamGraph: Clearing checkpoint data
>> for time 144310422 ms
>> 15/09/24 16:35:55 INFO streaming.DStreamGraph: Cleared checkpoint data
>> for time 144310422 ms
>> 15/09/24 16:35:55 ERROR actor.OneForOneStrategy:
>> java.lang.NullPointerException
>> at
>> org.apache.spark.streaming.DStreamGraph$$anonfun$getMaxInputStreamRememberDuration$2.apply(DStreamGraph.scala:168)
>> at
>> org.apache.spark.streaming.DStreamGraph$$anonfun$getMaxInputStreamRememberDuration$2.apply(DStreamGraph.scala:168)
>> at
>> scala.collection.TraversableOnce$$anonfun$maxBy$1.apply(TraversableOnce.scala:225)
>> at
>> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
>> at
>> scala.collection.IndexedSeqOptimized$class.reduceLeft(IndexedSeqOptimized.scala:68)
>> at
>> scala.collection.mutable.ArrayBuffer.reduceLeft(ArrayBuffer.scala:47)
>> at
>> scala.collection.TraversableOnce$class.maxBy(TraversableOnce.scala:225)
>> at
>> scala.collection.AbstractTraversable.maxBy(Traversable.scala:105)
>> at
>> org.apache.spark.streaming.DStreamGraph.getMaxInputStreamRememberDuration(DStreamGraph.scala:168)
>> at
>> org.apache.spark.streaming.scheduler.JobGenerator.clearCheckpointData(JobGenerator.scala:279)
>> at org.apache.spark.streaming.scheduler.JobGenerator.org
>> $apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:181)
>> at
>> org.apache.spark.streaming.scheduler.JobGenerator$$anonfun$start$1$$anon$1$$anonfun$receive$1.applyOrElse(JobGenerator.scala:86)
>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>> at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>> at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>> at
>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>> at
>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>> at
>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>> at
>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>> at
>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>
>>
>> Cheers,
>>
>> Uthay
>>
>>
>


Re: NullPointerException inside RDD when calling sc.textFile

2015-07-23 Thread Akhil Das
Did you try:

val data = indexed_files.groupByKey

val *modified_data* = data.map { a =
  var name = a._2.mkString(,)
  (a._1, name)
}

*modified_data*.foreach { a =
  var file = sc.textFile(a._2)
  println(file.count)
}


Thanks
Best Regards

On Wed, Jul 22, 2015 at 2:18 AM, MorEru hsb.sh...@gmail.com wrote:

 I have a number of CSV files and need to combine them into a RDD by part of
 their filenames.

 For example, for the below files
 $ ls
 20140101_1.csv  20140101_3.csv  20140201_2.csv  20140301_1.csv
 20140301_3.csv 20140101_2.csv  20140201_1.csv  20140201_3.csv

 I need to combine files with names 20140101*.csv into a RDD to work on it
 and so on.

 I am using sc.wholeTextFiles to read the entire directory and then grouping
 the filenames by their patters to form a string of filenames. I am then
 passing the string to sc.textFile to open the files as a single RDD.

 This is the code I have -

 val files = sc.wholeTextFiles(*.csv)
 val indexed_files = files.map(a = (a._1.split(_)(0),a._1))
 val data = indexed_files.groupByKey

 data.map { a =
   var name = a._2.mkString(,)
   (a._1, name)
 }

 data.foreach { a =
   var file = sc.textFile(a._2)
   println(file.count)
 }

 And I get SparkException - NullPointerException when I try to call
 textFile.
 The error stack refers to an Iterator inside the RDD. I am not able to
 understand the error -

 15/07/21 15:37:37 INFO TaskSchedulerImpl: Removed TaskSet 65.0, whose tasks
 have all completed, from pool
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 1
 in
 stage 65.0 failed 4 times, most recent failure: Lost task 1.3 in stage 65.0
 (TID 115, 10.132.8.10): java.lang.NullPointerException
 at
 $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(console:33)
 at
 $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(console:32)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at

 org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:870)
 at

 org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:870)
 at

 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1765)
 at

 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1765)

 However, when I do sc.textFile(data.first._2).count in the spark shell, I
 am
 able to form the RDD and able to retrieve the count.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-inside-RDD-when-calling-sc-textFile-tp23943.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




NullPointerException inside RDD when calling sc.textFile

2015-07-21 Thread MorEru
I have a number of CSV files and need to combine them into a RDD by part of
their filenames.

For example, for the below files
$ ls   
20140101_1.csv  20140101_3.csv  20140201_2.csv  20140301_1.csv 
20140301_3.csv 20140101_2.csv  20140201_1.csv  20140201_3.csv 

I need to combine files with names 20140101*.csv into a RDD to work on it
and so on.

I am using sc.wholeTextFiles to read the entire directory and then grouping
the filenames by their patters to form a string of filenames. I am then
passing the string to sc.textFile to open the files as a single RDD.

This is the code I have -

val files = sc.wholeTextFiles(*.csv)
val indexed_files = files.map(a = (a._1.split(_)(0),a._1))
val data = indexed_files.groupByKey

data.map { a =
  var name = a._2.mkString(,)
  (a._1, name)
}

data.foreach { a =
  var file = sc.textFile(a._2)
  println(file.count)
}

And I get SparkException - NullPointerException when I try to call textFile.
The error stack refers to an Iterator inside the RDD. I am not able to
understand the error -

15/07/21 15:37:37 INFO TaskSchedulerImpl: Removed TaskSet 65.0, whose tasks
have all completed, from pool
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in
stage 65.0 failed 4 times, most recent failure: Lost task 1.3 in stage 65.0
(TID 115, 10.132.8.10): java.lang.NullPointerException
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(console:33)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(console:32)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at
org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:870)
at
org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:870)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1765)
at
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1765)

However, when I do sc.textFile(data.first._2).count in the spark shell, I am
able to form the RDD and able to retrieve the count.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-inside-RDD-when-calling-sc-textFile-tp23943.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: NullPointerException with functions.rand()

2015-06-12 Thread Ted Yu
Created PR and verified the example given by Justin works with the change:
https://github.com/apache/spark/pull/6793

Cheers

On Wed, Jun 10, 2015 at 7:15 PM, Ted Yu yuzhih...@gmail.com wrote:

 Looks like the NPE came from this line:
   @transient protected lazy val rng = new XORShiftRandom(seed +
 TaskContext.get().partitionId())

 Could TaskContext.get() be null ?

 On Wed, Jun 10, 2015 at 6:15 PM, Justin Yip yipjus...@prediction.io
 wrote:

 Hello,

 I am using 1.4.0 and found the following weird behavior.

 This case works fine:

 scala sc.parallelize(Seq((1,2), (3, 100))).toDF.withColumn(index,
 rand(30)).show()
 +--+---+---+
 |_1| _2|  index|
 +--+---+---+
 | 1|  2| 0.6662967911724369|
 | 3|100|0.35734504984676396|
 +--+---+---+

 However, when I use sqlContext.createDataFrame instead, I get a NPE:

 scala sqlContext.createDataFrame(Seq((1,2), (3,
 100))).withColumn(index, rand(30)).show()
 java.lang.NullPointerException
 at
 org.apache.spark.sql.catalyst.expressions.RDG.rng$lzycompute(random.scala:39)
 at org.apache.spark.sql.catalyst.expressions.RDG.rng(random.scala:39)
 ..


 Does any one know why?

 Thanks.

 Justin

 --
 View this message in context: NullPointerException with functions.rand()
 http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-with-functions-rand-tp23267.html
 Sent from the Apache Spark User List mailing list archive
 http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com.





Re: hiveContext.sql NullPointerException

2015-06-11 Thread patcharee

Hi,

Does 
df.write.partitionBy(partitions).format(format).mode(overwrite).saveAsTable(tbl) 
support orc file?


I tried df.write.partitionBy(zone, z, year, 
month).format(orc).mode(overwrite).saveAsTable(tbl), but after 
the insert my table tbl schema has been changed to something I did not 
expected ..


-- FROM --
CREATE EXTERNAL TABLE `4dim`(`u` float,   `v` float)
PARTITIONED BY (`zone` int, `z` int, `year` int, `month` int)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
TBLPROPERTIES (
  'orc.compress'='ZLIB',
  'transient_lastDdlTime'='1433016475')

-- TO --
CREATE TABLE `4dim`(`col` arraystring COMMENT 'from deserializer')
PARTITIONED BY (`zone` int COMMENT '', `z` int COMMENT '', `year` int 
COMMENT '', `month` int COMMENT '')
ROW FORMAT SERDE 
'org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe'

STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.SequenceFileInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat'
TBLPROPERTIES (
  'EXTERNAL'='FALSE',
  'spark.sql.sources.provider'='orc',
  'spark.sql.sources.schema.numParts'='1',
'spark.sql.sources.schema.part.0'='{\type\:\struct\,\fields\:[{\name\:\u\,\type\:\float\,\nullable\:true,\metadata\:{}},{\name\:\v\,\type\:\float\,\nullable\:true,\metadata\:{}},{\name\:\zone\,\type\:\integer\,\nullable\:true,\metadata\:{}},{\name\:\z\,\type\:\integer\,\nullable\:true,\metadata\:{}},{\name\:\year\,\type\:\integer\,\nullable\:true,\metadata\:{}},{\name\:\month\,\type\:\integer\,\nullable\:true,\metadata\:{}}]}', 


  'transient_lastDdlTime'='1434055247')


I noticed there are files stored in hdfs as *.orc, but when I tried to 
query from hive I got nothing. How can I fix this? Any suggestions please


BR,
Patcharee


On 07. juni 2015 16:40, Cheng Lian wrote:
Spark SQL supports Hive dynamic partitioning, so one possible 
workaround is to create a Hive table partitioned by zone, z, year, and 
month dynamically, and then insert the whole dataset into it directly.


In 1.4, we also provides dynamic partitioning support for non-Hive 
environment, and you can do something like this:


df.write.partitionBy(zone, z, year, 
month).format(parquet).mode(overwrite).saveAsTable(tbl)


Cheng

On 6/7/15 9:48 PM, patcharee wrote:

Hi,

How can I expect to work on HiveContext on the executor? If only the 
driver can see HiveContext, does it mean I have to collect all 
datasets (very large) to the driver and use HiveContext there? It 
will be memory overload on the driver and fail.


BR,
Patcharee

On 07. juni 2015 11:51, Cheng Lian wrote:

Hi,

This is expected behavior. HiveContext.sql (and also 
DataFrame.registerTempTable) is only expected to be invoked on 
driver side. However, the closure passed to RDD.foreach is executed 
on executor side, where no viable HiveContext instance exists.


Cheng

On 6/7/15 10:06 AM, patcharee wrote:

Hi,

I try to insert data into a partitioned hive table. The groupByKey 
is to combine dataset into a partition of the hive table. After the 
groupByKey, I converted the iterable[X] to DB by X.toList.toDF(). 
But the hiveContext.sql  throws NullPointerException, see below. 
Any suggestions? What could be wrong? Thanks!


val varWHeightFlatRDD = 
varWHeightRDD.flatMap(FlatMapUtilClass().flatKeyFromWrf).groupByKey()

  .foreach(
x = {
  val zone = x._1._1
  val z = x._1._2
  val year = x._1._3
  val month = x._1._4
  val df_table_4dim = x._2.toList.toDF()
  df_table_4dim.registerTempTable(table_4Dim)
  hiveContext.sql(INSERT OVERWRITE table 4dim partition 
(zone= + zone + ,z= + z + ,year= + year + ,month= + month + 
)  +
select date, hh, x, y, height, u, v, w, ph, phb, t, p, 
pb, qvapor, qgraup, qnice, qnrain, tke_pbl, el_pbl from table_4Dim);


})


java.lang.NullPointerException
at 
org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:100)
at 
no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$7.apply(LoadWrfIntoHiveOptReduce1.scala:113)
at 
no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$7.apply(LoadWrfIntoHiveOptReduce1.scala:103)

at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at 
org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at 
org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:798)
at 
org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:798)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1511)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1511)
at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)

at org.apache.spark.scheduler.Task.run(Task.scala:64)
at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203

NullPointerException with functions.rand()

2015-06-10 Thread Justin Yip
Hello,

I am using 1.4.0 and found the following weird behavior.

This case works fine:

scala sc.parallelize(Seq((1,2), (3, 100))).toDF.withColumn(index,
rand(30)).show()
+--+---+---+
|_1| _2|  index|
+--+---+---+
| 1|  2| 0.6662967911724369|
| 3|100|0.35734504984676396|
+--+---+---+

However, when I use sqlContext.createDataFrame instead, I get a NPE:

scala sqlContext.createDataFrame(Seq((1,2), (3, 100))).withColumn(index,
rand(30)).show()
java.lang.NullPointerException
at
org.apache.spark.sql.catalyst.expressions.RDG.rng$lzycompute(random.scala:39)
at org.apache.spark.sql.catalyst.expressions.RDG.rng(random.scala:39)
..


Does any one know why?

Thanks.

Justin




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-with-functions-rand-tp23267.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: hiveContext.sql NullPointerException

2015-06-08 Thread patcharee

Hi,

Thanks for your guidelines. I will try it out.

Btw how do you know HiveContext.sql (and also 
DataFrame.registerTempTable) is only expected to be invoked on driver 
side? Where can I find document?


BR,
Patcharee


On 07. juni 2015 16:40, Cheng Lian wrote:
Spark SQL supports Hive dynamic partitioning, so one possible 
workaround is to create a Hive table partitioned by zone, z, year, and 
month dynamically, and then insert the whole dataset into it directly.


In 1.4, we also provides dynamic partitioning support for non-Hive 
environment, and you can do something like this:


df.write.partitionBy(zone, z, year, 
month).format(parquet).mode(overwrite).saveAsTable(tbl)


Cheng

On 6/7/15 9:48 PM, patcharee wrote:

Hi,

How can I expect to work on HiveContext on the executor? If only the 
driver can see HiveContext, does it mean I have to collect all 
datasets (very large) to the driver and use HiveContext there? It 
will be memory overload on the driver and fail.


BR,
Patcharee

On 07. juni 2015 11:51, Cheng Lian wrote:

Hi,

This is expected behavior. HiveContext.sql (and also 
DataFrame.registerTempTable) is only expected to be invoked on 
driver side. However, the closure passed to RDD.foreach is executed 
on executor side, where no viable HiveContext instance exists.


Cheng

On 6/7/15 10:06 AM, patcharee wrote:

Hi,

I try to insert data into a partitioned hive table. The groupByKey 
is to combine dataset into a partition of the hive table. After the 
groupByKey, I converted the iterable[X] to DB by X.toList.toDF(). 
But the hiveContext.sql  throws NullPointerException, see below. 
Any suggestions? What could be wrong? Thanks!


val varWHeightFlatRDD = 
varWHeightRDD.flatMap(FlatMapUtilClass().flatKeyFromWrf).groupByKey()

  .foreach(
x = {
  val zone = x._1._1
  val z = x._1._2
  val year = x._1._3
  val month = x._1._4
  val df_table_4dim = x._2.toList.toDF()
  df_table_4dim.registerTempTable(table_4Dim)
  hiveContext.sql(INSERT OVERWRITE table 4dim partition 
(zone= + zone + ,z= + z + ,year= + year + ,month= + month + 
)  +
select date, hh, x, y, height, u, v, w, ph, phb, t, p, 
pb, qvapor, qgraup, qnice, qnrain, tke_pbl, el_pbl from table_4Dim);


})


java.lang.NullPointerException
at 
org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:100)
at 
no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$7.apply(LoadWrfIntoHiveOptReduce1.scala:113)
at 
no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$7.apply(LoadWrfIntoHiveOptReduce1.scala:103)

at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at 
org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at 
org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:798)
at 
org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:798)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1511)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1511)
at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)

at org.apache.spark.scheduler.Task.run(Task.scala:64)
at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:744)

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org












-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: hiveContext.sql NullPointerException

2015-06-08 Thread Cheng Lian



On 6/8/15 4:02 PM, patcharee wrote:

Hi,

Thanks for your guidelines. I will try it out.

Btw how do you know HiveContext.sql (and also 
DataFrame.registerTempTable) is only expected to be invoked on driver 
side? Where can I find document?
I'm afraid we don't state this explicitly on the SQL programming guide 
for now. This should be mentioned though.


BR,
Patcharee


On 07. juni 2015 16:40, Cheng Lian wrote:
Spark SQL supports Hive dynamic partitioning, so one possible 
workaround is to create a Hive table partitioned by zone, z, year, 
and month dynamically, and then insert the whole dataset into it 
directly.


In 1.4, we also provides dynamic partitioning support for non-Hive 
environment, and you can do something like this:


df.write.partitionBy(zone, z, year, 
month).format(parquet).mode(overwrite).saveAsTable(tbl)


Cheng

On 6/7/15 9:48 PM, patcharee wrote:

Hi,

How can I expect to work on HiveContext on the executor? If only the 
driver can see HiveContext, does it mean I have to collect all 
datasets (very large) to the driver and use HiveContext there? It 
will be memory overload on the driver and fail.


BR,
Patcharee

On 07. juni 2015 11:51, Cheng Lian wrote:

Hi,

This is expected behavior. HiveContext.sql (and also 
DataFrame.registerTempTable) is only expected to be invoked on 
driver side. However, the closure passed to RDD.foreach is executed 
on executor side, where no viable HiveContext instance exists.


Cheng

On 6/7/15 10:06 AM, patcharee wrote:

Hi,

I try to insert data into a partitioned hive table. The groupByKey 
is to combine dataset into a partition of the hive table. After 
the groupByKey, I converted the iterable[X] to DB by 
X.toList.toDF(). But the hiveContext.sql  throws 
NullPointerException, see below. Any suggestions? What could be 
wrong? Thanks!


val varWHeightFlatRDD = 
varWHeightRDD.flatMap(FlatMapUtilClass().flatKeyFromWrf).groupByKey()

  .foreach(
x = {
  val zone = x._1._1
  val z = x._1._2
  val year = x._1._3
  val month = x._1._4
  val df_table_4dim = x._2.toList.toDF()
  df_table_4dim.registerTempTable(table_4Dim)
  hiveContext.sql(INSERT OVERWRITE table 4dim partition 
(zone= + zone + ,z= + z + ,year= + year + ,month= + month + 
)  +
select date, hh, x, y, height, u, v, w, ph, phb, t, 
p, pb, qvapor, qgraup, qnice, qnrain, tke_pbl, el_pbl from 
table_4Dim);


})


java.lang.NullPointerException
at 
org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:100)
at 
no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$7.apply(LoadWrfIntoHiveOptReduce1.scala:113)
at 
no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$7.apply(LoadWrfIntoHiveOptReduce1.scala:103)

at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at 
org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at 
org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:798)
at 
org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:798)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1511)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1511)
at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)

at org.apache.spark.scheduler.Task.run(Task.scala:64)
at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:744)

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org















-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: hiveContext.sql NullPointerException

2015-06-07 Thread patcharee

Hi,

How can I expect to work on HiveContext on the executor? If only the 
driver can see HiveContext, does it mean I have to collect all datasets 
(very large) to the driver and use HiveContext there? It will be memory 
overload on the driver and fail.


BR,
Patcharee

On 07. juni 2015 11:51, Cheng Lian wrote:

Hi,

This is expected behavior. HiveContext.sql (and also 
DataFrame.registerTempTable) is only expected to be invoked on driver 
side. However, the closure passed to RDD.foreach is executed on 
executor side, where no viable HiveContext instance exists.


Cheng

On 6/7/15 10:06 AM, patcharee wrote:

Hi,

I try to insert data into a partitioned hive table. The groupByKey is 
to combine dataset into a partition of the hive table. After the 
groupByKey, I converted the iterable[X] to DB by X.toList.toDF(). But 
the hiveContext.sql  throws NullPointerException, see below. Any 
suggestions? What could be wrong? Thanks!


val varWHeightFlatRDD = 
varWHeightRDD.flatMap(FlatMapUtilClass().flatKeyFromWrf).groupByKey()

  .foreach(
x = {
  val zone = x._1._1
  val z = x._1._2
  val year = x._1._3
  val month = x._1._4
  val df_table_4dim = x._2.toList.toDF()
  df_table_4dim.registerTempTable(table_4Dim)
  hiveContext.sql(INSERT OVERWRITE table 4dim partition 
(zone= + zone + ,z= + z + ,year= + year + ,month= + month + ) 
 +
select date, hh, x, y, height, u, v, w, ph, phb, t, p, 
pb, qvapor, qgraup, qnice, qnrain, tke_pbl, el_pbl from table_4Dim);


})


java.lang.NullPointerException
at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:100)
at 
no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$7.apply(LoadWrfIntoHiveOptReduce1.scala:113)
at 
no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$7.apply(LoadWrfIntoHiveOptReduce1.scala:103)

at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at 
org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)

at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:798)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:798)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1511)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1511)
at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)

at org.apache.spark.scheduler.Task.run(Task.scala:64)
at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:744)

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org







-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: hiveContext.sql NullPointerException

2015-06-07 Thread Cheng Lian
Spark SQL supports Hive dynamic partitioning, so one possible workaround 
is to create a Hive table partitioned by zone, z, year, and month 
dynamically, and then insert the whole dataset into it directly.


In 1.4, we also provides dynamic partitioning support for non-Hive 
environment, and you can do something like this:


df.write.partitionBy(zone, z, year, 
month).format(parquet).mode(overwrite).saveAsTable(tbl)


Cheng

On 6/7/15 9:48 PM, patcharee wrote:

Hi,

How can I expect to work on HiveContext on the executor? If only the 
driver can see HiveContext, does it mean I have to collect all 
datasets (very large) to the driver and use HiveContext there? It will 
be memory overload on the driver and fail.


BR,
Patcharee

On 07. juni 2015 11:51, Cheng Lian wrote:

Hi,

This is expected behavior. HiveContext.sql (and also 
DataFrame.registerTempTable) is only expected to be invoked on driver 
side. However, the closure passed to RDD.foreach is executed on 
executor side, where no viable HiveContext instance exists.


Cheng

On 6/7/15 10:06 AM, patcharee wrote:

Hi,

I try to insert data into a partitioned hive table. The groupByKey 
is to combine dataset into a partition of the hive table. After the 
groupByKey, I converted the iterable[X] to DB by X.toList.toDF(). 
But the hiveContext.sql  throws NullPointerException, see below. Any 
suggestions? What could be wrong? Thanks!


val varWHeightFlatRDD = 
varWHeightRDD.flatMap(FlatMapUtilClass().flatKeyFromWrf).groupByKey()

  .foreach(
x = {
  val zone = x._1._1
  val z = x._1._2
  val year = x._1._3
  val month = x._1._4
  val df_table_4dim = x._2.toList.toDF()
  df_table_4dim.registerTempTable(table_4Dim)
  hiveContext.sql(INSERT OVERWRITE table 4dim partition 
(zone= + zone + ,z= + z + ,year= + year + ,month= + month + 
)  +
select date, hh, x, y, height, u, v, w, ph, phb, t, p, 
pb, qvapor, qgraup, qnice, qnrain, tke_pbl, el_pbl from table_4Dim);


})


java.lang.NullPointerException
at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:100)
at 
no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$7.apply(LoadWrfIntoHiveOptReduce1.scala:113)
at 
no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$7.apply(LoadWrfIntoHiveOptReduce1.scala:103)

at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at 
org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)

at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:798)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:798)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1511)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1511)
at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)

at org.apache.spark.scheduler.Task.run(Task.scala:64)
at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:744)

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org










-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: NullPointerException SQLConf.setConf

2015-06-07 Thread Cheng Lian
Are you calling hiveContext.sql within an RDD.map closure or something 
similar? In this way, the call actually happens on executor side. 
However, HiveContext only exists on the driver side.


Cheng

On 6/4/15 3:45 PM, patcharee wrote:

Hi,

I am using Hive 0.14 and spark 0.13. I got 
java.lang.NullPointerException when inserted into hive. Any 
suggestions please.


hiveContext.sql(INSERT OVERWRITE table 4dim partition (zone= + ZONE 
+ ,z= + zz + ,year= + YEAR + ,month= + MONTH + )  +
select date, hh, x, y, height, u, v, w, ph, phb, t, p, pb, 
qvapor, qgraup, qnice, qnrain, tke_pbl, el_pbl from table_4Dim where 
z= + zz);


java.lang.NullPointerException
at org.apache.spark.sql.SQLConf.setConf(SQLConf.scala:196)
at org.apache.spark.sql.SQLContext.setConf(SQLContext.scala:74)
at 
org.apache.spark.sql.hive.HiveContext.hiveconf$lzycompute(HiveContext.scala:251)
at 
org.apache.spark.sql.hive.HiveContext.hiveconf(HiveContext.scala:250)

at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:95)
at 
no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$main$3$$anonfun$apply$1.apply(LoadWrfIntoHiveOptReduce1.scala:110)
at 
no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$main$3$$anonfun$apply$1.apply(LoadWrfIntoHiveOptReduce1.scala:107)

at scala.collection.immutable.Range.foreach(Range.scala:141)
at 
no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$main$3.apply(LoadWrfIntoHiveOptReduce1.scala:107)
at 
no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$main$3.apply(LoadWrfIntoHiveOptReduce1.scala:107)
at 
org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:806)
at 
org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:806)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1511)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1511)

at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:744)

Best,
Patcharee

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org





-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: hiveContext.sql NullPointerException

2015-06-07 Thread Cheng Lian

Hi,

This is expected behavior. HiveContext.sql (and also 
DataFrame.registerTempTable) is only expected to be invoked on driver 
side. However, the closure passed to RDD.foreach is executed on executor 
side, where no viable HiveContext instance exists.


Cheng

On 6/7/15 10:06 AM, patcharee wrote:

Hi,

I try to insert data into a partitioned hive table. The groupByKey is 
to combine dataset into a partition of the hive table. After the 
groupByKey, I converted the iterable[X] to DB by X.toList.toDF(). But 
the hiveContext.sql  throws NullPointerException, see below. Any 
suggestions? What could be wrong? Thanks!


val varWHeightFlatRDD = 
varWHeightRDD.flatMap(FlatMapUtilClass().flatKeyFromWrf).groupByKey()

  .foreach(
x = {
  val zone = x._1._1
  val z = x._1._2
  val year = x._1._3
  val month = x._1._4
  val df_table_4dim = x._2.toList.toDF()
  df_table_4dim.registerTempTable(table_4Dim)
  hiveContext.sql(INSERT OVERWRITE table 4dim partition 
(zone= + zone + ,z= + z + ,year= + year + ,month= + month + )  +
select date, hh, x, y, height, u, v, w, ph, phb, t, p, 
pb, qvapor, qgraup, qnice, qnrain, tke_pbl, el_pbl from table_4Dim);


})


java.lang.NullPointerException
at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:100)
at 
no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$7.apply(LoadWrfIntoHiveOptReduce1.scala:113)
at 
no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$7.apply(LoadWrfIntoHiveOptReduce1.scala:103)

at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at 
org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)

at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:798)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:798)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1511)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1511)

at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:744)

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org





-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



hiveContext.sql NullPointerException

2015-06-06 Thread patcharee

Hi,

I try to insert data into a partitioned hive table. The groupByKey is to 
combine dataset into a partition of the hive table. After the 
groupByKey, I converted the iterable[X] to DB by X.toList.toDF(). But 
the hiveContext.sql  throws NullPointerException, see below. Any 
suggestions? What could be wrong? Thanks!


val varWHeightFlatRDD = 
varWHeightRDD.flatMap(FlatMapUtilClass().flatKeyFromWrf).groupByKey()

  .foreach(
x = {
  val zone = x._1._1
  val z = x._1._2
  val year = x._1._3
  val month = x._1._4
  val df_table_4dim = x._2.toList.toDF()
  df_table_4dim.registerTempTable(table_4Dim)
  hiveContext.sql(INSERT OVERWRITE table 4dim partition 
(zone= + zone + ,z= + z + ,year= + year + ,month= + month + )  +
select date, hh, x, y, height, u, v, w, ph, phb, t, p, pb, 
qvapor, qgraup, qnice, qnrain, tke_pbl, el_pbl from table_4Dim);


})


java.lang.NullPointerException
at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:100)
at 
no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$7.apply(LoadWrfIntoHiveOptReduce1.scala:113)
at 
no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$7.apply(LoadWrfIntoHiveOptReduce1.scala:103)

at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at 
org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)

at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:798)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:798)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1511)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1511)

at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:744)

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



NullPointerException SQLConf.setConf

2015-06-04 Thread patcharee

Hi,

I am using Hive 0.14 and spark 0.13. I got 
java.lang.NullPointerException when inserted into hive. Any suggestions 
please.


hiveContext.sql(INSERT OVERWRITE table 4dim partition (zone= + ZONE + 
,z= + zz + ,year= + YEAR + ,month= + MONTH + )  +
select date, hh, x, y, height, u, v, w, ph, phb, t, p, pb, 
qvapor, qgraup, qnice, qnrain, tke_pbl, el_pbl from table_4Dim where z= 
+ zz);


java.lang.NullPointerException
at org.apache.spark.sql.SQLConf.setConf(SQLConf.scala:196)
at org.apache.spark.sql.SQLContext.setConf(SQLContext.scala:74)
at 
org.apache.spark.sql.hive.HiveContext.hiveconf$lzycompute(HiveContext.scala:251)
at 
org.apache.spark.sql.hive.HiveContext.hiveconf(HiveContext.scala:250)

at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:95)
at 
no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$main$3$$anonfun$apply$1.apply(LoadWrfIntoHiveOptReduce1.scala:110)
at 
no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$main$3$$anonfun$apply$1.apply(LoadWrfIntoHiveOptReduce1.scala:107)

at scala.collection.immutable.Range.foreach(Range.scala:141)
at 
no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$main$3.apply(LoadWrfIntoHiveOptReduce1.scala:107)
at 
no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$main$3.apply(LoadWrfIntoHiveOptReduce1.scala:107)
at 
org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:806)
at 
org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:806)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1511)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1511)

at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:744)

Best,
Patcharee

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



NullPointerException when accessing broadcast variable in DStream

2015-05-18 Thread hotienvu
Hi I'm trying to use broadcast variables in my Spark streaming program.

 val conf = new
SparkConf().setMaster(SPARK_MASTER).setAppName(APPLICATION_NAME)
 val ssc = new StreamingContext(conf, Seconds(1))
 
 val LIMIT = ssc.sparkContext.broadcast(5L)
 println(LIMIT.value) // this print 5
  val lines = ssc.socketTextStream(localhost, )
  val words = lines.flatMap(_.split( )) filter (_.size  LIMIT.value)
  words.print()
  ssc.start()
  ssc.awaitTermination()

It throws java.lang.NullPointerException at the line (_.size  LIMIT.value)
so I'm guessing LIMIT is not accessible within the stream

I'm running spark 1.3.1 in standalone mode with 2 nodes cluster. I tried
with spark-shell and it works fine. Please help!

Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-when-accessing-broadcast-variable-in-DStream-tp22934.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



NullPointerException with Avro + Spark.

2015-05-01 Thread ๏̯͡๏
I have this spark app that simply needs to do a simple regular join between
two datasets. IT works fine with tiny data set (2.5G input of each
dataset). When i run against 25G of each input and with .partitionBy(new
org.apache.spark.HashPartitioner(200)) , I see NullPointerExveption


this trace does not include a line from my code and hence i do not what is
causing error ?
I do have registered kryo serializer.

val conf = new SparkConf()
  .setAppName(detail)
*  .set(spark.serializer,
org.apache.spark.serializer.KryoSerializer)*
  .set(spark.kryoserializer.buffer.mb,
arguments.get(buffersize).get)
  .set(spark.kryoserializer.buffer.max.mb,
arguments.get(maxbuffersize).get)
  .set(spark.driver.maxResultSize, arguments.get(maxResultSize).get)
  .set(spark.yarn.maxAppAttempts, 0)
* 
.registerKryoClasses(Array(classOf[com.ebay.ep.poc.spark.reporting.process.model.dw.SpsLeve*
lMetricSum]))
val sc = new SparkContext(conf)

I see the exception when this task runs

val viEvents = details.map { vi = (vi.get(14).asInstanceOf[Long], vi) }

Its a simple mapping of input records to (itemId, record)

I found this
http://stackoverflow.com/questions/23962796/kryo-readobject-cause-nullpointerexception-with-arraylist
and
http://apache-spark-user-list.1001560.n3.nabble.com/Kryo-NPE-with-Array-td19797.html

Looks like Kryo (2.21v)  changed something to stop using default
constructors.

(Kryo.DefaultInstantiatorStrategy)
kryo.getInstantiatorStrategy()).setFallbackInstantiatorStrategy(new
StdInstantiatorStrategy());


Please suggest


Trace:
15/05/01 03:02:15 ERROR executor.Executor: Exception in task 110.1 in stage
2.0 (TID 774)
com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException
Serialization trace:
values (org.apache.avro.generic.GenericData$Record)
datum (org.apache.avro.mapred.AvroKey)
at
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:626)
at
com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648)
at
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
at
com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:41)
at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:33)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
at
org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:138)
at
org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
at
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32)
at
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:210)
at
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
at org.apache.avro.generic.GenericData$Array.add(GenericData.java:200)
at
com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109)
at
com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
at
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:338)
at
com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ObjectArraySerializer.read(DefaultArraySerializers.java:293)
at com.esotericsoftware.kryo.Kryo.readObject(Kryo.java:648)
at
com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:605)
... 27 more

-- 
Deepak


NullPointerException in TaskSetManager

2015-02-26 Thread gtinside
Hi ,

I am trying to run a simple hadoop job (that uses
CassandraHadoopInputOutputWriter) on spark (v1.2 , Hadoop v 1.x) but getting
NullPointerException in TaskSetManager

WARN 2015-02-26 14:21:43,217 [task-result-getter-0] TaskSetManager - Lost
task 14.2 in stage 0.0 (TID 29, devntom003.dev.blackrock.com):
java.lang.NullPointerException

org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1007)
com.bfm.spark.test.CassandraHadoopMigrator$.main(CassandraHadoopMigrator.scala:77)
com.bfm.spark.test.CassandraHadoopMigrator.main(CassandraHadoopMigrator.scala)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:606)
org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358)
org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)


logs doesn't have any stack trace, can someone please help ?

Regards,
Gaurav




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-in-TaskSetManager-tp21832.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark NullPointerException

2015-02-25 Thread Máté Gulyás
Hi all,

I am trying to run a Spark Java application on EMR, but I keep getting
NullPointerException from the Application master (spark version on
EMR: 1.2). The stacktrace is below. I also tried to run the
application on Hortonworks Sandbox (2.2) with spark 1.2, following the
blogpost (http://hortonworks.com/hadoop-tutorial/using-apache-spark-hdp/)
from Hortonworks, but that failed too. Same exception. I try to run
over YARN (master: yarn-cluster). Tried to run the hortonworks sample
application on the virtual machine, but that failed with the very same
exception. I also tried to set spark home in SparkConf, same
exception. What am I missing?

The stacktrace and the log:
15/02/25 11:38:41 INFO SecurityManager: Changing view acls to: root
15/02/25 11:38:41 INFO SecurityManager: Changing modify acls to: root
15/02/25 11:38:41 INFO SecurityManager: SecurityManager:
authentication disabled; ui acls disabled; users with view
permissions: Set(root); users with modify permissions: Set(root)
15/02/25 11:38:42 INFO Slf4jLogger: Slf4jLogger started
15/02/25 11:38:42 INFO Remoting: Starting remoting
15/02/25 11:38:42 INFO Remoting: Remoting started; listening on
addresses :[akka.tcp://sparkdri...@sandbox.hortonworks.com:53937]
15/02/25 11:38:42 INFO Utils: Successfully started service
'sparkDriver' on port 53937.
15/02/25 11:38:42 INFO SparkEnv: Registering MapOutputTracker
15/02/25 11:38:42 INFO SparkEnv: Registering BlockManagerMaster
15/02/25 11:38:42 INFO DiskBlockManager: Created local directory at
/tmp/spark-local-20150225113842-788f
15/02/25 11:38:42 INFO MemoryStore: MemoryStore started with capacity 265.4 MB
15/02/25 11:38:42 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where
applicable
15/02/25 11:38:42 INFO HttpFileServer: HTTP File server directory is
/tmp/spark-973069b3-aafd-4f1d-b18c-9e0a5d0efcaa
15/02/25 11:38:42 INFO HttpServer: Starting HTTP Server
15/02/25 11:38:43 INFO Utils: Successfully started service 'HTTP file
server' on port 39199.
15/02/25 11:38:43 INFO Utils: Successfully started service 'SparkUI'
on port 4040.
15/02/25 11:38:43 INFO SparkUI: Started SparkUI at
http://sandbox.hortonworks.com:4040
15/02/25 11:38:43 INFO SparkContext: Added JAR
file:/root/logprocessor-1.0-SNAPSHOT-jar-with-dependencies.jar at
http://192.168.100.37:39199/jars/logprocessor-1.0-SNAPSHOT-jar-with-dependencies.jar
with timestamp 1424864323482
15/02/25 11:38:43 INFO YarnClusterScheduler: Created YarnClusterScheduler
Exception in thread main java.lang.NullPointerException
at 
org.apache.spark.deploy.yarn.ApplicationMaster$.getAttempId(ApplicationMaster.scala:524)
at 
org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend.start(YarnClusterSchedulerBackend.scala:34)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:140)
at org.apache.spark.SparkContext.init(SparkContext.scala:337)
at 
org.apache.spark.api.java.JavaSparkContext.init(JavaSparkContext.scala:61)
at 
org.apache.spark.api.java.JavaSparkContext.init(JavaSparkContext.scala:75)
at hu.enbritely.logprocessor.Logprocessor.main(Logprocessor.java:43)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:360)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:76)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties


One of the program I try to run:

public static void main(String[] argv) {
  SparkConf conf = new SparkConf();
  JavaSparkContext spark = new JavaSparkContext(yarn-cluster,
Spark logprocessing, conf);
  JavaRDDString file = spark.textFile(hdfs://spark-output);
  file.saveAsTextFile(hdfs://output);
  spark.stop();
}

Thank you for your assistance!

Mate Gulyas

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



NullPointerException in ApplicationMaster

2015-02-25 Thread gulyasm
Hi all,

I am trying to run a Spark Java application on EMR, but I keep getting
NullPointerException from the Application master (spark version on
EMR: 1.2). The stacktrace is below. I also tried to run the
application on Hortonworks Sandbox (2.2) with spark 1.2, following the
blogpost (http://hortonworks.com/hadoop-tutorial/using-apache-spark-hdp/)
from Hortonworks, but that failed too. Same exception. I try to run
over YARN (master: yarn-cluster). Tried to run the hortonworks sample
application on the virtual machine, but that failed with the very same
exception. I also tried to set spark home in SparkConf, same
exception. What am I missing?

The stacktrace and the log:
15/02/25 11:38:41 INFO SecurityManager: Changing view acls to: root
15/02/25 11:38:41 INFO SecurityManager: Changing modify acls to: root
15/02/25 11:38:41 INFO SecurityManager: SecurityManager: authentication
disabled; ui acls disabled; users with view permissions: Set(root); users
with modify permissions: Set(root)
15/02/25 11:38:42 INFO Slf4jLogger: Slf4jLogger started
15/02/25 11:38:42 INFO Remoting: Starting remoting
15/02/25 11:38:42 INFO Remoting: Remoting started; listening on
addresses :[akka.tcp://sparkdri...@sandbox.hortonworks.com:53937]
15/02/25 11:38:42 INFO Utils: Successfully started service
'sparkDriver' on port 53937.
15/02/25 11:38:42 INFO SparkEnv: Registering MapOutputTracker
15/02/25 11:38:42 INFO SparkEnv: Registering BlockManagerMaster
15/02/25 11:38:42 INFO DiskBlockManager: Created local directory at
/tmp/spark-local-20150225113842-788f
15/02/25 11:38:42 INFO MemoryStore: MemoryStore started with capacity 265.4
MB
15/02/25 11:38:42 WARN NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where
applicable
15/02/25 11:38:42 INFO HttpFileServer: HTTP File server directory is
/tmp/spark-973069b3-aafd-4f1d-b18c-9e0a5d0efcaa
15/02/25 11:38:42 INFO HttpServer: Starting HTTP Server
15/02/25 11:38:43 INFO Utils: Successfully started service 'HTTP file
server' on port 39199.
15/02/25 11:38:43 INFO Utils: Successfully started service 'SparkUI' on port
4040.
15/02/25 11:38:43 INFO SparkUI: Started SparkUI at
http://sandbox.hortonworks.com:4040
15/02/25 11:38:43 INFO SparkContext: Added JAR
file:/root/logprocessor-1.0-SNAPSHOT-jar-with-dependencies.jar at
http://192.168.100.37:39199/jars/logprocessor-1.0-SNAPSHOT-jar-with-dependencies.jar
with timestamp 1424864323482
15/02/25 11:38:43 INFO YarnClusterScheduler: Created YarnClusterScheduler
Exception in thread main java.lang.NullPointerException
at
org.apache.spark.deploy.yarn.ApplicationMaster$.getAttempId(ApplicationMaster.scala:524)
at
org.apache.spark.scheduler.cluster.YarnClusterSchedulerBackend.start(YarnClusterSchedulerBackend.scala:34)
at
org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:140)
at org.apache.spark.SparkContext.init(SparkContext.scala:337)
at
org.apache.spark.api.java.JavaSparkContext.init(JavaSparkContext.scala:61)
at
org.apache.spark.api.java.JavaSparkContext.init(JavaSparkContext.scala:75)
at hu.enbritely.logprocessor.Logprocessor.main(Logprocessor.java:43)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at
org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:360)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:76)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Using Spark's default log4j profile:
org/apache/spark/log4j-defaults.properties


One of the program I try to run:

public static void main(String[] argv) {
  SparkConf conf = new SparkConf();
  JavaSparkContext spark = new JavaSparkContext(yarn-cluster,
Spark logprocessing, conf);
  JavaRDDString file = spark.textFile(hdfs://spark-output);
  file.saveAsTextFile(hdfs://output);
  spark.stop();
}

Thank you for your assistance!
Mate Gulyas



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-in-ApplicationMaster-tp21804.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: NullPointerException in ApplicationMaster

2015-02-25 Thread Zhan Zhang
Look at the trace again. It is a very weird error. The SparkSubmit is running 
on client side, but YarnClusterSchedulerBackend is supposed in running in YARN 
AM.

I suspect you are running the cluster with yarn-client mode, but in 
JavaSparkContext you set yarn-cluster”. As a result, spark context initiate 
YarnClusterSchedulerBackend instead of YarnClientSchedulerBackend,  which I 
think is the root cause.

Thanks.

Zhan Zhang

On Feb 25, 2015, at 1:53 PM, Zhan Zhang 
zzh...@hortonworks.commailto:zzh...@hortonworks.com wrote:

Hi Mate,

When you initialize the JavaSparkContext, you don’t need to specify the mode 
“yarn-cluster”. I suspect that is the root cause.

Thanks.

Zhan Zhang

On Feb 25, 2015, at 10:12 AM, gulyasm 
mgulya...@gmail.commailto:mgulya...@gmail.com wrote:

JavaSparkContext.




NullPointerException

2014-12-31 Thread rapelly kartheek
Hi,
I get this following Exception when I submit spark application that
calculates the frequency of characters in a file. Especially, when I
increase the size of data, I face this problem.

Exception in thread Thread-47 org.apache.spark.SparkException: Job
aborted due to stage failure: Task 11.0:10 failed 4 times, most recent
failure: Exception failure in TID 295 on host s1:
java.lang.NullPointerException
org.apache.spark.storage.BlockManager.org
$apache$spark$storage$BlockManager$$replicate(BlockManager.scala:786)
org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:752)
org.apache.spark.storage.BlockManager.put(BlockManager.scala:574)
org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:108)
org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
org.apache.spark.scheduler.Task.run(Task.scala:51)

org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
at scala.Option.foreach(Option.scala:236)
at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633)
at
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


Any help?

Thank you!


Re: NullPointerException

2014-12-31 Thread Josh Rosen
Which version of Spark are you using?

On Wed, Dec 31, 2014 at 10:24 PM, rapelly kartheek kartheek.m...@gmail.com
wrote:

 Hi,
 I get this following Exception when I submit spark application that
 calculates the frequency of characters in a file. Especially, when I
 increase the size of data, I face this problem.

 Exception in thread Thread-47 org.apache.spark.SparkException: Job
 aborted due to stage failure: Task 11.0:10 failed 4 times, most recent
 failure: Exception failure in TID 295 on host s1:
 java.lang.NullPointerException
 org.apache.spark.storage.BlockManager.org
 $apache$spark$storage$BlockManager$$replicate(BlockManager.scala:786)
 org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:752)
 org.apache.spark.storage.BlockManager.put(BlockManager.scala:574)
 org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:108)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
 org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
 org.apache.spark.scheduler.Task.run(Task.scala:51)

 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:745)
 Driver stacktrace:
 at org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
 at scala.Option.foreach(Option.scala:236)
 at
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633)
 at
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


 Any help?

 Thank you!



Re: NullPointerException

2014-12-31 Thread rapelly kartheek
spark-1.0.0

On Thu, Jan 1, 2015 at 12:04 PM, Josh Rosen rosenvi...@gmail.com wrote:

 Which version of Spark are you using?

 On Wed, Dec 31, 2014 at 10:24 PM, rapelly kartheek 
 kartheek.m...@gmail.com wrote:

 Hi,
 I get this following Exception when I submit spark application that
 calculates the frequency of characters in a file. Especially, when I
 increase the size of data, I face this problem.

 Exception in thread Thread-47 org.apache.spark.SparkException: Job
 aborted due to stage failure: Task 11.0:10 failed 4 times, most recent
 failure: Exception failure in TID 295 on host s1:
 java.lang.NullPointerException
 org.apache.spark.storage.BlockManager.org
 $apache$spark$storage$BlockManager$$replicate(BlockManager.scala:786)

 org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:752)
 org.apache.spark.storage.BlockManager.put(BlockManager.scala:574)
 org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:108)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
 org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
 org.apache.spark.scheduler.Task.run(Task.scala:51)

 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:745)
 Driver stacktrace:
 at org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
 at scala.Option.foreach(Option.scala:236)
 at
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633)
 at
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


 Any help?

 Thank you!





  1   2   >