Re: NullPointerException in SparkSession while reading Parquet files on S3

2021-05-25 Thread YEONWOO BAEK
unsubscribe

2021년 5월 26일 (수) 오전 12:31, Eric Beabes 님이 작성:

> I keep getting the following exception when I am trying to read a Parquet
> file from a Path on S3 in Spark/Scala. Note: I am running this on EMR.
>
> java.lang.NullPointerException
> at 
> org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:144)
> at 
> org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:142)
> at 
> org.apache.spark.sql.DataFrameReader.(DataFrameReader.scala:789)
> at org.apache.spark.sql.SparkSession.read(SparkSession.scala:656)
>
> Interestingly I can read the path from Spark shell:
>
> scala> val df = spark.read.parquet("s3://my-path/").count
> df: Long = 47
>
> I've created the SparkSession as follows:
>
> val sparkConf = new SparkConf().setAppName("My spark app")val spark = 
> SparkSession.builder.config(sparkConf).enableHiveSupport().getOrCreate()
> spark.sparkContext.setLogLevel("WARN")
> spark.sparkContext.hadoopConfiguration.set("java.library.path", 
> "/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native")
> spark.conf.set("spark.sql.parquet.mergeSchema", "true")
> spark.conf.set("spark.speculation", "false")
> spark.conf.set("spark.sql.crossJoin.enabled", "true")
> spark.conf.set("spark.sql.sources.partitionColumnTypeInference.enabled", 
> "true")
> spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")
> spark.sparkContext.hadoopConfiguration.set("mapreduce.fileoutputcommitter.algorithm.version",
>  "2")
> spark.sparkContext.hadoopConfiguration.setBoolean("mapreduce.fileoutputcommitter.cleanup.skipped",
>  true)
> spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", 
> System.getenv("AWS_ACCESS_KEY_ID"))
> spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", 
> System.getenv("AWS_SECRET_ACCESS_KEY"))
> spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint", 
> "s3.amazonaws.com")
>
> Here's the line where I am getting this exception:
>
> val df1 = spark.read.parquet(pathToRead)
>
> What am I doing wrong? I have tried without setting 'access key' & 'secret 
> key' as well with no luck.
>
>
>
>
>


Re: Nullpointerexception error when in repartition

2018-04-12 Thread Junfeng Chen
Hi,
I know it, but my purpose it to transforming json string in DataSet
to Dataset, while spark.readStream can only support read json file in
specified path.
https://stackoverflow.com/questions/48617474/how-to-convert-json-dataset-to-dataframe-in-spark-structured-streaming
gives an essential method, but the formats of every json data are not same.
Either Spark java api seems not supporting grammer like

.select(from_json($"value", colourSchema))



Regard,
Junfeng Chen

On Fri, Apr 13, 2018 at 7:09 AM, Tathagata Das 
wrote:

> Have you read through the documentation of Structured Streaming?
> https://spark.apache.org/docs/latest/structured-streaming-
> programming-guide.html
>
> One of the basic mistakes you are making is defining the dataset as with
> `spark.read()`. You define a streaming Dataset as `spark.readStream()`
>
> On Thu, Apr 12, 2018 at 3:02 AM, Junfeng Chen  wrote:
>
>> Hi, Tathagata
>>
>> I have tried structured streaming, but in line
>>
>>> Dataset rowDataset = spark.read().json(jsondataset);
>>
>>
>> Always throw
>>
>>> Queries with streaming sources must be executed with writeStream.start()
>>
>>
>> But what i need to do in this step is only transforming json string data
>> to Dataset . How to fix it?
>>
>> Thanks!
>>
>>
>> Regard,
>> Junfeng Chen
>>
>> On Thu, Apr 12, 2018 at 3:08 PM, Tathagata Das <
>> tathagata.das1...@gmail.com> wrote:
>>
>>> It's not very surprising that doing this sort of RDD to DF conversion
>>> inside DStream.foreachRDD has weird corner cases like this. In fact, you
>>> are going to have additional problems with partial parquet files (when
>>> there are failures) in this approach. I strongly suggest that you use
>>> Structured Streaming, which is designed to do this sort of processing. It
>>> will take care of tracking the written parquet files correctly.
>>>
>>> TD
>>>
>>> On Wed, Apr 11, 2018 at 6:58 PM, Junfeng Chen 
>>> wrote:
>>>
 I write a program to read some json data from kafka and purpose to save
 them to parquet file on hdfs.
 Here is my code:

> JavaInputDstream stream = ...
> JavaDstream rdd = stream.map...
> rdd.repartition(taksNum).foreachRDD(VoldFunction stringjavardd->{
> Dataset df = spark.read().json( stringjavardd ); // convert
> json to df
> JavaRDD rowJavaRDD = df.javaRDD().map...  //add some new
> fields
> StructType type = df.schema()...; // constuct new type for new
> added fields
> Dataset //create new dataframe
> newdf.repatition(taskNum).write().mode(SaveMode.Append).pati
> tionedBy("appname").parquet(savepath); // save to parquet
> })



 However, if I remove the repartition method of newdf in writing parquet
 stage, the program always throw nullpointerexception error in json convert
 line:

 Java.lang.NullPointerException
>  at org.apache.spark.SparkContext.getPreferredLocs(SparkContext.
> scala:1783)
> ...


 While it looks make no sense, writing parquet operation should be in
 different stage with json transforming operation.
 So how to solve it? Thanks!

 Regard,
 Junfeng Chen

>>>
>>>
>>
>


Re: Nullpointerexception error when in repartition

2018-04-12 Thread Tathagata Das
Have you read through the documentation of Structured Streaming?
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html

One of the basic mistakes you are making is defining the dataset as with
`spark.read()`. You define a streaming Dataset as `spark.readStream()`

On Thu, Apr 12, 2018 at 3:02 AM, Junfeng Chen  wrote:

> Hi, Tathagata
>
> I have tried structured streaming, but in line
>
>> Dataset rowDataset = spark.read().json(jsondataset);
>
>
> Always throw
>
>> Queries with streaming sources must be executed with writeStream.start()
>
>
> But what i need to do in this step is only transforming json string data
> to Dataset . How to fix it?
>
> Thanks!
>
>
> Regard,
> Junfeng Chen
>
> On Thu, Apr 12, 2018 at 3:08 PM, Tathagata Das <
> tathagata.das1...@gmail.com> wrote:
>
>> It's not very surprising that doing this sort of RDD to DF conversion
>> inside DStream.foreachRDD has weird corner cases like this. In fact, you
>> are going to have additional problems with partial parquet files (when
>> there are failures) in this approach. I strongly suggest that you use
>> Structured Streaming, which is designed to do this sort of processing. It
>> will take care of tracking the written parquet files correctly.
>>
>> TD
>>
>> On Wed, Apr 11, 2018 at 6:58 PM, Junfeng Chen  wrote:
>>
>>> I write a program to read some json data from kafka and purpose to save
>>> them to parquet file on hdfs.
>>> Here is my code:
>>>
 JavaInputDstream stream = ...
 JavaDstream rdd = stream.map...
 rdd.repartition(taksNum).foreachRDD(VoldFunction{
 Dataset df = spark.read().json( stringjavardd ); // convert
 json to df
 JavaRDD rowJavaRDD = df.javaRDD().map...  //add some new fields
 StructType type = df.schema()...; // constuct new type for new
 added fields
 Dataset>>
>>>
>>>
>>> However, if I remove the repartition method of newdf in writing parquet
>>> stage, the program always throw nullpointerexception error in json convert
>>> line:
>>>
>>> Java.lang.NullPointerException
  at org.apache.spark.SparkContext.getPreferredLocs(SparkContext.
 scala:1783)
 ...
>>>
>>>
>>> While it looks make no sense, writing parquet operation should be in
>>> different stage with json transforming operation.
>>> So how to solve it? Thanks!
>>>
>>> Regard,
>>> Junfeng Chen
>>>
>>
>>
>


Re: Nullpointerexception error when in repartition

2018-04-12 Thread Junfeng Chen
Hi, Tathagata

I have tried structured streaming, but in line

> Dataset rowDataset = spark.read().json(jsondataset);


Always throw

> Queries with streaming sources must be executed with writeStream.start()


But what i need to do in this step is only transforming json string data to
Dataset . How to fix it?

Thanks!


Regard,
Junfeng Chen

On Thu, Apr 12, 2018 at 3:08 PM, Tathagata Das 
wrote:

> It's not very surprising that doing this sort of RDD to DF conversion
> inside DStream.foreachRDD has weird corner cases like this. In fact, you
> are going to have additional problems with partial parquet files (when
> there are failures) in this approach. I strongly suggest that you use
> Structured Streaming, which is designed to do this sort of processing. It
> will take care of tracking the written parquet files correctly.
>
> TD
>
> On Wed, Apr 11, 2018 at 6:58 PM, Junfeng Chen  wrote:
>
>> I write a program to read some json data from kafka and purpose to save
>> them to parquet file on hdfs.
>> Here is my code:
>>
>>> JavaInputDstream stream = ...
>>> JavaDstream rdd = stream.map...
>>> rdd.repartition(taksNum).foreachRDD(VoldFunction>> stringjavardd->{
>>> Dataset df = spark.read().json( stringjavardd ); // convert
>>> json to df
>>> JavaRDD rowJavaRDD = df.javaRDD().map...  //add some new fields
>>> StructType type = df.schema()...; // constuct new type for new added
>>> fields
>>> Dataset>> //create new dataframe
>>> newdf.repatition(taskNum).write().mode(SaveMode.Append).pati
>>> tionedBy("appname").parquet(savepath); // save to parquet
>>> })
>>
>>
>>
>> However, if I remove the repartition method of newdf in writing parquet
>> stage, the program always throw nullpointerexception error in json convert
>> line:
>>
>> Java.lang.NullPointerException
>>>  at org.apache.spark.SparkContext.getPreferredLocs(SparkContext.
>>> scala:1783)
>>> ...
>>
>>
>> While it looks make no sense, writing parquet operation should be in
>> different stage with json transforming operation.
>> So how to solve it? Thanks!
>>
>> Regard,
>> Junfeng Chen
>>
>
>


Re: Nullpointerexception error when in repartition

2018-04-12 Thread Tathagata Das
It's not very surprising that doing this sort of RDD to DF conversion
inside DStream.foreachRDD has weird corner cases like this. In fact, you
are going to have additional problems with partial parquet files (when
there are failures) in this approach. I strongly suggest that you use
Structured Streaming, which is designed to do this sort of processing. It
will take care of tracking the written parquet files correctly.

TD

On Wed, Apr 11, 2018 at 6:58 PM, Junfeng Chen  wrote:

> I write a program to read some json data from kafka and purpose to save
> them to parquet file on hdfs.
> Here is my code:
>
>> JavaInputDstream stream = ...
>> JavaDstream rdd = stream.map...
>> rdd.repartition(taksNum).foreachRDD(VoldFunction> stringjavardd->{
>> Dataset df = spark.read().json( stringjavardd ); // convert
>> json to df
>> JavaRDD rowJavaRDD = df.javaRDD().map...  //add some new fields
>> StructType type = df.schema()...; // constuct new type for new added
>> fields
>> Dataset> //create new dataframe
>> newdf.repatition(taskNum).write().mode(SaveMode.Append).
>> patitionedBy("appname").parquet(savepath); // save to parquet
>> })
>
>
>
> However, if I remove the repartition method of newdf in writing parquet
> stage, the program always throw nullpointerexception error in json convert
> line:
>
> Java.lang.NullPointerException
>>  at org.apache.spark.SparkContext.getPreferredLocs(SparkContext.
>> scala:1783)
>> ...
>
>
> While it looks make no sense, writing parquet operation should be in
> different stage with json transforming operation.
> So how to solve it? Thanks!
>
> Regard,
> Junfeng Chen
>


Re: NullPointerException while reading a column from the row

2017-12-19 Thread Vadim Semenov
getAs defined as:

def getAs[T](i: Int): T = get(i).asInstanceOf[T]

and when you do toString you call Object.toString which doesn't depend on
the type,
so asInstanceOf[T] get dropped by the compiler, i.e.

row.getAs[Int](0).toString -> row.get(0).toString

we can confirm that by writing a simple scala code:

import org.apache.spark.sql._
object Test {
  val row = Row(null)
  row.getAs[Int](0).toString
}

and then compiling it:

$ scalac -classpath $SPARK_HOME/jars/'*' -print test.scala
[[syntax trees at end of   cleanup]] // test.scala
package  {
  object Test extends Object {
private[this] val row: org.apache.spark.sql.Row = _;
  def row(): org.apache.spark.sql.Row = Test.this.row;
def (): Test.type = {
  Test.super.();
  Test.this.row =
org.apache.spark.sql.Row.apply(scala.this.Predef.genericWrapArray(Array[Object]{null}));
  *Test.this.row().getAs(0).toString();*
  ()
}
  }
}

So the proper way would be:

String.valueOf(row.getAs[Int](0))


On Tue, Dec 19, 2017 at 4:23 AM, Anurag Sharma  wrote:

> The following Scala (Spark 1.6) code for reading a value from a Row fails
> with a NullPointerException when the value is null.
>
> val test = row.getAs[Int]("ColumnName").toString
>
> while this works fine
>
> val test1 = row.getAs[Int]("ColumnName") // returns 0 for nullval test2 = 
> test1.toString // converts to String fine
>
> What is causing NullPointerException and what is the recommended way to
> handle such cases?
>
> PS: getting row from DataFrame as follows:
>
>  val myRDD = myDF
> .repartition(partitions)
> .mapPartitions {
>   rows =>
> rows.flatMap {
> row =>
>   functionWithRows(row) //has above logic to read null column 
> which fails
>   }
>   }
>
> functionWithRows has then above mentioned NullPointerException
>
> MyDF schema:
>
> root
>  |-- LDID: string (nullable = true)
>  |-- KTAG: string (nullable = true)
>  |-- ColumnName: integer (nullable = true)
>
>


Re: NullPointerException error while saving Scala Dataframe to HBase

2017-10-01 Thread Marco Mistroni
Hi
 The question is getting to the list.
I have no experience in hbase ...though , having seen similar stuff when
saving a df somewhere else...it might have to do with the properties you
need to set to let spark know it is dealing with hbase? Don't u need to set
some properties on the spark context you are using?
Hth
 Marco


On Oct 1, 2017 4:33 AM,  wrote:

Hi Guys- am not sure whether the email is reaching to the community
members. Please can somebody acknowledge

Sent from my iPhone

> On 30-Sep-2017, at 5:02 PM, Debabrata Ghosh  wrote:
>
> Dear All,
>Greetings ! I am repeatedly hitting a NullPointerException
error while saving a Scala Dataframe to HBase. Please can you help
resolving this for me. Here is the code snippet:
>
> scala> def catalog = s"""{
>  ||"table":{"namespace":"default", "name":"table1"},
>  ||"rowkey":"key",
>  ||"columns":{
>  |  |"col0":{"cf":"rowkey", "col":"key", "type":"string"},
>  |  |"col1":{"cf":"cf1", "col":"col1", "type":"string"}
>  ||}
>  |  |}""".stripMargin
> catalog: String
>
> scala> case class HBaseRecord(
>  |col0: String,
>  |col1: String)
> defined class HBaseRecord
>
> scala> val data = (0 to 255).map { i =>  HBaseRecord(i.toString, "extra")}
> data: scala.collection.immutable.IndexedSeq[HBaseRecord] =
Vector(HBaseRecord(0,extra), HBaseRecord(1,extra), HBaseRecord
>
> (2,extra), HBaseRecord(3,extra), HBaseRecord(4,extra),
HBaseRecord(5,extra), HBaseRecord(6,extra), HBaseRecord(7,extra),
>
> HBaseRecord(8,extra), HBaseRecord(9,extra), HBaseRecord(10,extra),
HBaseRecord(11,extra), HBaseRecord(12,extra),
>
> HBaseRecord(13,extra), HBaseRecord(14,extra), HBaseRecord(15,extra),
HBaseRecord(16,extra), HBaseRecord(17,extra),
>
> HBaseRecord(18,extra), HBaseRecord(19,extra), HBaseRecord(20,extra),
HBaseRecord(21,extra), HBaseRecord(22,extra),
>
> HBaseRecord(23,extra), HBaseRecord(24,extra), HBaseRecord(25,extra),
HBaseRecord(26,extra), HBaseRecord(27,extra),
>
> HBaseRecord(28,extra), HBaseRecord(29,extra), HBaseRecord(30,extra),
HBaseRecord(31,extra), HBase...
>
> scala> import org.apache.spark.sql.datasources.hbase
> import org.apache.spark.sql.datasources.hbase
>
>
> scala> import org.apache.spark.sql.datasources.hbase.{HBaseTableCatalog}
> import org.apache.spark.sql.datasources.hbase.HBaseTableCatalog
>
> scala> 
> sc.parallelize(data).toDF.write.options(Map(HBaseTableCatalog.tableCatalog
-> catalog, HBaseTableCatalog.newTable ->
>
> "5")).format("org.apache.hadoop.hbase.spark").save()
>
> java.lang.NullPointerException
>   at org.apache.hadoop.hbase.spark.HBaseRelation.(
DefaultSource.scala:134)
>   at org.apache.hadoop.hbase.spark.DefaultSource.createRelation(
DefaultSource.scala:75)
>   at org.apache.spark.sql.execution.datasources.
DataSource.write(DataSource.scala:426)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215)
>   ... 56 elided
>
>
> Thanks in advance !
>
> Debu
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org


Re: NullPointerException error while saving Scala Dataframe to HBase

2017-09-30 Thread mailfordebu
Hi Guys- am not sure whether the email is reaching to the community members. 
Please can somebody acknowledge 

Sent from my iPhone

> On 30-Sep-2017, at 5:02 PM, Debabrata Ghosh  wrote:
> 
> Dear All,
>Greetings ! I am repeatedly hitting a NullPointerException 
> error while saving a Scala Dataframe to HBase. Please can you help resolving 
> this for me. Here is the code snippet:
> 
> scala> def catalog = s"""{
>  ||"table":{"namespace":"default", "name":"table1"},
>  ||"rowkey":"key",
>  ||"columns":{
>  |  |"col0":{"cf":"rowkey", "col":"key", "type":"string"},
>  |  |"col1":{"cf":"cf1", "col":"col1", "type":"string"}
>  ||}
>  |  |}""".stripMargin
> catalog: String
> 
> scala> case class HBaseRecord(
>  |col0: String,
>  |col1: String)
> defined class HBaseRecord
> 
> scala> val data = (0 to 255).map { i =>  HBaseRecord(i.toString, "extra")}
> data: scala.collection.immutable.IndexedSeq[HBaseRecord] = 
> Vector(HBaseRecord(0,extra), HBaseRecord(1,extra), HBaseRecord
> 
> (2,extra), HBaseRecord(3,extra), HBaseRecord(4,extra), HBaseRecord(5,extra), 
> HBaseRecord(6,extra), HBaseRecord(7,extra), 
> 
> HBaseRecord(8,extra), HBaseRecord(9,extra), HBaseRecord(10,extra), 
> HBaseRecord(11,extra), HBaseRecord(12,extra), 
> 
> HBaseRecord(13,extra), HBaseRecord(14,extra), HBaseRecord(15,extra), 
> HBaseRecord(16,extra), HBaseRecord(17,extra), 
> 
> HBaseRecord(18,extra), HBaseRecord(19,extra), HBaseRecord(20,extra), 
> HBaseRecord(21,extra), HBaseRecord(22,extra), 
> 
> HBaseRecord(23,extra), HBaseRecord(24,extra), HBaseRecord(25,extra), 
> HBaseRecord(26,extra), HBaseRecord(27,extra), 
> 
> HBaseRecord(28,extra), HBaseRecord(29,extra), HBaseRecord(30,extra), 
> HBaseRecord(31,extra), HBase...
> 
> scala> import org.apache.spark.sql.datasources.hbase
> import org.apache.spark.sql.datasources.hbase
>  
> 
> scala> import org.apache.spark.sql.datasources.hbase.{HBaseTableCatalog}
> import org.apache.spark.sql.datasources.hbase.HBaseTableCatalog
> 
> scala> 
> sc.parallelize(data).toDF.write.options(Map(HBaseTableCatalog.tableCatalog -> 
> catalog, HBaseTableCatalog.newTable -> 
> 
> "5")).format("org.apache.hadoop.hbase.spark").save()
> 
> java.lang.NullPointerException
>   at 
> org.apache.hadoop.hbase.spark.HBaseRelation.(DefaultSource.scala:134)
>   at 
> org.apache.hadoop.hbase.spark.DefaultSource.createRelation(DefaultSource.scala:75)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:426)
>   at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215)
>   ... 56 elided
> 
> 
> Thanks in advance !
> 
> Debu
> 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: NullPointerException when starting StreamingContext

2016-06-24 Thread Sunita Arvind
I was able to resolve the serialization issue. The root cause was, I was
accessing the config values within foreachRDD{}.
The solution was to extract the values from config outside the foreachRDD
scope and send in values to the loop directly. Probably something obvious
as we cannot have nested distribution data sets. Mentioning it here for
benefit of anyone else stumbling upon the same issue.

regards
Sunita

On Wed, Jun 22, 2016 at 8:20 PM, Sunita Arvind 
wrote:

> Hello Experts,
>
> I am getting this error repeatedly:
>
> 16/06/23 03:06:59 ERROR streaming.StreamingContext: Error starting the 
> context, marking it as stopped
> java.lang.NullPointerException
>   at 
> com.typesafe.config.impl.SerializedConfigValue.writeOrigin(SerializedConfigValue.java:202)
>   at 
> com.typesafe.config.impl.ConfigImplUtil.writeOrigin(ConfigImplUtil.java:228)
>   at 
> com.typesafe.config.ConfigException.writeObject(ConfigException.java:58)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
>   at 
> java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:440)
>   at java.lang.Throwable.writeObject(Throwable.java:985)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>   at 
> java.io.ObjectOutputStream.writeFatalException(ObjectOutputStream.java:1576)
>   at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:350)
>   at 
> org.apache.spark.streaming.Checkpoint$$anonfun$serialize$1.apply$mcV$sp(Checkpoint.scala:141)
>   at 
> org.apache.spark.streaming.Checkpoint$$anonfun$serialize$1.apply(Checkpoint.scala:141)
>   at 
> org.apache.spark.streaming.Checkpoint$$anonfun$serialize$1.apply(Checkpoint.scala:141)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1251)
>   at 
> org.apache.spark.streaming.Checkpoint$.serialize(Checkpoint.scala:142)
>   at 
> org.apache.spark.streaming.StreamingContext.validate(StreamingContext.scala:554)
>   at 
> org.apache.spark.streaming.StreamingContext.liftedTree1$1(StreamingContext.scala:601)
>   at 
> org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:600)
>   at 
> com.edgecast.engine.ProcessingEngine$$anonfun$main$1.apply(ProcessingEngine.scala:73)
>   at 
> com.edgecast.engine.ProcessingEngine$$anonfun$main$1.apply(ProcessingEngine.scala:67)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at com.edgecast.engine.ProcessingEngine$.main(ProcessingEngine.scala:67)
>   at com.edgecast.engine.ProcessingEngine.main(ProcessingEngine.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:542)
>
>
> It seems to be a typical issue. All I am doing here is as below:
>
> Object ProcessingEngine{
>
> def initializeSpark(customer:String):StreamingContext={
>   LogHandler.log.info("InitialeSpark")
>   val custConf = ConfigFactory.load(customer + 
> ".conf").getConfig(customer).withFallback(AppConf)
>   implicit val sparkConf: SparkConf = new SparkConf().setAppName(customer)
>   val ssc: StreamingContext = new StreamingContext(sparkConf, 
> Seconds(custConf.getLong("batchDurSec")))
>   ssc.checkpoint(custConf.getString("checkpointDir"))
>   ssc
> }
>
> def createDataStreamFromKafka(customer:String, ssc: 
> StreamingContext):DStream[Array[Byte]]={
>   val 

Re: NullPointerException when starting StreamingContext

2016-06-24 Thread Cody Koeninger
That looks like a classpath problem.  You should not have to include
the kafka_2.10 artifact in your pom, spark-streaming-kafka_2.10
already has a transitive dependency on it.  That being said, 0.8.2.1
is the correct version, so that's a little strange.

How are you building and submitting your application?

Finally, if this ends up being a CDH related issue, you may have
better luck on their forum.

On Thu, Jun 23, 2016 at 1:16 PM, Sunita Arvind  wrote:
> Also, just to keep it simple, I am trying to use 1.6.0CDH5.7.0 in the
> pom.xml as the cluster I am trying to run on is CDH5.7.0 with spark 1.6.0.
>
> Here is my pom setting:
>
>
> 1.6.0-cdh5.7.0
> 
> org.apache.spark
> spark-core_2.10
> ${cdh.spark.version}
> compile
> 
> 
> org.apache.spark
> spark-streaming_2.10
> ${cdh.spark.version}
> compile
> 
> 
> org.apache.spark
> spark-sql_2.10
> ${cdh.spark.version}
> compile
> 
> 
> org.apache.spark
> spark-streaming-kafka_2.10
> ${cdh.spark.version}
> compile
> 
> 
> org.apache.kafka
> kafka_2.10
> 0.8.2.1
> compile
> 
>
> But trying to execute the application throws errors like below:
> Exception in thread "main" java.lang.NoClassDefFoundError:
> kafka/cluster/BrokerEndPoint
> at
> org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$2$$anonfun$3$$anonfun$apply$6$$anonfun$apply$7.apply(KafkaCluster.scala:90)
> at scala.Option.map(Option.scala:145)
> at
> org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$2$$anonfun$3$$anonfun$apply$6.apply(KafkaCluster.scala:90)
> at
> org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$2$$anonfun$3$$anonfun$apply$6.apply(KafkaCluster.scala:87)
> at
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
> at
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
> at
> org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$2$$anonfun$3.apply(KafkaCluster.scala:87)
> at
> org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$2$$anonfun$3.apply(KafkaCluster.scala:86)
> at
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at
> scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
> at scala.collection.immutable.Set$Set1.foreach(Set.scala:74)
> at
> scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
> at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
> at
> org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$2.apply(KafkaCluster.scala:86)
> at
> org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$2.apply(KafkaCluster.scala:85)
> at scala.util.Either$RightProjection.flatMap(Either.scala:523)
> at
> org.apache.spark.streaming.kafka.KafkaCluster.findLeaders(KafkaCluster.scala:85)
> at
> org.apache.spark.streaming.kafka.KafkaCluster.getLeaderOffsets(KafkaCluster.scala:179)
> at
> org.apache.spark.streaming.kafka.KafkaCluster.getLeaderOffsets(KafkaCluster.scala:161)
> at
> org.apache.spark.streaming.kafka.KafkaCluster.getLatestLeaderOffsets(KafkaCluster.scala:150)
> at
> org.apache.spark.streaming.kafka.KafkaUtils$$anonfun$5.apply(KafkaUtils.scala:215)
> at
> org.apache.spark.streaming.kafka.KafkaUtils$$anonfun$5.apply(KafkaUtils.scala:211)
> at scala.util.Either$RightProjection.flatMap(Either.scala:523)
> at
> org.apache.spark.streaming.kafka.KafkaUtils$.getFromOffsets(KafkaUtils.scala:211)
> at
> org.apache.spark.streaming.kafka.KafkaUtils$.createDirectStream(KafkaUtils.scala:484)
> at
> com.edgecast.engine.ConcurrentOps$.createDataStreamFromKafka(ConcurrentOps.scala:68)
> at
> com.edgecast.engine.ConcurrentOps$.startProcessing(ConcurrentOps.scala:32)
> at com.edgecast.engine.ProcessingEngine$.main(ProcessingEngine.scala:33)
> at com.edgecast.engine.ProcessingEngine.main(ProcessingEngine.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:606)
> at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)
> Caused by: java.lang.ClassNotFoundException: kafka.cluster.BrokerEndPoint
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> at java.security.AccessController.doPrivileged(Native Method)
>   

Re: NullPointerException when starting StreamingContext

2016-06-23 Thread Sunita Arvind
Also, just to keep it simple, I am trying to use 1.6.0CDH5.7.0 in the
pom.xml as the cluster I am trying to run on is CDH5.7.0 with spark 1.6.0.

Here is my pom setting:


1.6.0-cdh5.7.0

org.apache.spark
spark-core_2.10
${cdh.spark.version}
compile


org.apache.spark
spark-streaming_2.10
${cdh.spark.version}
compile


org.apache.spark
spark-sql_2.10
${cdh.spark.version}
compile


org.apache.spark
spark-streaming-kafka_2.10
${cdh.spark.version}
compile


org.apache.kafka
kafka_2.10
0.8.2.1
compile


But trying to execute the application throws errors like below:
Exception in thread "main" java.lang.NoClassDefFoundError:
kafka/cluster/BrokerEndPoint
at
org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$2$$anonfun$3$$anonfun$apply$6$$anonfun$apply$7.apply(KafkaCluster.scala:90)
at scala.Option.map(Option.scala:145)
at
org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$2$$anonfun$3$$anonfun$apply$6.apply(KafkaCluster.scala:90)
at
org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$2$$anonfun$3$$anonfun$apply$6.apply(KafkaCluster.scala:87)
at
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
at
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at
org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$2$$anonfun$3.apply(KafkaCluster.scala:87)
at
org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$2$$anonfun$3.apply(KafkaCluster.scala:86)
at
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.immutable.Set$Set1.foreach(Set.scala:74)
at
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at
org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$2.apply(KafkaCluster.scala:86)
at
org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$2.apply(KafkaCluster.scala:85)
at scala.util.Either$RightProjection.flatMap(Either.scala:523)
at
org.apache.spark.streaming.kafka.KafkaCluster.findLeaders(KafkaCluster.scala:85)
at
org.apache.spark.streaming.kafka.KafkaCluster.getLeaderOffsets(KafkaCluster.scala:179)
at
org.apache.spark.streaming.kafka.KafkaCluster.getLeaderOffsets(KafkaCluster.scala:161)
at
org.apache.spark.streaming.kafka.KafkaCluster.getLatestLeaderOffsets(KafkaCluster.scala:150)
at
org.apache.spark.streaming.kafka.KafkaUtils$$anonfun$5.apply(KafkaUtils.scala:215)
at
org.apache.spark.streaming.kafka.KafkaUtils$$anonfun$5.apply(KafkaUtils.scala:211)
at scala.util.Either$RightProjection.flatMap(Either.scala:523)
at
org.apache.spark.streaming.kafka.KafkaUtils$.getFromOffsets(KafkaUtils.scala:211)
at
org.apache.spark.streaming.kafka.KafkaUtils$.createDirectStream(KafkaUtils.scala:484)
at
com.edgecast.engine.ConcurrentOps$.createDataStreamFromKafka(ConcurrentOps.scala:68)
at
com.edgecast.engine.ConcurrentOps$.startProcessing(ConcurrentOps.scala:32)
at com.edgecast.engine.ProcessingEngine$.main(ProcessingEngine.scala:33)
at com.edgecast.engine.ProcessingEngine.main(ProcessingEngine.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)
Caused by: java.lang.ClassNotFoundException: kafka.cluster.BrokerEndPoint
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 38 more
16/06/23 11:09:53 INFO SparkContext: Invoking stop() from shutdown hook


I've tried kafka version 0.8.2.0, 0.8.2.2, 0.9.0.0. With 0.9.0.0 the
processing hangs much sooner.
Can someone help with this error?

regards
Sunita

On Wed, Jun 22, 2016 at 8:20 PM, Sunita Arvind 
wrote:

> Hello Experts,
>
> I am getting this error repeatedly:
>
> 16/06/23 03:06:59 ERROR streaming.StreamingContext: Error 

Re: NullPointerException when starting StreamingContext

2016-06-22 Thread Ted Yu
Which Scala version / Spark release are you using ?

Cheers

On Wed, Jun 22, 2016 at 8:20 PM, Sunita Arvind 
wrote:

> Hello Experts,
>
> I am getting this error repeatedly:
>
> 16/06/23 03:06:59 ERROR streaming.StreamingContext: Error starting the 
> context, marking it as stopped
> java.lang.NullPointerException
>   at 
> com.typesafe.config.impl.SerializedConfigValue.writeOrigin(SerializedConfigValue.java:202)
>   at 
> com.typesafe.config.impl.ConfigImplUtil.writeOrigin(ConfigImplUtil.java:228)
>   at 
> com.typesafe.config.ConfigException.writeObject(ConfigException.java:58)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>   at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
>   at 
> java.io.ObjectOutputStream.defaultWriteObject(ObjectOutputStream.java:440)
>   at java.lang.Throwable.writeObject(Throwable.java:985)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:988)
>   at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1495)
>   at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
>   at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
>   at 
> java.io.ObjectOutputStream.writeFatalException(ObjectOutputStream.java:1576)
>   at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:350)
>   at 
> org.apache.spark.streaming.Checkpoint$$anonfun$serialize$1.apply$mcV$sp(Checkpoint.scala:141)
>   at 
> org.apache.spark.streaming.Checkpoint$$anonfun$serialize$1.apply(Checkpoint.scala:141)
>   at 
> org.apache.spark.streaming.Checkpoint$$anonfun$serialize$1.apply(Checkpoint.scala:141)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1251)
>   at 
> org.apache.spark.streaming.Checkpoint$.serialize(Checkpoint.scala:142)
>   at 
> org.apache.spark.streaming.StreamingContext.validate(StreamingContext.scala:554)
>   at 
> org.apache.spark.streaming.StreamingContext.liftedTree1$1(StreamingContext.scala:601)
>   at 
> org.apache.spark.streaming.StreamingContext.start(StreamingContext.scala:600)
>   at 
> com.edgecast.engine.ProcessingEngine$$anonfun$main$1.apply(ProcessingEngine.scala:73)
>   at 
> com.edgecast.engine.ProcessingEngine$$anonfun$main$1.apply(ProcessingEngine.scala:67)
>   at scala.collection.immutable.List.foreach(List.scala:318)
>   at com.edgecast.engine.ProcessingEngine$.main(ProcessingEngine.scala:67)
>   at com.edgecast.engine.ProcessingEngine.main(ProcessingEngine.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:542)
>
>
> It seems to be a typical issue. All I am doing here is as below:
>
> Object ProcessingEngine{
>
> def initializeSpark(customer:String):StreamingContext={
>   LogHandler.log.info("InitialeSpark")
>   val custConf = ConfigFactory.load(customer + 
> ".conf").getConfig(customer).withFallback(AppConf)
>   implicit val sparkConf: SparkConf = new SparkConf().setAppName(customer)
>   val ssc: StreamingContext = new StreamingContext(sparkConf, 
> Seconds(custConf.getLong("batchDurSec")))
>   ssc.checkpoint(custConf.getString("checkpointDir"))
>   ssc
> }
>
> def createDataStreamFromKafka(customer:String, ssc: 
> StreamingContext):DStream[Array[Byte]]={
>   val custConf = ConfigFactory.load(customer + 
> ".conf").getConfig(customer).withFallback(ConfigFactory.load())
>   LogHandler.log.info("createDataStreamFromKafka")
>   KafkaUtils.createDirectStream[String,
> Array[Byte],
> StringDecoder,
> DefaultDecoder](
> ssc,
> Map[String, String]("metadata.broker.list" -> 
> 

Re: NullPointerException

2016-03-12 Thread saurabh guru
I don't see how that would be possible. I am reading from a live stream of
data through kafka.

On Sat 12 Mar, 2016 20:28 Ted Yu,  wrote:

> Interesting.
> If kv._1 was null, shouldn't the NPE have come from getPartition() (line
> 105) ?
>
> Was it possible that records.next() returned null ?
>
> On Fri, Mar 11, 2016 at 11:20 PM, Prabhu Joseph <
> prabhujose.ga...@gmail.com> wrote:
>
>> Looking at ExternalSorter.scala line 192, i suspect some input record has
>> Null key.
>>
>> 189  while (records.hasNext) {
>> 190addElementsRead()
>> 191kv = records.next()
>> 192map.changeValue((getPartition(kv._1), kv._1), update)
>>
>>
>>
>> On Sat, Mar 12, 2016 at 12:48 PM, Prabhu Joseph <
>> prabhujose.ga...@gmail.com> wrote:
>>
>>> Looking at ExternalSorter.scala line 192
>>>
>>> 189
>>> while (records.hasNext) { addElementsRead() kv = records.next()
>>> map.changeValue((getPartition(kv._1), kv._1), update)
>>> maybeSpillCollection(usingMap = true) }
>>>
>>> On Sat, Mar 12, 2016 at 12:31 PM, Saurabh Guru 
>>> wrote:
>>>
 I am seeing the following exception in my Spark Cluster every few days
 in production.

 2016-03-12 05:30:00,541 - WARN  TaskSetManager - Lost task 0.0 in stage
 12528.0 (TID 18792, ip-1X-1XX-1-1XX.us 
 -west-1.compute.internal
 ): java.lang.NullPointerException
at
 org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:192)
at
 org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)


 I have debugged in local machine but haven’t been able to pin point the
 cause of the error. Anyone knows why this might occur? Any suggestions?


 Thanks,
 Saurabh




>>>
>>
>


Re: NullPointerException

2016-03-12 Thread Ted Yu
Interesting.
If kv._1 was null, shouldn't the NPE have come from getPartition() (line
105) ?

Was it possible that records.next() returned null ?

On Fri, Mar 11, 2016 at 11:20 PM, Prabhu Joseph 
wrote:

> Looking at ExternalSorter.scala line 192, i suspect some input record has
> Null key.
>
> 189  while (records.hasNext) {
> 190addElementsRead()
> 191kv = records.next()
> 192map.changeValue((getPartition(kv._1), kv._1), update)
>
>
>
> On Sat, Mar 12, 2016 at 12:48 PM, Prabhu Joseph <
> prabhujose.ga...@gmail.com> wrote:
>
>> Looking at ExternalSorter.scala line 192
>>
>> 189
>> while (records.hasNext) { addElementsRead() kv = records.next()
>> map.changeValue((getPartition(kv._1), kv._1), update)
>> maybeSpillCollection(usingMap = true) }
>>
>> On Sat, Mar 12, 2016 at 12:31 PM, Saurabh Guru 
>> wrote:
>>
>>> I am seeing the following exception in my Spark Cluster every few days
>>> in production.
>>>
>>> 2016-03-12 05:30:00,541 - WARN  TaskSetManager - Lost task 0.0 in stage
>>> 12528.0 (TID 18792, ip-1X-1XX-1-1XX.us 
>>> -west-1.compute.internal
>>> ): java.lang.NullPointerException
>>>at
>>> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:192)
>>>at
>>> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
>>>at
>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>>>at
>>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>>>at org.apache.spark.scheduler.Task.run(Task.scala:89)
>>>at
>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>>>at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>>at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>>at java.lang.Thread.run(Thread.java:745)
>>>
>>>
>>> I have debugged in local machine but haven’t been able to pin point the
>>> cause of the error. Anyone knows why this might occur? Any suggestions?
>>>
>>>
>>> Thanks,
>>> Saurabh
>>>
>>>
>>>
>>>
>>
>


Re: NullPointerException

2016-03-11 Thread Saurabh Guru
I am using the following versions:


org.apache.spark
spark-streaming_2.10
1.6.0



org.apache.spark
spark-streaming-kafka_2.10
1.6.0



org.elasticsearch
elasticsearch-spark_2.10
2.2.0


Thanks,
Saurabh

:)



> On 12-Mar-2016, at 12:56 PM, Ted Yu  wrote:
> 
> Which Spark release do you use ?
> 
> I wonder if the following may have fixed the problem:
> SPARK-8029 Robust shuffle writer
> 
> JIRA is down, cannot check now.
> 
> On Fri, Mar 11, 2016 at 11:01 PM, Saurabh Guru  > wrote:
> I am seeing the following exception in my Spark Cluster every few days in 
> production.
> 
> 2016-03-12 05:30:00,541 - WARN  TaskSetManager - Lost task 0.0 in stage 
> 12528.0 (TID 18792, ip-1X-1XX-1-1XX.us 
> -west-1.compute.internal
> ): java.lang.NullPointerException
>at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:192)
>at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
>at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>at org.apache.spark.scheduler.Task.run(Task.scala:89)
>at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>at java.lang.Thread.run(Thread.java:745)
> 
> 
> I have debugged in local machine but haven’t been able to pin point the cause 
> of the error. Anyone knows why this might occur? Any suggestions? 
> 
> 
> Thanks,
> Saurabh
> 
> 
> 
> 



Re: NullPointerException

2016-03-11 Thread Ted Yu
Which Spark release do you use ?

I wonder if the following may have fixed the problem:
SPARK-8029 Robust shuffle writer

JIRA is down, cannot check now.

On Fri, Mar 11, 2016 at 11:01 PM, Saurabh Guru 
wrote:

> I am seeing the following exception in my Spark Cluster every few days in
> production.
>
> 2016-03-12 05:30:00,541 - WARN  TaskSetManager - Lost task 0.0 in stage
> 12528.0 (TID 18792, ip-1X-1XX-1-1XX.us 
> -west-1.compute.internal
> ): java.lang.NullPointerException
>at
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:192)
>at
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
>at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>at org.apache.spark.scheduler.Task.run(Task.scala:89)
>at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>at java.lang.Thread.run(Thread.java:745)
>
>
> I have debugged in local machine but haven’t been able to pin point the
> cause of the error. Anyone knows why this might occur? Any suggestions?
>
>
> Thanks,
> Saurabh
>
>
>
>


Re: NullPointerException

2016-03-11 Thread Prabhu Joseph
Looking at ExternalSorter.scala line 192, i suspect some input record has
Null key.

189  while (records.hasNext) {
190addElementsRead()
191kv = records.next()
192map.changeValue((getPartition(kv._1), kv._1), update)



On Sat, Mar 12, 2016 at 12:48 PM, Prabhu Joseph 
wrote:

> Looking at ExternalSorter.scala line 192
>
> 189
> while (records.hasNext) { addElementsRead() kv = records.next()
> map.changeValue((getPartition(kv._1), kv._1), update)
> maybeSpillCollection(usingMap = true) }
>
> On Sat, Mar 12, 2016 at 12:31 PM, Saurabh Guru 
> wrote:
>
>> I am seeing the following exception in my Spark Cluster every few days in
>> production.
>>
>> 2016-03-12 05:30:00,541 - WARN  TaskSetManager - Lost task 0.0 in stage
>> 12528.0 (TID 18792, ip-1X-1XX-1-1XX.us 
>> -west-1.compute.internal
>> ): java.lang.NullPointerException
>>at
>> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:192)
>>at
>> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
>>at
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>>at
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>>at org.apache.spark.scheduler.Task.run(Task.scala:89)
>>at
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>>at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>at java.lang.Thread.run(Thread.java:745)
>>
>>
>> I have debugged in local machine but haven’t been able to pin point the
>> cause of the error. Anyone knows why this might occur? Any suggestions?
>>
>>
>> Thanks,
>> Saurabh
>>
>>
>>
>>
>


Re: NullPointerException

2016-03-11 Thread Prabhu Joseph
Looking at ExternalSorter.scala line 192

189
while (records.hasNext) { addElementsRead() kv = records.next()
map.changeValue((getPartition(kv._1), kv._1), update)
maybeSpillCollection(usingMap = true) }

On Sat, Mar 12, 2016 at 12:31 PM, Saurabh Guru 
wrote:

> I am seeing the following exception in my Spark Cluster every few days in
> production.
>
> 2016-03-12 05:30:00,541 - WARN  TaskSetManager - Lost task 0.0 in stage
> 12528.0 (TID 18792, ip-1X-1XX-1-1XX.us 
> -west-1.compute.internal
> ): java.lang.NullPointerException
>at
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:192)
>at
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
>at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
>at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
>at org.apache.spark.scheduler.Task.run(Task.scala:89)
>at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
>at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>at java.lang.Thread.run(Thread.java:745)
>
>
> I have debugged in local machine but haven’t been able to pin point the
> cause of the error. Anyone knows why this might occur? Any suggestions?
>
>
> Thanks,
> Saurabh
>
>
>
>


Re: NullPointerException with joda time

2015-11-12 Thread Romain Sagean
I Still can't make the logger work inside a map function. I can use
"logInfo("")" in the main but not in the function. Anyway I rewrite my
program to use java.util.Date instead joda time and I don't have NPE
anymore.

I will stick with this solution for the moment even if I find java Date
ugly.

Thanks for your help.

2015-11-11 15:54 GMT+01:00 Ted Yu :

> In case you need to adjust log4j properties, see the following thread:
>
>
> http://search-hadoop.com/m/q3RTtJHkzb1t0J66=Re+Spark+Streaming+Log4j+Inside+Eclipse
>
> Cheers
>
> On Tue, Nov 10, 2015 at 1:28 PM, Ted Yu  wrote:
>
>> I took a look at
>> https://github.com/JodaOrg/joda-time/blob/master/src/main/java/org/joda/time/DateTime.java
>> Looks like the NPE came from line below:
>>
>> long instant = getChronology().days().add(getMillis(), days);
>> Maybe catch the NPE and print out the value of currentDate to see if
>> there is more clue ?
>>
>> Cheers
>>
>> On Tue, Nov 10, 2015 at 12:55 PM, Romain Sagean 
>> wrote:
>>
>>> see below a more complete version of the code.
>>> the firstDate (previously minDate) should not be null, I even added an 
>>> extra "filter( _._2 != null)" before the flatMap and the error is still 
>>> there.
>>>
>>> What I don't understand is why I have the error on dateSeq.las.plusDays and 
>>> not on dateSeq.last.isBefore (in the condition).
>>>
>>> I also tried changing the allDates function to use a while loop but i got 
>>> the same error.
>>>
>>>   def allDates(dateStart: DateTime, dateEnd: DateTime): Seq[DateTime] = {
>>> var dateSeq = Seq(dateStart)
>>> var currentDate = dateStart
>>> while (currentDate.isBefore(dateEnd)){
>>>   dateSeq = dateSeq :+ currentDate
>>>   currentDate = currentDate.plusDays(1)
>>> }
>>> return dateSeq
>>>   }
>>>
>>> val videoAllDates = events.select("player_id", "current_ts")  
>>> .filter("player_id is not null")  .filter("current_ts is not null") 
>>>  .map( row => (row.getString(0), timestampToDate(row.getString(1  
>>> .filter(r => r._2.isAfter(minimumDate))  .reduceByKey(minDateTime)  
>>> .flatMapValues( firstDate => allDates(firstDate, endDate))
>>>
>>>
>>> And the stack trace.
>>>
>>> 15/11/10 21:10:36 INFO MapOutputTrackerMasterActor: Asked to send map
>>> output locations for shuffle 2 to sparkexecu...@r610-2.pro.hupi.loc
>>> :50821
>>> 15/11/10 21:10:36 INFO MapOutputTrackerMaster: Size of output statuses
>>> for shuffle 2 is 695 bytes
>>> 15/11/10 21:10:36 INFO MapOutputTrackerMasterActor: Asked to send map
>>> output locations for shuffle 1 to sparkexecu...@r610-2.pro.hupi.loc
>>> :50821
>>> 15/11/10 21:10:36 INFO MapOutputTrackerMaster: Size of output statuses
>>> for shuffle 1 is 680 bytes
>>> 15/11/10 21:10:36 INFO TaskSetManager: Starting task 206.0 in stage 3.0
>>> (TID 798, R610-2.pro.hupi.loc, PROCESS_LOCAL, 4416 bytes)
>>> 15/11/10 21:10:36 WARN TaskSetManager: Lost task 205.0 in stage 3.0 (TID
>>> 797, R610-2.pro.hupi.loc): java.lang.NullPointerException
>>> at org.joda.time.DateTime.plusDays(DateTime.java:1070)
>>> at Heatmap$.allDates(heatmap.scala:34)
>>> at Heatmap$$anonfun$12.apply(heatmap.scala:97)
>>> at Heatmap$$anonfun$12.apply(heatmap.scala:97)
>>> at
>>> org.apache.spark.rdd.PairRDDFunctions$$anonfun$flatMapValues$1$$anonfun$apply$16.apply(PairRDDFunctions.scala:686)
>>> at
>>> org.apache.spark.rdd.PairRDDFunctions$$anonfun$flatMapValues$1$$anonfun$apply$16.apply(PairRDDFunctions.scala:685)
>>> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>>> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>>> at
>>> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125)
>>> at
>>> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:160)
>>> at
>>> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
>>> at
>>> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>>> at
>>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>>> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>>> at
>>> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
>>> at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:159)
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>>> at
>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>>> at
>>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>>> at 

Re: NullPointerException with joda time

2015-11-12 Thread Ted Yu
Even if log4j didn't work, you can still get some clue by wrapping the
following call with try block:

  currentDate = currentDate.plusDays(1)

catching NPE and rethrowing with an exception that shows the value of
currentDate


Cheers


On Thu, Nov 12, 2015 at 1:56 AM, Romain Sagean 
wrote:

> I Still can't make the logger work inside a map function. I can use
> "logInfo("")" in the main but not in the function. Anyway I rewrite my
> program to use java.util.Date instead joda time and I don't have NPE
> anymore.
>
> I will stick with this solution for the moment even if I find java Date
> ugly.
>
> Thanks for your help.
>
> 2015-11-11 15:54 GMT+01:00 Ted Yu :
>
>> In case you need to adjust log4j properties, see the following thread:
>>
>>
>> http://search-hadoop.com/m/q3RTtJHkzb1t0J66=Re+Spark+Streaming+Log4j+Inside+Eclipse
>>
>> Cheers
>>
>> On Tue, Nov 10, 2015 at 1:28 PM, Ted Yu  wrote:
>>
>>> I took a look at
>>> https://github.com/JodaOrg/joda-time/blob/master/src/main/java/org/joda/time/DateTime.java
>>> Looks like the NPE came from line below:
>>>
>>> long instant = getChronology().days().add(getMillis(), days);
>>> Maybe catch the NPE and print out the value of currentDate to see if
>>> there is more clue ?
>>>
>>> Cheers
>>>
>>> On Tue, Nov 10, 2015 at 12:55 PM, Romain Sagean 
>>> wrote:
>>>
 see below a more complete version of the code.
 the firstDate (previously minDate) should not be null, I even added an 
 extra "filter( _._2 != null)" before the flatMap and the error is still 
 there.

 What I don't understand is why I have the error on dateSeq.las.plusDays 
 and not on dateSeq.last.isBefore (in the condition).

 I also tried changing the allDates function to use a while loop but i got 
 the same error.

   def allDates(dateStart: DateTime, dateEnd: DateTime): Seq[DateTime] = {
 var dateSeq = Seq(dateStart)
 var currentDate = dateStart
 while (currentDate.isBefore(dateEnd)){
   dateSeq = dateSeq :+ currentDate
   currentDate = currentDate.plusDays(1)
 }
 return dateSeq
   }

 val videoAllDates = events.select("player_id", "current_ts")  
 .filter("player_id is not null")  .filter("current_ts is not null")
   .map( row => (row.getString(0), timestampToDate(row.getString(1  
 .filter(r => r._2.isAfter(minimumDate))  .reduceByKey(minDateTime) 
  .flatMapValues( firstDate => allDates(firstDate, endDate))


 And the stack trace.

 15/11/10 21:10:36 INFO MapOutputTrackerMasterActor: Asked to send map
 output locations for shuffle 2 to sparkexecu...@r610-2.pro.hupi.loc
 :50821
 15/11/10 21:10:36 INFO MapOutputTrackerMaster: Size of output statuses
 for shuffle 2 is 695 bytes
 15/11/10 21:10:36 INFO MapOutputTrackerMasterActor: Asked to send map
 output locations for shuffle 1 to sparkexecu...@r610-2.pro.hupi.loc
 :50821
 15/11/10 21:10:36 INFO MapOutputTrackerMaster: Size of output statuses
 for shuffle 1 is 680 bytes
 15/11/10 21:10:36 INFO TaskSetManager: Starting task 206.0 in stage 3.0
 (TID 798, R610-2.pro.hupi.loc, PROCESS_LOCAL, 4416 bytes)
 15/11/10 21:10:36 WARN TaskSetManager: Lost task 205.0 in stage 3.0
 (TID 797, R610-2.pro.hupi.loc): java.lang.NullPointerException
 at org.joda.time.DateTime.plusDays(DateTime.java:1070)
 at Heatmap$.allDates(heatmap.scala:34)
 at Heatmap$$anonfun$12.apply(heatmap.scala:97)
 at Heatmap$$anonfun$12.apply(heatmap.scala:97)
 at
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$flatMapValues$1$$anonfun$apply$16.apply(PairRDDFunctions.scala:686)
 at
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$flatMapValues$1$$anonfun$apply$16.apply(PairRDDFunctions.scala:685)
 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
 at
 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125)
 at
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:160)
 at
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
 at
 scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at
 scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
 at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:159)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
 at 

Re: NullPointerException with joda time

2015-11-12 Thread Koert Kuipers
i remember us having issues with joda classes not serializing property and
coming out null "on the other side" in tasks

On Thu, Nov 12, 2015 at 10:12 AM, Ted Yu  wrote:

> Even if log4j didn't work, you can still get some clue by wrapping the
> following call with try block:
>
>   currentDate = currentDate.plusDays(1)
>
> catching NPE and rethrowing with an exception that shows the value of 
> currentDate
>
>
> Cheers
>
>
> On Thu, Nov 12, 2015 at 1:56 AM, Romain Sagean 
> wrote:
>
>> I Still can't make the logger work inside a map function. I can use
>> "logInfo("")" in the main but not in the function. Anyway I rewrite my
>> program to use java.util.Date instead joda time and I don't have NPE
>> anymore.
>>
>> I will stick with this solution for the moment even if I find java Date
>> ugly.
>>
>> Thanks for your help.
>>
>> 2015-11-11 15:54 GMT+01:00 Ted Yu :
>>
>>> In case you need to adjust log4j properties, see the following thread:
>>>
>>>
>>> http://search-hadoop.com/m/q3RTtJHkzb1t0J66=Re+Spark+Streaming+Log4j+Inside+Eclipse
>>>
>>> Cheers
>>>
>>> On Tue, Nov 10, 2015 at 1:28 PM, Ted Yu  wrote:
>>>
 I took a look at
 https://github.com/JodaOrg/joda-time/blob/master/src/main/java/org/joda/time/DateTime.java
 Looks like the NPE came from line below:

 long instant = getChronology().days().add(getMillis(), days);
 Maybe catch the NPE and print out the value of currentDate to see if
 there is more clue ?

 Cheers

 On Tue, Nov 10, 2015 at 12:55 PM, Romain Sagean 
 wrote:

> see below a more complete version of the code.
> the firstDate (previously minDate) should not be null, I even added an 
> extra "filter( _._2 != null)" before the flatMap and the error is still 
> there.
>
> What I don't understand is why I have the error on dateSeq.las.plusDays 
> and not on dateSeq.last.isBefore (in the condition).
>
> I also tried changing the allDates function to use a while loop but i got 
> the same error.
>
>   def allDates(dateStart: DateTime, dateEnd: DateTime): Seq[DateTime] = {
> var dateSeq = Seq(dateStart)
> var currentDate = dateStart
> while (currentDate.isBefore(dateEnd)){
>   dateSeq = dateSeq :+ currentDate
>   currentDate = currentDate.plusDays(1)
> }
> return dateSeq
>   }
>
> val videoAllDates = events.select("player_id", "current_ts")  
> .filter("player_id is not null")  .filter("current_ts is not null")   
>.map( row => (row.getString(0), timestampToDate(row.getString(1
>   .filter(r => r._2.isAfter(minimumDate))  .reduceByKey(minDateTime)  
> .flatMapValues( firstDate => allDates(firstDate, endDate))
>
>
> And the stack trace.
>
> 15/11/10 21:10:36 INFO MapOutputTrackerMasterActor: Asked to send map
> output locations for shuffle 2 to sparkexecu...@r610-2.pro.hupi.loc
> :50821
> 15/11/10 21:10:36 INFO MapOutputTrackerMaster: Size of output statuses
> for shuffle 2 is 695 bytes
> 15/11/10 21:10:36 INFO MapOutputTrackerMasterActor: Asked to send map
> output locations for shuffle 1 to sparkexecu...@r610-2.pro.hupi.loc
> :50821
> 15/11/10 21:10:36 INFO MapOutputTrackerMaster: Size of output statuses
> for shuffle 1 is 680 bytes
> 15/11/10 21:10:36 INFO TaskSetManager: Starting task 206.0 in stage
> 3.0 (TID 798, R610-2.pro.hupi.loc, PROCESS_LOCAL, 4416 bytes)
> 15/11/10 21:10:36 WARN TaskSetManager: Lost task 205.0 in stage 3.0
> (TID 797, R610-2.pro.hupi.loc): java.lang.NullPointerException
> at org.joda.time.DateTime.plusDays(DateTime.java:1070)
> at Heatmap$.allDates(heatmap.scala:34)
> at Heatmap$$anonfun$12.apply(heatmap.scala:97)
> at Heatmap$$anonfun$12.apply(heatmap.scala:97)
> at
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$flatMapValues$1$$anonfun$apply$16.apply(PairRDDFunctions.scala:686)
> at
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$flatMapValues$1$$anonfun$apply$16.apply(PairRDDFunctions.scala:685)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125)
> at
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:160)
> at
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
> at
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at
> 

Re: NullPointerException with joda time

2015-11-11 Thread Ted Yu
In case you need to adjust log4j properties, see the following thread:

http://search-hadoop.com/m/q3RTtJHkzb1t0J66=Re+Spark+Streaming+Log4j+Inside+Eclipse

Cheers

On Tue, Nov 10, 2015 at 1:28 PM, Ted Yu  wrote:

> I took a look at
> https://github.com/JodaOrg/joda-time/blob/master/src/main/java/org/joda/time/DateTime.java
> Looks like the NPE came from line below:
>
> long instant = getChronology().days().add(getMillis(), days);
> Maybe catch the NPE and print out the value of currentDate to see if
> there is more clue ?
>
> Cheers
>
> On Tue, Nov 10, 2015 at 12:55 PM, Romain Sagean 
> wrote:
>
>> see below a more complete version of the code.
>> the firstDate (previously minDate) should not be null, I even added an extra 
>> "filter( _._2 != null)" before the flatMap and the error is still there.
>>
>> What I don't understand is why I have the error on dateSeq.las.plusDays and 
>> not on dateSeq.last.isBefore (in the condition).
>>
>> I also tried changing the allDates function to use a while loop but i got 
>> the same error.
>>
>>   def allDates(dateStart: DateTime, dateEnd: DateTime): Seq[DateTime] = {
>> var dateSeq = Seq(dateStart)
>> var currentDate = dateStart
>> while (currentDate.isBefore(dateEnd)){
>>   dateSeq = dateSeq :+ currentDate
>>   currentDate = currentDate.plusDays(1)
>> }
>> return dateSeq
>>   }
>>
>> val videoAllDates = events.select("player_id", "current_ts")  
>> .filter("player_id is not null")  .filter("current_ts is not null")  
>> .map( row => (row.getString(0), timestampToDate(row.getString(1  
>> .filter(r => r._2.isAfter(minimumDate))  .reduceByKey(minDateTime)  
>> .flatMapValues( firstDate => allDates(firstDate, endDate))
>>
>>
>> And the stack trace.
>>
>> 15/11/10 21:10:36 INFO MapOutputTrackerMasterActor: Asked to send map
>> output locations for shuffle 2 to sparkexecu...@r610-2.pro.hupi.loc:50821
>> 15/11/10 21:10:36 INFO MapOutputTrackerMaster: Size of output statuses
>> for shuffle 2 is 695 bytes
>> 15/11/10 21:10:36 INFO MapOutputTrackerMasterActor: Asked to send map
>> output locations for shuffle 1 to sparkexecu...@r610-2.pro.hupi.loc:50821
>> 15/11/10 21:10:36 INFO MapOutputTrackerMaster: Size of output statuses
>> for shuffle 1 is 680 bytes
>> 15/11/10 21:10:36 INFO TaskSetManager: Starting task 206.0 in stage 3.0
>> (TID 798, R610-2.pro.hupi.loc, PROCESS_LOCAL, 4416 bytes)
>> 15/11/10 21:10:36 WARN TaskSetManager: Lost task 205.0 in stage 3.0 (TID
>> 797, R610-2.pro.hupi.loc): java.lang.NullPointerException
>> at org.joda.time.DateTime.plusDays(DateTime.java:1070)
>> at Heatmap$.allDates(heatmap.scala:34)
>> at Heatmap$$anonfun$12.apply(heatmap.scala:97)
>> at Heatmap$$anonfun$12.apply(heatmap.scala:97)
>> at
>> org.apache.spark.rdd.PairRDDFunctions$$anonfun$flatMapValues$1$$anonfun$apply$16.apply(PairRDDFunctions.scala:686)
>> at
>> org.apache.spark.rdd.PairRDDFunctions$$anonfun$flatMapValues$1$$anonfun$apply$16.apply(PairRDDFunctions.scala:685)
>> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
>> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>> at
>> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125)
>> at
>> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:160)
>> at
>> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
>> at
>> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>> at
>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>> at
>> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
>> at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:159)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>> at
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>> at
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>> at
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>> at
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
>> at
>> 

Re: NullPointerException with joda time

2015-11-10 Thread Ted Yu
Can you show the stack trace for the NPE ?

Which release of Spark are you using ?

Cheers

On Tue, Nov 10, 2015 at 8:20 AM, romain sagean 
wrote:

> Hi community,
> I try to apply the function below during a flatMapValues or a map but I
> get a nullPointerException with the plusDays(1). What did I miss ?
>
> def allDates(dateSeq: Seq[DateTime], dateEnd: DateTime): Seq[DateTime] = {
> if (dateSeq.last.isBefore(dateEnd)){
>   allDates(dateSeq:+ dateSeq.last.plusDays(1), dateEnd)
> } else {
>   dateSeq
> }
>   }
>
> val videoAllDates = .select("player_id", "mindate").flatMapValues( minDate
> => allDates(Seq(minDate), endDate))
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: NullPointerException with joda time

2015-11-10 Thread Romain Sagean
see below a more complete version of the code.
the firstDate (previously minDate) should not be null, I even added an
extra "filter( _._2 != null)" before the flatMap and the error is
still there.

What I don't understand is why I have the error on
dateSeq.las.plusDays and not on dateSeq.last.isBefore (in the
condition).

I also tried changing the allDates function to use a while loop but i
got the same error.

  def allDates(dateStart: DateTime, dateEnd: DateTime): Seq[DateTime] = {
var dateSeq = Seq(dateStart)
var currentDate = dateStart
while (currentDate.isBefore(dateEnd)){
  dateSeq = dateSeq :+ currentDate
  currentDate = currentDate.plusDays(1)
}
return dateSeq
  }

val videoAllDates = events.select("player_id", "current_ts")
.filter("player_id is not null")  .filter("current_ts is not
null")  .map( row => (row.getString(0),
timestampToDate(row.getString(1  .filter(r =>
r._2.isAfter(minimumDate))  .reduceByKey(minDateTime)
.flatMapValues( firstDate => allDates(firstDate, endDate))


And the stack trace.

15/11/10 21:10:36 INFO MapOutputTrackerMasterActor: Asked to send map
output locations for shuffle 2 to sparkexecu...@r610-2.pro.hupi.loc:50821
15/11/10 21:10:36 INFO MapOutputTrackerMaster: Size of output statuses for
shuffle 2 is 695 bytes
15/11/10 21:10:36 INFO MapOutputTrackerMasterActor: Asked to send map
output locations for shuffle 1 to sparkexecu...@r610-2.pro.hupi.loc:50821
15/11/10 21:10:36 INFO MapOutputTrackerMaster: Size of output statuses for
shuffle 1 is 680 bytes
15/11/10 21:10:36 INFO TaskSetManager: Starting task 206.0 in stage 3.0
(TID 798, R610-2.pro.hupi.loc, PROCESS_LOCAL, 4416 bytes)
15/11/10 21:10:36 WARN TaskSetManager: Lost task 205.0 in stage 3.0 (TID
797, R610-2.pro.hupi.loc): java.lang.NullPointerException
at org.joda.time.DateTime.plusDays(DateTime.java:1070)
at Heatmap$.allDates(heatmap.scala:34)
at Heatmap$$anonfun$12.apply(heatmap.scala:97)
at Heatmap$$anonfun$12.apply(heatmap.scala:97)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$flatMapValues$1$$anonfun$apply$16.apply(PairRDDFunctions.scala:686)
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$flatMapValues$1$$anonfun$apply$16.apply(PairRDDFunctions.scala:685)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
at
org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125)
at
org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:160)
at
org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
at
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:159)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

15/11/10 

Re: NullPointerException with joda time

2015-11-10 Thread Ted Yu
I took a look at
https://github.com/JodaOrg/joda-time/blob/master/src/main/java/org/joda/time/DateTime.java
Looks like the NPE came from line below:

long instant = getChronology().days().add(getMillis(), days);
Maybe catch the NPE and print out the value of currentDate to see if there
is more clue ?

Cheers

On Tue, Nov 10, 2015 at 12:55 PM, Romain Sagean 
wrote:

> see below a more complete version of the code.
> the firstDate (previously minDate) should not be null, I even added an extra 
> "filter( _._2 != null)" before the flatMap and the error is still there.
>
> What I don't understand is why I have the error on dateSeq.las.plusDays and 
> not on dateSeq.last.isBefore (in the condition).
>
> I also tried changing the allDates function to use a while loop but i got the 
> same error.
>
>   def allDates(dateStart: DateTime, dateEnd: DateTime): Seq[DateTime] = {
> var dateSeq = Seq(dateStart)
> var currentDate = dateStart
> while (currentDate.isBefore(dateEnd)){
>   dateSeq = dateSeq :+ currentDate
>   currentDate = currentDate.plusDays(1)
> }
> return dateSeq
>   }
>
> val videoAllDates = events.select("player_id", "current_ts")  
> .filter("player_id is not null")  .filter("current_ts is not null")  
> .map( row => (row.getString(0), timestampToDate(row.getString(1  
> .filter(r => r._2.isAfter(minimumDate))  .reduceByKey(minDateTime)  
> .flatMapValues( firstDate => allDates(firstDate, endDate))
>
>
> And the stack trace.
>
> 15/11/10 21:10:36 INFO MapOutputTrackerMasterActor: Asked to send map
> output locations for shuffle 2 to sparkexecu...@r610-2.pro.hupi.loc:50821
> 15/11/10 21:10:36 INFO MapOutputTrackerMaster: Size of output statuses for
> shuffle 2 is 695 bytes
> 15/11/10 21:10:36 INFO MapOutputTrackerMasterActor: Asked to send map
> output locations for shuffle 1 to sparkexecu...@r610-2.pro.hupi.loc:50821
> 15/11/10 21:10:36 INFO MapOutputTrackerMaster: Size of output statuses for
> shuffle 1 is 680 bytes
> 15/11/10 21:10:36 INFO TaskSetManager: Starting task 206.0 in stage 3.0
> (TID 798, R610-2.pro.hupi.loc, PROCESS_LOCAL, 4416 bytes)
> 15/11/10 21:10:36 WARN TaskSetManager: Lost task 205.0 in stage 3.0 (TID
> 797, R610-2.pro.hupi.loc): java.lang.NullPointerException
> at org.joda.time.DateTime.plusDays(DateTime.java:1070)
> at Heatmap$.allDates(heatmap.scala:34)
> at Heatmap$$anonfun$12.apply(heatmap.scala:97)
> at Heatmap$$anonfun$12.apply(heatmap.scala:97)
> at
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$flatMapValues$1$$anonfun$apply$16.apply(PairRDDFunctions.scala:686)
> at
> org.apache.spark.rdd.PairRDDFunctions$$anonfun$flatMapValues$1$$anonfun$apply$16.apply(PairRDDFunctions.scala:685)
> at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> at
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:125)
> at
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:160)
> at
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)
> at
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
> at
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
> at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:159)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> at org.apache.spark.rdd.UnionRDD.compute(UnionRDD.scala:87)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> 

Re: NullPointerException when cache DataFrame in Java (Spark1.5.1)

2015-10-29 Thread Romi Kuntsman
Did you try to cache a DataFrame with just a single row?
Do you rows have any columns with null values?
Can you post a code snippet here on how you load/generate the dataframe?
Does dataframe.rdd.cache work?

*Romi Kuntsman*, *Big Data Engineer*
http://www.totango.com

On Thu, Oct 29, 2015 at 4:33 AM, Zhang, Jingyu 
wrote:

> It is not a problem to use JavaRDD.cache() for 200M data (all Objects read
> form Json Format). But when I try to use DataFrame.cache(), It shown
> exception in below.
>
> My machine can cache 1 G data in Avro format without any problem.
>
> 15/10/29 13:26:23 INFO GeneratePredicate: Code generated in 154.531827 ms
>
> 15/10/29 13:26:23 INFO GenerateUnsafeProjection: Code generated in
> 27.832369 ms
>
> 15/10/29 13:26:23 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID
> 1)
>
> java.lang.NullPointerException
>
> at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source)
>
> at sun.reflect.DelegatingMethodAccessorImpl.invoke(
> DelegatingMethodAccessorImpl.java:43)
>
> at java.lang.reflect.Method.invoke(Method.java:497)
>
> at
> org.apache.spark.sql.SQLContext$$anonfun$9$$anonfun$apply$1$$anonfun$apply$2.apply(
> SQLContext.scala:500)
>
> at
> org.apache.spark.sql.SQLContext$$anonfun$9$$anonfun$apply$1$$anonfun$apply$2.apply(
> SQLContext.scala:500)
>
> at scala.collection.TraversableLike$$anonfun$map$1.apply(
> TraversableLike.scala:244)
>
> at scala.collection.TraversableLike$$anonfun$map$1.apply(
> TraversableLike.scala:244)
>
> at scala.collection.IndexedSeqOptimized$class.foreach(
> IndexedSeqOptimized.scala:33)
>
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>
> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
>
> at org.apache.spark.sql.SQLContext$$anonfun$9$$anonfun$apply$1.apply(
> SQLContext.scala:500)
>
> at org.apache.spark.sql.SQLContext$$anonfun$9$$anonfun$apply$1.apply(
> SQLContext.scala:498)
>
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>
> at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:389)
>
> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>
> at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(
> InMemoryColumnarTableScan.scala:127)
>
> at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(
> InMemoryColumnarTableScan.scala:120)
>
> at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:278
> )
>
> at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
>
> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
>
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
>
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38
> )
>
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38
> )
>
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>
> at org.apache.spark.scheduler.Task.run(Task.scala:88)
>
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>
> at java.util.concurrent.ThreadPoolExecutor.runWorker(
> ThreadPoolExecutor.java:1142)
>
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(
> ThreadPoolExecutor.java:617)
>
> at java.lang.Thread.run(Thread.java:745)
>
> 15/10/29 13:26:23 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1,
> localhost): java.lang.NullPointerException
>
> at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source)
>
>
> Thanks,
>
>
> Jingyu
>
> This message and its attachments may contain legally privileged or
> confidential information. It is intended solely for the named addressee. If
> you are not the addressee indicated in this message or responsible for
> delivery of the message to the addressee, you may not copy or deliver this
> message or its attachments to anyone. Rather, you should permanently delete
> this message and its attachments and kindly notify the sender by reply
> e-mail. Any content of this message and its attachments which does not
> relate to the official business of the sending company must be taken not to
> have been sent or endorsed by that company or any of its related entities.
> No warranty is made that the e-mail or attachments are free from computer
> virus or other defect.


Re: NullPointerException when cache DataFrame in Java (Spark1.5.1)

2015-10-29 Thread Zhang, Jingyu
Thanks Romi,

I resize the dataset to 7MB, however, the code show NullPointerException
 exception as well.

Did you try to cache a DataFrame with just a single row?

Yes, I tried. But, Same problem.
.
Do you rows have any columns with null values?

No, I had filter out null values before cache the dataframe.

Can you post a code snippet here on how you load/generate the dataframe?

Sure, Here is the working code 1:

JavaRDD pixels = pixelsStr.map(new PixelGenerator()).cache();

System.out.println(pixels.count()); // 3000-4000 rows

Working code 2:

JavaRDD pixels = pixelsStr.map(new PixelGenerator());

DataFrame schemaPixel = sqlContext.createDataFrame(pixels, PixelObject.class
);

DataFrame totalDF1 =
schemaPixel.select(schemaPixel.col("domain")).filter("'domain'
is not null").limit(500);

System.out.println(totalDF1.count());


BUT, after change limit(500) to limit(1000). The code report
NullPointerException.


JavaRDD pixels = pixelsStr.map(new PixelGenerator());

DataFrame schemaPixel = sqlContext.createDataFrame(pixels, PixelObject.class
);

DataFrame totalDF =
schemaPixel.select(schemaPixel.col("domain")).filter("'domain'
is not null").limit(*1000*);

System.out.println(totalDF.count()); // problem at this line

15/10/29 18:56:28 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks
have all completed, from pool

15/10/29 18:56:28 INFO TaskSchedulerImpl: Cancelling stage 0

15/10/29 18:56:28 INFO DAGScheduler: ShuffleMapStage 0 (count at
X.java:113) failed in 3.764 s

15/10/29 18:56:28 INFO DAGScheduler: Job 0 failed: count at XXX.java:113,
took 3.862207 s

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage
0.0 (TID 0, localhost): java.lang.NullPointerException

at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
Does dataframe.rdd.cache work?

No, I tried but same exception.

Thanks,

Jingyu

On 29 October 2015 at 17:38, Romi Kuntsman  wrote:

> Did you try to cache a DataFrame with just a single row?
> Do you rows have any columns with null values?
> Can you post a code snippet here on how you load/generate the dataframe?
> Does dataframe.rdd.cache work?
>
> *Romi Kuntsman*, *Big Data Engineer*
> http://www.totango.com
>
> On Thu, Oct 29, 2015 at 4:33 AM, Zhang, Jingyu 
> wrote:
>
>> It is not a problem to use JavaRDD.cache() for 200M data (all Objects
>> read form Json Format). But when I try to use DataFrame.cache(), It shown
>> exception in below.
>>
>> My machine can cache 1 G data in Avro format without any problem.
>>
>> 15/10/29 13:26:23 INFO GeneratePredicate: Code generated in 154.531827 ms
>>
>> 15/10/29 13:26:23 INFO GenerateUnsafeProjection: Code generated in
>> 27.832369 ms
>>
>> 15/10/29 13:26:23 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID
>> 1)
>>
>> java.lang.NullPointerException
>>
>> at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source)
>>
>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(
>> DelegatingMethodAccessorImpl.java:43)
>>
>> at java.lang.reflect.Method.invoke(Method.java:497)
>>
>> at
>> org.apache.spark.sql.SQLContext$$anonfun$9$$anonfun$apply$1$$anonfun$apply$2.apply(
>> SQLContext.scala:500)
>>
>> at
>> org.apache.spark.sql.SQLContext$$anonfun$9$$anonfun$apply$1$$anonfun$apply$2.apply(
>> SQLContext.scala:500)
>>
>> at scala.collection.TraversableLike$$anonfun$map$1.apply(
>> TraversableLike.scala:244)
>>
>> at scala.collection.TraversableLike$$anonfun$map$1.apply(
>> TraversableLike.scala:244)
>>
>> at scala.collection.IndexedSeqOptimized$class.foreach(
>> IndexedSeqOptimized.scala:33)
>>
>> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>>
>> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>>
>> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
>>
>> at org.apache.spark.sql.SQLContext$$anonfun$9$$anonfun$apply$1.apply(
>> SQLContext.scala:500)
>>
>> at org.apache.spark.sql.SQLContext$$anonfun$9$$anonfun$apply$1.apply(
>> SQLContext.scala:498)
>>
>> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>>
>> at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:389)
>>
>> at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
>>
>> at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(
>> InMemoryColumnarTableScan.scala:127)
>>
>> at org.apache.spark.sql.columnar.InMemoryRelation$$anonfun$3$$anon$1.next(
>> InMemoryColumnarTableScan.scala:120)
>>
>> at org.apache.spark.storage.MemoryStore.unrollSafely(
>> MemoryStore.scala:278)
>>
>> at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171
>> )
>>
>> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
>>
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:262)
>>
>> at org.apache.spark.rdd.MapPartitionsRDD.compute(
>> MapPartitionsRDD.scala:38)
>>
>> at 

Re: NullPointerException when cache DataFrame in Java (Spark1.5.1)

2015-10-29 Thread Romi Kuntsman
>
> BUT, after change limit(500) to limit(1000). The code report
> NullPointerException.
>
I had a similar situation, and the problem was with a certain record.
Try to find which records are returned when you limit to 1000 but not
returned when you limit to 500.

Could it be a NPE thrown from PixelObject?
Are you running spark with master=local, so it's running inside your IDE
and you can see the errors from the driver and worker?


*Romi Kuntsman*, *Big Data Engineer*
http://www.totango.com

On Thu, Oct 29, 2015 at 10:04 AM, Zhang, Jingyu 
wrote:

> Thanks Romi,
>
> I resize the dataset to 7MB, however, the code show NullPointerException
>  exception as well.
>
> Did you try to cache a DataFrame with just a single row?
>
> Yes, I tried. But, Same problem.
> .
> Do you rows have any columns with null values?
>
> No, I had filter out null values before cache the dataframe.
>
> Can you post a code snippet here on how you load/generate the dataframe?
>
> Sure, Here is the working code 1:
>
> JavaRDD pixels = pixelsStr.map(new PixelGenerator()).cache();
>
> System.out.println(pixels.count()); // 3000-4000 rows
>
> Working code 2:
>
> JavaRDD pixels = pixelsStr.map(new PixelGenerator());
>
> DataFrame schemaPixel = sqlContext.createDataFrame(pixels, PixelObject.
> class);
>
> DataFrame totalDF1 = 
> schemaPixel.select(schemaPixel.col("domain")).filter("'domain'
> is not null").limit(500);
>
> System.out.println(totalDF1.count());
>
>
> BUT, after change limit(500) to limit(1000). The code report
> NullPointerException.
>
>
> JavaRDD pixels = pixelsStr.map(new PixelGenerator());
>
> DataFrame schemaPixel = sqlContext.createDataFrame(pixels, PixelObject.
> class);
>
> DataFrame totalDF = 
> schemaPixel.select(schemaPixel.col("domain")).filter("'domain'
> is not null").limit(*1000*);
>
> System.out.println(totalDF.count()); // problem at this line
>
> 15/10/29 18:56:28 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks
> have all completed, from pool
>
> 15/10/29 18:56:28 INFO TaskSchedulerImpl: Cancelling stage 0
>
> 15/10/29 18:56:28 INFO DAGScheduler: ShuffleMapStage 0 (count at
> X.java:113) failed in 3.764 s
>
> 15/10/29 18:56:28 INFO DAGScheduler: Job 0 failed: count at XXX.java:113,
> took 3.862207 s
>
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
> in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage
> 0.0 (TID 0, localhost): java.lang.NullPointerException
>
> at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
> Does dataframe.rdd.cache work?
>
> No, I tried but same exception.
>
> Thanks,
>
> Jingyu
>
> On 29 October 2015 at 17:38, Romi Kuntsman  wrote:
>
>> Did you try to cache a DataFrame with just a single row?
>> Do you rows have any columns with null values?
>> Can you post a code snippet here on how you load/generate the dataframe?
>> Does dataframe.rdd.cache work?
>>
>> *Romi Kuntsman*, *Big Data Engineer*
>> http://www.totango.com
>>
>> On Thu, Oct 29, 2015 at 4:33 AM, Zhang, Jingyu 
>> wrote:
>>
>>> It is not a problem to use JavaRDD.cache() for 200M data (all Objects
>>> read form Json Format). But when I try to use DataFrame.cache(), It shown
>>> exception in below.
>>>
>>> My machine can cache 1 G data in Avro format without any problem.
>>>
>>> 15/10/29 13:26:23 INFO GeneratePredicate: Code generated in 154.531827 ms
>>>
>>> 15/10/29 13:26:23 INFO GenerateUnsafeProjection: Code generated in
>>> 27.832369 ms
>>>
>>> 15/10/29 13:26:23 ERROR Executor: Exception in task 0.0 in stage 1.0
>>> (TID 1)
>>>
>>> java.lang.NullPointerException
>>>
>>> at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source)
>>>
>>> at sun.reflect.DelegatingMethodAccessorImpl.invoke(
>>> DelegatingMethodAccessorImpl.java:43)
>>>
>>> at java.lang.reflect.Method.invoke(Method.java:497)
>>>
>>> at
>>> org.apache.spark.sql.SQLContext$$anonfun$9$$anonfun$apply$1$$anonfun$apply$2.apply(
>>> SQLContext.scala:500)
>>>
>>> at
>>> org.apache.spark.sql.SQLContext$$anonfun$9$$anonfun$apply$1$$anonfun$apply$2.apply(
>>> SQLContext.scala:500)
>>>
>>> at scala.collection.TraversableLike$$anonfun$map$1.apply(
>>> TraversableLike.scala:244)
>>>
>>> at scala.collection.TraversableLike$$anonfun$map$1.apply(
>>> TraversableLike.scala:244)
>>>
>>> at scala.collection.IndexedSeqOptimized$class.foreach(
>>> IndexedSeqOptimized.scala:33)
>>>
>>> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>>>
>>> at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
>>>
>>> at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
>>>
>>> at org.apache.spark.sql.SQLContext$$anonfun$9$$anonfun$apply$1.apply(
>>> SQLContext.scala:500)
>>>
>>> at org.apache.spark.sql.SQLContext$$anonfun$9$$anonfun$apply$1.apply(
>>> SQLContext.scala:498)
>>>
>>> at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>>>
>>> at 

Re: NullPointerException inside RDD when calling sc.textFile

2015-07-23 Thread Akhil Das
Did you try:

val data = indexed_files.groupByKey

val *modified_data* = data.map { a =
  var name = a._2.mkString(,)
  (a._1, name)
}

*modified_data*.foreach { a =
  var file = sc.textFile(a._2)
  println(file.count)
}


Thanks
Best Regards

On Wed, Jul 22, 2015 at 2:18 AM, MorEru hsb.sh...@gmail.com wrote:

 I have a number of CSV files and need to combine them into a RDD by part of
 their filenames.

 For example, for the below files
 $ ls
 20140101_1.csv  20140101_3.csv  20140201_2.csv  20140301_1.csv
 20140301_3.csv 20140101_2.csv  20140201_1.csv  20140201_3.csv

 I need to combine files with names 20140101*.csv into a RDD to work on it
 and so on.

 I am using sc.wholeTextFiles to read the entire directory and then grouping
 the filenames by their patters to form a string of filenames. I am then
 passing the string to sc.textFile to open the files as a single RDD.

 This is the code I have -

 val files = sc.wholeTextFiles(*.csv)
 val indexed_files = files.map(a = (a._1.split(_)(0),a._1))
 val data = indexed_files.groupByKey

 data.map { a =
   var name = a._2.mkString(,)
   (a._1, name)
 }

 data.foreach { a =
   var file = sc.textFile(a._2)
   println(file.count)
 }

 And I get SparkException - NullPointerException when I try to call
 textFile.
 The error stack refers to an Iterator inside the RDD. I am not able to
 understand the error -

 15/07/21 15:37:37 INFO TaskSchedulerImpl: Removed TaskSet 65.0, whose tasks
 have all completed, from pool
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 1
 in
 stage 65.0 failed 4 times, most recent failure: Lost task 1.3 in stage 65.0
 (TID 115, 10.132.8.10): java.lang.NullPointerException
 at
 $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(console:33)
 at
 $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(console:32)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at

 org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:870)
 at

 org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:870)
 at

 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1765)
 at

 org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1765)

 However, when I do sc.textFile(data.first._2).count in the spark shell, I
 am
 able to form the RDD and able to retrieve the count.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-inside-RDD-when-calling-sc-textFile-tp23943.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: NullPointerException with functions.rand()

2015-06-12 Thread Ted Yu
Created PR and verified the example given by Justin works with the change:
https://github.com/apache/spark/pull/6793

Cheers

On Wed, Jun 10, 2015 at 7:15 PM, Ted Yu yuzhih...@gmail.com wrote:

 Looks like the NPE came from this line:
   @transient protected lazy val rng = new XORShiftRandom(seed +
 TaskContext.get().partitionId())

 Could TaskContext.get() be null ?

 On Wed, Jun 10, 2015 at 6:15 PM, Justin Yip yipjus...@prediction.io
 wrote:

 Hello,

 I am using 1.4.0 and found the following weird behavior.

 This case works fine:

 scala sc.parallelize(Seq((1,2), (3, 100))).toDF.withColumn(index,
 rand(30)).show()
 +--+---+---+
 |_1| _2|  index|
 +--+---+---+
 | 1|  2| 0.6662967911724369|
 | 3|100|0.35734504984676396|
 +--+---+---+

 However, when I use sqlContext.createDataFrame instead, I get a NPE:

 scala sqlContext.createDataFrame(Seq((1,2), (3,
 100))).withColumn(index, rand(30)).show()
 java.lang.NullPointerException
 at
 org.apache.spark.sql.catalyst.expressions.RDG.rng$lzycompute(random.scala:39)
 at org.apache.spark.sql.catalyst.expressions.RDG.rng(random.scala:39)
 ..


 Does any one know why?

 Thanks.

 Justin

 --
 View this message in context: NullPointerException with functions.rand()
 http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-with-functions-rand-tp23267.html
 Sent from the Apache Spark User List mailing list archive
 http://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com.





Re: NullPointerException SQLConf.setConf

2015-06-07 Thread Cheng Lian
Are you calling hiveContext.sql within an RDD.map closure or something 
similar? In this way, the call actually happens on executor side. 
However, HiveContext only exists on the driver side.


Cheng

On 6/4/15 3:45 PM, patcharee wrote:

Hi,

I am using Hive 0.14 and spark 0.13. I got 
java.lang.NullPointerException when inserted into hive. Any 
suggestions please.


hiveContext.sql(INSERT OVERWRITE table 4dim partition (zone= + ZONE 
+ ,z= + zz + ,year= + YEAR + ,month= + MONTH + )  +
select date, hh, x, y, height, u, v, w, ph, phb, t, p, pb, 
qvapor, qgraup, qnice, qnrain, tke_pbl, el_pbl from table_4Dim where 
z= + zz);


java.lang.NullPointerException
at org.apache.spark.sql.SQLConf.setConf(SQLConf.scala:196)
at org.apache.spark.sql.SQLContext.setConf(SQLContext.scala:74)
at 
org.apache.spark.sql.hive.HiveContext.hiveconf$lzycompute(HiveContext.scala:251)
at 
org.apache.spark.sql.hive.HiveContext.hiveconf(HiveContext.scala:250)

at org.apache.spark.sql.hive.HiveContext.sql(HiveContext.scala:95)
at 
no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$main$3$$anonfun$apply$1.apply(LoadWrfIntoHiveOptReduce1.scala:110)
at 
no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$main$3$$anonfun$apply$1.apply(LoadWrfIntoHiveOptReduce1.scala:107)

at scala.collection.immutable.Range.foreach(Range.scala:141)
at 
no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$main$3.apply(LoadWrfIntoHiveOptReduce1.scala:107)
at 
no.uni.computing.etl.LoadWrfIntoHiveOptReduce1$$anonfun$main$3.apply(LoadWrfIntoHiveOptReduce1.scala:107)
at 
org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:806)
at 
org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:806)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1511)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1511)

at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

at java.lang.Thread.run(Thread.java:744)

Best,
Patcharee

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org





-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: NullPointerException in ApplicationMaster

2015-02-25 Thread Zhan Zhang
Look at the trace again. It is a very weird error. The SparkSubmit is running 
on client side, but YarnClusterSchedulerBackend is supposed in running in YARN 
AM.

I suspect you are running the cluster with yarn-client mode, but in 
JavaSparkContext you set yarn-cluster”. As a result, spark context initiate 
YarnClusterSchedulerBackend instead of YarnClientSchedulerBackend,  which I 
think is the root cause.

Thanks.

Zhan Zhang

On Feb 25, 2015, at 1:53 PM, Zhan Zhang 
zzh...@hortonworks.commailto:zzh...@hortonworks.com wrote:

Hi Mate,

When you initialize the JavaSparkContext, you don’t need to specify the mode 
“yarn-cluster”. I suspect that is the root cause.

Thanks.

Zhan Zhang

On Feb 25, 2015, at 10:12 AM, gulyasm 
mgulya...@gmail.commailto:mgulya...@gmail.com wrote:

JavaSparkContext.




Re: NullPointerException

2014-12-31 Thread Josh Rosen
Which version of Spark are you using?

On Wed, Dec 31, 2014 at 10:24 PM, rapelly kartheek kartheek.m...@gmail.com
wrote:

 Hi,
 I get this following Exception when I submit spark application that
 calculates the frequency of characters in a file. Especially, when I
 increase the size of data, I face this problem.

 Exception in thread Thread-47 org.apache.spark.SparkException: Job
 aborted due to stage failure: Task 11.0:10 failed 4 times, most recent
 failure: Exception failure in TID 295 on host s1:
 java.lang.NullPointerException
 org.apache.spark.storage.BlockManager.org
 $apache$spark$storage$BlockManager$$replicate(BlockManager.scala:786)
 org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:752)
 org.apache.spark.storage.BlockManager.put(BlockManager.scala:574)
 org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:108)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
 org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
 org.apache.spark.scheduler.Task.run(Task.scala:51)

 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:745)
 Driver stacktrace:
 at org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
 at scala.Option.foreach(Option.scala:236)
 at
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633)
 at
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


 Any help?

 Thank you!



Re: NullPointerException

2014-12-31 Thread rapelly kartheek
spark-1.0.0

On Thu, Jan 1, 2015 at 12:04 PM, Josh Rosen rosenvi...@gmail.com wrote:

 Which version of Spark are you using?

 On Wed, Dec 31, 2014 at 10:24 PM, rapelly kartheek 
 kartheek.m...@gmail.com wrote:

 Hi,
 I get this following Exception when I submit spark application that
 calculates the frequency of characters in a file. Especially, when I
 increase the size of data, I face this problem.

 Exception in thread Thread-47 org.apache.spark.SparkException: Job
 aborted due to stage failure: Task 11.0:10 failed 4 times, most recent
 failure: Exception failure in TID 295 on host s1:
 java.lang.NullPointerException
 org.apache.spark.storage.BlockManager.org
 $apache$spark$storage$BlockManager$$replicate(BlockManager.scala:786)

 org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:752)
 org.apache.spark.storage.BlockManager.put(BlockManager.scala:574)
 org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:108)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
 org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
 org.apache.spark.scheduler.Task.run(Task.scala:51)

 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:745)
 Driver stacktrace:
 at org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
 at scala.Option.foreach(Option.scala:236)
 at
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633)
 at
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


 Any help?

 Thank you!





Re: NullPointerException

2014-12-31 Thread Josh Rosen
It looks like 'null' might be selected as a block replication peer?
https://github.com/apache/spark/blob/v1.0.0/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L786

I know that we fixed some replication bugs in newer versions of Spark (such
as https://github.com/apache/spark/pull/2366), so it's possible that this
issue would be resolved by updating.  Can you try re-running your job with
a newer Spark version to see whether you still see the same error?

On Wed, Dec 31, 2014 at 10:35 PM, rapelly kartheek kartheek.m...@gmail.com
wrote:

 spark-1.0.0

 On Thu, Jan 1, 2015 at 12:04 PM, Josh Rosen rosenvi...@gmail.com wrote:

 Which version of Spark are you using?

 On Wed, Dec 31, 2014 at 10:24 PM, rapelly kartheek 
 kartheek.m...@gmail.com wrote:

 Hi,
 I get this following Exception when I submit spark application that
 calculates the frequency of characters in a file. Especially, when I
 increase the size of data, I face this problem.

 Exception in thread Thread-47 org.apache.spark.SparkException: Job
 aborted due to stage failure: Task 11.0:10 failed 4 times, most recent
 failure: Exception failure in TID 295 on host s1:
 java.lang.NullPointerException
 org.apache.spark.storage.BlockManager.org
 $apache$spark$storage$BlockManager$$replicate(BlockManager.scala:786)

 org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:752)
 org.apache.spark.storage.BlockManager.put(BlockManager.scala:574)

 org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:108)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
 org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
 org.apache.spark.scheduler.Task.run(Task.scala:51)

 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:745)
 Driver stacktrace:
 at org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
 at scala.Option.foreach(Option.scala:236)
 at
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633)
 at
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


 Any help?

 Thank you!






Re: NullPointerException

2014-12-31 Thread rapelly kartheek
Ok. Let me try out on a newer version.

Thank you!!

On Thu, Jan 1, 2015 at 12:17 PM, Josh Rosen rosenvi...@gmail.com wrote:

 It looks like 'null' might be selected as a block replication peer?
 https://github.com/apache/spark/blob/v1.0.0/core/src/main/scala/org/apache/spark/storage/BlockManager.scala#L786

 I know that we fixed some replication bugs in newer versions of Spark
 (such as https://github.com/apache/spark/pull/2366), so it's possible
 that this issue would be resolved by updating.  Can you try re-running your
 job with a newer Spark version to see whether you still see the same error?

 On Wed, Dec 31, 2014 at 10:35 PM, rapelly kartheek 
 kartheek.m...@gmail.com wrote:

 spark-1.0.0

 On Thu, Jan 1, 2015 at 12:04 PM, Josh Rosen rosenvi...@gmail.com wrote:

 Which version of Spark are you using?

 On Wed, Dec 31, 2014 at 10:24 PM, rapelly kartheek 
 kartheek.m...@gmail.com wrote:

 Hi,
 I get this following Exception when I submit spark application that
 calculates the frequency of characters in a file. Especially, when I
 increase the size of data, I face this problem.

 Exception in thread Thread-47 org.apache.spark.SparkException: Job
 aborted due to stage failure: Task 11.0:10 failed 4 times, most recent
 failure: Exception failure in TID 295 on host s1:
 java.lang.NullPointerException
 org.apache.spark.storage.BlockManager.org
 $apache$spark$storage$BlockManager$$replicate(BlockManager.scala:786)

 org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:752)

 org.apache.spark.storage.BlockManager.put(BlockManager.scala:574)

 org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:108)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
 org.apache.spark.rdd.FilteredRDD.compute(FilteredRDD.scala:34)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
 org.apache.spark.scheduler.Task.run(Task.scala:51)

 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 java.lang.Thread.run(Thread.java:745)
 Driver stacktrace:
 at org.apache.spark.scheduler.DAGScheduler.org
 $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
 at
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
 at scala.Option.foreach(Option.scala:236)
 at
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633)
 at
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


 Any help?

 Thank you!







Re: NullPointerException on cluster mode when using foreachPartition

2014-12-16 Thread Shixiong Zhu
Could you post the stack trace?


Best Regards,
Shixiong Zhu

2014-12-16 23:21 GMT+08:00 richiesgr richie...@gmail.com:

 Hi

 This time I need expert.
 On 1.1.1 and only in cluster (standalone or EC2)
 when I use this code :

 countersPublishers.foreachRDD(rdd = {
 rdd.foreachPartition(partitionRecords = {
   partitionRecords.foreach(record = {
 //dbActorUpdater ! updateDBMessage(record)
 println(record)
   })
 })
   })

 Get NPP (When I run this locally all is OK)

 If I use this
   countersPublishers.foreachRDD(rdd = rdd.collect().foreach(r =
 dbActorUpdater ! updateDBMessage(r)))

 There is no problem. I think something is misconfigured
 Thanks for help




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-on-cluster-mode-when-using-foreachPartition-tp20719.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: NullPointerException When Reading Avro Sequence Files

2014-12-15 Thread Simone Franzini
To me this looks like an internal error to the REPL. I am not sure what is
causing that.
Personally I never use the REPL, can you try typing up your program and
running it from an IDE or spark-submit and see if you still get the same
error?

Simone Franzini, PhD

http://www.linkedin.com/in/simonefranzini

On Mon, Dec 15, 2014 at 4:54 PM, Cristovao Jose Domingues Cordeiro 
cristovao.corde...@cern.ch wrote:

  Sure, thanks:
 warning: there were 1 deprecation warning(s); re-run with -deprecation for
 details
 java.lang.IllegalStateException: Job in state DEFINE instead of RUNNING
 at org.apache.hadoop.mapreduce.Job.ensureState(Job.java:283)
 at org.apache.hadoop.mapreduce.Job.toString(Job.java:462)
 at
 scala.runtime.ScalaRunTime$.scala$runtime$ScalaRunTime$$inner$1(ScalaRunTime.scala:324)
 at scala.runtime.ScalaRunTime$.stringOf(ScalaRunTime.scala:329)
 at scala.runtime.ScalaRunTime$.replStringOf(ScalaRunTime.scala:337)
 at .init(console:10)
 at .clinit(console)
 at $print(console)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:846)
 at
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1119)
 at
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:672)
 at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:703)
 at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:667)
 at
 org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:819)
 at
 org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:864)
 at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:776)
 at
 org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:619)
 at
 org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:627)
 at org.apache.spark.repl.SparkILoop.loop(SparkILoop.scala:632)
 at
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:959)
 at
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:907)
 at
 org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:907)
 at
 scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
 at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:907)
 at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1002)
 at org.apache.spark.repl.Main$.main(Main.scala:31)
 at org.apache.spark.repl.Main.main(Main.scala)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at
 org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:331)
 at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
 at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)




 Could something you omitted in your snippet be chaining this exception?

  Cumprimentos / Best regards,
 Cristóvão José Domingues Cordeiro
 IT Department - 28/R-018
 CERN
--
 *From:* Simone Franzini [captainfr...@gmail.com]
 *Sent:* 15 December 2014 16:52

 *To:* Cristovao Jose Domingues Cordeiro
 *Subject:* Re: NullPointerException When Reading Avro Sequence Files

   Ok, I have no idea what that is. That appears to be an internal Spark
 exception. Maybe if you can post the entire stack trace it would give some
 more details to understand what is going on.

  Simone Franzini, PhD

 http://www.linkedin.com/in/simonefranzini

 On Mon, Dec 15, 2014 at 4:50 PM, Cristovao Jose Domingues Cordeiro 
 cristovao.corde...@cern.ch wrote:

  Hi,

 thanks for that.
 But yeah the 2nd line is an exception. jobread is not created.

  Cumprimentos / Best regards,
 Cristóvão José Domingues Cordeiro
 IT Department - 28/R-018
 CERN
--
 *From:* Simone Franzini [captainfr...@gmail.com]
 *Sent:* 15 December 2014 16:39

 *To:* Cristovao Jose Domingues Cordeiro
 *Subject:* Re: NullPointerException When Reading Avro Sequence Files

I did not mention the imports needed in my code. I think these are
 all of them:

  import org.apache.hadoop.mapreduce.Job
 import org.apache.hadoop.io.NullWritable
 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat
 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat
 import

Re: NullPointerException When Reading Avro Sequence Files

2014-12-09 Thread Simone Franzini
Hi Cristovao,

I have seen a very similar issue that I have posted about in this thread:
http://apache-spark-user-list.1001560.n3.nabble.com/Kryo-NPE-with-Array-td19797.html
I think your main issue here is somewhat similar, in that the MapWrapper
Scala class is not registered. This gets registered by the Twitter
chill-scala AllScalaRegistrar class that you are currently not using.

As far as I understand, in order to use Avro with Spark, you also have to
use Kryo. This means you have to use the Spark KryoSerializer. This in turn
uses Twitter chill. I posted the basic code that I am using here:

http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-read-this-avro-file-using-spark-amp-scala-td19400.html#a19491

Maybe there is a simpler solution to your problem but I am not that much of
an expert yet. I hope this helps.

Simone Franzini, PhD

http://www.linkedin.com/in/simonefranzini

On Tue, Dec 9, 2014 at 8:50 AM, Cristovao Jose Domingues Cordeiro 
cristovao.corde...@cern.ch wrote:

  Hi Simone,

 thanks but I don't think that's it.
 I've tried several libraries within the --jar argument. Some do give what
 you said. But other times (when I put the right version I guess) I get the
 following:
 14/12/09 15:45:54 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID
 0)
 java.io.NotSerializableException:
 scala.collection.convert.Wrappers$MapWrapper
 at
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)
 at
 java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377)


 Which is odd since I am reading a Avro I wrote...with the same piece of
 code:
 https://gist.github.com/MLnick/5864741781b9340cb211

  Cumprimentos / Best regards,
 Cristóvão José Domingues Cordeiro
 IT Department - 28/R-018
 CERN
--
 *From:* Simone Franzini [captainfr...@gmail.com]
 *Sent:* 06 December 2014 15:48
 *To:* Cristovao Jose Domingues Cordeiro
 *Subject:* Re: NullPointerException When Reading Avro Sequence Files

   java.lang.IncompatibleClassChangeError: Found interface
 org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected

  That is a sign that you are mixing up versions of Hadoop. This is
 particularly an issue when dealing with AVRO. If you are using Hadoop 2,
 you will need to get the hadoop 2 version of avro-mapred. In Maven you
 would do this with the classifier hadoop2 /classifier tag.

  Simone Franzini, PhD

 http://www.linkedin.com/in/simonefranzini

 On Fri, Dec 5, 2014 at 3:52 AM, cjdc cristovao.corde...@cern.ch wrote:

 Hi all,

 I've tried the above example on Gist, but it doesn't work (at least for
 me).
 Did anyone get this:
 14/12/05 10:44:40 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID
 0)
 java.lang.IncompatibleClassChangeError: Found interface
 org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
 at

 org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
 at
 org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:115)
 at
 org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:103)
 at
 org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:65)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
 at org.apache.spark.scheduler.Task.run(Task.scala:54)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
 at

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:745)
 14/12/05 10:44:40 ERROR ExecutorUncaughtExceptionHandler: Uncaught
 exception
 in thread Thread[Executor task launch worker-0,5,main]
 java.lang.IncompatibleClassChangeError: Found interface
 org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
 at

 org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
 at
 org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:115)
 at
 org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:103)
 at
 org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:65)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229

Re: NullPointerException When Reading Avro Sequence Files

2014-12-09 Thread Simone Franzini
You can use this Maven dependency:

dependency
groupIdcom.twitter/groupId
artifactIdchill-avro/artifactId
version0.4.0/version
/dependency

Simone Franzini, PhD

http://www.linkedin.com/in/simonefranzini

On Tue, Dec 9, 2014 at 9:53 AM, Cristovao Jose Domingues Cordeiro 
cristovao.corde...@cern.ch wrote:

  Thanks for the reply!

 I've tried in fact your code. But I lack the twiter chill package and I
 can not find it online. So I am now trying this
 http://spark.apache.org/docs/latest/tuning.html#data-serialization . But
 in case I can't do it, could you tell me where to get that Twiter package
 you used?

 Thanks

  Cumprimentos / Best regards,
 Cristóvão José Domingues Cordeiro
 IT Department - 28/R-018
 CERN
--
 *From:* Simone Franzini [captainfr...@gmail.com]
 *Sent:* 09 December 2014 16:42
 *To:* Cristovao Jose Domingues Cordeiro; user

 *Subject:* Re: NullPointerException When Reading Avro Sequence Files

   Hi Cristovao,

 I have seen a very similar issue that I have posted about in this thread:

 http://apache-spark-user-list.1001560.n3.nabble.com/Kryo-NPE-with-Array-td19797.html
  I think your main issue here is somewhat similar, in that the MapWrapper
 Scala class is not registered. This gets registered by the Twitter
 chill-scala AllScalaRegistrar class that you are currently not using.

  As far as I understand, in order to use Avro with Spark, you also have
 to use Kryo. This means you have to use the Spark KryoSerializer. This in
 turn uses Twitter chill. I posted the basic code that I am using here:


 http://apache-spark-user-list.1001560.n3.nabble.com/How-can-I-read-this-avro-file-using-spark-amp-scala-td19400.html#a19491

  Maybe there is a simpler solution to your problem but I am not that much
 of an expert yet. I hope this helps.

  Simone Franzini, PhD

 http://www.linkedin.com/in/simonefranzini

 On Tue, Dec 9, 2014 at 8:50 AM, Cristovao Jose Domingues Cordeiro 
 cristovao.corde...@cern.ch wrote:

  Hi Simone,

 thanks but I don't think that's it.
 I've tried several libraries within the --jar argument. Some do give what
 you said. But other times (when I put the right version I guess) I get the
 following:
 14/12/09 15:45:54 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID
 0)
 java.io.NotSerializableException:
 scala.collection.convert.Wrappers$MapWrapper
 at
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)
 at
 java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1377)


 Which is odd since I am reading a Avro I wrote...with the same piece of
 code:
 https://gist.github.com/MLnick/5864741781b9340cb211

  Cumprimentos / Best regards,
 Cristóvão José Domingues Cordeiro
 IT Department - 28/R-018
 CERN
--
 *From:* Simone Franzini [captainfr...@gmail.com]
 *Sent:* 06 December 2014 15:48
 *To:* Cristovao Jose Domingues Cordeiro
 *Subject:* Re: NullPointerException When Reading Avro Sequence Files

java.lang.IncompatibleClassChangeError: Found interface
 org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected

  That is a sign that you are mixing up versions of Hadoop. This is
 particularly an issue when dealing with AVRO. If you are using Hadoop 2,
 you will need to get the hadoop 2 version of avro-mapred. In Maven you
 would do this with the classifier hadoop2 /classifier tag.

  Simone Franzini, PhD

 http://www.linkedin.com/in/simonefranzini

 On Fri, Dec 5, 2014 at 3:52 AM, cjdc cristovao.corde...@cern.ch wrote:

 Hi all,

 I've tried the above example on Gist, but it doesn't work (at least for
 me).
 Did anyone get this:
 14/12/05 10:44:40 ERROR Executor: Exception in task 0.0 in stage 0.0
 (TID 0)
 java.lang.IncompatibleClassChangeError: Found interface
 org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
 at

 org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
 at
 org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:115)
 at
 org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:103)
 at
 org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:65)
 at
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
 at
 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at
 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
 at org.apache.spark.scheduler.Task.run(Task.scala:54)
 at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
 at

 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at

 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615

Re: NullPointerException When Reading Avro Sequence Files

2014-12-05 Thread cjdc
Hi all,

I've tried the above example on Gist, but it doesn't work (at least for me).
Did anyone get this:
14/12/05 10:44:40 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.IncompatibleClassChangeError: Found interface
org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
at
org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
at
org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:115)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:103)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:65)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
14/12/05 10:44:40 ERROR ExecutorUncaughtExceptionHandler: Uncaught exception
in thread Thread[Executor task launch worker-0,5,main]
java.lang.IncompatibleClassChangeError: Found interface
org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
at
org.apache.avro.mapreduce.AvroKeyInputFormat.createRecordReader(AvroKeyInputFormat.java:47)
at
org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:115)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:103)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:65)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
14/12/05 10:44:40 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times;
aborting job


Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-when-reading-Avro-Sequence-files-tp10201p20456.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: NullPointerException on reading checkpoint files

2014-09-23 Thread RodrigoB
Hi TD,

This is actually an important requirement (recovery of shared variables) for
us as we need to spread some referential data across the Spark nodes on
application startup. I just bumped into this issue on Spark version 1.0.1. I
assume the latest one also doesn't include this capability. Are there any
plans to do so. 

If not could you give me your opinion on how difficult would it be to
implement this? If it's nothing too complex I could consider contributing on
that level.

BTW, regarding recovery I have posted a topic on which I would very much
appreciate your comments on
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-data-checkpoint-cleaning-td14847.html

tnks,
Rod



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-on-reading-checkpoint-files-tp7306p14882.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: NullPointerException on reading checkpoint files

2014-09-23 Thread Tathagata Das
This is actually a very tricky as their two pretty big challenges that need
to be solved.
(i) Checkpointing for broadcast variables: Unlike RDDs, broadcasts variable
dont have checkpointing support (that is you cannot write the content of a
 broadcast variable to HDFS and recover it automatically when needed).
(ii) Remembering the checkpoint info of broacast vars used in every batch,
and recovering those vars from the checkpoint info. And exposing this in
the API such that it can be used such that all the checkpointing/recovering
can be done by Spark Streaming seamlessly without user's knowledge.

I have some thoughts on it, but nothing concrete yet. The first, that is,
broadcast checkpointing, should be straight forward, and may be rewarding
outside streaming.

TD

On Tue, Sep 23, 2014 at 4:22 PM, RodrigoB rodrigo.boav...@aspect.com
wrote:

 Hi TD,

 This is actually an important requirement (recovery of shared variables)
 for
 us as we need to spread some referential data across the Spark nodes on
 application startup. I just bumped into this issue on Spark version 1.0.1.
 I
 assume the latest one also doesn't include this capability. Are there any
 plans to do so.

 If not could you give me your opinion on how difficult would it be to
 implement this? If it's nothing too complex I could consider contributing
 on
 that level.

 BTW, regarding recovery I have posted a topic on which I would very much
 appreciate your comments on

 http://apache-spark-user-list.1001560.n3.nabble.com/RDD-data-checkpoint-cleaning-td14847.html

 tnks,
 Rod



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-on-reading-checkpoint-files-tp7306p14882.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: NullPointerException from '.count.foreachRDD'

2014-08-20 Thread anoldbrain
Looking at the source codes of DStream.scala


   /**
* Return a new DStream in which each RDD has a single element generated
 by counting each RDD
* of this DStream.
*/
   def count(): DStream[Long] = {
 this.map(_ = (null, 1L))
 .transform(_.union(context.sparkContext.makeRDD(Seq((null, 0L)),
 1)))
 .reduceByKey(_ + _)
 .map(_._2)
   }

transform is the line throwing the NullPointerException. Can anyone give
some hints as what would cause _ to be null (it is indeed null)? This only
happens when there is no data to process.

When there's data, no NullPointerException is thrown, and all the
processing/counting proceeds correctly.

I am using my custom InputDStream. Could it be that this is the source of
the problem?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-from-count-foreachRDD-tp2066p12461.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: NullPointerException from '.count.foreachRDD'

2014-08-20 Thread Shao, Saisai
Hi,

I don't think there's a NPE issue when using DStream/count() even there is no 
data feed into Spark Streaming. I tested using Kafka in my local settings, both 
are OK with and without data consumed.

Actually you can see the details in ReceiverInputDStream, even there is no data 
in this batch duration, it will generate an empty BlockRDD, so map() and 
transformation() in count() operator will never meet NPE. I think the problem 
may lies on your customized InputDStream, you should make sure to generate an 
empty RDD even when there is no data feed in.

Thanks
Jerry

-Original Message-
From: anoldbrain [mailto:anoldbr...@gmail.com] 
Sent: Wednesday, August 20, 2014 4:13 PM
To: u...@spark.incubator.apache.org
Subject: Re: NullPointerException from '.count.foreachRDD'

Looking at the source codes of DStream.scala


   /**
* Return a new DStream in which each RDD has a single element 
 generated by counting each RDD
* of this DStream.
*/
   def count(): DStream[Long] = {
 this.map(_ = (null, 1L))
 .transform(_.union(context.sparkContext.makeRDD(Seq((null, 
 0L)),
 1)))
 .reduceByKey(_ + _)
 .map(_._2)
   }

transform is the line throwing the NullPointerException. Can anyone give some 
hints as what would cause _ to be null (it is indeed null)? This only happens 
when there is no data to process.

When there's data, no NullPointerException is thrown, and all the 
processing/counting proceeds correctly.

I am using my custom InputDStream. Could it be that this is the source of the 
problem?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-from-count-foreachRDD-tp2066p12461.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: NullPointerException from '.count.foreachRDD'

2014-08-20 Thread anoldbrain
Thank you for the reply. I implemented my InputDStream to return None when
there's no data. After changing it to return empty RDD, the exception is
gone.

I am curious as to why all other processings worked correctly with my old
incorrect implementation, with or without data? My actual codes, without the
count() part, did glom then foreachRDD.

I have StatsReportListener registered and when there was no data, there was
no StatsReportListener output. And I thought this was Spark's smart logic to
avoid launching workers when there was no data. Wouldn't have thought it was
actually an indication that I had my InputDStream implementation wrong. On
the other hand, why use return type Option if None should not be used at
all?

Thanks for help solving my problem.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-from-count-foreachRDD-tp2066p12476.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: NullPointerException when connecting from Spark to a Hive table backed by HBase

2014-08-19 Thread Yin Huai
Seems https://issues.apache.org/jira/browse/SPARK-2846 is the jira tracking
this issue.


On Mon, Aug 18, 2014 at 6:26 PM, cesararevalo ce...@zephyrhealthinc.com
wrote:

 Thanks, Zhan for the follow up.

 But, do you know how I am supposed to set that table name on the jobConf? I
 don't have access to that object from my client driver?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-when-connecting-from-Spark-to-a-Hive-table-backed-by-HBase-tp12284p12331.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: NullPointerException when connecting from Spark to a Hive table backed by HBase

2014-08-19 Thread Cesar Arevalo
Thanks! Yeah, it may be related to that. I'll check out that pull request
that was sent and hopefully that fixes the issue. I'll let you know, after
fighting with this issue yesterday I had decided to just leave it on the
side and return to it after, so it may take me a while to get back to you.

-Cesar


On Tue, Aug 19, 2014 at 2:04 PM, Yin Huai huaiyin@gmail.com wrote:

 Seems https://issues.apache.org/jira/browse/SPARK-2846 is the jira
 tracking this issue.


 On Mon, Aug 18, 2014 at 6:26 PM, cesararevalo ce...@zephyrhealthinc.com
 wrote:

 Thanks, Zhan for the follow up.

 But, do you know how I am supposed to set that table name on the jobConf?
 I
 don't have access to that object from my client driver?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-when-connecting-from-Spark-to-a-Hive-table-backed-by-HBase-tp12284p12331.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





-- 
Cesar Arevalo
Software Engineer ❘ Zephyr Health
450 Mission Street, Suite #201 ❘ San Francisco, CA 94105
m: +1 415-571-7687 ❘ s: arevalocesar | t: @zephyrhealth
https://twitter.com/zephyrhealth
o: +1 415-529-7649 ❘ f: +1 415-520-9288
http://www.zephyrhealth.com


Re: NullPointerException when connecting from Spark to a Hive table backed by HBase

2014-08-18 Thread Akhil Das
Looks like your hiveContext is null. Have a look at this documentation.
https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables

Thanks
Best Regards


On Mon, Aug 18, 2014 at 12:09 PM, Cesar Arevalo ce...@zephyrhealthinc.com
wrote:

 Hello:

 I am trying to setup Spark to connect to a Hive table which is backed by
 HBase, but I am running into the following NullPointerException:

 scala val hiveCount = hiveContext.sql(select count(*) from
 dataset_records).collect().head.getLong(0)
 14/08/18 06:34:29 INFO ParseDriver: Parsing command: select count(*) from
 dataset_records
 14/08/18 06:34:29 INFO ParseDriver: Parse Completed
 14/08/18 06:34:29 INFO HiveMetaStore: 0: get_table : db=default
 tbl=dataset_records
 14/08/18 06:34:29 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_table :
 db=default tbl=dataset_records
 14/08/18 06:34:30 INFO MemoryStore: ensureFreeSpace(160296) called with
 curMem=0, maxMem=280248975
 14/08/18 06:34:30 INFO MemoryStore: Block broadcast_0 stored as values in
 memory (estimated size 156.5 KB, free 267.1 MB)
 14/08/18 06:34:30 INFO SparkContext: Starting job: collect at
 SparkPlan.scala:85
 14/08/18 06:34:31 WARN DAGScheduler: Creating new stage failed due to
 exception - job: 0
 java.lang.NullPointerException
 at org.apache.hadoop.hbase.util.Bytes.toBytes(Bytes.java:502)
  at
 org.apache.hadoop.hive.hbase.HiveHBaseTableInputFormat.getSplits(HiveHBaseTableInputFormat.java:418)
 at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:179)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
  at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
  at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
 at scala.Option.getOrElse(Option.scala:120)




 This is happening from the master on spark, I am running hbase version
 hbase-0.98.4-hadoop1 and hive version 0.13.1. And here is how I am running
 the spark shell:

 bin/spark-shell --driver-class-path
 

Re: NullPointerException when connecting from Spark to a Hive table backed by HBase

2014-08-18 Thread Cesar Arevalo
Nope, it is NOT null. Check this out:

scala hiveContext == null
res2: Boolean = false


And thanks for sending that link, but I had already looked at it. Any other
ideas?

I looked through some of the relevant Spark Hive code and I'm starting to
think this may be a bug.

-Cesar



On Mon, Aug 18, 2014 at 12:00 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 Looks like your hiveContext is null. Have a look at this documentation.
 https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables

 Thanks
 Best Regards


 On Mon, Aug 18, 2014 at 12:09 PM, Cesar Arevalo ce...@zephyrhealthinc.com
  wrote:

 Hello:

 I am trying to setup Spark to connect to a Hive table which is backed by
 HBase, but I am running into the following NullPointerException:

 scala val hiveCount = hiveContext.sql(select count(*) from
 dataset_records).collect().head.getLong(0)
 14/08/18 06:34:29 INFO ParseDriver: Parsing command: select count(*) from
 dataset_records
 14/08/18 06:34:29 INFO ParseDriver: Parse Completed
 14/08/18 06:34:29 INFO HiveMetaStore: 0: get_table : db=default
 tbl=dataset_records
 14/08/18 06:34:29 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_table
 : db=default tbl=dataset_records
 14/08/18 06:34:30 INFO MemoryStore: ensureFreeSpace(160296) called with
 curMem=0, maxMem=280248975
 14/08/18 06:34:30 INFO MemoryStore: Block broadcast_0 stored as values in
 memory (estimated size 156.5 KB, free 267.1 MB)
 14/08/18 06:34:30 INFO SparkContext: Starting job: collect at
 SparkPlan.scala:85
 14/08/18 06:34:31 WARN DAGScheduler: Creating new stage failed due to
 exception - job: 0
 java.lang.NullPointerException
 at org.apache.hadoop.hbase.util.Bytes.toBytes(Bytes.java:502)
  at
 org.apache.hadoop.hive.hbase.HiveHBaseTableInputFormat.getSplits(HiveHBaseTableInputFormat.java:418)
 at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:179)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
  at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
  at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
 at scala.Option.getOrElse(Option.scala:120)




 This is happening from the master on spark, I am running hbase version
 hbase-0.98.4-hadoop1 and hive version 0.13.1. And here is how I am running
 the spark shell:

 bin/spark-shell --driver-class-path
 

Re: NullPointerException when connecting from Spark to a Hive table backed by HBase

2014-08-18 Thread Akhil Das
Then definitely its a jar conflict. Can you try removing this jar from the
class path /opt/spark-poc/lib_managed/jars/org.spark-project.hive/hive-exec/
hive-exec-0.12.0.jar

Thanks
Best Regards


On Mon, Aug 18, 2014 at 12:45 PM, Cesar Arevalo ce...@zephyrhealthinc.com
wrote:

 Nope, it is NOT null. Check this out:

 scala hiveContext == null
 res2: Boolean = false


 And thanks for sending that link, but I had already looked at it. Any
 other ideas?

 I looked through some of the relevant Spark Hive code and I'm starting to
 think this may be a bug.

 -Cesar



 On Mon, Aug 18, 2014 at 12:00 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 Looks like your hiveContext is null. Have a look at this documentation.
 https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables

 Thanks
 Best Regards


 On Mon, Aug 18, 2014 at 12:09 PM, Cesar Arevalo 
 ce...@zephyrhealthinc.com wrote:

 Hello:

 I am trying to setup Spark to connect to a Hive table which is backed by
 HBase, but I am running into the following NullPointerException:

 scala val hiveCount = hiveContext.sql(select count(*) from
 dataset_records).collect().head.getLong(0)
 14/08/18 06:34:29 INFO ParseDriver: Parsing command: select count(*)
 from dataset_records
 14/08/18 06:34:29 INFO ParseDriver: Parse Completed
 14/08/18 06:34:29 INFO HiveMetaStore: 0: get_table : db=default
 tbl=dataset_records
 14/08/18 06:34:29 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_table
 : db=default tbl=dataset_records
 14/08/18 06:34:30 INFO MemoryStore: ensureFreeSpace(160296) called with
 curMem=0, maxMem=280248975
 14/08/18 06:34:30 INFO MemoryStore: Block broadcast_0 stored as values
 in memory (estimated size 156.5 KB, free 267.1 MB)
 14/08/18 06:34:30 INFO SparkContext: Starting job: collect at
 SparkPlan.scala:85
 14/08/18 06:34:31 WARN DAGScheduler: Creating new stage failed due to
 exception - job: 0
 java.lang.NullPointerException
 at org.apache.hadoop.hbase.util.Bytes.toBytes(Bytes.java:502)
  at
 org.apache.hadoop.hive.hbase.HiveHBaseTableInputFormat.getSplits(HiveHBaseTableInputFormat.java:418)
 at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:179)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
  at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
  at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
 at scala.Option.getOrElse(Option.scala:120)




 This is happening from the master on spark, I am running hbase version
 hbase-0.98.4-hadoop1 and hive version 0.13.1. And here is how I am running
 the spark shell:

 bin/spark-shell --driver-class-path
 

Re: NullPointerException when connecting from Spark to a Hive table backed by HBase

2014-08-18 Thread Cesar Arevalo
I removed the JAR that you suggested but now I get another error when I try
to create the HiveContext. Here is the error:

scala val hiveContext = new HiveContext(sc)
error: bad symbolic reference. A signature in HiveContext.class refers to
term ql
in package org.apache.hadoop.hive which is not available.
It may be completely missing from the current classpath,
ommitted more stacktrace for readability...


Best,
-Cesar


On Mon, Aug 18, 2014 at 12:47 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:

 Then definitely its a jar conflict. Can you try removing this jar from the
 class path /opt/spark-poc/lib_managed/jars/org.
 spark-project.hive/hive-exec/hive-exec-0.12.0.jar

 Thanks
 Best Regards


 On Mon, Aug 18, 2014 at 12:45 PM, Cesar Arevalo ce...@zephyrhealthinc.com
  wrote:

 Nope, it is NOT null. Check this out:

 scala hiveContext == null
 res2: Boolean = false


 And thanks for sending that link, but I had already looked at it. Any
 other ideas?

 I looked through some of the relevant Spark Hive code and I'm starting to
 think this may be a bug.

 -Cesar



 On Mon, Aug 18, 2014 at 12:00 AM, Akhil Das ak...@sigmoidanalytics.com
 wrote:

 Looks like your hiveContext is null. Have a look at this documentation.
 https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables

 Thanks
 Best Regards


 On Mon, Aug 18, 2014 at 12:09 PM, Cesar Arevalo 
 ce...@zephyrhealthinc.com wrote:

 Hello:

 I am trying to setup Spark to connect to a Hive table which is backed
 by HBase, but I am running into the following NullPointerException:

 scala val hiveCount = hiveContext.sql(select count(*) from
 dataset_records).collect().head.getLong(0)
 14/08/18 06:34:29 INFO ParseDriver: Parsing command: select count(*)
 from dataset_records
 14/08/18 06:34:29 INFO ParseDriver: Parse Completed
 14/08/18 06:34:29 INFO HiveMetaStore: 0: get_table : db=default
 tbl=dataset_records
 14/08/18 06:34:29 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_table
 : db=default tbl=dataset_records
 14/08/18 06:34:30 INFO MemoryStore: ensureFreeSpace(160296) called with
 curMem=0, maxMem=280248975
 14/08/18 06:34:30 INFO MemoryStore: Block broadcast_0 stored as values
 in memory (estimated size 156.5 KB, free 267.1 MB)
 14/08/18 06:34:30 INFO SparkContext: Starting job: collect at
 SparkPlan.scala:85
 14/08/18 06:34:31 WARN DAGScheduler: Creating new stage failed due to
 exception - job: 0
 java.lang.NullPointerException
 at org.apache.hadoop.hbase.util.Bytes.toBytes(Bytes.java:502)
  at
 org.apache.hadoop.hive.hbase.HiveHBaseTableInputFormat.getSplits(HiveHBaseTableInputFormat.java:418)
 at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:179)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
  at scala.Option.getOrElse(Option.scala:120)
 at org.apache.spark.rdd.RDD.partitions(RDD.scala:202)
  at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
 at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
  at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:202)
 at scala.Option.getOrElse(Option.scala:120)




 This is happening from the master on spark, I am running hbase version
 hbase-0.98.4-hadoop1 and hive version 0.13.1. And here is how I am running
 the spark shell:

 bin/spark-shell --driver-class-path
 

Re: NullPointerException when connecting from Spark to a Hive table backed by HBase

2014-08-18 Thread Zhan Zhang
Looks like hbaseTableName is null, probably caused by incorrect configuration.


String hbaseTableName = jobConf.get(HBaseSerDe.HBASE_TABLE_NAME);
setHTable(new HTable(HBaseConfiguration.create(jobConf), 
Bytes.toBytes(hbaseTableName)));

Here is the definition.

  public static final String HBASE_TABLE_NAME = hbase.table.name”;

Thanks.

Zhan Zhang


On Aug 17, 2014, at 11:39 PM, Cesar Arevalo ce...@zephyrhealthinc.com wrote:

 HadoopRDD


-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.


Re: NullPointerException when connecting from Spark to a Hive table backed by HBase

2014-08-18 Thread cesararevalo
Thanks, Zhan for the follow up.

But, do you know how I am supposed to set that table name on the jobConf? I
don't have access to that object from my client driver?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-when-connecting-from-Spark-to-a-Hive-table-backed-by-HBase-tp12284p12331.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: NullPointerException When Reading Avro Sequence Files

2014-07-21 Thread Sparky
For those curious I used the JavaSparkContext and got access to an
AvroSequenceFile (wrapper around Sequence File) using the following:

file = sc.newAPIHadoopFile(hdfs path to my file,
AvroSequenceFileInputFormat.class, AvroKey.class, AvroValue.class,
new Configuration())



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-when-reading-Avro-Sequence-files-tp10201p10305.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: NullPointerException When Reading Avro Sequence Files

2014-07-19 Thread Sparky
I see Spark is using AvroRecordReaderBase, which is used to grab Avro
Container Files, which is different from Sequence Files.  If anyone is using
Avro Sequence Files with success and has an example, please let me know.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-when-reading-Avro-Sequence-files-tp10201p10233.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: NullPointerException When Reading Avro Sequence Files

2014-07-19 Thread Sparky
To be more specific, I'm working with a system that stores data in
org.apache.avro.hadoop.io.AvroSequenceFile format.  An AvroSequenceFile is 
A wrapper around a Hadoop SequenceFile that also supports reading and
writing Avro data.

It seems that Spark does not support this out of the box.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-when-reading-Avro-Sequence-files-tp10201p10234.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: NullPointerException When Reading Avro Sequence Files

2014-07-19 Thread Nick Pentreath
I got this working locally a little while ago when playing around with
AvroKeyInputFile: https://gist.github.com/MLnick/5864741781b9340cb211

But not sure about AvroSequenceFile. Any chance you have an example
datafile or records?



On Sat, Jul 19, 2014 at 11:00 AM, Sparky gullo_tho...@bah.com wrote:

 To be more specific, I'm working with a system that stores data in
 org.apache.avro.hadoop.io.AvroSequenceFile format.  An AvroSequenceFile is
 A wrapper around a Hadoop SequenceFile that also supports reading and
 writing Avro data.

 It seems that Spark does not support this out of the box.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-when-reading-Avro-Sequence-files-tp10201p10234.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: NullPointerException When Reading Avro Sequence Files

2014-07-19 Thread Sparky
Thanks for the gist.  I'm just now learning about Avro.  I think when you use
a DataFileWriter you are writing to an Avro Container (which is different
than an Avro Sequence File).  I have a system where data was written to an
HDFS Sequence File using  AvroSequenceFile.Writer (which is a wrapper around
sequence file).  

I'll put together an example of the problem so others can better understand
what I'm talking about.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-when-reading-Avro-Sequence-files-tp10201p10237.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: NullPointerException When Reading Avro Sequence Files

2014-07-18 Thread aaronjosephs
I think you probably want to use `AvroSequenceFileOutputFormat` with
`newAPIHadoopFile`. I'm not even sure that in hadoop you would use
SequenceFileInput format to read an avro sequence file



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-when-reading-Avro-Sequence-files-tp10201p10203.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: NullPointerException When Reading Avro Sequence Files

2014-07-18 Thread Sparky
Thanks for responding.  I tried using the newAPIHadoopFile method and got an
IO Exception with the message Not a data file.  

If anyone has an example of this working I'd appreciate your input or
examples.  

What I entered at the repl and what I got back are below:

val myAvroSequenceFile = sc.newAPIHadoopFile(hdfs://my url, 
classOf[AvroKeyInputFormat[GenericRecord]], classOf[AvroKey[GenericRecord]],
classOf[NullWritable])

scala myAvroSequenceFile.first()
14/07/18 17:02:38 INFO FileInputFormat: Total input paths to process : 1
14/07/18 17:02:38 INFO SparkContext: Starting job: first at console:19
14/07/18 17:02:38 INFO DAGScheduler: Got job 0 (first at console:19) with
1 output partitions (allowLocal=true)
14/07/18 17:02:38 INFO DAGScheduler: Final stage: Stage 0(first at
console:19)
14/07/18 17:02:38 INFO DAGScheduler: Parents of final stage: List()
14/07/18 17:02:38 INFO DAGScheduler: Missing parents: List()
14/07/18 17:02:38 INFO DAGScheduler: Computing the requested partition
locally
14/07/18 17:02:38 INFO NewHadoopRDD: Input split: hdfs:my url
14/07/18 17:02:38 WARN AvroKeyInputFormat: Reader schema was not set. Use
AvroJob.setInputKeySchema() if desired.
14/07/18 17:02:38 INFO AvroKeyInputFormat: Using a reader schema equal to
the writer schema.
14/07/18 17:02:38 INFO DAGScheduler: Failed to run first at console:19
org.apache.spark.SparkDriverExecutionException: Execution error
at
org.apache.spark.scheduler.DAGScheduler.runLocallyWithinThread(DAGScheduler.scala:585)
at
org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:563)
Caused by: java.io.IOException: Not a data file.
at 
org.apache.avro.file.DataFileStream.initialize(DataFileStream.java:105)
at org.apache.avro.file.DataFileReader.init(DataFileReader.java:97)
at
org.apache.avro.mapreduce.AvroRecordReaderBase.createAvroFileReader(AvroRecordReaderBase.java:180)
at
org.apache.avro.mapreduce.AvroRecordReaderBase.initialize(AvroRecordReaderBase.java:90)
at 
org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:114)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:100)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:62)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:261)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
at
org.apache.spark.scheduler.DAGScheduler.runLocallyWithinThread(DAGScheduler.scala:578)
... 1 more



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-when-reading-Avro-Sequence-files-tp10201p10204.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: NullPointerException on ExternalAppendOnlyMap

2014-07-02 Thread Andrew Or
Hi Konstantin,

Thanks for reporting this. This happens because there are null keys in your
data. In general, Spark should not throw null pointer exceptions, so this
is a bug. I have fixed this here: https://github.com/apache/spark/pull/1288.

For now, you can workaround this by special-handling your null keys before
passing your key value pairs to a combine operator (e.g. groupBy,
reduceBy). For instance, rdd.map { case (k, v) = if (k == null)
(SPECIAL_VALUE, v) else (k, v) }.

Best,
Andrew




2014-07-02 10:22 GMT-07:00 Konstantin Kudryavtsev 
kudryavtsev.konstan...@gmail.com:

 Hi all,

 I catch very confusing exception running Spark 1.0 on HDP2.1
 During save rdd as text file I got:


 14/07/02 10:11:12 WARN TaskSetManager: Loss was due to 
 java.lang.NullPointerException
 java.lang.NullPointerException
   at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.org$apache$spark$util$collection$ExternalAppendOnlyMap$ExternalIterator$$getMorePairs(ExternalAppendOnlyMap.scala:254)
   at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator$$anonfun$3.apply(ExternalAppendOnlyMap.scala:237)
   at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator$$anonfun$3.apply(ExternalAppendOnlyMap.scala:236)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap$ExternalIterator.init(ExternalAppendOnlyMap.scala:236)
   at 
 org.apache.spark.util.collection.ExternalAppendOnlyMap.iterator(ExternalAppendOnlyMap.scala:218)
   at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:162)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   at 
 org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   at 
 org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
   at org.apache.spark.scheduler.Task.run(Task.scala:51)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:744)


 Do you have any idea what is it? how can I debug this issue or perhaps
 access another log?


 Thank you,
 Konstantin Kudryavtsev



Re: NullPointerException on reading checkpoint files

2014-06-12 Thread Kiran
I am also seeing similar problem when trying to continue job using saved
checkpoint. Can somebody help in solving this problem?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-on-reading-checkpoint-files-tp7306p7507.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.