Write Parquet File with spark-streaming with Spark 1.3

2015-03-25 Thread richiesgr
Hi

I've succeed to write kafka stream to parquet file in Spark 1.2 but I can't
make it with spark 1.3
As in streaming I can't use saveAsParquetFile() because I can't add data to
an existing parquet File

I know that it's possible to stream data directly into parquet 
could you help me by providing a little sample what API I need to use ??

Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Write-Parquet-File-with-spark-streaming-with-Spark-1-3-tp8.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



NullPointerException on cluster mode when using foreachPartition

2014-12-16 Thread richiesgr
Hi

This time I need expert.
On 1.1.1 and only in cluster (standalone or EC2) 
when I use this code :

countersPublishers.foreachRDD(rdd = {
rdd.foreachPartition(partitionRecords = {
  partitionRecords.foreach(record = {
//dbActorUpdater ! updateDBMessage(record)
println(record)
  })
})
  })

Get NPP (When I run this locally all is OK)

If I use this 
  countersPublishers.foreachRDD(rdd = rdd.collect().foreach(r =
dbActorUpdater ! updateDBMessage(r)))

There is no problem. I think something is misconfigured 
Thanks for help 




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-on-cluster-mode-when-using-foreachPartition-tp20719.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Driver fail with out of memory exception

2014-09-14 Thread richiesgr
Hi

I've written a job (I think not very complicated only 1 reduceByKey) the
driver JVM always hang with OOM killing the worker of course. How can I know
what is running on the driver and what is running on the worker how to debug
the memory problem.
I've already used --driver-memory 4g params to give more memory ut nothing
help it always fail

Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Driver-fail-with-out-of-memory-exception-tp14188.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to scale more consumer to Kafka stream

2014-09-11 Thread richiesgr
Thanks for all 
I'm going to check both solution



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-scale-more-consumer-to-Kafka-stream-tp13883p13959.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



How to scale large kafka topic

2014-09-08 Thread richiesgr
Hi

I'm building a application the read from kafka stream event. In production
we've 5 consumers that share 10 partitions.
But on spark streaming kafka the master act as a consumer then distribute
the tasks to workers so I can have only 1 masters acting as consumer but I
need more because only 1 consumer means Lags.

Do you've any idea what I can do ? Another point is interresting the master
is not loaded at all I can get up more than 10 % CPU

I've tried to increase the queued.max.message.chunks on the kafka client to
read more records thinking it'll speed up the read but I only get 

ERROR consumer.ConsumerFetcherThread:
[ConsumerFetcherThread-SparkEC2_ip-10-138-59-194.ec2.internal-1410182950783-5c49c8e8-0-174167372],
Error in fetch Name: FetchRequest; Version: 0; CorrelationId: 73; ClientId:
SparkEC2-ConsumerFetcherThread-SparkEC2_ip-10-138-59-194.ec2.internal-1410182950783-5c49c8e8-0-174167372;
ReplicaId: -1; MaxWait: 100 ms; MinBytes: 1 bytes; RequestInfo: [IA2,7] -
PartitionFetchInfo(929838589,1048576),[IA2,6] -
PartitionFetchInfo(929515796,1048576),[IA2,9] -
PartitionFetchInfo(929577946,1048576),[IA2,8] -
PartitionFetchInfo(930751599,1048576),[IA2,2] -
PartitionFetchInfo(926457704,1048576),[IA2,5] -
PartitionFetchInfo(930774385,1048576),[IA2,0] -
PartitionFetchInfo(929913213,1048576),[IA2,3] -
PartitionFetchInfo(929268891,1048576),[IA2,4] -
PartitionFetchInfo(929949877,1048576),[IA2,1] -
PartitionFetchInfo(930063114,1048576)
java.lang.OutOfMemoryError: Java heap space

Is someone have ideas ?
Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-scale-large-kafka-topic-tp13691.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: NoSuchElementException: key not found when changing the window lenght and interval in Spark Streaming

2014-07-10 Thread richiesgr
Hi

I get exactly the same problem here, do you've found the problem ?
Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/NoSuchElementException-key-not-found-when-changing-the-window-lenght-and-interval-in-Spark-Streaming-tp9010p9283.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Error with Stream Kafka Kryo

2014-07-09 Thread richiesgr
Hi

My setup is to use localMode standalone, Sprak 1.0.0 release version, scala
2.10.4

I made a job that receive serialized object from Kafka broker. The objects
are serialized using kryo. 
The code :

val sparkConf = new
SparkConf().setMaster(local[4]).setAppName(SparkTest)
  .set(spark.serializer, org.apache.spark.serializer.JavaSerializer)
  .set(spark.kryo.registrator,
com.inneractive.fortknox.kafka.EventDetailRegistrator)

val ssc = new StreamingContext(sparkConf, Seconds(20))
ssc.checkpoint(checkpoint)


val topicMap = topic.split(,).map((_,partitions)).toMap

// Create a Stream using my Decoder EventKryoEncoder 
val events = KafkaUtils.createStream[String, EventDetails,
StringDecoder, EventKryoEncoder] (ssc, kafkaMapParams,
  topicMap, StorageLevel.MEMORY_AND_DISK_SER).map(_._2)

val data = events.map(e = (e.getPublisherId, 1L))
val counter = data.reduceByKey(_ + _)
counter.print()

ssc.start()
ssc.awaitTermination()

When I run this I get
java.lang.IllegalStateException: unread block data
at
java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2421)
~[na:1.7.0_60]
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1382)
~[na:1.7.0_60]
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
~[na:1.7.0_60]
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
~[na:1.7.0_60]
at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
~[na:1.7.0_60]
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
~[na:1.7.0_60]
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
~[na:1.7.0_60]
at
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
~[spark-core_2.10-1.0.0.jar:1.0.0]
at
org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125)
~[spark-core_2.10-1.0.0.jar:1.0.0]
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
~[spark-core_2.10-1.0.0.jar:1.0.0]
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
~[scala-library-2.10.4.jar:na]
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
~[scala-library-2.10.4.jar:na]
at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
~[spark-core_2.10-1.0.0.jar:1.0.0]
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:96)
~[spark-core_2.10-1.0.0.jar:1.0.0]
at
org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95)
~[spark-core_2.10-1.0.0.jar:1.0.0]
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
~[spark-core_2.10-1.0.0.jar:1.0.0]
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
~[spark-core_2.10-1.0.0.jar:1.0.0]
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
~[spark-core_2.10-1.0.0.jar:1.0.0]
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
~[spark-core_2.10-1.0.0.jar:1.0.0]
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
~[spark-core_2.10-1.0.0.jar:1.0.0]
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
~[spark-core_2.10-1.0.0.jar:1.0.0]
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
~[spark-core_2.10-1.0.0.jar:1.0.0]
at org.apache.spark.scheduler.Task.run(Task.scala:51)
~[spark-core_2.10-1.0.0.jar:1.0.0]
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
~[spark-core_2.10-1.0.0.jar:1.0.0]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
[na:1.7.0_60]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
[na:1.7.0_60]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_60]

I've check that my decoder is working I can trace that the deserialization
is OK thus sprark must get ready to use object 

My setup work if I use JSON and not Kryo serialized object
Thanks for help because I don't what to do next






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Error-with-Stream-Kafka-Kryo-tp9200.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.