Write Parquet File with spark-streaming with Spark 1.3
Hi I've succeed to write kafka stream to parquet file in Spark 1.2 but I can't make it with spark 1.3 As in streaming I can't use saveAsParquetFile() because I can't add data to an existing parquet File I know that it's possible to stream data directly into parquet could you help me by providing a little sample what API I need to use ?? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Write-Parquet-File-with-spark-streaming-with-Spark-1-3-tp8.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
NullPointerException on cluster mode when using foreachPartition
Hi This time I need expert. On 1.1.1 and only in cluster (standalone or EC2) when I use this code : countersPublishers.foreachRDD(rdd = { rdd.foreachPartition(partitionRecords = { partitionRecords.foreach(record = { //dbActorUpdater ! updateDBMessage(record) println(record) }) }) }) Get NPP (When I run this locally all is OK) If I use this countersPublishers.foreachRDD(rdd = rdd.collect().foreach(r = dbActorUpdater ! updateDBMessage(r))) There is no problem. I think something is misconfigured Thanks for help -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/NullPointerException-on-cluster-mode-when-using-foreachPartition-tp20719.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Driver fail with out of memory exception
Hi I've written a job (I think not very complicated only 1 reduceByKey) the driver JVM always hang with OOM killing the worker of course. How can I know what is running on the driver and what is running on the worker how to debug the memory problem. I've already used --driver-memory 4g params to give more memory ut nothing help it always fail Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Driver-fail-with-out-of-memory-exception-tp14188.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to scale more consumer to Kafka stream
Thanks for all I'm going to check both solution -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-scale-more-consumer-to-Kafka-stream-tp13883p13959.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
How to scale large kafka topic
Hi I'm building a application the read from kafka stream event. In production we've 5 consumers that share 10 partitions. But on spark streaming kafka the master act as a consumer then distribute the tasks to workers so I can have only 1 masters acting as consumer but I need more because only 1 consumer means Lags. Do you've any idea what I can do ? Another point is interresting the master is not loaded at all I can get up more than 10 % CPU I've tried to increase the queued.max.message.chunks on the kafka client to read more records thinking it'll speed up the read but I only get ERROR consumer.ConsumerFetcherThread: [ConsumerFetcherThread-SparkEC2_ip-10-138-59-194.ec2.internal-1410182950783-5c49c8e8-0-174167372], Error in fetch Name: FetchRequest; Version: 0; CorrelationId: 73; ClientId: SparkEC2-ConsumerFetcherThread-SparkEC2_ip-10-138-59-194.ec2.internal-1410182950783-5c49c8e8-0-174167372; ReplicaId: -1; MaxWait: 100 ms; MinBytes: 1 bytes; RequestInfo: [IA2,7] - PartitionFetchInfo(929838589,1048576),[IA2,6] - PartitionFetchInfo(929515796,1048576),[IA2,9] - PartitionFetchInfo(929577946,1048576),[IA2,8] - PartitionFetchInfo(930751599,1048576),[IA2,2] - PartitionFetchInfo(926457704,1048576),[IA2,5] - PartitionFetchInfo(930774385,1048576),[IA2,0] - PartitionFetchInfo(929913213,1048576),[IA2,3] - PartitionFetchInfo(929268891,1048576),[IA2,4] - PartitionFetchInfo(929949877,1048576),[IA2,1] - PartitionFetchInfo(930063114,1048576) java.lang.OutOfMemoryError: Java heap space Is someone have ideas ? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-scale-large-kafka-topic-tp13691.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: NoSuchElementException: key not found when changing the window lenght and interval in Spark Streaming
Hi I get exactly the same problem here, do you've found the problem ? Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/NoSuchElementException-key-not-found-when-changing-the-window-lenght-and-interval-in-Spark-Streaming-tp9010p9283.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Error with Stream Kafka Kryo
Hi My setup is to use localMode standalone, Sprak 1.0.0 release version, scala 2.10.4 I made a job that receive serialized object from Kafka broker. The objects are serialized using kryo. The code : val sparkConf = new SparkConf().setMaster(local[4]).setAppName(SparkTest) .set(spark.serializer, org.apache.spark.serializer.JavaSerializer) .set(spark.kryo.registrator, com.inneractive.fortknox.kafka.EventDetailRegistrator) val ssc = new StreamingContext(sparkConf, Seconds(20)) ssc.checkpoint(checkpoint) val topicMap = topic.split(,).map((_,partitions)).toMap // Create a Stream using my Decoder EventKryoEncoder val events = KafkaUtils.createStream[String, EventDetails, StringDecoder, EventKryoEncoder] (ssc, kafkaMapParams, topicMap, StorageLevel.MEMORY_AND_DISK_SER).map(_._2) val data = events.map(e = (e.getPublisherId, 1L)) val counter = data.reduceByKey(_ + _) counter.print() ssc.start() ssc.awaitTermination() When I run this I get java.lang.IllegalStateException: unread block data at java.io.ObjectInputStream$BlockDataInputStream.setBlockDataMode(ObjectInputStream.java:2421) ~[na:1.7.0_60] at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1382) ~[na:1.7.0_60] at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) ~[na:1.7.0_60] at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) ~[na:1.7.0_60] at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) ~[na:1.7.0_60] at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) ~[na:1.7.0_60] at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) ~[na:1.7.0_60] at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63) ~[spark-core_2.10-1.0.0.jar:1.0.0] at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:125) ~[spark-core_2.10-1.0.0.jar:1.0.0] at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) ~[spark-core_2.10-1.0.0.jar:1.0.0] at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) ~[scala-library-2.10.4.jar:na] at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) ~[scala-library-2.10.4.jar:na] at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58) ~[spark-core_2.10-1.0.0.jar:1.0.0] at org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:96) ~[spark-core_2.10-1.0.0.jar:1.0.0] at org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95) ~[spark-core_2.10-1.0.0.jar:1.0.0] at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582) ~[spark-core_2.10-1.0.0.jar:1.0.0] at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582) ~[spark-core_2.10-1.0.0.jar:1.0.0] at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) ~[spark-core_2.10-1.0.0.jar:1.0.0] at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) ~[spark-core_2.10-1.0.0.jar:1.0.0] at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) ~[spark-core_2.10-1.0.0.jar:1.0.0] at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158) ~[spark-core_2.10-1.0.0.jar:1.0.0] at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) ~[spark-core_2.10-1.0.0.jar:1.0.0] at org.apache.spark.scheduler.Task.run(Task.scala:51) ~[spark-core_2.10-1.0.0.jar:1.0.0] at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187) ~[spark-core_2.10-1.0.0.jar:1.0.0] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_60] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_60] at java.lang.Thread.run(Thread.java:745) [na:1.7.0_60] I've check that my decoder is working I can trace that the deserialization is OK thus sprark must get ready to use object My setup work if I use JSON and not Kryo serialized object Thanks for help because I don't what to do next -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Error-with-Stream-Kafka-Kryo-tp9200.html Sent from the Apache Spark User List mailing list archive at Nabble.com.