Hi Charles, I have created very simplified job - https://github.com/ponkin/KafkaSnapshot to illustrate the problem. https://github.com/ponkin/KafkaSnapshot/blob/master/src/main/scala/ru/ponkin/KafkaSnapshot.scala
In a short - may be persist method is working but not like I expected. I thought that spark will fetch all data from kafka topic once and cache it in memory, instead add is calculating every time I call saveAsObjectFile method -- Яндекс.Почта — надёжная почта http://mail.yandex.ru/neo2/collect/?exp=1&t=1 12.01.2016, 10:56, "charles li" <charles.up...@gmail.com>: > cache is the default storage level of persist, and it is lazy [ not cached > indeed ] until the first time it is computed. > > > > On Tue, Jan 12, 2016 at 5:13 AM, ponkin <alexey.pon...@ya.ru> wrote: >> Hi, >> >> Here is my use case : >> I have kafka topic. The job is fairly simple - it reads topic and save data >> to several hdfs paths. >> I create rdd with the following code >> val r = >> KafkaUtils.createRDD[Array[Byte],Array[Byte],DefaultDecoder,DefaultDecoder](context,kafkaParams,range) >> Then I am trying to cache that rdd with >> r.cache() >> and then save this rdd to several hdfs locations. >> But it seems that KafkaRDD is fetching data from kafka broker every time I >> call saveAsNewAPIHadoopFile. >> >> How can I cache data from Kafka in memory? >> >> P.S. When I do repartition add it seems to work properly( read kafka only >> once) but spark store shuffled data localy. >> Is it possible to keep data in memory? >> >> ---------------------------------------- >> View this message in context: [KafkaRDD]: rdd.cache() does not seem to work >> Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org