Hi Charles,

I have created very simplified job - https://github.com/ponkin/KafkaSnapshot to 
illustrate the problem. 
https://github.com/ponkin/KafkaSnapshot/blob/master/src/main/scala/ru/ponkin/KafkaSnapshot.scala

In a short - may be persist method is working but not like I expected.
I thought that spark will fetch all data from kafka topic once and cache it in 
memory, instead add is calculating every time I call saveAsObjectFile method

-- 
Яндекс.Почта — надёжная почта
http://mail.yandex.ru/neo2/collect/?exp=1&t=1


12.01.2016, 10:56, "charles li" <charles.up...@gmail.com>:
> cache is the default storage level of persist, and it is lazy [ not cached 
> indeed ] until the first time it is computed.
>
> ​
>
> On Tue, Jan 12, 2016 at 5:13 AM, ponkin <alexey.pon...@ya.ru> wrote:
>> Hi,
>>
>> Here is my use case :
>> I have kafka topic. The job is fairly simple - it reads topic and save data 
>> to several hdfs paths.
>> I create rdd with the following code
>>  val r =  
>> KafkaUtils.createRDD[Array[Byte],Array[Byte],DefaultDecoder,DefaultDecoder](context,kafkaParams,range)
>> Then I am trying to cache that rdd with
>>  r.cache()
>> and then save this rdd to several hdfs locations.
>> But it seems that KafkaRDD is fetching data from kafka broker every time I 
>> call saveAsNewAPIHadoopFile.
>>
>> How can I cache data from Kafka in memory?
>>
>> P.S. When I do repartition add it seems to work properly( read kafka only 
>> once) but spark store shuffled data localy.
>> Is it possible to keep data in memory?
>>
>> ----------------------------------------
>> View this message in context: [KafkaRDD]: rdd.cache() does not seem to work
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to