Re: Caching small Rdd's take really long time and Spark seems frozen

Guillermo Ortiz Thu, 23 Aug 2018 13:44:14 -0700

it's a complex DAG before the point I cache the RDD, they are some joins,
filter and maps before caching data, but most of the times it doesn't take
almost time to do it. I could understand if it would take the same time all
the times to process or cache the data. Besides it seems random and they
are any weird data in the input.


Another test I tried it's disabled caching, and I saw that all the
microbatches last the same time, so it seems that it's relation with
caching these RDD's.

El jue., 23 ago. 2018 a las 15:29, Sonal Goyal (<sonalgoy...@gmail.com>)
escribió:

> How are these small RDDs created? Could the blockage be in their compute
> creation instead of their caching?
>
> Thanks,
> Sonal
> Nube Technologies <http://www.nubetech.co>
>
> <http://in.linkedin.com/in/sonalgoyal>
>
>
>
> On Thu, Aug 23, 2018 at 6:38 PM, Guillermo Ortiz <konstt2...@gmail.com>
> wrote:
>
>> I use spark with caching with persist method. I have several RDDs what I
>> cache but some of them are pretty small (about 300kbytes). Most of time it
>> works well and usually lasts 1s the whole job, but sometimes it takes about
>> 40s to store 300kbytes to cache.
>>
>> If I go to the SparkUI->Cache, I can see how the percentage is increasing
>> until 83% (250kbytes) and then it stops for a while. If I check the event
>> time in the Spark UI I can see that when this happen there is a node where
>> tasks takes very long time. This node could be any from the cluster, it's
>> not always the same.
>>
>> In the spark executor logs I can see it's that it takes about 40s in
>> store 3.7kb when this problem occurs
>>
>>     INFO  2018-08-23 12:46:58 Logging.scala:54 -
>> org.apache.spark.storage.BlockManager: Found block rdd_1705_23 locally
>>     INFO  2018-08-23 12:47:38 Logging.scala:54 -
>> org.apache.spark.storage.memory.MemoryStore: Block rdd_1692_7 stored as
>> bytes in memory (estimated size 3.7 KB, free 1048.0 MB)
>>     INFO  2018-08-23 12:47:38 Logging.scala:54 -
>> org.apache.spark.storage.BlockManager: Found block rdd_1692_7 locally
>>
>> I have tried with MEMORY_ONLY, MEMORY_AND_SER and so on with the same
>> results. I have checked the IO disk (although if I use memory_only I guess
>> that it doesn't have sense) and I can't see any problem. This happens
>> randomly, but it could be in the 25% of the jobs.
>>
>> Any idea about what it could be happening?
>>
>
>

Re: Caching small Rdd's take really long time and Spark seems frozen

Reply via email to