Re: Spark APIs memory usage?

Akhil Das Sun, 19 Jul 2015 08:14:20 -0700

This is what happens when you create a DataFrame
<https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L430>,
in your case, rdd1.values.flatMap(fun) will be executed
<https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L127>
when you create the df. Can you check just rdd1.values.flatMap(fun).count()
or a save just to see it executes without any problems.


Thanks
Best Regards

On Sat, Jul 18, 2015 at 2:27 PM, Harit Vishwakarma <
harit.vishwaka...@gmail.com> wrote:

> Even if I remove numpy calls. (no matrices loaded), Same exception is
> coming.
> Can anyone tell what createDataFrame does internally? Are there any
> alternatives for it?
>
> On Fri, Jul 17, 2015 at 6:43 PM, Akhil Das <ak...@sigmoidanalytics.com>
> wrote:
>
>> I suspect its the numpy filling up Memory.
>>
>> Thanks
>> Best Regards
>>
>> On Fri, Jul 17, 2015 at 5:46 PM, Harit Vishwakarma <
>> harit.vishwaka...@gmail.com> wrote:
>>
>>> 1. load 3 matrices of size ~ 10000 X 10000 using numpy.
>>> 2. rdd2 = rdd1.values().flatMap( fun )  # rdd1 has roughly 10^7 tuples
>>> 3. df = sqlCtx.createDataFrame(rdd2)
>>> 4. df.save() # in parquet format
>>>
>>> It throws exception in createDataFrame() call. I don't know what exactly
>>> it is creating ? everything in memory? or can I make it to persist
>>> simultaneously while getting created.
>>>
>>> Thanks
>>>
>>>
>>> On Fri, Jul 17, 2015 at 5:16 PM, Akhil Das <ak...@sigmoidanalytics.com>
>>> wrote:
>>>
>>>> Can you paste the code? How much memory does your system have and how
>>>> big is your dataset? Did you try df.persist(StorageLevel.MEMORY_AND_DISK)?
>>>>
>>>> Thanks
>>>> Best Regards
>>>>
>>>> On Fri, Jul 17, 2015 at 5:14 PM, Harit Vishwakarma <
>>>> harit.vishwaka...@gmail.com> wrote:
>>>>
>>>>> Thanks,
>>>>> Code is running on a single machine.
>>>>> And it still doesn't answer my question.
>>>>>
>>>>> On Fri, Jul 17, 2015 at 4:52 PM, ayan guha <guha.a...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> You can bump up number of partitions while creating the rdd you are
>>>>>> using for df
>>>>>> On 17 Jul 2015 21:03, "Harit Vishwakarma" <
>>>>>> harit.vishwaka...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I used createDataFrame API of SqlContext in python. and getting
>>>>>>> OutOfMemoryException. I am wondering if it is creating whole dataFrame 
>>>>>>> in
>>>>>>> memory?
>>>>>>> I did not find any documentation describing memory usage of Spark
>>>>>>> APIs.
>>>>>>> Documentation given is nice but little more details (specially on
>>>>>>> memory usage/ data distribution etc.) will really help.
>>>>>>>
>>>>>>> --
>>>>>>> Regards
>>>>>>> Harit Vishwakarma
>>>>>>>
>>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Regards
>>>>> Harit Vishwakarma
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Regards
>>> Harit Vishwakarma
>>>
>>>
>>
>
>
> --
> Regards
> Harit Vishwakarma
>
>

Re: Spark APIs memory usage?

Reply via email to