This is what happens when you create a DataFrame <https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala#L430>, in your case, rdd1.values.flatMap(fun) will be executed <https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L127> when you create the df. Can you check just rdd1.values.flatMap(fun).count() or a save just to see it executes without any problems.
Thanks Best Regards On Sat, Jul 18, 2015 at 2:27 PM, Harit Vishwakarma < harit.vishwaka...@gmail.com> wrote: > Even if I remove numpy calls. (no matrices loaded), Same exception is > coming. > Can anyone tell what createDataFrame does internally? Are there any > alternatives for it? > > On Fri, Jul 17, 2015 at 6:43 PM, Akhil Das <ak...@sigmoidanalytics.com> > wrote: > >> I suspect its the numpy filling up Memory. >> >> Thanks >> Best Regards >> >> On Fri, Jul 17, 2015 at 5:46 PM, Harit Vishwakarma < >> harit.vishwaka...@gmail.com> wrote: >> >>> 1. load 3 matrices of size ~ 10000 X 10000 using numpy. >>> 2. rdd2 = rdd1.values().flatMap( fun ) # rdd1 has roughly 10^7 tuples >>> 3. df = sqlCtx.createDataFrame(rdd2) >>> 4. df.save() # in parquet format >>> >>> It throws exception in createDataFrame() call. I don't know what exactly >>> it is creating ? everything in memory? or can I make it to persist >>> simultaneously while getting created. >>> >>> Thanks >>> >>> >>> On Fri, Jul 17, 2015 at 5:16 PM, Akhil Das <ak...@sigmoidanalytics.com> >>> wrote: >>> >>>> Can you paste the code? How much memory does your system have and how >>>> big is your dataset? Did you try df.persist(StorageLevel.MEMORY_AND_DISK)? >>>> >>>> Thanks >>>> Best Regards >>>> >>>> On Fri, Jul 17, 2015 at 5:14 PM, Harit Vishwakarma < >>>> harit.vishwaka...@gmail.com> wrote: >>>> >>>>> Thanks, >>>>> Code is running on a single machine. >>>>> And it still doesn't answer my question. >>>>> >>>>> On Fri, Jul 17, 2015 at 4:52 PM, ayan guha <guha.a...@gmail.com> >>>>> wrote: >>>>> >>>>>> You can bump up number of partitions while creating the rdd you are >>>>>> using for df >>>>>> On 17 Jul 2015 21:03, "Harit Vishwakarma" < >>>>>> harit.vishwaka...@gmail.com> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I used createDataFrame API of SqlContext in python. and getting >>>>>>> OutOfMemoryException. I am wondering if it is creating whole dataFrame >>>>>>> in >>>>>>> memory? >>>>>>> I did not find any documentation describing memory usage of Spark >>>>>>> APIs. >>>>>>> Documentation given is nice but little more details (specially on >>>>>>> memory usage/ data distribution etc.) will really help. >>>>>>> >>>>>>> -- >>>>>>> Regards >>>>>>> Harit Vishwakarma >>>>>>> >>>>>>> >>>>> >>>>> >>>>> -- >>>>> Regards >>>>> Harit Vishwakarma >>>>> >>>>> >>>> >>> >>> >>> -- >>> Regards >>> Harit Vishwakarma >>> >>> >> > > > -- > Regards > Harit Vishwakarma > >