I suspect its the numpy filling up Memory. Thanks Best Regards
On Fri, Jul 17, 2015 at 5:46 PM, Harit Vishwakarma < harit.vishwaka...@gmail.com> wrote: > 1. load 3 matrices of size ~ 10000 X 10000 using numpy. > 2. rdd2 = rdd1.values().flatMap( fun ) # rdd1 has roughly 10^7 tuples > 3. df = sqlCtx.createDataFrame(rdd2) > 4. df.save() # in parquet format > > It throws exception in createDataFrame() call. I don't know what exactly > it is creating ? everything in memory? or can I make it to persist > simultaneously while getting created. > > Thanks > > > On Fri, Jul 17, 2015 at 5:16 PM, Akhil Das <ak...@sigmoidanalytics.com> > wrote: > >> Can you paste the code? How much memory does your system have and how big >> is your dataset? Did you try df.persist(StorageLevel.MEMORY_AND_DISK)? >> >> Thanks >> Best Regards >> >> On Fri, Jul 17, 2015 at 5:14 PM, Harit Vishwakarma < >> harit.vishwaka...@gmail.com> wrote: >> >>> Thanks, >>> Code is running on a single machine. >>> And it still doesn't answer my question. >>> >>> On Fri, Jul 17, 2015 at 4:52 PM, ayan guha <guha.a...@gmail.com> wrote: >>> >>>> You can bump up number of partitions while creating the rdd you are >>>> using for df >>>> On 17 Jul 2015 21:03, "Harit Vishwakarma" <harit.vishwaka...@gmail.com> >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> I used createDataFrame API of SqlContext in python. and getting >>>>> OutOfMemoryException. I am wondering if it is creating whole dataFrame in >>>>> memory? >>>>> I did not find any documentation describing memory usage of Spark APIs. >>>>> Documentation given is nice but little more details (specially on >>>>> memory usage/ data distribution etc.) will really help. >>>>> >>>>> -- >>>>> Regards >>>>> Harit Vishwakarma >>>>> >>>>> >>> >>> >>> -- >>> Regards >>> Harit Vishwakarma >>> >>> >> > > > -- > Regards > Harit Vishwakarma > >