Hi Davies, I running 1.1.0. Now I'm following this thread that recommend use batchsize parameter = 1
http://apache-spark-user-list.1001560.n3.nabble.com/pySpark-memory-usage-td3022.html if this does not work I will install 1.2.1 or 1.3 Regards On Wed, Mar 25, 2015 at 3:39 PM, Davies Liu <dav...@databricks.com> wrote: > What's the version of Spark you are running? > > There is a bug in SQL Python API [1], it's fixed in 1.2.1 and 1.3, > > [1] https://issues.apache.org/jira/browse/SPARK-6055 > > On Wed, Mar 25, 2015 at 10:33 AM, Eduardo Cusa > <eduardo.c...@usmediaconsulting.com> wrote: > > Hi Guys, I running the following function with spark-submmit and de SO is > > killing my process : > > > > > > def getRdd(self,date,provider): > > path='s3n://'+AWS_BUCKET+'/'+date+'/*.log.gz' > > log2= self.sqlContext.jsonFile(path) > > log2.registerTempTable('log_test') > > log2.cache() > > You only visit the table once, cache does not help here. > > > out=self.sqlContext.sql("SELECT user, tax from log_test where > provider = > > '"+provider+"'and country <> ''").map(lambda row: (row.user, row.tax)) > > print "out1" > > return map((lambda (x,y): (x, list(y))), > > sorted(out.groupByKey(2000).collect())) > > 100 partitions (or less) will be enough for 2G dataset. > > > > > > > The input dataset has 57 zip files (2 GB) > > > > The same process with a smaller dataset completed successfully > > > > Any ideas to debug is welcome. > > > > Regards > > Eduardo > > > > >