Hi Davies, I running 1.1.0.

Now I'm following this thread that recommend use batchsize parameter = 1


http://apache-spark-user-list.1001560.n3.nabble.com/pySpark-memory-usage-td3022.html

if this does not work I will install  1.2.1 or  1.3

Regards






On Wed, Mar 25, 2015 at 3:39 PM, Davies Liu <dav...@databricks.com> wrote:

> What's the version of Spark you are running?
>
> There is a bug in SQL Python API [1], it's fixed in 1.2.1 and 1.3,
>
> [1] https://issues.apache.org/jira/browse/SPARK-6055
>
> On Wed, Mar 25, 2015 at 10:33 AM, Eduardo Cusa
> <eduardo.c...@usmediaconsulting.com> wrote:
> > Hi Guys, I running the following function with spark-submmit and de SO is
> > killing my process :
> >
> >
> >   def getRdd(self,date,provider):
> >     path='s3n://'+AWS_BUCKET+'/'+date+'/*.log.gz'
> >     log2= self.sqlContext.jsonFile(path)
> >     log2.registerTempTable('log_test')
> >     log2.cache()
>
> You only visit the table once, cache does not help here.
>
> >     out=self.sqlContext.sql("SELECT user, tax from log_test where
> provider =
> > '"+provider+"'and country <> ''").map(lambda row: (row.user, row.tax))
> >     print "out1"
> >     return  map((lambda (x,y): (x, list(y))),
> > sorted(out.groupByKey(2000).collect()))
>
> 100 partitions (or less) will be enough for 2G dataset.
>
> >
> >
> > The input dataset has 57 zip files (2 GB)
> >
> > The same process with a smaller dataset completed successfully
> >
> > Any ideas to debug is welcome.
> >
> > Regards
> > Eduardo
> >
> >
>

Reply via email to