RE: ETL on pyspark

2014-02-25 Thread Adrian Mocanu
Matei Zaharia [mailto:matei.zaha...@gmail.com] Sent: February-25-14 3:02 PM To: user@spark.apache.org Cc: u...@spark.incubator.apache.org Subject: Re: ETL on pyspark It will only move a file to the final directory when it's successfully finished writing it, so the file shouldn't have any

Re: ETL on pyspark

2014-02-25 Thread Matei Zaharia
after recovery from the failure does > it continue where it left off or will there be duplicates in the file? > > -A > From: Matei Zaharia [mailto:matei.zaha...@gmail.com] > Sent: February-24-14 4:20 PM > To: u...@spark.incubator.apache.org > Subject: Re: ETL on pyspark >

RE: ETL on pyspark

2014-02-25 Thread Adrian Mocanu
on pyspark collect() means to bring all the data back to the master node, and there might just be too much of it for that. How big is your file? If you can't bring it back to the master node try saveAsTextFile to write it out to a filesystem (in parallel). Matei On Feb 24, 2014, at 1: