I think you can collect the results in driver through toLocalIterator method of RDD and save the result to the driver program; rather than writing it to the file on the local disk and collecting it separately. If your data is small enough and if you have enough cores/memory try processing everything in local mode and write the results locally.
-Sathish On Fri, Aug 11, 2017 at 1:17 PM Steve Loughran <ste...@hortonworks.com> wrote: > On 10 Aug 2017, at 09:51, Hemanth Gudela <hemanth.gud...@qvantel.com> > wrote: > > Yeah, installing HDFS in our environment is unfornutately going to take > lot of time (approvals/planning etc). I will have to live with local FS for > now. > The other option I had already tried is collect() and send everything to > driver node. But my data volume is too huge for driver node to handle alone. > > > NFS cross mount. > > > I’m now trying to split the data into multiple datasets, then collect > individual dataset and write it to local FS on driver node (this approach > slows down the spark job, but I hope it works). > > > > I doubt it. The job driver is in charge of committing work renaming data > under _temporary into the right place. Every operation which calls write() > to safe to an FS must have the same paths visible to all nodes in the spark > cluster. > > A cluster-wide filesystem of some form is mandatory, or you abandon > write() and implement your own operations to save (partitioned) data > > > Thank you, > Hemanth > > *From: *Femi Anthony <femib...@gmail.com> > *Date: *Thursday, 10 August 2017 at 11.24 > *To: *Hemanth Gudela <hemanth.gud...@qvantel.com> > *Cc: *"user@spark.apache.org" <user@spark.apache.org> > *Subject: *Re: spark.write.csv is not able write files to specified path, > but is writing to unintended subfolder _temporary/0/task_xxx folder on > worker nodes > > Also, why are you trying to write results locally if you're not using a > distributed file system ? Spark is geared towards writing to a distributed > file system. I would suggest trying to collect() so the data is sent to the > master and then do a write if the result set isn't too big, or repartition > before trying to write (though I suspect this won't really help). You > really should install HDFS if that is possible. > > Sent from my iPhone > > > On Aug 10, 2017, at 3:58 AM, Hemanth Gudela <hemanth.gud...@qvantel.com> > wrote: > > Thanks for reply Femi! > > I’m writing the file like this à myDataFrame. > write.mode("overwrite").csv("myFilePath") > There absolutely are no errors/warnings after the write. > > _SUCCESS file is created on master node, but the problem of _temporary is > noticed only on worked nodes. > > I know spark.write.csv works best with HDFS, but with the current setup I > have in my environment, I have to deal with spark write to node’s local > file system and not to HDFS. > > Regards, > Hemanth > > *From: *Femi Anthony <femib...@gmail.com> > *Date: *Thursday, 10 August 2017 at 10.38 > *To: *Hemanth Gudela <hemanth.gud...@qvantel.com> > *Cc: *"user@spark.apache.org" <user@spark.apache.org> > *Subject: *Re: spark.write.csv is not able write files to specified path, > but is writing to unintended subfolder _temporary/0/task_xxx folder on > worker nodes > > Normally the* _temporary* directory gets deleted as part of the cleanup > when the write is complete and a SUCCESS file is created. I suspect that > the writes are not properly completed. How are you specifying the write ? > Any error messages in the logs ? > > On Thu, Aug 10, 2017 at 3:17 AM, Hemanth Gudela < > hemanth.gud...@qvantel.com> wrote: > > Hi, > > I’m running spark on cluster mode containing 4 nodes, and trying to write > CSV files to node’s local path (*not HDFS*). > I’m spark.write.csv to write CSV files. > > *On master node*: > spark.write.csv creates a folder with csv file name and writes many files > with part-r-000n suffix. This is okay for me, I can merge them later. > *But on worker nodes*: > spark.write.csv creates a folder with csv file name and > writes many folders and files under _temporary/0/. This is not okay for me. > Could someone please suggest me what could have been going wrong in my > settings/how to be able to write csv files to the specified folder, and not > to subfolders (_temporary/0/task_xxx) in worker machines. > > Thank you, > Hemanth > > > > > > -- > http://www.femibyte.com/twiki5/bin/view/Tech/ > http://www.nextmatrix.com > "Great spirits have always encountered violent opposition from mediocre > minds." - Albert Einstein. > >