Thanks again Piotr. It's good to know there are a number of options. Once again I'm glad I put all my workers on the same ethernet switch, as unanticipated shuffling isn't so bad. Sincerely, Pete
On Mon, Sep 26, 2016 at 8:35 AM, Piotr Smoliński < piotr.smolinski...@gmail.com> wrote: > Best, you should write to HDFS or when you test the product with no HDFS > available just create a shared > filesystem (windows shares, nfs, etc.) where the data will be written. > > You'll still end up with many files, but this time there will be only one > directory tree. > > You may reduce the number of files by: > * combining partitions on the same executor with coalesce call > * repartitioning the RDD (DataFrame or DataSet depending on the API you > use) > > The latter one is useful when you write the data to a partitioned > structure. Note that repartitioning > is explicit shuffle. > > If you want to have only single file you need to repartition the whole RDD > to single partition. > Depending on the result data size it may be something that you want or do > not want to do ;-) > > Regards, > Piotr > > > > On Mon, Sep 26, 2016 at 2:30 PM, Peter Figliozzi <pete.figlio...@gmail.com > > wrote: > >> Thank you Piotr, that's what happened. In fact, there are about 100 >> files on each worker node in a directory corresponding to the write. >> >> Any way to tone that down a bit (maybe 1 file per worker)? Or, write a >> single file somewhere? >> >> >> On Mon, Sep 26, 2016 at 12:44 AM, Piotr Smoliński < >> piotr.smolinski...@gmail.com> wrote: >> >>> Hi Peter, >>> >>> The blank file _SUCCESS indicates properly finished output operation. >>> >>> What is the topology of your application? >>> I presume, you write to local filesystem and have more than one worker >>> machine. >>> In such case Spark will write the result files for each partition (in >>> the worker which >>> holds it) and complete operation writing the _SUCCESS in the driver node. >>> >>> Cheers, >>> Piotr >>> >>> >>> On Mon, Sep 26, 2016 at 4:56 AM, Peter Figliozzi < >>> pete.figlio...@gmail.com> wrote: >>> >>>> Both >>>> >>>> df.write.csv("/path/to/foo") >>>> >>>> and >>>> >>>> df.write.format("com.databricks.spark.csv").save("/path/to/foo") >>>> >>>> results in a *blank* file called "_SUCCESS" under /path/to/foo. >>>> >>>> My df has stuff in it.. tried this with both my real df, and a quick df >>>> constructed from literals. >>>> >>>> Why isn't it writing anything? >>>> >>>> Thanks, >>>> >>>> Pete >>>> >>> >>> >> >