Re: Writing Dataframe to CSV yields blank file called "_SUCCESS"

Peter Figliozzi Mon, 26 Sep 2016 07:26:21 -0700

Thanks again Piotr.  It's good to know there are a number of options.  Once
again I'm glad I put all my workers on the same ethernet switch, as
unanticipated shuffling isn't so bad.
Sincerely,
Pete


On Mon, Sep 26, 2016 at 8:35 AM, Piotr Smoliński <
piotr.smolinski...@gmail.com> wrote:

> Best, you should write to HDFS or when you test the product with no HDFS
> available just create a shared
> filesystem (windows shares, nfs, etc.) where the data will be written.
>
> You'll still end up with many files, but this time there will be only one
> directory tree.
>
> You may reduce the number of files by:
> * combining partitions on the same executor with coalesce call
> * repartitioning the RDD (DataFrame or DataSet depending on the API you
> use)
>
> The latter one is useful when you write the data to a partitioned
> structure. Note that repartitioning
> is explicit shuffle.
>
> If you want to have only single file you need to repartition the whole RDD
> to single partition.
> Depending on the result data size it may be something that you want or do
> not want to do ;-)
>
> Regards,
> Piotr
>
>
>
> On Mon, Sep 26, 2016 at 2:30 PM, Peter Figliozzi <pete.figlio...@gmail.com
> > wrote:
>
>> Thank you Piotr, that's what happened.  In fact, there are about 100
>> files on each worker node in a directory corresponding to the write.
>>
>> Any way to tone that down a bit (maybe 1 file per worker)?  Or, write a
>> single file somewhere?
>>
>>
>> On Mon, Sep 26, 2016 at 12:44 AM, Piotr Smoliński <
>> piotr.smolinski...@gmail.com> wrote:
>>
>>> Hi Peter,
>>>
>>> The blank file _SUCCESS indicates properly finished output operation.
>>>
>>> What is the topology of your application?
>>> I presume, you write to local filesystem and have more than one worker
>>> machine.
>>> In such case Spark will write the result files for each partition (in
>>> the worker which
>>> holds it) and complete operation writing the _SUCCESS in the driver node.
>>>
>>> Cheers,
>>> Piotr
>>>
>>>
>>> On Mon, Sep 26, 2016 at 4:56 AM, Peter Figliozzi <
>>> pete.figlio...@gmail.com> wrote:
>>>
>>>> Both
>>>>
>>>> df.write.csv("/path/to/foo")
>>>>
>>>> and
>>>>
>>>> df.write.format("com.databricks.spark.csv").save("/path/to/foo")
>>>>
>>>> results in a *blank* file called "_SUCCESS" under /path/to/foo.
>>>>
>>>> My df has stuff in it.. tried this with both my real df, and a quick df
>>>> constructed from literals.
>>>>
>>>> Why isn't it writing anything?
>>>>
>>>> Thanks,
>>>>
>>>> Pete
>>>>
>>>
>>>
>>
>

Re: Writing Dataframe to CSV yields blank file called "_SUCCESS"

Reply via email to