Re: Spark2: Deciphering saving text file name

Jason Nerothin Tue, 09 Apr 2019 10:06:08 -0700

Hi Subash,

Short answer: It’s effectively random.

Longer answer: In general the DataFrameWriter expects to be receiving data
from multiple partitions. Let’s say you were writing to ORC instead of text.

In this case, even when you specify the output path, the writer creates a
directory at the specified path and saves one of those funny-named files
per partition.

Even longer: Assume you are running atop of YARN (or Messi or K8S...) In
this case, the resource manager is responsible for provisioning disk on
request, and it is the programmers’ responsibility to implement the
upstream business logic.

The implication is that it’s probably not a good idea to violate the
responsibility boundary. Because, if you do, you are probably going to
violate some implicit assumptions that the YARN designers are relying upon.
For example (just making this up): YARN will calculate available disk after
each write action completes.

HTH,
Jason

On Mon, Apr 8, 2019 at 19:55 Subash Prabakar <subashpraba...@gmail.com>
wrote:

> Hi,
> While saving in Spark2 as text file - I see encoded/hash value attached in
> the part files along with part number. I am curious to know what is that
> value is about ?
>
> Example:
> ds.write.save(SaveMode.Overwrite).option("compression","gzip").text(path)
>
> Produces,
> part-00001-1e4c5369-6694-4012-894a-73b971fe1ab1-c000.txt.gz
>
>
> 1e4c5369-6694-4012-894a-73b971fe1ab1-c000 => what is this value ?
>
> Is there any options available to remove this part or is it attached for
> some reason ?
>
> Thanks,
> Subash
>
-- 
Thanks,
Jason

Re: Spark2: Deciphering saving text file name

Reply via email to