You can repartition your dataframe into 1 partition and all the data will land
into one partition. However, doing this is perilious because you will end up
with all your data on one node, and if you have too much data you will run out
of memory. In fact, anytime you are thinking about putting data in a single
file, you should ask yourself “Does this data fit into memory?”
The reason why Spark is geared towards reading and writing data in a
partitioned manner is because fundamentally, partitioning data is how you scale
your applications. Partitioned data allows Spark (or really any application
that is designed to scale on a cluster) to read data in parallel, process it
and spit out, without any bottlenecking. Humans prefer all their data in a
single file/table, because humans have a limited ability of keeping track of
multitude of files. Grid enabled software hate single files, simply because
there is no good way for 2 nodes to read a large file without having some sort
of bottlenecking
Imagine a data processing pipeline that starts with some sort of ingestion and
transformation at one end, which feeds into several analytical processes.
Usually there are humans at the end who are looking at the results of the
analytics. These humans love to get their analytics in a dashboard that gives
them a high-level view of the data. However, all the data processing systems
that go from input to analytics, prefer their data to be cut up into bite sized
chunks
From: Christopher Piggott
Date: Saturday, December 30, 2017 at 3:45 PM
To: "user@spark.apache.org"
Subject: Converting binary files
I have been searching for examples, but not finding exactly what I need.
I am looking for the paradigm for using spark 2.2 to convert a bunch of binary
files into a bunch of different binary files. I'm starting with:
val files =
spark.sparkContext.binaryFiles("hdfs://1.2.3.4/input<http://1.2.3.4/input>")
then convert them:
val converted = files.map { case (filename, content) => ( filename ->
convert(content) }
but I don't really want to save by 'partition', I want to save the file using
the original name but in a different directory.e.g. "converted/*"
I'm not quite sure how I'm supposed to do this within the framework of what's
available to me in SparkContext. Do I need to do it myself using the HDFS api?
It would seem like this would be a pretty normal thing to do. Imagine for
instance I were saying take a bunch of binary files and compress them, and save
the compressed output to a different directory. I feel like I'm missing
something fundamental here.
--C
The information contained in this e-mail is confidential and/or proprietary to
Capital One and/or its affiliates and may only be used solely in performance of
work or services for Capital One. The information transmitted herewith is
intended only for use by the individual or entity to which it is addressed. If
the reader of this message is not the intended recipient, you are hereby
notified that any review, retransmission, dissemination, distribution, copying
or other use of, or taking of any action in reliance upon this information is
strictly prohibited. If you have received this communication in error, please
contact the sender and delete the material from your computer.