Re: Can a Spark App run with spark-submit write pdf files to HDFS

William Briggs Tue, 09 Jun 2015 18:19:20 -0700

I don't know anything about your use case, so take this with a grain of
salt, but typically if you are operating at a scale that benefits from
Spark, then you likely will not want to write your output records as
individual files into HDFS. Spark has built-in support for the Hadoop
"SequenceFile" container format, which is a more scalable way to handle
writing out your results; you could write your Spark RDD transformations in
such a way that your final RDD is a PairRDD with a unique key (possibly
what would normally have been the standalone file name) and the value (in
this case, likely the byte array of the PDF you generated).

It looks like PDFBox's "PDDocument" class allows you to save the document
to an OutputStream
<https://pdfbox.apache.org/docs/1.8.9/javadocs/org/apache/pdfbox/pdmodel/PDDocument.html#save(java.io.OutputStream)>,
so you could probably get away with saving to a ByteArrayOutputStream, and
snagging the bytes that comprise the final document. You can see more about
how to write SequenceFiles from Spark here
<https://spark.apache.org/docs/latest/programming-guide.html#actions>.

As an aside, one hint that I have found helpful since I starting working
with Spark is that if your transformation requires classes that are
expensive to instantiate, you may want to look into mapPartitions, which
allows you to do the setup once per partition instead of once per record. I
haven't used PDFBox, but it wouldn't surprise me to learn that there's some
non-neglible overhead involved.

Hope that helps,
Will

On Tue, Jun 9, 2015 at 5:57 PM, Richard Catlin <richard.m.cat...@gmail.com>
wrote:

> I would like to write pdf files using pdfbox to HDFS from my Spark
> application.  Can this be done?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Can-a-Spark-App-run-with-spark-submit-write-pdf-files-to-HDFS-tp23233.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Can a Spark App run with spark-submit write pdf files to HDFS

Reply via email to