I don't know anything about your use case, so take this with a grain of salt, but typically if you are operating at a scale that benefits from Spark, then you likely will not want to write your output records as individual files into HDFS. Spark has built-in support for the Hadoop "SequenceFile" container format, which is a more scalable way to handle writing out your results; you could write your Spark RDD transformations in such a way that your final RDD is a PairRDD with a unique key (possibly what would normally have been the standalone file name) and the value (in this case, likely the byte array of the PDF you generated).
It looks like PDFBox's "PDDocument" class allows you to save the document to an OutputStream <https://pdfbox.apache.org/docs/1.8.9/javadocs/org/apache/pdfbox/pdmodel/PDDocument.html#save(java.io.OutputStream)>, so you could probably get away with saving to a ByteArrayOutputStream, and snagging the bytes that comprise the final document. You can see more about how to write SequenceFiles from Spark here <https://spark.apache.org/docs/latest/programming-guide.html#actions>. As an aside, one hint that I have found helpful since I starting working with Spark is that if your transformation requires classes that are expensive to instantiate, you may want to look into mapPartitions, which allows you to do the setup once per partition instead of once per record. I haven't used PDFBox, but it wouldn't surprise me to learn that there's some non-neglible overhead involved. Hope that helps, Will On Tue, Jun 9, 2015 at 5:57 PM, Richard Catlin <richard.m.cat...@gmail.com> wrote: > I would like to write pdf files using pdfbox to HDFS from my Spark > application. Can this be done? > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Can-a-Spark-App-run-with-spark-submit-write-pdf-files-to-HDFS-tp23233.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >