Best practices for dealing with large no of PDF files

unk1102 Mon, 23 Apr 2018 09:25:44 -0700

Hi I need guidance on dealing with large no of pdf files when using Hadoop
and Spark. Can I store as binaryFiles using sc.binaryFiles and then convert
it to text using pdf parsers like Apache Tika or PDFBox etc or I convert it
into text using these parsers and store it as text files but in doing so I
am loosing colors, formatting etc Please guide.




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Best practices for dealing with large no of PDF files

Reply via email to