Hi I need guidance on dealing with large no of pdf files when using Hadoop
and Spark. Can I store as binaryFiles using sc.binaryFiles and then convert
it to text using pdf parsers like Apache Tika or PDFBox etc or I convert it
into text using these parsers and store it as text files but in doing so I
am loosing colors, formatting etc Please guide.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to