Hi I need guidance on dealing with large no of pdf files when using Hadoop and Spark. Can I store as binaryFiles using sc.binaryFiles and then convert it to text using pdf parsers like Apache Tika or PDFBox etc or I convert it into text using these parsers and store it as text files but in doing so I am loosing colors, formatting etc Please guide.
-- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org