Hi Problem is number of files on hadoop;
I deal with 50M pdf files. What I did is to put them in an avro table on hdfs, as a binary column. Then I read it with spark and push that into pdfbox. Transforming 50M pdfs into text took 2hours on a 5 computers clusters About colors and formating, I guess pdfbox is able to get that information and then maybe you could add html balise in your txt output. That's some extra work indeed 2018-04-23 18:25 GMT+02:00 unk1102 <umesh.ka...@gmail.com>: > Hi I need guidance on dealing with large no of pdf files when using Hadoop > and Spark. Can I store as binaryFiles using sc.binaryFiles and then convert > it to text using pdf parsers like Apache Tika or PDFBox etc or I convert it > into text using these parsers and store it as text files but in doing so I > am loosing colors, formatting etc Please guide. > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >