Re: Best practices for dealing with large no of PDF files

Nicolas Paris Mon, 23 Apr 2018 09:44:38 -0700

Hi

Problem is number of files on hadoop;



I deal with 50M pdf files. What I did is to put them in an avro table on
hdfs,
as a binary column.

Then I read it with spark and push that into pdfbox.

Transforming 50M pdfs into text took 2hours on a 5 computers clusters

About colors and formating, I guess pdfbox is able to get that information
and then maybe you could add html balise in your txt output.
That's some extra work indeed




2018-04-23 18:25 GMT+02:00 unk1102 <umesh.ka...@gmail.com>:

> Hi I need guidance on dealing with large no of pdf files when using Hadoop
> and Spark. Can I store as binaryFiles using sc.binaryFiles and then convert
> it to text using pdf parsers like Apache Tika or PDFBox etc or I convert it
> into text using these parsers and store it as text files but in doing so I
> am loosing colors, formatting etc Please guide.
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: Best practices for dealing with large no of PDF files

Reply via email to