Re: Best practices for dealing with large no of PDF files

2018-04-23 Thread Nicolas Paris
guys here the illustration https://github.com/parisni/SparkPdfExtractor Please add issues if any questions or improvement ideas Enjoy Cheers 2018-04-23 20:42 GMT+02:00 unk1102 : > Thanks much Nicolas really appreciate it. > > > > -- > Sent from:

Re: Best practices for dealing with large no of PDF files

2018-04-23 Thread unk1102
Thanks much Nicolas really appreciate it. -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Best practices for dealing with large no of PDF files

2018-04-23 Thread Nicolas Paris
sure then let me recap steps: 1. load pdfs in a local folder to hdfs avro 2. load avro in spark as a RDD 3. apply pdfbox to each csv and return content as string 4. write the result as a huge csv file That's some work guys for me to push all that. Should find some time however within 7 days

Re: Best practices for dealing with large no of PDF files

2018-04-23 Thread Deepak Sharma
Yes Nicolas. It would be great hell if you can push code to github and share URL. Thanks Deepak On Mon, Apr 23, 2018, 23:00 unk1102 wrote: > Hi Nicolas thanks much for guidance it was very useful information if you > can > push that code to github and share url it would

Re: Best practices for dealing with large no of PDF files

2018-04-23 Thread unk1102
Hi Nicolas thanks much for guidance it was very useful information if you can push that code to github and share url it would be a great help. Looking forward. If you can find time to push early it would be even greater help as I have to finish POC on this use case ASAP. -- Sent from:

Re: Best practices for dealing with large no of PDF files

2018-04-23 Thread Nicolas Paris
2018-04-23 18:59 GMT+02:00 unk1102 : > Hi Nicolas thanks much for the reply. Do you have any sample code > somewhere? > ​I have some open-source code. I could find time to push on github if needed.​ > Do your just keep pdf in avro binary all the time? ​yes, I store

Re: Best practices for dealing with large no of PDF files

2018-04-23 Thread unk1102
Hi Nicolas thanks much for the reply. Do you have any sample code somewhere? Do your just keep pdf in avro binary all the time? How often you parse into text using pdfbox? Is it on demand basis or you always parse as text and keep pdf as binary in avro as just interim state? -- Sent from:

Re: Best practices for dealing with large no of PDF files

2018-04-23 Thread Deepak Sharma
Is there any open source code base to refer to for this kind of use case ? Thanks Deepak On Mon, Apr 23, 2018, 22:13 Nicolas Paris wrote: > Hi > > Problem is number of files on hadoop; > > > I deal with 50M pdf files. What I did is to put them in an avro table on > hdfs, >

Re: Best practices for dealing with large no of PDF files

2018-04-23 Thread Nicolas Paris
Hi Problem is number of files on hadoop; I deal with 50M pdf files. What I did is to put them in an avro table on hdfs, as a binary column. Then I read it with spark and push that into pdfbox. Transforming 50M pdfs into text took 2hours on a 5 computers clusters About colors and formating, I

Best practices for dealing with large no of PDF files

2018-04-23 Thread unk1102
Hi I need guidance on dealing with large no of pdf files when using Hadoop and Spark. Can I store as binaryFiles using sc.binaryFiles and then convert it to text using pdf parsers like Apache Tika or PDFBox etc or I convert it into text using these parsers and store it as text files but in doing