2018-04-23 18:59 GMT+02:00 unk1102 <umesh.ka...@gmail.com>: > Hi Nicolas thanks much for the reply. Do you have any sample code > somewhere? >
I have some open-source code. I could find time to push on github if needed. > Do your just keep pdf in avro binary all the time? yes, I store them. Actually, I did that one time for 50M pdf, and the daily 100K and each run is archived on hdfs so that I can query them with hive in a table with multiple avro files > How often you parse into > text using pdfbox? Each time I improve my pdfbox extractor program. say...one time a year maybe > Is it on demand basis or you always parse as text and > keep pdf as binary in avro as just interim state? > Can be both. Also, I store them into an orc file for an other use case with a webservice on top of that to share the pdfs. That table is 4TO and contains 50M pdfs. It gets MERGED every day with the new 100K pdf, thanks to HIVE merge and ORC acid capabilities