guys
here the illustration
https://github.com/parisni/SparkPdfExtractor
Please add issues if any questions or improvement ideas
Enjoy
Cheers
2018-04-23 20:42 GMT+02:00 unk1102 :
> Thanks much Nicolas really appreciate it.
>
>
>
> --
> Sent from:
Thanks much Nicolas really appreciate it.
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
sure then let me recap steps:
1. load pdfs in a local folder to hdfs avro
2. load avro in spark as a RDD
3. apply pdfbox to each csv and return content as string
4. write the result as a huge csv file
That's some work guys for me to push all that. Should find some time
however within 7 days
Yes Nicolas.
It would be great hell if you can push code to github and share URL.
Thanks
Deepak
On Mon, Apr 23, 2018, 23:00 unk1102 wrote:
> Hi Nicolas thanks much for guidance it was very useful information if you
> can
> push that code to github and share url it would
Hi Nicolas thanks much for guidance it was very useful information if you can
push that code to github and share url it would be a great help. Looking
forward. If you can find time to push early it would be even greater help as
I have to finish POC on this use case ASAP.
--
Sent from:
2018-04-23 18:59 GMT+02:00 unk1102 :
> Hi Nicolas thanks much for the reply. Do you have any sample code
> somewhere?
>
I have some open-source code. I could find time to push on github if
needed.
> Do your just keep pdf in avro binary all the time?
yes, I store
Hi Nicolas thanks much for the reply. Do you have any sample code somewhere?
Do your just keep pdf in avro binary all the time? How often you parse into
text using pdfbox? Is it on demand basis or you always parse as text and
keep pdf as binary in avro as just interim state?
--
Sent from:
Is there any open source code base to refer to for this kind of use case ?
Thanks
Deepak
On Mon, Apr 23, 2018, 22:13 Nicolas Paris wrote:
> Hi
>
> Problem is number of files on hadoop;
>
>
> I deal with 50M pdf files. What I did is to put them in an avro table on
> hdfs,
>
Hi
Problem is number of files on hadoop;
I deal with 50M pdf files. What I did is to put them in an avro table on
hdfs,
as a binary column.
Then I read it with spark and push that into pdfbox.
Transforming 50M pdfs into text took 2hours on a 5 computers clusters
About colors and formating, I
Hi I need guidance on dealing with large no of pdf files when using Hadoop
and Spark. Can I store as binaryFiles using sc.binaryFiles and then convert
it to text using pdf parsers like Apache Tika or PDFBox etc or I convert it
into text using these parsers and store it as text files but in doing
10 matches
Mail list logo