RE: Extracting Text from embedded images in PDF docs

2017-05-23 Thread Allison, Timothy B.
Subject: Re: Extracting Text from embedded images in PDF docs Hi Tim Sure, once I get an initial PR ready I'll send an update and I'll explain what I did for a start and we will discuss it further

Re: Extracting Text from embedded images in PDF docs

2017-05-19 Thread Sergey Beryozkin
Beam, tho. :) -Original Message- From: Sergey Beryozkin [mailto:sberyoz...@gmail.com] Sent: Friday, May 19, 2017 12:40 PM To: user@tika.apache.org Subject: Re: Extracting Text from embedded images in PDF docs Hi Tim On 19/05/17 17:31, Allison, Timothy B. wrote: The autoscaling feature of Beam a

RE: Extracting Text from embedded images in PDF docs

2017-05-19 Thread Allison, Timothy B.
This is fantastic news! Let me know if I can help...I know _nothing_ about Beam, tho. :) -Original Message- From: Sergey Beryozkin [mailto:sberyoz...@gmail.com] Sent: Friday, May 19, 2017 12:40 PM To: user@tika.apache.org Subject: Re: Extracting Text from embedded images in PDF docs

Re: Extracting Text from embedded images in PDF docs

2017-05-19 Thread Sergey Beryozkin
ira/browse/BEAM-2328 It will take me few more weeks to create a PR, Thanks, Sergey -Original Message- From: Sergey Beryozkin [mailto:sberyoz...@gmail.com] Sent: Friday, May 19, 2017 12:27 PM To: user@tika.apache.org Subject: Re: Extracting Text from embedded images in PDF docs Hi Chr

RE: Extracting Text from embedded images in PDF docs

2017-05-19 Thread Allison, Timothy B.
Message- From: Sergey Beryozkin [mailto:sberyoz...@gmail.com] Sent: Friday, May 19, 2017 12:27 PM To: user@tika.apache.org Subject: Re: Extracting Text from embedded images in PDF docs Hi Chris I'm getting nervous now, what will happen to me if it will not work out in the end :-). Though, it

Re: Extracting Text from embedded images in PDF docs

2017-05-19 Thread Chris Mattmann
💯 On 5/19/17, 9:27 AM, "Sergey Beryozkin" wrote: Hi Chris I'm getting nervous now, what will happen to me if it will not work out in the end :-). Though, it actually does work, for me at least :-) Cheers, Sergey On 19/05/17 17:23, Mattmann, Chris A (3010) wrote:

RE: Extracting Text from embedded images in PDF docs

2017-05-19 Thread Allison, Timothy B.
> Well, I'm trying to integrate Tika with Apache Beam, Awesome! I saw two fantastic Beam talks at ApacheCon (two days ago?). I won't tell anyone. ;)

Re: Extracting Text from embedded images in PDF docs

2017-05-19 Thread Sergey Beryozkin
Hi Chris I'm getting nervous now, what will happen to me if it will not work out in the end :-). Though, it actually does work, for me at least :-) Cheers, Sergey On 19/05/17 17:23, Mattmann, Chris A (3010) wrote: Thanks Sergey what an awesome surprise you are the best! +

Re: Extracting Text from embedded images in PDF docs

2017-05-19 Thread Mattmann, Chris A (3010)
Thanks Sergey what an awesome surprise you are the best! ++ Chris Mattmann, Ph.D. Principal Data Scientist, Engineering Administrative Office (3010) Manager, NSF & Open Source Projects Formulation and Development Offices (8212

Re: Extracting Text from embedded images in PDF docs

2017-05-19 Thread Sergey Beryozkin
Hi Tim On 19/05/17 16:47, Allison, Timothy B. wrote: Yes I was asking about it as I thought it was confusing it did not work - I saw you following up on this possible issue in the other email... Y, I agree. That _should_ work. I'm doing some work with Tika now so it was of an immediate inte

RE: Extracting Text from embedded images in PDF docs

2017-05-19 Thread Allison, Timothy B.
>Yes I was asking about it as I thought it was confusing it did not work >- I saw you following up on this possible issue in the other email... Y, I agree. That _should_ work. >I'm doing some work with Tika now so it was of an immediate interest to me... Yay! What are you working on? >Sure. By

Re: Extracting Text from embedded images in PDF docs

2017-05-19 Thread Sergey Beryozkin
On 19/05/17 16:25, Allison, Timothy B. wrote: and when is "extractInlineImages" actually effective ? Not sure I understand the question exactly? If the question is "why didn't extractInlineImages work on a specific document"? That's probably a bug or could be user error in the configuratio

RE: Extracting Text from embedded images in PDF docs

2017-05-19 Thread Allison, Timothy B.
>>and when is "extractInlineImages" actually effective ? Not sure I understand the question exactly? If the question is "why didn't extractInlineImages work on a specific document"? That's probably a bug or could be user error in the configuration...either way, please follow up and help us so

RE: Extracting Text from embedded images in PDF docs

2017-05-19 Thread Allison, Timothy B.
rom embedded images in PDF docs Got it working. In case someone else hits the same issue, here is my config file... Well... That was obvious :D ocr_and_text David Le 19 mai 2017 à 10:59, David Pilato mailto

Re: Extracting Text from embedded images in PDF docs

2017-05-19 Thread Sergey Beryozkin
Extracting Text from embedded images in PDF docs Got it working. In case someone else hits the same issue, here is my config file... Well... That was obvious :D / /<*properties*> <*parsers*> <*parser class="org.apache.tika.parser.DefaultParser"*/> <*parser cl

RE: Extracting Text from embedded images in PDF docs

2017-05-19 Thread Allison, Timothy B.
documentation so that you don’t waste an hour? From: David Pilato [mailto:da...@pilato.fr] Sent: Friday, May 19, 2017 5:55 AM To: user@tika.apache.org Subject: Re: Extracting Text from embedded images in PDF docs Got it working. In case someone else hits the same issue, here is my config file... Well... That

Re: Extracting Text from embedded images in PDF docs

2017-05-19 Thread Chris Mattmann
tika.apache.org" Subject: Re: Extracting Text from embedded images in PDF docs Got it working. In case someone else hits the same issue, here is my config file... Well... That was obvious :D                     ocr_and_text            

Re: Extracting Text from embedded images in PDF docs

2017-05-19 Thread David Pilato
Got it working. In case someone else hits the same issue, here is my config file... Well... That was obvious :D ocr_and_text David > Le 19 mai 2017 à 10:59, David Pilato a écrit : > > So I saw in debug mode tha

Re: Extracting Text from embedded images in PDF docs

2017-05-19 Thread David Pilato
So I saw in debug mode that indeed config.getExtractInlineImages() is false so I'm going to check my config. :D David > Le 18 mai 2017 à 22:18, David Pilato a écrit : > > Hey guys > > > First post here ;) > > I'm trying to play with OCR with Tika. I installed Tesseract and I can > extract

Extracting Text from embedded images in PDF docs

2017-05-18 Thread David Pilato
Hey guys First post here ;) I'm trying to play with OCR with Tika. I installed Tesseract and I can extract text from a PNG image. I created a PDF document with this image embedded and I'm trying now to extract the text out of it. I added this configuration but I guess I'm doing it wrong: