Hi Tim

Sure, once I get an initial PR ready I'll send an update and I'll explain what I did for a start and we will discuss it further

Thanks, Sergey
On 19/05/17 19:12, Allison, Timothy B. wrote:
This is fantastic news!  Let me know if I can help...I know _nothing_ about 
Beam, tho.
:)

-----Original Message-----
From: Sergey Beryozkin [mailto:sberyoz...@gmail.com]
Sent: Friday, May 19, 2017 12:40 PM
To: user@tika.apache.org
Subject: Re: Extracting Text from embedded images in PDF docs

Hi Tim

On 19/05/17 17:31, Allison, Timothy B. wrote:
The autoscaling feature of Beam and the job stealing (not their term) look to 
be fantastic for Tika jobs.

Though, it actually does work, for me at least :-)
Have you tried the MockParser?  That's where the fun really begins.  Simulate 
an oom or permanent hang!

Thanks for the hint. The initial issue that will need to be handled is to how 
to adapt the SAX stream of events to the Beam Pipeline API, so for the moment 
I'm using an internal ExecutorService and Queue to adapt.

I've created
https://issues.apache.org/jira/browse/BEAM-2328

It will take me few more weeks to create a PR,

Thanks, Sergey



-----Original Message-----
From: Sergey Beryozkin [mailto:sberyoz...@gmail.com]
Sent: Friday, May 19, 2017 12:27 PM
To: user@tika.apache.org
Subject: Re: Extracting Text from embedded images in PDF docs

Hi Chris

I'm getting nervous now, what will happen to me if it will not work
out in the end :-). Though, it actually does work, for me at least :-)

Cheers, Sergey
On 19/05/17 17:23, Mattmann, Chris A (3010) wrote:
Thanks Sergey what an awesome surprise you are the best!

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Principal Data Scientist, Engineering Administrative Office (3010)
Manager, NSF & Open Source Projects Formulation and Development
Offices (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 180-503E, Mailstop: 180-503
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Director, Information Retrieval and Data Science Group (IRDS) Adjunct
Associate Professor, Computer Science Department University of
Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
On 5/19/17, 9:11 AM, "Sergey Beryozkin" <sberyoz...@gmail.com> wrote:

       Hi Tim
       On 19/05/17 16:47, Allison, Timothy B. wrote:
       >
       >> Yes I was asking about it as I thought it was confusing it did not 
work
       >> - I saw you following up on this possible issue in the other email...
       > Y, I agree.  That _should_ work.
       >
       >> I'm doing some work with Tika now so it was of an immediate interest 
to me...
       > Yay! What are you working on?
       >
       Was supposed to be a secret for few weeks but I'll let you know, but do
       not tell anyone please :-). Well, I'm trying to integrate Tika with
       Apache Beam, hoping to get something ready in a couple of weeks, if it
       won't make it to the Beam source then I'll create a standalone demo,
       will share the link either way...
       >> Sure. By the way I was not complaining...
       > I didn't take it that way at all!  I apologize if anything I wrote 
came across that way.
       >
       Np, my apologies instead :-), I thought may be I asked it the way which
       sounded like a 'why does it just not work' question which would indeed
       be strange to hear from a Tika committer (nearly veteran I should say
       :-)).
Thanks, Sergey
       > Cheers,
       >
       >          Tim
       >

Reply via email to