Hi Tim
Sure, once I get an initial PR ready I'll send an update and I'll
explain what I did for a start and we will discuss it further
Thanks, Sergey
On 19/05/17 19:12, Allison, Timothy B. wrote:
This is fantastic news! Let me know if I can help...I know _nothing_ about
Beam, tho.
:)
-----Original Message-----
From: Sergey Beryozkin [mailto:sberyoz...@gmail.com]
Sent: Friday, May 19, 2017 12:40 PM
To: user@tika.apache.org
Subject: Re: Extracting Text from embedded images in PDF docs
Hi Tim
On 19/05/17 17:31, Allison, Timothy B. wrote:
The autoscaling feature of Beam and the job stealing (not their term) look to
be fantastic for Tika jobs.
Though, it actually does work, for me at least :-)
Have you tried the MockParser? That's where the fun really begins. Simulate
an oom or permanent hang!
Thanks for the hint. The initial issue that will need to be handled is to how
to adapt the SAX stream of events to the Beam Pipeline API, so for the moment
I'm using an internal ExecutorService and Queue to adapt.
I've created
https://issues.apache.org/jira/browse/BEAM-2328
It will take me few more weeks to create a PR,
Thanks, Sergey
-----Original Message-----
From: Sergey Beryozkin [mailto:sberyoz...@gmail.com]
Sent: Friday, May 19, 2017 12:27 PM
To: user@tika.apache.org
Subject: Re: Extracting Text from embedded images in PDF docs
Hi Chris
I'm getting nervous now, what will happen to me if it will not work
out in the end :-). Though, it actually does work, for me at least :-)
Cheers, Sergey
On 19/05/17 17:23, Mattmann, Chris A (3010) wrote:
Thanks Sergey what an awesome surprise you are the best!
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Principal Data Scientist, Engineering Administrative Office (3010)
Manager, NSF & Open Source Projects Formulation and Development
Offices (8212) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 180-503E, Mailstop: 180-503
Email: chris.a.mattm...@nasa.gov
WWW: http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Director, Information Retrieval and Data Science Group (IRDS) Adjunct
Associate Professor, Computer Science Department University of
Southern California, Los Angeles, CA 90089 USA
WWW: http://irds.usc.edu/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
On 5/19/17, 9:11 AM, "Sergey Beryozkin" <sberyoz...@gmail.com> wrote:
Hi Tim
On 19/05/17 16:47, Allison, Timothy B. wrote:
>
>> Yes I was asking about it as I thought it was confusing it did not
work
>> - I saw you following up on this possible issue in the other email...
> Y, I agree. That _should_ work.
>
>> I'm doing some work with Tika now so it was of an immediate interest
to me...
> Yay! What are you working on?
>
Was supposed to be a secret for few weeks but I'll let you know, but do
not tell anyone please :-). Well, I'm trying to integrate Tika with
Apache Beam, hoping to get something ready in a couple of weeks, if it
won't make it to the Beam source then I'll create a standalone demo,
will share the link either way...
>> Sure. By the way I was not complaining...
> I didn't take it that way at all! I apologize if anything I wrote
came across that way.
>
Np, my apologies instead :-), I thought may be I asked it the way which
sounded like a 'why does it just not work' question which would indeed
be strange to hear from a Tika committer (nearly veteran I should say
:-)).
Thanks, Sergey
> Cheers,
>
> Tim
>