Re: What may have changed in ODT parser in Tika 2

2022-06-18 Thread Sergey Beryozkin
Hi Tim

Thanks for this fix, confirmed all works now with 2.4.1

Cheers, Sergey

On Mon, May 2, 2022 at 6:43 PM Sergey Beryozkin 
wrote:

> Hi Tim
>
> I gave it another try and it looks like only the thumbnail file name is
> reported, `ToTextContentHandler` is used by default
>
> I can try again with 2.4.1 RC later
>
> Thanks, Sergey
>
>
> On Sat, Apr 30, 2022 at 2:08 PM Sergey Beryozkin 
> wrote:
>
>> Hi Tim
>>
>> Thanks for a quick fix, missed your answer yesterday,  will check soon
>> and let you know.
>>
>> Cheers Sergey
>>
>>
>> On Fri 29 Apr 2022, 16:49 Tim Allison,  wrote:
>>
>>> Hi Sergey,
>>>   That the thumbnail file name showed up in the stream is a bug I
>>> introduced in 2.3.x.  I missed it in the fix in 2.4.0 (TIKA-3711), but
>>> I just fixed it now (TIKA-3745).
>>>   Are you not seeing "Hello Quarkus" at all, or is it just not the
>>> only text -- contains vs equals?  I am seeing "Hello Quarkus" in at
>>> least the 2.4.0-rc1.
>>>
>>> On Fri, Apr 29, 2022 at 10:54 AM Sergey Beryozkin 
>>> wrote:
>>> >
>>> > Hi Tim, All
>>> >
>>> > I have a simple test reading a string content from an ODT doc failing,
>>> PDF,
>>> > Excel are good, but something is going on with the ODT parsing.
>>> >
>>> > quarkus.odt in
>>> >
>>> https://github.com/quarkiverse/quarkus-tika/blob/main/integration-tests/src/main/resources/
>>> > is expected to return a "Hello Quarkus" string
>>> >
>>> > but now the test fails with
>>> >
>>> > Expected: is "Hello Quarkus"
>>> >   Actual: Thumbnails/thumbnail.png.
>>> >
>>> > AutoDetectParser is used to parse, using a standard sequence
>>> >
>>> >
>>> https://github.com/quarkiverse/quarkus-tika/blob/main/runtime/src/main/java/io/quarkus/tika/TikaParser.java#L85
>>> >
>>> > May be it is an auto-detection issue, the media type which is used is
>>> here:
>>> >
>>> >
>>> https://github.com/quarkiverse/quarkus-tika/blob/main/integration-tests/src/test/java/io/quarkus/it/tika/TikaParserTest.java#L25
>>> >
>>> > Any hints will be appreciated
>>> >
>>> > Thanks, Sergey
>>>
>>


Re: Code review on use of cxf in Apache Tika?

2022-05-07 Thread Sergey Beryozkin
 Hi Andriy

No problems at all, thanks a million for starting looking into it

Sergey

On Fri, May 6, 2022 at 2:06 AM Andriy Redko  wrote:

> Hey Sergey,
>
> My apologies, I was off last week and only now caught up with all the
> things,
> so let me understand the problem first. I have looked at [1] and saw that
> team
> has added programmatic SSL/TLS configuration [2]. However, another
> approach team is
> looking at is to use declarative cxf.xml to replace the programmatic one,
> is that
> an accurate description? And once tried, we have separate Jetty server
> spawn up?
>
> Thank you!
>
> [1] https://issues.apache.org/jira/browse/TIKA-3719
> [2]
> https://github.com/apache/tika/commit/c1c69dac4f5f948f38e0b198c3fdaad61a7d80be#diff-32fed2ec8d113792f680c2242ac6cb0cb67cfdd142660d993dbe92aaede00f6fR265
>
> Best Regards,
> Andriy Redko
>
>
> SB> Hey Andriy
>
> SB> Great stuff, glad to hear, it is a collection of JAX-RS endpoints
> backed up
> SB> by CXF, so the team needs some help to setup HTTPS, Basic (and possibly
> SB> bearer JWT token verification going forward), I can help with
> clarifying
> SB> some details related to JWT, CXF has everything related to it...
>
> SB> Cheers, Sergey
>
> SB> On Sat, Apr 23, 2022 at 8:02 PM Andriy Redko  wrote:
>
> >> Hi Tim & Sergey,
>
> >> Yeah, sure, happy to help here. I think I understood the problem, will
> try
> >> to
> >> look shortly on how to address that in context of Tika Server (I have
> never
> >> used the server-based deployment of Tika yet).
>
> >> Best Regards,
> >> Andriy Redko
>
> >> SB> Hi Tim
>
> >> SB> Apologies I'm totally occupied with Quarkus right now, I'm sorry it
> >> SB> consumes all the time.
> >> SB> Andriy, if you could help the Tika colleagues then it would be
> great,
> >> as
> >> SB> you've helped with integrating Tika in Apache CXF as well, recall
> how
> >> we
> >> SB> enjoyed the presentation about Tika at one of ASF Conferences :-).
>
> >> SB> Cheers, Sergey
>
> >> SB> On Thu, Apr 21, 2022 at 10:55 PM Tim Allison 
> >> wrote:
>
> >> >> Friends and colleagues,
>
> >> >>   Over on Apache Tika, our server has been using cxf for a long time.
> >> >> We've been very happy with its capabilities and robustness.  So,
> thank
> >> >> you!
>
> >> >>   Recently we were asked to add TLS, and we managed to do so
> >> >> programmatically[0]. The requestor on that issue noted that it would
> >> >> be great if we could use the regular cxf.xml file configuration
> >> >> process[1].  Further, the requestor noted that if he put a cxf.xml
> >> >> file on his class path, a separate jetty server was spun up.  Are
> >> >> there better ways we can use CXF and its configuration process?
> >> >>   This is how we're initializing the server [2].
>
> >> >>Thank you!
>
> >> >>   Best,
>
> >> >>  Tim
>
> >> >> [0]
> >> >>
> >>
> https://github.com/apache/tika/blob/TIKA-3719/tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/TikaServerProcess.java#L259
>
> >> >> [1]
> >> >>
> >>
> https://issues.apache.org/jira/browse/TIKA-3725?focusedCommentId=17526098=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17526098
>
> >> >> [2]
> >> >>
> >>
> https://github.com/apache/tika/blob/main/tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/TikaServerProcess.java#L234
>
>


Re: What may have changed in ODT parser in Tika 2

2022-05-02 Thread Sergey Beryozkin
Hi Tim

I gave it another try and it looks like only the thumbnail file name is
reported, `ToTextContentHandler` is used by default

I can try again with 2.4.1 RC later

Thanks, Sergey


On Sat, Apr 30, 2022 at 2:08 PM Sergey Beryozkin 
wrote:

> Hi Tim
>
> Thanks for a quick fix, missed your answer yesterday,  will check soon
> and let you know.
>
> Cheers Sergey
>
>
> On Fri 29 Apr 2022, 16:49 Tim Allison,  wrote:
>
>> Hi Sergey,
>>   That the thumbnail file name showed up in the stream is a bug I
>> introduced in 2.3.x.  I missed it in the fix in 2.4.0 (TIKA-3711), but
>> I just fixed it now (TIKA-3745).
>>   Are you not seeing "Hello Quarkus" at all, or is it just not the
>> only text -- contains vs equals?  I am seeing "Hello Quarkus" in at
>> least the 2.4.0-rc1.
>>
>> On Fri, Apr 29, 2022 at 10:54 AM Sergey Beryozkin 
>> wrote:
>> >
>> > Hi Tim, All
>> >
>> > I have a simple test reading a string content from an ODT doc failing,
>> PDF,
>> > Excel are good, but something is going on with the ODT parsing.
>> >
>> > quarkus.odt in
>> >
>> https://github.com/quarkiverse/quarkus-tika/blob/main/integration-tests/src/main/resources/
>> > is expected to return a "Hello Quarkus" string
>> >
>> > but now the test fails with
>> >
>> > Expected: is "Hello Quarkus"
>> >   Actual: Thumbnails/thumbnail.png.
>> >
>> > AutoDetectParser is used to parse, using a standard sequence
>> >
>> >
>> https://github.com/quarkiverse/quarkus-tika/blob/main/runtime/src/main/java/io/quarkus/tika/TikaParser.java#L85
>> >
>> > May be it is an auto-detection issue, the media type which is used is
>> here:
>> >
>> >
>> https://github.com/quarkiverse/quarkus-tika/blob/main/integration-tests/src/test/java/io/quarkus/it/tika/TikaParserTest.java#L25
>> >
>> > Any hints will be appreciated
>> >
>> > Thanks, Sergey
>>
>


Re: What may have changed in ODT parser in Tika 2

2022-04-30 Thread Sergey Beryozkin
Hi Tim

Thanks for a quick fix, missed your answer yesterday,  will check soon
and let you know.

Cheers Sergey


On Fri 29 Apr 2022, 16:49 Tim Allison,  wrote:

> Hi Sergey,
>   That the thumbnail file name showed up in the stream is a bug I
> introduced in 2.3.x.  I missed it in the fix in 2.4.0 (TIKA-3711), but
> I just fixed it now (TIKA-3745).
>   Are you not seeing "Hello Quarkus" at all, or is it just not the
> only text -- contains vs equals?  I am seeing "Hello Quarkus" in at
> least the 2.4.0-rc1.
>
> On Fri, Apr 29, 2022 at 10:54 AM Sergey Beryozkin 
> wrote:
> >
> > Hi Tim, All
> >
> > I have a simple test reading a string content from an ODT doc failing,
> PDF,
> > Excel are good, but something is going on with the ODT parsing.
> >
> > quarkus.odt in
> >
> https://github.com/quarkiverse/quarkus-tika/blob/main/integration-tests/src/main/resources/
> > is expected to return a "Hello Quarkus" string
> >
> > but now the test fails with
> >
> > Expected: is "Hello Quarkus"
> >   Actual: Thumbnails/thumbnail.png.
> >
> > AutoDetectParser is used to parse, using a standard sequence
> >
> >
> https://github.com/quarkiverse/quarkus-tika/blob/main/runtime/src/main/java/io/quarkus/tika/TikaParser.java#L85
> >
> > May be it is an auto-detection issue, the media type which is used is
> here:
> >
> >
> https://github.com/quarkiverse/quarkus-tika/blob/main/integration-tests/src/test/java/io/quarkus/it/tika/TikaParserTest.java#L25
> >
> > Any hints will be appreciated
> >
> > Thanks, Sergey
>


What may have changed in ODT parser in Tika 2

2022-04-29 Thread Sergey Beryozkin
Hi Tim, All

I have a simple test reading a string content from an ODT doc failing, PDF,
Excel are good, but something is going on with the ODT parsing.

quarkus.odt in
https://github.com/quarkiverse/quarkus-tika/blob/main/integration-tests/src/main/resources/
is expected to return a "Hello Quarkus" string

but now the test fails with

Expected: is "Hello Quarkus"
  Actual: Thumbnails/thumbnail.png.

AutoDetectParser is used to parse, using a standard sequence

https://github.com/quarkiverse/quarkus-tika/blob/main/runtime/src/main/java/io/quarkus/tika/TikaParser.java#L85

May be it is an auto-detection issue, the media type which is used is here:

https://github.com/quarkiverse/quarkus-tika/blob/main/integration-tests/src/test/java/io/quarkus/it/tika/TikaParserTest.java#L25

Any hints will be appreciated

Thanks, Sergey


Re: How to deal with the recursive content in Tika 2

2022-04-29 Thread Sergey Beryozkin
That helped with the recursive parser test

Thanks, Sergey

On Thu, Apr 28, 2022 at 4:37 PM Sergey Beryozkin 
wrote:

> Great, will give it a try asap
>
> Cheers, Serget
>
> On Thu, Apr 28, 2022 at 4:22 PM Tim Allison  wrote:
>
>> Give this a try:
>>
>> https://github.com/apache/tika/blob/main/tika-core/src/main/java/org/apache/tika/metadata/TikaCoreProperties.java#L60
>>
>> On Thu, Apr 28, 2022 at 11:12 AM Sergey Beryozkin 
>> wrote:
>> >
>> > Hi Tim, All
>> >
>> > We have a pending issue in Quarkus Tika to upgrade to Tika 2.
>> > One of the problems is that according to a user's comment the recursive
>> > content is treated somehow differently in Tika2, specifically, this
>> code:
>> >
>> >
>> https://github.com/quarkiverse/quarkus-tika/blob/main/runtime/src/main/java/io/quarkus/tika/TikaParser.java#L95
>> >
>> > attempts to get a collection of the parsed outer and embedded documents
>> by
>> > accessing them as
>> >
>> > metadata.get(AbstractRecursiveParserWrapperHandler.TIKA_CONTENT);
>> >
>> > What is the equivalent way to achieve the same with Tika 2 ?
>> >
>> > Thanks, Sergey
>>
>


Re: How to deal with the recursive content in Tika 2

2022-04-28 Thread Sergey Beryozkin
Great, will give it a try asap

Cheers, Serget

On Thu, Apr 28, 2022 at 4:22 PM Tim Allison  wrote:

> Give this a try:
>
> https://github.com/apache/tika/blob/main/tika-core/src/main/java/org/apache/tika/metadata/TikaCoreProperties.java#L60
>
> On Thu, Apr 28, 2022 at 11:12 AM Sergey Beryozkin 
> wrote:
> >
> > Hi Tim, All
> >
> > We have a pending issue in Quarkus Tika to upgrade to Tika 2.
> > One of the problems is that according to a user's comment the recursive
> > content is treated somehow differently in Tika2, specifically, this code:
> >
> >
> https://github.com/quarkiverse/quarkus-tika/blob/main/runtime/src/main/java/io/quarkus/tika/TikaParser.java#L95
> >
> > attempts to get a collection of the parsed outer and embedded documents
> by
> > accessing them as
> >
> > metadata.get(AbstractRecursiveParserWrapperHandler.TIKA_CONTENT);
> >
> > What is the equivalent way to achieve the same with Tika 2 ?
> >
> > Thanks, Sergey
>


How to deal with the recursive content in Tika 2

2022-04-28 Thread Sergey Beryozkin
Hi Tim, All

We have a pending issue in Quarkus Tika to upgrade to Tika 2.
One of the problems is that according to a user's comment the recursive
content is treated somehow differently in Tika2, specifically, this code:

https://github.com/quarkiverse/quarkus-tika/blob/main/runtime/src/main/java/io/quarkus/tika/TikaParser.java#L95

attempts to get a collection of the parsed outer and embedded documents by
accessing them as

metadata.get(AbstractRecursiveParserWrapperHandler.TIKA_CONTENT);

What is the equivalent way to achieve the same with Tika 2 ?

Thanks, Sergey


Re: Code review on use of cxf in Apache Tika?

2022-04-23 Thread Sergey Beryozkin
Hey Andriy

Great stuff, glad to hear, it is a collection of JAX-RS endpoints backed up
by CXF, so the team needs some help to setup HTTPS, Basic (and possibly
bearer JWT token verification going forward), I can help with clarifying
some details related to JWT, CXF has everything related to it...

Cheers, Sergey

On Sat, Apr 23, 2022 at 8:02 PM Andriy Redko  wrote:

> Hi Tim & Sergey,
>
> Yeah, sure, happy to help here. I think I understood the problem, will try
> to
> look shortly on how to address that in context of Tika Server (I have never
> used the server-based deployment of Tika yet).
>
> Best Regards,
> Andriy Redko
>
> SB> Hi Tim
>
> SB> Apologies I'm totally occupied with Quarkus right now, I'm sorry it
> SB> consumes all the time.
> SB> Andriy, if you could help the Tika colleagues then it would be great,
> as
> SB> you've helped with integrating Tika in Apache CXF as well, recall how
> we
> SB> enjoyed the presentation about Tika at one of ASF Conferences :-).
>
> SB> Cheers, Sergey
>
> SB> On Thu, Apr 21, 2022 at 10:55 PM Tim Allison 
> wrote:
>
> >> Friends and colleagues,
>
> >>   Over on Apache Tika, our server has been using cxf for a long time.
> >> We've been very happy with its capabilities and robustness.  So, thank
> >> you!
>
> >>   Recently we were asked to add TLS, and we managed to do so
> >> programmatically[0]. The requestor on that issue noted that it would
> >> be great if we could use the regular cxf.xml file configuration
> >> process[1].  Further, the requestor noted that if he put a cxf.xml
> >> file on his class path, a separate jetty server was spun up.  Are
> >> there better ways we can use CXF and its configuration process?
> >>   This is how we're initializing the server [2].
>
> >>Thank you!
>
> >>   Best,
>
> >>  Tim
>
> >> [0]
> >>
> https://github.com/apache/tika/blob/TIKA-3719/tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/TikaServerProcess.java#L259
>
> >> [1]
> >>
> https://issues.apache.org/jira/browse/TIKA-3725?focusedCommentId=17526098=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17526098
>
> >> [2]
> >>
> https://github.com/apache/tika/blob/main/tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/TikaServerProcess.java#L234
>
>


Re: Code review on use of cxf in Apache Tika?

2022-04-23 Thread Sergey Beryozkin
Hi Tim

Apologies I'm totally occupied with Quarkus right now, I'm sorry it
consumes all the time.
Andriy, if you could help the Tika colleagues then it would be great, as
you've helped with integrating Tika in Apache CXF as well, recall how we
enjoyed the presentation about Tika at one of ASF Conferences :-).

Cheers, Sergey

On Thu, Apr 21, 2022 at 10:55 PM Tim Allison  wrote:

> Friends and colleagues,
>
>   Over on Apache Tika, our server has been using cxf for a long time.
> We've been very happy with its capabilities and robustness.  So, thank
> you!
>
>   Recently we were asked to add TLS, and we managed to do so
> programmatically[0]. The requestor on that issue noted that it would
> be great if we could use the regular cxf.xml file configuration
> process[1].  Further, the requestor noted that if he put a cxf.xml
> file on his class path, a separate jetty server was spun up.  Are
> there better ways we can use CXF and its configuration process?
>   This is how we're initializing the server [2].
>
>Thank you!
>
>   Best,
>
>  Tim
>
> [0]
> https://github.com/apache/tika/blob/TIKA-3719/tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/TikaServerProcess.java#L259
>
> [1]
> https://issues.apache.org/jira/browse/TIKA-3725?focusedCommentId=17526098=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17526098
>
> [2]
> https://github.com/apache/tika/blob/main/tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/TikaServerProcess.java#L234
>


[OT] Looking for Apache POI help

2020-10-20 Thread Sergey Beryozkin
Hi All,
sorry for this off-topic post, it is a little bit relevant to Tika dev, but
only a little bit :-),

We are having some good interest in making Apache POI work in our Quarkus
project, both as part of its Quarkus Tika integration, but also
independently. Particularly, running it in GraalVM native image - all works
fine in the regular JVM.
If someone with the Apache POI experience can be interested then please
ping me offline and I link further to the Quarkus issues/discussions.
We do quite well in Quarkus Tika for the mainstream formats like PDF, but
some formats like those supported via Apache POI are not very well covered.

Thanks, and sorry for the noise
Sergey


Re: Looking for a small PDF file with fontbox fonts

2020-10-03 Thread Sergey Beryozkin
Never mind, we've got the file :-)

Thanks, Sergey

On Fri, Oct 2, 2020 at 6:27 PM Sergey Beryozkin 
wrote:

> Hi All
>
> I'm looking for a small PDF file which I can copy to Quarkus to confirm a
> native GraalVM issue related to loading these resources:
>
>
> https://svn.apache.org/viewvc/pdfbox/trunk/fontbox/src/main/resources/org/apache/fontbox/cmap/
>
> I'd appreciate if someone could share a link
>
> Cheers, Sergey
>


Looking for a small PDF file with fontbox fonts

2020-10-02 Thread Sergey Beryozkin
Hi All

I'm looking for a small PDF file which I can copy to Quarkus to confirm a
native GraalVM issue related to loading these resources:

https://svn.apache.org/viewvc/pdfbox/trunk/fontbox/src/main/resources/org/apache/fontbox/cmap/

I'd appreciate if someone could share a link

Cheers, Sergey


Re: [EXTERNAL] Tika 2.0 modularization

2020-08-19 Thread Sergey Beryozkin
Hi Tim

It looks good. Perfect.
Do you plant to have tika-parsers reuse the new module as its dependencies
?

Cheers, Sergey

On Tue, Aug 18, 2020 at 3:41 PM Tim Allison  wrote:

> If anyone has any time, please take a look here:
> https://github.com/apache/tika/tree/branch_2x/tika-parser-modules
>
> Does this basically look ok?
>
> I've put the integration tests in
> https://github.com/apache/tika/tree/branch_2x/tika-parser-integration-tests
> ... that doesn't build yet.
>
> I've flipped Bob's design so that the integration tests pull test files
> from the individual parser modules via test-jar.
>
> On Fri, Aug 14, 2020 at 3:30 PM Bob Paulin  wrote:
>
> > +1 excited about this.
> >
> > - Bob
> > On 8/14/2020 11:29 AM, Sergey Beryozkin wrote:
> >
> > +1 
> >
> > Cheers Sergey
> >
> > On Fri 14 Aug 2020, 18:26 Chris Mattmann,  <
> mattm...@apache.org> wrote:
> >
> >
> > Haha  I’m down and supportive!
> >
> >
> >
> > Time’s TIME FOR 2.x 
> >
> >
> >
> >
> >
> >
> >
> > From: Tim Allison  
> > Reply-To: "dev@tika.apache.org"  <
> dev@tika.apache.org> , "Allison, Tim (US
> > 174B-Affiliate)"  <
> timothy.b.alli...@jpl.nasa.gov>
> > Date: Friday, August 14, 2020 at 6:06 AM
> > To: " " 
> 
> > Subject: [EXTERNAL] Tika 2.0 modularization
> >
> >
> >
> > All,
> >
> >   I _think_ I might have some time to start working on integrating Bob's
> >
> > work on the current main branch.  I'll have to ignore most of the
> incoming
> >
> > issues for a bit...unlike the last 4 years...this time I mean it. :)
> >
> >   Let me know if there are any objections to heading down this path now.
> >
> >
> >
> >Cheers,
> >
> >
> >
> >   Tim
> >
> >
> >
> >
> >
> >
>


Re: [EXTERNAL] Tika 2.0 modularization

2020-08-14 Thread Sergey Beryozkin
+1 

Cheers Sergey

On Fri 14 Aug 2020, 18:26 Chris Mattmann,  wrote:

> Haha  I’m down and supportive!
>
>
>
> Time’s TIME FOR 2.x 
>
>
>
>
>
>
>
> From: Tim Allison 
> Reply-To: "dev@tika.apache.org" , "Allison, Tim (US
> 174B-Affiliate)" 
> Date: Friday, August 14, 2020 at 6:06 AM
> To: "" 
> Subject: [EXTERNAL] Tika 2.0 modularization
>
>
>
> All,
>
>   I _think_ I might have some time to start working on integrating Bob's
>
> work on the current main branch.  I'll have to ignore most of the incoming
>
> issues for a bit...unlike the last 4 years...this time I mean it. :)
>
>   Let me know if there are any objections to heading down this path now.
>
>
>
>Cheers,
>
>
>
>   Tim
>
>
>
>


Re: Tika master branch not building

2020-04-08 Thread Sergey Beryozkin
Hi Lewis

Getting one of the latest releases should be fine; while I've been out of
touch with CXF recently, I can ask around for some version advice as the
guys deal with the security vulnerabilities seriously there, if addressing
this issue proves problematic
Cheers, Sergey

On Tue, Apr 7, 2020 at 10:44 PM Lewis John McGibbney 
wrote:

> I suspected this was the case folks :)
> I actually really like this idea.
> I'll take the action item to address this seeing as I pulled it up...
> seeing as I am also working on tika-server right now I'll also take the
> action item to address the vulnerable CXF deps.
> Thanks,
> Lewis
>
> On 2020/04/06 16:19:16, Tim Allison  wrote:
> > >We shouldn't have any at release time, but they will obviously creep in
> > between releases
> >
> > Except the time, where I did the release and was trying to build it for
> > updating the site, and this had already kicked in. :(
> >
> > Y, we can turn this to warn, as long as we run it with fail as part of
> the
> > release process.
> >
> > On Mon, Apr 6, 2020 at 9:59 AM Nick Burch  wrote:
> >
> > > On Mon, 6 Apr 2020, Eric Pugh wrote:
> > > > Maybe this needs better documentation, however this is a “works as
> > > > designed” feature!
> > > >
> > > > To avoid the build failing, run mvn package -Dossindex.fail=false
> > >
> > > Should we maybe have this set to false by default, and only enabled
> > > on release builds?
> > >
> > > (We shouldn't have any at release time, but they will obviously creep
> in
> > > between releases)
> > >
> > > Nick
> >
>


Re: [VOTE] Release Apache Tika 1.23 Candidate #2

2019-12-03 Thread Sergey Beryozkin
+1
Sergey

On Tue, Dec 3, 2019 at 9:44 AM Oleg Tikhonov  wrote:

> [x] +1 Release this package as Apache Tika 1.23
>
> Thanks,
> Oleg
>
> On Tue, Dec 3, 2019 at 5:15 AM Tim Allison  wrote:
>
> > A candidate for the Tika 1.23 release is available at:
> >   https://dist.apache.org/repos/dist/dev/tika/
> >
> > The release candidate is a zip archive of the sources in:
> >   https://github.com/apache/tika/tree/1.23-rc2/
> >
> > The SHA-512 checksum of the archive is
> >
> >
> >
> d6e91f6b29183f836ccb4faabb690c07f4c33408d846f3d93e65b780745ca8c1dd6bb7cea6c265e987a06c318cbea2fcedc4c7ca723c030da46bbcd3423b49cf.
> >
> > In addition, a staged maven repository is available here:
> >
> >
> >
> https://repository.apache.org/content/repositories/orgapachetika-1057/org/apache/tika
> >
> > Please vote on releasing this package as Apache Tika 1.23.
> > The vote is open for the next 72 hours and passes if a majority of at
> > least three +1 Tika PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Tika 1.23
> > [ ] -1 Do not release this package because...
> >
> > Here's my +1.
> >
> > Cheers,
> >
> >  Tim
> >
>


Re: Concern about tika-parsers' dependencies

2019-11-28 Thread Sergey Beryozkin
Hi

We have an issue assigned to me, I hope to complete Bob's modularization
effort asap

Sergey

On Thu, Nov 28, 2019 at 9:46 AM Mark Hissink Muller  wrote:

> Hi all,
>
> I would like to voice my concern about the amount of dependencies of
> org.apache.tika:tika-parsers:jar:1.22.
>
> I recently needed to detect a charset, and I found CharsetDetector from
> the tika-parsers jar. It seemed like a small and and easy solution to zoom
> in on an encoding problem.
>
> After I started to have problems starting my application (Cannot run
> program "C:\opt\jdk-13\bin\java.exe" ... CreateProcess error=206), I
> discovered the dependencies below were added by
> org.apache.tika:tika-parsers:jar:1.22 (extract from mvn dependency:tree).
>
> I do note that the classpath for my application is already quite busy, but
> what tika-parsers added seems a bit over the top.
>
> Hope this helps.
>
> Best, Mark
>
>
> [INFO] +- org.apache.tika:tika-parsers:jar:1.22:compile
> [INFO] |  +- org.apache.tika:tika-core:jar:1.22:compile
> [INFO] |  +- org.glassfish.jaxb:jaxb-runtime:jar:2.3.2:compile
> [INFO] |  |  +- org.glassfish.jaxb:txw2:jar:2.3.2:compile
> [INFO] |  |  +- com.sun.istack:istack-commons-runtime:jar:3.0.8:compile
> [INFO] |  |  +- org.jvnet.staxex:stax-ex:jar:1.8.1:compile
> [INFO] |  |  \- com.sun.xml.fastinfoset:FastInfoset:jar:1.2.16:compile
> [INFO] |  +- com.sun.activation:jakarta.activation:jar:1.2.1:compile
> [INFO] |  +- xerces:xercesImpl:jar:2.12.0:compile
> [INFO] |  |  \- xml-apis:xml-apis:jar:1.4.01:compile
> [INFO] |  +- org.apache.commons:commons-lang3:jar:3.9:compile
> [INFO] |  +- javax.annotation:javax.annotation-api:jar:1.3.2:compile
> [INFO] |  +- org.gagravarr:vorbis-java-tika:jar:0.8:compile
> [INFO] |  +- org.tallison:jmatio:jar:1.5:compile
> [INFO] |  +- org.apache.james:apache-mime4j-core:jar:0.8.3:compile
> [INFO] |  +- org.apache.james:apache-mime4j-dom:jar:0.8.3:compile
> [INFO] |  +- org.apache.commons:commons-compress:jar:1.18:compile
> [INFO] |  +- org.tukaani:xz:jar:1.8:compile
> [INFO] |  +- com.epam:parso:jar:2.0.11:compile
> [INFO] |  +- org.brotli:dec:jar:0.1.2:compile
> [INFO] |  +- org.apache.pdfbox:pdfbox-tools:jar:2.0.16:compile
> [INFO] |  +- org.apache.pdfbox:jempbox:jar:1.8.16:compile
> [INFO] |  +- org.bouncycastle:bcmail-jdk15on:jar:1.62:compile
> [INFO] |  |  \- org.bouncycastle:bcpkix-jdk15on:jar:1.62:compile
> [INFO] |  +- org.bouncycastle:bcprov-jdk15on:jar:1.62:compile
> [INFO] |  +- org.apache.poi:poi-scratchpad:jar:4.0.1:compile
> [INFO] |  +- com.healthmarketscience.jackcess:jackcess:jar:3.0.1:compile
> [INFO] |  +-
> com.healthmarketscience.jackcess:jackcess-encrypt:jar:3.0.0:compile
> [INFO] |  +- org.ccil.cowan.tagsoup:tagsoup:jar:1.2.1:compile
> [INFO] |  +- org.ow2.asm:asm:jar:7.2-beta:compile
> [INFO] |  +- com.googlecode.mp4parser:isoparser:jar:1.1.22:compile
> [INFO] |  +- com.drewnoakes:metadata-extractor:jar:2.11.0:compile
> [INFO] |  |  \- com.adobe.xmp:xmpcore:jar:5.1.3:compile
> [INFO] |  +- de.l3s.boilerpipe:boilerpipe:jar:1.1.0:compile
> [INFO] |  +- com.rometools:rome:jar:1.12.1:compile
> [INFO] |  |  \- com.rometools:rome-utils:jar:1.12.1:compile
> [INFO] |  +- org.gagravarr:vorbis-java-core:jar:0.8:compile
> [INFO] |  +-
> com.googlecode.juniversalchardet:juniversalchardet:jar:1.0.3:compile
> [INFO] |  +- org.codelibs:jhighlight:jar:1.0.3:compile
> [INFO] |  +- com.pff:java-libpst:jar:0.8.1:compile
> [INFO] |  +- com.github.junrar:junrar:jar:4.0.0:compile
> [INFO] |  +- org.apache.cxf:cxf-rt-rs-client:jar:3.3.2:compile
> [INFO] |  |  +- org.apache.cxf:cxf-rt-transports-http:jar:3.3.2:compile
> [INFO] |  |  +- org.apache.cxf:cxf-core:jar:3.3.2:compile
> [INFO] |  |  |  +- com.fasterxml.woodstox:woodstox-core:jar:5.0.3:compile
> [INFO] |  |  |  |  \- org.codehaus.woodstox:stax2-api:jar:3.1.4:compile
> [INFO] |  |  |  +- org.apache.ws.xmlschema:xmlschema-core:jar:2.2.4:compile
> [INFO] |  |  |  \- org.glassfish.jaxb:jaxb-xjc:jar:2.3.2:compile
> [INFO] |  |  | +- org.glassfish.jaxb:xsom:jar:2.3.2:compile
> [INFO] |  |  | +- org.glassfish.jaxb:codemodel:jar:2.3.2:compile
> [INFO] |  |  | +- com.sun.xml.bind.external:rngom:jar:2.3.2:compile
> [INFO] |  |  | +- com.sun.xml.dtd-parser:dtd-parser:jar:1.4.1:compile
> [INFO] |  |  | +- com.sun.istack:istack-commons-tools:jar:3.0.8:compile
> [INFO] |  |  | |  \- org.apache.ant:ant:jar:1.10.5:compile
> [INFO] |  |  | | \- org.apache.ant:ant-launcher:jar:1.10.5:compile
> [INFO] |  |  | \-
> com.sun.xml.bind.external:relaxng-datatype:jar:2.3.2:compile
> [INFO] |  |  +- org.apache.cxf:cxf-rt-frontend-jaxrs:jar:3.3.2:compile
> [INFO] |  |  |  +- jakarta.ws.rs:jakarta.ws.rs-api:jar:2.1.6:compile
> [INFO] |  |  |  \- org.apache.cxf:cxf-rt-security:jar:3.3.2:compile
> [INFO] |  |  +- javax.xml.ws:jaxws-api:jar:2.3.1:compile
> [INFO] |  |  |  \- javax.xml.soap:javax.xml.soap-api:jar:1.4.0:compile
> [INFO] |  |  +- com.sun.activation:javax.activation:jar:1.2.0:compile
> [INFO] |  |  

Re: Our very own Sergey Beryozkin interviewed by jaxenter.com

2019-10-31 Thread Sergey Beryozkin
I should also of course mention Thejan Wijesinghe who helped me to set up
my laptop, and was keeping encouraging me during the talk by saying I have
at least 40 mins left :-) (or may be less :-)); and we had a good
conversation afterwards

Cheers, Sergey

On Thu, Oct 31, 2019 at 3:50 PM Sergey Beryozkin 
wrote:

> Hi Tim
>
> Thanks for sharing this link :-). I met Nick at the conference and he also
> suggested to share a link to the presentation, here it is:
>
> https://aceu19.apachecon.com/sites/aceu19.apachecon.com/files/2019-10/SergeyBeryozkin_ApacheTikaGoesNative_0.odp
>
> (and Nick suggested to share a link to the recorded presentation itself
> but I was shocked when I heard it myself, I thought I had no accent :-) lol
> ).
>
> One thing I wanted to say, is that while quite a bit of attention went to
> Quarkus, I've tried to show that Apache Tika was the main star :-). Both
> project really helped me get back to the ASF after a tough year.
> I've previously integrated Tika into Apache CXF and Beam but I do feel the
> current Quarkus integration will be the best. We've already had some users
> trying it and my colleagues are supportive, so I'm glad about it. Hoping to
> find some time asap to actually do something useful for Tika :-)
>
> Cheers, Sergey
>
>
>
> On Thu, Oct 31, 2019 at 3:14 PM Tim Allison  wrote:
>
>> https://twitter.com/ThoHeller/status/1189737257411973120
>>
>>
>> https://jaxenter.com/apache-tika-data-driven-analytics-heart-modern-applications-163377.html
>>
>> Sergey,
>>
>> Thank you for all of your work on Tika and on the integration with Quarkus
>> and graalvm!
>>
>> Cheers,
>>
>> Tim
>>
>


Re: Our very own Sergey Beryozkin interviewed by jaxenter.com

2019-10-31 Thread Sergey Beryozkin
Hi Tim

Thanks for sharing this link :-). I met Nick at the conference and he also
suggested to share a link to the presentation, here it is:
https://aceu19.apachecon.com/sites/aceu19.apachecon.com/files/2019-10/SergeyBeryozkin_ApacheTikaGoesNative_0.odp

(and Nick suggested to share a link to the recorded presentation itself but
I was shocked when I heard it myself, I thought I had no accent :-) lol ).

One thing I wanted to say, is that while quite a bit of attention went to
Quarkus, I've tried to show that Apache Tika was the main star :-). Both
project really helped me get back to the ASF after a tough year.
I've previously integrated Tika into Apache CXF and Beam but I do feel the
current Quarkus integration will be the best. We've already had some users
trying it and my colleagues are supportive, so I'm glad about it. Hoping to
find some time asap to actually do something useful for Tika :-)

Cheers, Sergey



On Thu, Oct 31, 2019 at 3:14 PM Tim Allison  wrote:

> https://twitter.com/ThoHeller/status/1189737257411973120
>
>
> https://jaxenter.com/apache-tika-data-driven-analytics-heart-modern-applications-163377.html
>
> Sergey,
>
> Thank you for all of your work on Tika and on the integration with Quarkus
> and graalvm!
>
> Cheers,
>
> Tim
>


Re: HTML to PDF conversion

2019-10-17 Thread Sergey Beryozkin
Hi Tim, All
Sure, agree that Tika is not really about the transformation. etc, it is
just not what I was suggesting, even though I started with a link to IHTML
to PRD transformer. Let me just clarify one more time and I'll be happy to
move on. So, trying to put it into a practical surface:
- create a tika-format-creator (or similarly named) module
- introduce a simple generic API (similarly to the prototype API earlier in
the thread) for creating simple format specific docs and document it is
going to stay experimental for a while
- this API is not about transformation but for Tika users to create the
docs directly
- provide two implementations of this API for a start only, one for PDF,
another one for ODT. In time it may grow a bit to support few more most
used formats, no goal to support hundreds of formats. (This is why I don't
understand the maintenance concern :-) )

In the end the users would be able to use Tika specific API to read and for
some most used formats - create docs.
Tika appeal is about having the uniform API for reading N formats, so the
users don't have to have a code switching between N format specific parser
APIs. But the users working with Tika and having an additional task of
creating some formats still have to go beyond Tika...ending up with a
semi-generic code after all. That was the idea I tried to convey earlier in
the thread...

Thanks all, Sergey


On Wed, Oct 16, 2019 at 5:07 PM Tim Allison  wrote:

> +1 to Ken’s earlier point about maintenance. Note Tika wouldn’t even build
> in Germany, and we only discovered that because of inviting Tilman. :D We
> have a huge amount of maintenance already...
>
> Checkout the incubating Daffodil project that aims to convert files to xml,
> validate them and then serialize back to original format.
>
> I do see a use for transform() and if we could use xhtml as an
> intermediary, then...maybe, but My inclination is w Ken.
>
> On Wed, Oct 16, 2019 at 11:50 AM Ken Krugler  wrote:
>
> > I can see the attraction of one API to convert XHTML to various formats.
> >
> > Though very quickly that simple API would become complex, as each target
> > format has its own conversion options.
> >
> > And if successful, we’d pull in even more 3rd party jars to handle that
> > conversion.
> >
> > Wonder if there’s a need for a new project called “Akit”, which focuses
> on
> > XHTML -> various formats :)
> >
> > — Ken
> >
> > > On Oct 16, 2019, at 5:05 AM, Sergey Beryozkin 
> > wrote:
> > >
> > > Ken, thanks for the feedback, I meant to reply to your comments,
> > >
> > > I suppose I really meant Tika offering a uniform API to create some
> > simple
> > > structured PDF/etc files.
> > > ContentCreator creator = ContentCreator.get("PDF");
> > > creator.addTitle("Introduction to Tika");
> > > creator.addText("");
> > > creator.addTable("tablename", new LinkedHashMap List>());
> > > creator.addAttachment(someImage);
> > > creator.complete();
> > >
> > > It would be consistent with the Tika approach on the read side.
> > >
> > > Cheers, Sergey
> > > On Mon, Oct 14, 2019 at 4:13 PM Ken Krugler 
> wrote:
> > >
> > >> If you’re suggesting ways to make it easier to use something like
> > >> YaHPConverter with Tika, definitely yes.
> > >>
> > >> If you’re talking about integrating this functionality…my personal
> view
> > is
> > >> no.
> > >>
> > >> I think Tika should focus on extracting content from documents, versus
> > >> format transformations.
> > >>
> > >> Tika is an attractive location for functionality like this, since it
> > sits
> > >> in the middle of a lot of data processing pipelines, but I worry
> about a
> > >> bloated code base, with corresponding challenges in maintenance and
> > support.
> > >>
> > >> Regards,
> > >>
> > >> — Ken
> > >>
> > >>
> > >>> On Oct 14, 2019, at 4:38 AM, Sergey Beryozkin 
> > >> wrote:
> > >>>
> > >>> Hi All
> > >>>
> > >>> I've seen a Quarkus user asking how to convert to PDF, and one of my
> > >>> colleagues pointed to
> > >>>
> > >>
> >
> http://www.allcolor.org/YaHPConverter/doc/org/allcolor/yahp/converter/IHtmlToPdfTransformer.html
> > >>>
> > >>> Does it make sense for Tika to offer something related to the text to
> > PDF
> > >>> (for a start, something on top of that transformer), and then may be
> > even
> > >>> for other formats ?
> > >>>
> > >>> Sergey
> > >>
> > >> --
> > >> Ken Krugler
> > >> http://www.scaleunlimited.com
> > >> custom big data solutions & training
> > >> Hadoop, Cascading, Cassandra & Solr
> > >>
> > >>
> >
> > --
> > Ken Krugler
> > http://www.scaleunlimited.com
> > custom big data solutions & training
> > Hadoop, Cascading, Cassandra & Solr
> >
> >
>


Re: HTML to PDF conversion

2019-10-16 Thread Sergey Beryozkin
Such an API would of course have the limitations in that a pretty simple
format specific content could be created, but many PDFs I've seen a very
simple, so I can imagine having for ex TikaPDFCreator implementation of the
ContentCreator interface which would just do some simple delegation to
PDFBox

But anyway, plenty of tools exists for it...

Cheers, Sergey

On Wed, Oct 16, 2019 at 4:59 PM Sergey Beryozkin 
wrote:

> It was not what I was suggesting. My only proposal was about having a
> simple API (without an attempt to cover all the various format specific
> options at the API level) which would let Tika users quickly create format
> specific content without having to deal with the format specific libraries,
> exactly consistent what it does on the read side.
> I appreciate it can require some effort and by no means I'm pushing for it
>
> Sergey
>
> On Wed, Oct 16, 2019 at 4:50 PM Ken Krugler  wrote:
>
>> I can see the attraction of one API to convert XHTML to various formats.
>>
>> Though very quickly that simple API would become complex, as each target
>> format has its own conversion options.
>>
>> And if successful, we’d pull in even more 3rd party jars to handle that
>> conversion.
>>
>> Wonder if there’s a need for a new project called “Akit”, which focuses
>> on XHTML -> various formats :)
>>
>> — Ken
>>
>> > On Oct 16, 2019, at 5:05 AM, Sergey Beryozkin 
>> wrote:
>> >
>> > Ken, thanks for the feedback, I meant to reply to your comments,
>> >
>> > I suppose I really meant Tika offering a uniform API to create some
>> simple
>> > structured PDF/etc files.
>> > ContentCreator creator = ContentCreator.get("PDF");
>> > creator.addTitle("Introduction to Tika");
>> > creator.addText("");
>> > creator.addTable("tablename", new LinkedHashMap> List>());
>> > creator.addAttachment(someImage);
>> > creator.complete();
>> >
>> > It would be consistent with the Tika approach on the read side.
>> >
>> > Cheers, Sergey
>> > On Mon, Oct 14, 2019 at 4:13 PM Ken Krugler 
>> wrote:
>> >
>> >> If you’re suggesting ways to make it easier to use something like
>> >> YaHPConverter with Tika, definitely yes.
>> >>
>> >> If you’re talking about integrating this functionality…my personal
>> view is
>> >> no.
>> >>
>> >> I think Tika should focus on extracting content from documents, versus
>> >> format transformations.
>> >>
>> >> Tika is an attractive location for functionality like this, since it
>> sits
>> >> in the middle of a lot of data processing pipelines, but I worry about
>> a
>> >> bloated code base, with corresponding challenges in maintenance and
>> support.
>> >>
>> >> Regards,
>> >>
>> >> — Ken
>> >>
>> >>
>> >>> On Oct 14, 2019, at 4:38 AM, Sergey Beryozkin 
>> >> wrote:
>> >>>
>> >>> Hi All
>> >>>
>> >>> I've seen a Quarkus user asking how to convert to PDF, and one of my
>> >>> colleagues pointed to
>> >>>
>> >>
>> http://www.allcolor.org/YaHPConverter/doc/org/allcolor/yahp/converter/IHtmlToPdfTransformer.html
>> >>>
>> >>> Does it make sense for Tika to offer something related to the text to
>> PDF
>> >>> (for a start, something on top of that transformer), and then may be
>> even
>> >>> for other formats ?
>> >>>
>> >>> Sergey
>> >>
>> >> --
>> >> Ken Krugler
>> >> http://www.scaleunlimited.com
>> >> custom big data solutions & training
>> >> Hadoop, Cascading, Cassandra & Solr
>> >>
>> >>
>>
>> --
>> Ken Krugler
>> http://www.scaleunlimited.com
>> custom big data solutions & training
>> Hadoop, Cascading, Cassandra & Solr
>>
>>


Re: HTML to PDF conversion

2019-10-16 Thread Sergey Beryozkin
It was not what I was suggesting. My only proposal was about having a
simple API (without an attempt to cover all the various format specific
options at the API level) which would let Tika users quickly create format
specific content without having to deal with the format specific libraries,
exactly consistent what it does on the read side.
I appreciate it can require some effort and by no means I'm pushing for it

Sergey

On Wed, Oct 16, 2019 at 4:50 PM Ken Krugler  wrote:

> I can see the attraction of one API to convert XHTML to various formats.
>
> Though very quickly that simple API would become complex, as each target
> format has its own conversion options.
>
> And if successful, we’d pull in even more 3rd party jars to handle that
> conversion.
>
> Wonder if there’s a need for a new project called “Akit”, which focuses on
> XHTML -> various formats :)
>
> — Ken
>
> > On Oct 16, 2019, at 5:05 AM, Sergey Beryozkin 
> wrote:
> >
> > Ken, thanks for the feedback, I meant to reply to your comments,
> >
> > I suppose I really meant Tika offering a uniform API to create some
> simple
> > structured PDF/etc files.
> > ContentCreator creator = ContentCreator.get("PDF");
> > creator.addTitle("Introduction to Tika");
> > creator.addText("");
> > creator.addTable("tablename", new LinkedHashMap>());
> > creator.addAttachment(someImage);
> > creator.complete();
> >
> > It would be consistent with the Tika approach on the read side.
> >
> > Cheers, Sergey
> > On Mon, Oct 14, 2019 at 4:13 PM Ken Krugler  wrote:
> >
> >> If you’re suggesting ways to make it easier to use something like
> >> YaHPConverter with Tika, definitely yes.
> >>
> >> If you’re talking about integrating this functionality…my personal view
> is
> >> no.
> >>
> >> I think Tika should focus on extracting content from documents, versus
> >> format transformations.
> >>
> >> Tika is an attractive location for functionality like this, since it
> sits
> >> in the middle of a lot of data processing pipelines, but I worry about a
> >> bloated code base, with corresponding challenges in maintenance and
> support.
> >>
> >> Regards,
> >>
> >> — Ken
> >>
> >>
> >>> On Oct 14, 2019, at 4:38 AM, Sergey Beryozkin 
> >> wrote:
> >>>
> >>> Hi All
> >>>
> >>> I've seen a Quarkus user asking how to convert to PDF, and one of my
> >>> colleagues pointed to
> >>>
> >>
> http://www.allcolor.org/YaHPConverter/doc/org/allcolor/yahp/converter/IHtmlToPdfTransformer.html
> >>>
> >>> Does it make sense for Tika to offer something related to the text to
> PDF
> >>> (for a start, something on top of that transformer), and then may be
> even
> >>> for other formats ?
> >>>
> >>> Sergey
> >>
> >> --
> >> Ken Krugler
> >> http://www.scaleunlimited.com
> >> custom big data solutions & training
> >> Hadoop, Cascading, Cassandra & Solr
> >>
> >>
>
> --
> Ken Krugler
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
>
>


Re: HTML to PDF conversion

2019-10-16 Thread Sergey Beryozkin
Hi Dave

Thanks, I was suggesting a more neutral approach

Cheers, Sergey

On Wed, Oct 16, 2019 at 3:50 PM Dave Fisher  wrote:

> Hi -
>
> You may want to take a look at Apache FOP which is part of the Apache XML
> Graphics project. My team had success with that in generating PDF from XML.
>
> Regards,
> Dave
>
> > On Oct 16, 2019, at 5:05 AM, Sergey Beryozkin 
> wrote:
> >
> > Ken, thanks for the feedback, I meant to reply to your comments,
> >
> > I suppose I really meant Tika offering a uniform API to create some
> simple
> > structured PDF/etc files.
> > ContentCreator creator = ContentCreator.get("PDF");
> > creator.addTitle("Introduction to Tika");
> > creator.addText("");
> > creator.addTable("tablename", new LinkedHashMap>());
> > creator.addAttachment(someImage);
> > creator.complete();
> >
> > It would be consistent with the Tika approach on the read side.
> >
> > Cheers, Sergey
> > On Mon, Oct 14, 2019 at 4:13 PM Ken Krugler  wrote:
> >
> >> If you’re suggesting ways to make it easier to use something like
> >> YaHPConverter with Tika, definitely yes.
> >>
> >> If you’re talking about integrating this functionality…my personal view
> is
> >> no.
> >>
> >> I think Tika should focus on extracting content from documents, versus
> >> format transformations.
> >>
> >> Tika is an attractive location for functionality like this, since it
> sits
> >> in the middle of a lot of data processing pipelines, but I worry about a
> >> bloated code base, with corresponding challenges in maintenance and
> support.
> >>
> >> Regards,
> >>
> >> — Ken
> >>
> >>
> >>> On Oct 14, 2019, at 4:38 AM, Sergey Beryozkin 
> >> wrote:
> >>>
> >>> Hi All
> >>>
> >>> I've seen a Quarkus user asking how to convert to PDF, and one of my
> >>> colleagues pointed to
> >>>
> >>
> http://www.allcolor.org/YaHPConverter/doc/org/allcolor/yahp/converter/IHtmlToPdfTransformer.html
> >>>
> >>> Does it make sense for Tika to offer something related to the text to
> PDF
> >>> (for a start, something on top of that transformer), and then may be
> even
> >>> for other formats ?
> >>>
> >>> Sergey
> >>
> >> --
> >> Ken Krugler
> >> http://www.scaleunlimited.com
> >> custom big data solutions & training
> >> Hadoop, Cascading, Cassandra & Solr
> >>
> >>
>
>


Re: HTML to PDF conversion

2019-10-16 Thread Sergey Beryozkin
Ken, thanks for the feedback, I meant to reply to your comments,

I suppose I really meant Tika offering a uniform API to create some simple
structured PDF/etc files.
ContentCreator creator = ContentCreator.get("PDF");
creator.addTitle("Introduction to Tika");
creator.addText("");
creator.addTable("tablename", new LinkedHashMap>());
creator.addAttachment(someImage);
creator.complete();

It would be consistent with the Tika approach on the read side.

Cheers, Sergey
On Mon, Oct 14, 2019 at 4:13 PM Ken Krugler  wrote:

> If you’re suggesting ways to make it easier to use something like
> YaHPConverter with Tika, definitely yes.
>
> If you’re talking about integrating this functionality…my personal view is
> no.
>
> I think Tika should focus on extracting content from documents, versus
> format transformations.
>
> Tika is an attractive location for functionality like this, since it sits
> in the middle of a lot of data processing pipelines, but I worry about a
> bloated code base, with corresponding challenges in maintenance and support.
>
> Regards,
>
> — Ken
>
>
> > On Oct 14, 2019, at 4:38 AM, Sergey Beryozkin 
> wrote:
> >
> > Hi All
> >
> > I've seen a Quarkus user asking how to convert to PDF, and one of my
> > colleagues pointed to
> >
> http://www.allcolor.org/YaHPConverter/doc/org/allcolor/yahp/converter/IHtmlToPdfTransformer.html
> >
> > Does it make sense for Tika to offer something related to the text to PDF
> > (for a start, something on top of that transformer), and then may be even
> > for other formats ?
> >
> > Sergey
>
> --
> Ken Krugler
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
>
>


Re: HTML to PDF conversion

2019-10-14 Thread Sergey Beryozkin
Hi All

Thanks for the comments;
Simone, Tilman, thanks for the links :-), shared them with our users [1]

Cheers, Sergey

[1]
https://quarkusio.zulipchat.com/#narrow/stream/187030-users/topic/Generate.20pdf.20endpoint

On Mon, Oct 14, 2019 at 5:57 PM Tilman Hausherr 
wrote:

> Am 14.10.2019 um 13:39 schrieb Sergey Beryozkin:
> > or on top of PDFBox ?
>
>
> This project on top of PDFBox converts HTML to PDF:
>
> https://github.com/danfickle/openhtmltopdf
>
>
> Tilman
>
>
>
> >
> > On Mon, Oct 14, 2019 at 12:38 PM Sergey Beryozkin 
> > wrote:
> >
> >> Hi All
> >>
> >> I've seen a Quarkus user asking how to convert to PDF, and one of my
> >> colleagues pointed to
> >>
> >>
> http://www.allcolor.org/YaHPConverter/doc/org/allcolor/yahp/converter/IHtmlToPdfTransformer.html
> >>
> >> Does it make sense for Tika to offer something related to the text to
> PDF
> >> (for a start, something on top of that transformer), and then may be
> even
> >> for other formats ?
> >>
> >> Sergey
> >>
>
>


Re: HTML to PDF conversion

2019-10-14 Thread Sergey Beryozkin
or on top of PDFBox ?

On Mon, Oct 14, 2019 at 12:38 PM Sergey Beryozkin 
wrote:

> Hi All
>
> I've seen a Quarkus user asking how to convert to PDF, and one of my
> colleagues pointed to
>
> http://www.allcolor.org/YaHPConverter/doc/org/allcolor/yahp/converter/IHtmlToPdfTransformer.html
>
> Does it make sense for Tika to offer something related to the text to PDF
> (for a start, something on top of that transformer), and then may be even
> for other formats ?
>
> Sergey
>


HTML to PDF conversion

2019-10-14 Thread Sergey Beryozkin
Hi All

I've seen a Quarkus user asking how to convert to PDF, and one of my
colleagues pointed to
http://www.allcolor.org/YaHPConverter/doc/org/allcolor/yahp/converter/IHtmlToPdfTransformer.html

Does it make sense for Tika to offer something related to the text to PDF
(for a start, something on top of that transformer), and then may be even
for other formats ?

Sergey


Re: ApacheCon Europe 2019 talks which are relevant to Apache Tika

2019-10-13 Thread Sergey Beryozkin
Hi Myrle, All

Is a presentation template with Apache Con EU/Berlin theme available ?
I'm going to use some other template otherwise, but I recall my Tika
colleagues used to prepare such templates for other ASF conferences. Nick,
may be you have it ? :-)

Thanks, Sergey


On Fri, Oct 4, 2019 at 6:11 PM  wrote:

> Dear Apache Tika committers,
>
> In a little over 2 weeks time, ApacheCon Europe is taking place in
> Berlin. Join us from October 22 to 24 for an exciting program and lovely
> get-together of the Apache Community.
>
> We are also planning a hackathon.  If your project is interested in
> participating, please enter yourselves here:
> https://cwiki.apache.org/confluence/display/COMDEV/Hackathon
>
> The following talks should be especially relevant for you:
>
>   * *
> https://aceu19.apachecon.com/session/apache-tika-goes-native-graalvm-and-quarkus
> *
>   *
> https://aceu19.apachecon.com/session/patterns-and-anti-patterns-running-apache-bigdata-projects-kubernetes
>   *
>
>
> https://aceu19.apachecon.com/session/open-source-big-data-tools-accelerating-physics-research-cern
> <
> https://aceu19.apachecon.com/session/data-driven-aiml-solutions-apache-software
> >
>
>   *
>
>
> https://aceu19.apachecon.com/session/ui-dev-big-data-world-using-open-source
> <
> https://aceu19.apachecon.com/session/apache-beam-running-big-data-pipelines-python-and-go-spark
> >
>
>   *
>
>
> https://aceu19.apachecon.com/session/data-driven-aiml-solutions-apache-software
> <
> https://aceu19.apachecon.com/session/apache-beam-running-big-data-pipelines-python-and-go-spark
> >
>
>   *
>
>
> https://aceu19.apachecon.com/session/apache-beam-running-big-data-pipelines-python-and-go-spark*
> <
> https://aceu19.apachecon.com/session/maintaining-java-library-light-new-java-release-train
> >**<
> https://aceu19.apachecon.com/session/maintaining-java-library-light-new-java-release-train
> >*
>
>   *
>
> *
> https://aceu19.apachecon.com/session/maintaining-java-library-light-new-java-release-train*
>
>
> Furthermore there will be a whole conference track on community topics:
> Learn how to motivate users to contribute patches, how the board of
> directors works, how to navigate the Incubator and much more: ApacheCon
> Europe 2019 Community track <
> https://aceu19.apachecon.com/sessions?track=42>
>
> Tickets are available here  –
> for Apache Committers we offer discounted tickets.  Prices will be going
> up on October 7th, so book soon.
>
> Please also help spread the word and make ApacheCon Europe 2019 a success!
>
> We’re looking forward to welcoming you at #ACEU19!
>
> Best,
>
> Your ApacheCon team
>
>


Re: build failure in master

2019-09-20 Thread Sergey Beryozkin
Is it the message digest signatures of some PDF content ? May be to do with
some MessageDigest enhancements in Java 11 ? I haven't found anything
specific, but this one possible line. Or may be the default hashCode() has
changed, which can also affect the collection hashCode() and the total
digest too

Cheers, Sergey

On Fri, Sep 20, 2019 at 1:43 PM Tim Allison  wrote:

> PDFBox Colleagues,
>   Do you know of any diffs between Java 8 and 11 that would affect the
> extraction of images from PDFs?  Dan is getting a build failure
> because of a hash mismatch.
>   Thank you.
>
>Best,
>
>Tim
>
> On Fri, Sep 20, 2019 at 8:39 AM Dan Becker  wrote:
> >
> > When one installs "sudo apt install -y default-jre" on Ubuntu 16.04, the
> > Tika build will be successful with the following tools:
> > vagrant@ubuntu-xenial:~/tika$ javac -version
> > javac 1.8.0_222
> > vagrant@ubuntu-xenial:~/tika$ java -version
> > openjdk version "1.8.0_222"
> >
> > When one installs "sudo apt install -y default-jre" on Ubuntu 18.04, the
> > Tika build will FAIL with the following tools:
> > vagrant@ubuntu-bionic:~/tika$ javac -version
> > javac 11.0.4
> > vagrant@ubuntu-bionic:~/tika$ java -version
> > openjdk version "11.0.4" 2019-07-16
> >
> > When one installs "sudo apt install -y openjdk-8-jdk" on Ubuntu 18.04,
> the
> > Tika build will FAIL with the following tools (Note the compiler has been
> > changed to jdk 8, but java has not):
> > vagrant@ubuntu-bionic:~/tika$ javac -version
> > javac 1.8.0_222
> > vagrant@ubuntu-bionic:~/tika$ java -version
> > openjdk version "11.0.4" 2019-07-16
> >
> > If you switch the Java version with "echo 2 | sudo update-alternatives
> >  --config java" (Note "2" works for a clean bionic vagrant VM, but YMMV),
> > then the Tika build will be successful with the following tools:
> > vagrant@ubuntu-bionic:~/tika$ javac -version
> > javac 1.8.0_222
> > vagrant@ubuntu-bionic:~/tika$ java -version
> > openjdk version "1.8.0_222"
> >
> > I suspect that there is some new issue with running Tika under Open JDK
> > 11.0.4. I will continue to look for the root cause of that next week.
> >
> >
> > Dan
> > C: 301-524-8899
> >
> >
> > On Thu, Sep 19, 2019 at 1:00 PM Dan Becker  wrote:
> >
> > > It is a clean checkout and build with no local changes.
> > >
> > > I tested against the stock Ubuntu 16.04, and all the tests passed.  The
> > > only difference with the command sequence listed in the first email is
> > > "vagrant init ubuntu/xenial64".
> > >
> > > I retested Ubuntu 18.04 on a different host (different version of
> vagrant,
> > > etc), and I got the same error, so the problem does "repro".
> > >
> > > I will try to debug it further to determine the root cause.
> > >
> > > Dan
> > > C: 301-524-8899
> > >
> > >
> > > On Thu, Sep 19, 2019 at 5:07 AM Nick Burch 
> wrote:
> > >
> > >> On Wed, 18 Sep 2019, Dan Becker wrote:
> > >> > I am trying to build the master branch from Ubuntu 18.04, but I am
> > >> getting
> > >> > the following error:
> > >> >
> > >> > [ERROR] Tests run: 11, Failures: 1, Errors: 0, Skipped: 0, Time
> elapsed:
> > >> > 1.409 s <<< FAILURE! - in
> org.apache.tika.server.UnpackerResourceTest
> > >> > [ERROR] testPDFImages(org.apache.tika.server.UnpackerResourceTest)
> Time
> > >> > elapsed: 0.366 s  <<< FAILURE!
> > >> > org.junit.ComparisonFailure:
> > >> expected:<[7c2f14acbb737672a1245f4ceb50622a]>
> > >> > but was:<[58b8269d1a584b7e8c1adcb936123923]>
> > >> >at
> > >> >
> > >>
> org.apache.tika.server.UnpackerResourceTest.testPDFImages(UnpackerResourceTest.java:208)
> > >>
> > >> Have you made any local changes first? Anything that might've been
> merged
> > >> in locally?
> > >>
> > >> I'm building on Ubuntu 18.04 with Java 11, and the build completes
> fine
> > >> for me with no errors. Pretty sure some/most of our build servers are
> > >> Ubuntu too. So, not sure what's wrong for you...
> > >>
> > >> Nick
> > >>
> > >
>


[jira] [Updated] (TIKA-2945) AutoDetectParser should skip the content type detection if Metadata already has it

2019-09-13 Thread Sergey Beryozkin (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-2945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Beryozkin updated TIKA-2945:
---
Summary: AutoDetectParser should skip the content type detection if 
Metadata already has it  (was: AutoDetectParser should skip the conetnt type 
detection if Metadata already has it)

> AutoDetectParser should skip the content type detection if Metadata already 
> has it
> --
>
> Key: TIKA-2945
> URL: https://issues.apache.org/jira/browse/TIKA-2945
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Sergey Beryozkin
>    Assignee: Sergey Beryozkin
>Priority: Minor
> Fix For: 2.0.0, 1.23
>
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (TIKA-2943) Modularize tika-parsers

2019-09-13 Thread Sergey Beryozkin (Jira)
Sergey Beryozkin created TIKA-2943:
--

 Summary: Modularize tika-parsers
 Key: TIKA-2943
 URL: https://issues.apache.org/jira/browse/TIKA-2943
 Project: Tika
  Issue Type: Improvement
Reporter: Sergey Beryozkin
Assignee: Sergey Beryozkin
 Fix For: 2.0.0


This effort will be based on the work done by Bob at the [2.x 
branch|https://github.com/apache/tika/tree/2.x/tika-parser-modules] 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Updated] (TIKA-2944) TikaConfig should support the parameters without XML type attribute

2019-09-13 Thread Sergey Beryozkin (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-2944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Beryozkin updated TIKA-2944:
---
Summary: TikaConfig should support the parameters without XML type 
attribute  (was: TikaConfig should support the parameters with the XML type 
attribute)

> TikaConfig should support the parameters without XML type attribute
> ---
>
> Key: TIKA-2944
> URL: https://issues.apache.org/jira/browse/TIKA-2944
> Project: Tika
>  Issue Type: Improvement
>  Components: config
>    Reporter: Sergey Beryozkin
>Assignee: Sergey Beryozkin
>Priority: Major
> Fix For: 2.0.0, 1.23
>
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (TIKA-2945) AutoDetectParser should skip the conetnt type detection if Metadata already has it

2019-09-13 Thread Sergey Beryozkin (Jira)
Sergey Beryozkin created TIKA-2945:
--

 Summary: AutoDetectParser should skip the conetnt type detection 
if Metadata already has it
 Key: TIKA-2945
 URL: https://issues.apache.org/jira/browse/TIKA-2945
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Sergey Beryozkin
Assignee: Sergey Beryozkin
 Fix For: 2.0.0, 1.23






--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (TIKA-2944) TikaConfig should support the parameters with the XML type attribute

2019-09-13 Thread Sergey Beryozkin (Jira)
Sergey Beryozkin created TIKA-2944:
--

 Summary: TikaConfig should support the parameters with the XML 
type attribute
 Key: TIKA-2944
 URL: https://issues.apache.org/jira/browse/TIKA-2944
 Project: Tika
  Issue Type: Improvement
  Components: config
Reporter: Sergey Beryozkin
Assignee: Sergey Beryozkin
 Fix For: 2.0.0, 1.23






--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (TIKA-2882) Parsers should not include HTTP client code

2019-09-13 Thread Sergey Beryozkin (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16929130#comment-16929130
 ] 

Sergey Beryozkin commented on TIKA-2882:


I'll create a dedicated issue so that I can link from it from the other 
sources, etc

> Parsers should not include HTTP client code
> ---
>
> Key: TIKA-2882
> URL: https://issues.apache.org/jira/browse/TIKA-2882
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.21
>Reporter: Jonathan Essex
>    Assignee: Sergey Beryozkin
>Priority: Major
>
> Folks, does it really make sense for a parser to have a REST client built in?
> The GROBID and NLTKNERecogniser parsers use the apache CXF client directly. 
>  
> Since I don't use CXF and my entire app is built on a different JAX-RS stack 
> this just dropped me straight into dependency hell.
> Surely it would make more sense to keep the parsers... well, parsers... and 
> build support for delegating parsing to other services into some higher level 
> in the stack (such as the server, where the CXF dependency is more benign). 
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (TIKA-2882) Parsers should not include HTTP client code

2019-08-16 Thread Sergey Beryozkin (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16909117#comment-16909117
 ] 

Sergey Beryozkin commented on TIKA-2882:


OK, I've assigned to myself. Well, now that I'll have to do it, I have to say, 
my priority is to prepare to the Apache Con EU Tika talk well, but I'll give a 
try (and copy Bob's work from the 2.x branch :-) ) asap. Thanks  

> Parsers should not include HTTP client code
> ---
>
> Key: TIKA-2882
> URL: https://issues.apache.org/jira/browse/TIKA-2882
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.21
>Reporter: Jonathan Essex
>    Assignee: Sergey Beryozkin
>Priority: Major
>
> Folks, does it really make sense for a parser to have a REST client built in?
> The GROBID and NLTKNERecogniser parsers use the apache CXF client directly. 
>  
> Since I don't use CXF and my entire app is built on a different JAX-RS stack 
> this just dropped me straight into dependency hell.
> Surely it would make more sense to keep the parsers... well, parsers... and 
> build support for delegating parsing to other services into some higher level 
> in the stack (such as the server, where the CXF dependency is more benign). 
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Comment Edited] (TIKA-2882) Parsers should not include HTTP client code

2019-08-16 Thread Sergey Beryozkin (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16909117#comment-16909117
 ] 

Sergey Beryozkin edited comment on TIKA-2882 at 8/16/19 3:11 PM:
-

OK, I've assigned to myself. Well, now that I'll have to do it, I have to say, 
my priority is to prepare to the Apache Con EU Tika talk well, but I'll give it 
a try (and copy Bob's work from the 2.x branch :-) ) asap. Thanks  


was (Author: sergey_beryozkin):
OK, I've assigned to myself. Well, now that I'll have to do it, I have to say, 
my priority is to prepare to the Apache Con EU Tika talk well, but I'll give a 
try (and copy Bob's work from the 2.x branch :-) ) asap. Thanks  

> Parsers should not include HTTP client code
> ---
>
> Key: TIKA-2882
> URL: https://issues.apache.org/jira/browse/TIKA-2882
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.21
>Reporter: Jonathan Essex
>    Assignee: Sergey Beryozkin
>Priority: Major
>
> Folks, does it really make sense for a parser to have a REST client built in?
> The GROBID and NLTKNERecogniser parsers use the apache CXF client directly. 
>  
> Since I don't use CXF and my entire app is built on a different JAX-RS stack 
> this just dropped me straight into dependency hell.
> Surely it would make more sense to keep the parsers... well, parsers... and 
> build support for delegating parsing to other services into some higher level 
> in the stack (such as the server, where the CXF dependency is more benign). 
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (TIKA-2882) Parsers should not include HTTP client code

2019-08-16 Thread Sergey Beryozkin (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Beryozkin reassigned TIKA-2882:
--

Assignee: Sergey Beryozkin

> Parsers should not include HTTP client code
> ---
>
> Key: TIKA-2882
> URL: https://issues.apache.org/jira/browse/TIKA-2882
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.21
>Reporter: Jonathan Essex
>    Assignee: Sergey Beryozkin
>Priority: Major
>
> Folks, does it really make sense for a parser to have a REST client built in?
> The GROBID and NLTKNERecogniser parsers use the apache CXF client directly. 
>  
> Since I don't use CXF and my entire app is built on a different JAX-RS stack 
> this just dropped me straight into dependency hell.
> Surely it would make more sense to keep the parsers... well, parsers... and 
> build support for delegating parsing to other services into some higher level 
> in the stack (such as the server, where the CXF dependency is more benign). 
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (TIKA-2882) Parsers should not include HTTP client code

2019-08-16 Thread Sergey Beryozkin (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16909101#comment-16909101
 ] 

Sergey Beryozkin commented on TIKA-2882:


Hi Tim

Can we consider giving it a go ? Bob agrees to focus on the modules only, so 
all we have to do to get it started is to create few modules grouping the 
specific parsers, and have the existing tika-parsers incorporating those new 
modules. There should be no even coding involved unless I'm missing something. 
If you can create a quick PR only and then I can test it with Quarkus, etc, 
what do you think ?

> Parsers should not include HTTP client code
> ---
>
> Key: TIKA-2882
> URL: https://issues.apache.org/jira/browse/TIKA-2882
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.21
>Reporter: Jonathan Essex
>Priority: Major
>
> Folks, does it really make sense for a parser to have a REST client built in?
> The GROBID and NLTKNERecogniser parsers use the apache CXF client directly. 
>  
> Since I don't use CXF and my entire app is built on a different JAX-RS stack 
> this just dropped me straight into dependency hell.
> Surely it would make more sense to keep the parsers... well, parsers... and 
> build support for delegating parsing to other services into some higher level 
> in the stack (such as the server, where the CXF dependency is more benign). 
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Re: Quarkus integration

2019-08-15 Thread Sergey Beryozkin
If someone from the large Tika team can give that extension a try, whenever
time allows, it would be super, it will help me improve that extension. If
you do decide to try, please post the feedback to
https://groups.google.com/forum/#!forum/quarkus-dev
or if it fails miserably for your documents, may be here first :-)
Cheers, Sergey

On Thu, Aug 15, 2019 at 3:15 PM Sergey Beryozkin 
wrote:

> Hi,
> The initial documentation is here:
> https://quarkus.io/guides/tika-guide
>
> Lots more to come over time, and we have already had users trying it (not
> many but hope to see more feedback from them soon)
> Sergey
>
> On Fri, May 10, 2019 at 6:04 PM Sergey Beryozkin 
> wrote:
>
>> I've managed to get the PDFParser running in the native mode, but I had
>> to delay the initialization of
>> org.apache.pdfbox.pdmodel.font.PDType1Font, this class has static
>> PDType1Font instances, one of them leading to
>> org.apache.fontbox.ttf.RAFDataStream which opens a file handler thus Graal
>> can not convert it to the native code during the build time, so one needs
>> to delay the initialization of PDType1Font till the run time.
>>
>> If we start from the PDF parser the the call path to RAFDataStream starts
>> from:
>>
>>
>> org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults(PDAcroForm.java:106)
>>  at
>> org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.(PDAcroForm.java:93)
>>  at
>> org.apache.pdfbox.pdmodel.PDDocumentCatalog.getAcroForm(PDDocumentCatalog.java:108)
>>
>> org.apache.tika.parser.pdf.PDFParser.handleXFAOnly(PDFParser.java:534)
>>
>> I guess I may need to create a PR for PDFBox where RAFDataStream opens a
>> stream lazily, with a check like ensureOpen() being added to its read
>> methods...
>>
>> Sergey
>>
>> On Fri, May 3, 2019 at 1:22 PM Sergey Beryozkin 
>> wrote:
>>
>>> Yes, please add 'sergeyb', I've just assigned myself a CXF issue as
>>> 'sergeyb'. Sorry about these multiple ids, but indeed I'll try to keep
>>> using a single one.
>>>
>>> Thanks, Sergey
>>>
>>>
>>>
>>> On Fri, May 3, 2019 at 12:13 PM Tim Allison  wrote:
>>>
>>>> I can add 'sergeyb' if you'd prefer!
>>>>
>>>> On Fri, May 3, 2019 at 5:43 AM Sergey Beryozkin 
>>>> wrote:
>>>> >
>>>> > Though I might need to settle on the 'sergeyb' eventually since it is
>>>> my
>>>> > apache committer id.
>>>> > Thanks...
>>>> >
>>>> > On Fri, May 3, 2019 at 10:29 AM Sergey Beryozkin <
>>>> sberyoz...@gmail.com>
>>>> > wrote:
>>>> >
>>>> > > Oh, I forgot I had a 'sergey_beryozkin' id as well, this is not
>>>> good,
>>>> > > shows how long ago I did contribute :-) (did try sergey.beryozkin
>>>> though).
>>>> > >
>>>> > > Thanks for checking it, I've just assigned this issue to myself.
>>>> > > Cheers, Sergey
>>>> > >
>>>> > >
>>>> > > On Thu, May 2, 2019 at 6:08 PM Sergey Beryozkin <
>>>> sberyoz...@gmail.com>
>>>> > > wrote:
>>>> > >
>>>> > >> Hi Tim
>>>> > >>
>>>> > >> I can't assign
>>>> > >> https://issues.apache.org/jira/browse/TIKA-2862
>>>> > >>
>>>> > >> to myself, I used to be able to assign, I know I had some time
>>>> away from
>>>> > >> Tika, but I'm keen to return with few contributions :-)
>>>> > >> Please update my record for me to be able to assign the issues
>>>> again
>>>> > >>
>>>> > >> Cheers, Sergey
>>>> > >>
>>>> > >> On Tue, Apr 30, 2019 at 6:22 PM Sergey Beryozkin <
>>>> sberyoz...@gmail.com>
>>>> > >> wrote:
>>>> > >>
>>>> > >>> Hi Tim, All
>>>> > >>>
>>>> > >>> I've started working on integrating Tika with Quarkus [1]. The
>>>> main idea
>>>> > >>> is to be able to use Tika in the native image mode.
>>>> > >>> It's quite likely I'll start creating the PRs soon, to get the
>>>> native
>>>> > >>> image related issues resolved, these are related to some libraries
>>>> > >>> statically initializing FileDescriptors, etc.
>>>> > >>>
>>>> > >>> Thanks, Sergey
>>>> > >>>
>>>> > >>> [1]
>>>> > >>>
>>>> https://github.com/sberyozkin/quarkus/tree/tika_extension/extensions/tika
>>>> > >>> [2]
>>>> > >>>
>>>> https://github.com/sberyozkin/quarkus-quickstarts/tree/tika/getting-started-tika
>>>> > >>>
>>>> > >>>
>>>>
>>>


Re: Quarkus integration

2019-08-15 Thread Sergey Beryozkin
Hi,
The initial documentation is here:
https://quarkus.io/guides/tika-guide

Lots more to come over time, and we have already had users trying it (not
many but hope to see more feedback from them soon)
Sergey

On Fri, May 10, 2019 at 6:04 PM Sergey Beryozkin 
wrote:

> I've managed to get the PDFParser running in the native mode, but I had to
> delay the initialization of
> org.apache.pdfbox.pdmodel.font.PDType1Font, this class has static
> PDType1Font instances, one of them leading to
> org.apache.fontbox.ttf.RAFDataStream which opens a file handler thus Graal
> can not convert it to the native code during the build time, so one needs
> to delay the initialization of PDType1Font till the run time.
>
> If we start from the PDF parser the the call path to RAFDataStream starts
> from:
>
>
> org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults(PDAcroForm.java:106)
>  at
> org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.(PDAcroForm.java:93)
>  at
> org.apache.pdfbox.pdmodel.PDDocumentCatalog.getAcroForm(PDDocumentCatalog.java:108)
>  org.apache.tika.parser.pdf.PDFParser.handleXFAOnly(PDFParser.java:534)
>
> I guess I may need to create a PR for PDFBox where RAFDataStream opens a
> stream lazily, with a check like ensureOpen() being added to its read
> methods...
>
> Sergey
>
> On Fri, May 3, 2019 at 1:22 PM Sergey Beryozkin 
> wrote:
>
>> Yes, please add 'sergeyb', I've just assigned myself a CXF issue as
>> 'sergeyb'. Sorry about these multiple ids, but indeed I'll try to keep
>> using a single one.
>>
>> Thanks, Sergey
>>
>>
>>
>> On Fri, May 3, 2019 at 12:13 PM Tim Allison  wrote:
>>
>>> I can add 'sergeyb' if you'd prefer!
>>>
>>> On Fri, May 3, 2019 at 5:43 AM Sergey Beryozkin 
>>> wrote:
>>> >
>>> > Though I might need to settle on the 'sergeyb' eventually since it is
>>> my
>>> > apache committer id.
>>> > Thanks...
>>> >
>>> > On Fri, May 3, 2019 at 10:29 AM Sergey Beryozkin >> >
>>> > wrote:
>>> >
>>> > > Oh, I forgot I had a 'sergey_beryozkin' id as well, this is not good,
>>> > > shows how long ago I did contribute :-) (did try sergey.beryozkin
>>> though).
>>> > >
>>> > > Thanks for checking it, I've just assigned this issue to myself.
>>> > > Cheers, Sergey
>>> > >
>>> > >
>>> > > On Thu, May 2, 2019 at 6:08 PM Sergey Beryozkin <
>>> sberyoz...@gmail.com>
>>> > > wrote:
>>> > >
>>> > >> Hi Tim
>>> > >>
>>> > >> I can't assign
>>> > >> https://issues.apache.org/jira/browse/TIKA-2862
>>> > >>
>>> > >> to myself, I used to be able to assign, I know I had some time away
>>> from
>>> > >> Tika, but I'm keen to return with few contributions :-)
>>> > >> Please update my record for me to be able to assign the issues again
>>> > >>
>>> > >> Cheers, Sergey
>>> > >>
>>> > >> On Tue, Apr 30, 2019 at 6:22 PM Sergey Beryozkin <
>>> sberyoz...@gmail.com>
>>> > >> wrote:
>>> > >>
>>> > >>> Hi Tim, All
>>> > >>>
>>> > >>> I've started working on integrating Tika with Quarkus [1]. The
>>> main idea
>>> > >>> is to be able to use Tika in the native image mode.
>>> > >>> It's quite likely I'll start creating the PRs soon, to get the
>>> native
>>> > >>> image related issues resolved, these are related to some libraries
>>> > >>> statically initializing FileDescriptors, etc.
>>> > >>>
>>> > >>> Thanks, Sergey
>>> > >>>
>>> > >>> [1]
>>> > >>>
>>> https://github.com/sberyozkin/quarkus/tree/tika_extension/extensions/tika
>>> > >>> [2]
>>> > >>>
>>> https://github.com/sberyozkin/quarkus-quickstarts/tree/tika/getting-started-tika
>>> > >>>
>>> > >>>
>>>
>>


[jira] [Commented] (TIKA-2910) Text extraction using Tika command line and Tika server differs

2019-08-12 Thread Sergey Beryozkin (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16905636#comment-16905636
 ] 

Sergey Beryozkin commented on TIKA-2910:


Hi [~talli...@apache.org], IMHO it should be fixed in the 1.x branch as well, 
may be with a property letting the users to enable or disable this fix at 
runtime 

> Text extraction using Tika command line and Tika server differs
> ---
>
> Key: TIKA-2910
> URL: https://issues.apache.org/jira/browse/TIKA-2910
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.21
>Reporter: Walter
>Priority: Major
>  Labels: newbie
> Attachments: CorpusP_25471990.xml
>
>
> When extracting TXT from the very same XML file using either Tika command 
> line utility or the Tika in server mode, the results differ.
> It looks as if PCDATA in deeper nested XML structures are just ignored and 
> only an empty line is returned.
> I assume both use the same base code. Are there any default settings that may 
> differ or can be set?
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Content-Type and AutoDetectParser

2019-08-09 Thread Sergey Beryozkin
Hi Tim, All

I'd like to do a small PR for the AutoDetectParser which will check the
Metadata for the Content-Type before trying to detect it itself, since on
the JAX-RS path, the content type if often already known.
Does it make sense or do you see some possible side-effects ?

Sergey


[jira] [Commented] (TIKA-2910) Text extraction using Tika command line and Tika server differs

2019-08-08 Thread Sergey Beryozkin (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16902845#comment-16902845
 ] 

Sergey Beryozkin commented on TIKA-2910:


Thanks, Tim may be away so lets wait till he is back

> Text extraction using Tika command line and Tika server differs
> ---
>
> Key: TIKA-2910
> URL: https://issues.apache.org/jira/browse/TIKA-2910
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.21
>Reporter: Walter
>Priority: Major
>  Labels: newbie
> Attachments: CorpusP_25471990.xml
>
>
> When extracting TXT from the very same XML file using either Tika command 
> line utility or the Tika in server mode, the results differ.
> It looks as if PCDATA in deeper nested XML structures are just ignored and 
> only an empty line is returned.
> I assume both use the same base code. Are there any default settings that may 
> differ or can be set?
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (TIKA-2910) Text extraction using Tika command line and Tika server differs

2019-08-08 Thread Sergey Beryozkin (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16902836#comment-16902836
 ] 

Sergey Beryozkin commented on TIKA-2910:


Hi [~akit] Can you please download the source and debug ? It can help

> Text extraction using Tika command line and Tika server differs
> ---
>
> Key: TIKA-2910
> URL: https://issues.apache.org/jira/browse/TIKA-2910
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.21
>Reporter: Walter
>Priority: Major
>  Labels: newbie
> Attachments: CorpusP_25471990.xml
>
>
> When extracting TXT from the very same XML file using either Tika command 
> line utility or the Tika in server mode, the results differ.
> It looks as if PCDATA in deeper nested XML structures are just ignored and 
> only an empty line is returned.
> I assume both use the same base code. Are there any default settings that may 
> differ or can be set?
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


Better Tika performance with Quarkus Tika ?

2019-07-11 Thread Sergey Beryozkin
Hi Tim, All,

A simple Quarkus Tika extension I was working upon has made it into the
latest Quarkus 0.19.1 release:
https://github.com/quarkusio/quarkus/tree/0.19.1/extensions/tika

Happy about it and intend to grow it further. It is pretty basic at the
moment (AutoDetectParser + ToTextCotentHandler). But what can be of
interest is that it can be run in the GraalVM native image (so it will be
all just a C compiled executable AFAIK).
The basic test shows how to use it:

https://github.com/quarkusio/quarkus/tree/master/integration-tests/tika
(Simple PDF, ODT and text files are checked)

I'll work on a demo in the next couple of months, and I will present about
it at Apache Con EU in Oct. But if someone is interested to try to run this
extension in the native mode, let me know please and I'll try to prepare
some simple instructions on how to quickly try with some massive PDF, etc.

Cheers, Sergey


Re: Merge flow

2019-07-10 Thread Sergey Beryozkin
Thanks, I was just curious, hope to start contributing a bit more...

Sergey

On Wed, Jul 10, 2019 at 1:39 PM Tim Allison  wrote:

> Y.  Although sometimes I flip the order. :D
>
> If it matters or if I’m doing something wrong, let me know!
>
> On Wed, Jul 10, 2019 at 4:52 AM Sergey Beryozkin 
> wrote:
>
> > Hi Tim
> >
> > What is the current process for merging the fixes ? The fix goes to the
> > master first and then it is cherry-picked into the branch_1x ?
> >
> > Cheers, Sergey
> >
>


Merge flow

2019-07-10 Thread Sergey Beryozkin
Hi Tim

What is the current process for merging the fixes ? The fix goes to the
master first and then it is cherry-picked into the branch_1x ?

Cheers, Sergey


Re: Tika 1.22?

2019-06-25 Thread Sergey Beryozkin
Sounds good

Thanks, Sergey

On Tue, Jun 25, 2019 at 3:45 PM Tim Allison  wrote:

> All,
>   The vote for the next version of PDFBox is under way.  I think we've
> had a number of useful upgrades since our last release.  Any
> objections to starting the release process for Tika 1.22 a week or so
> after we integrate PDFBox?
>
>  Cheers,
>
>   Tim
>


[jira] [Resolved] (TIKA-2896) NullPointerException in MimeTypesReader.releaseParser()

2019-06-18 Thread Sergey Beryozkin (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Beryozkin resolved TIKA-2896.

   Resolution: Fixed
Fix Version/s: 1.22

Thanks for the patch

> NullPointerException in MimeTypesReader.releaseParser()
> ---
>
> Key: TIKA-2896
> URL: https://issues.apache.org/jira/browse/TIKA-2896
> Project: Tika
>  Issue Type: Bug
>  Components: mime
>Affects Versions: 1.21
>Reporter: Eamonn Saunders
>Priority: Major
> Fix For: 1.22
>
>
> We have encountered a situation where the call to parser.reset() in the 
> following code snippet results in a NullPointerException.
> {code:java}
>     private static void releaseParser(SAXParser parser) {
>     try {
>     parser.reset();
>     } catch (UnsupportedOperationException e) {
>     //ignore
>     }
> {code}
> releaseParser() is called in the finally block of MimeTypesReader.read()
> {code:java}
>     public void read(InputStream stream) throws IOException, 
> MimeTypeException {
>     SAXParser parser = null;
>     try {
>     parser = acquireSAXParser();
>     parser.parse(stream, this);
>     } catch (TikaException e) {
>     throw new MimeTypeException("Unable to create an XML parser", e);
>     } catch (SAXException e) {
>     throw new MimeTypeException("Invalid type configuration", e);
>     } finally {
>     releaseParser(parser);
>     }
>     }{code}
> The parser variable will be null coming out of acquireSAXParser() if 
> acquireSAXParser() is called on a thread that is interrupted (i.e. the 
> InterruptedException is handled in the following code):
> {code:java}
>     private static SAXParser acquireSAXParser()
>     throws TikaException {
>     while (true) {
>     SAXParser parser = null;
>     try {
>     READ_WRITE_LOCK.readLock().lock();
>     parser = SAX_PARSERS.poll(10, TimeUnit.MILLISECONDS);
>     } catch (InterruptedException e) {
>     throw new TikaException("interrupted while waiting for 
> SAXParser", e);
>     } finally {
>     READ_WRITE_LOCK.readLock().unlock();
>     }
>     if (parser != null) {
>     return parser;
>     }
>     }
>     }{code}
> A simple fix would be to check for null before calling releaseParser() in the 
> finally block.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (TIKA-2882) Parsers should not include HTTP client code

2019-05-29 Thread Sergey Beryozkin (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16851032#comment-16851032
 ] 

Sergey Beryozkin edited comment on TIKA-2882 at 5/29/19 4:26 PM:
-

I will definitely support getting the tika-parser-modules idea in 2.0 
prioritized (please start a dev thread if you'd like, so that the categories 
can be reviewed what goes where etc). May be it is a simplistic view but if we 
postpone the OSGI-ification aspects till later then it is about moving specific 
parsers out of tika-parsers to respective modules ? Sorry if it is not the case 
:-). I can definitely help with testing. Or moving some parsers to a new module 
once I see how you do it one of these modules :-) 


was (Author: sergey_beryozkin):
I will definitely support getting the tika-parser-modules idea in 2.0 
prioritized (please start a dev thread if you'd like, so that the categories 
can be reviewed what goes where etc). May be it is a simplistic view but if 
postpone the OSGI-ification aspects till later then it is about moving specific 
parsers out of tika-parsers to respective modules ? Sorry if it is not the case 
:-). I can definitely help with testing. Or moving some parsers to a new module 
once I see how you do it one of these modules :-) 

> Parsers should not include HTTP client code
> ---
>
> Key: TIKA-2882
> URL: https://issues.apache.org/jira/browse/TIKA-2882
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.21
>Reporter: Jonathan Essex
>Priority: Major
>
> Folks, does it really make sense for a parser to have a REST client built in?
> The GROBID and NLTKNERecogniser parsers use the apache CXF client directly. 
>  
> Since I don't use CXF and my entire app is built on a different JAX-RS stack 
> this just dropped me straight into dependency hell.
> Surely it would make more sense to keep the parsers... well, parsers... and 
> build support for delegating parsing to other services into some higher level 
> in the stack (such as the server, where the CXF dependency is more benign). 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2882) Parsers should not include HTTP client code

2019-05-29 Thread Sergey Beryozkin (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16851032#comment-16851032
 ] 

Sergey Beryozkin commented on TIKA-2882:


I will definitely support getting the tika-parser-modules idea in 2.0 
prioritized (please start a dev thread if you'd like, so that the categories 
can be reviewed what goes where etc). May be it is a simplistic view but if 
postpone the OSGI-ification aspects till later then it is about moving specific 
parsers out of tika-parsers to respective modules ? Sorry if it is not the case 
:-). I can definitely help with testing. Or moving some parsers to a new module 
once I see how you do it one of these modules :-) 

> Parsers should not include HTTP client code
> ---
>
> Key: TIKA-2882
> URL: https://issues.apache.org/jira/browse/TIKA-2882
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.21
>Reporter: Jonathan Essex
>Priority: Major
>
> Folks, does it really make sense for a parser to have a REST client built in?
> The GROBID and NLTKNERecogniser parsers use the apache CXF client directly. 
>  
> Since I don't use CXF and my entire app is built on a different JAX-RS stack 
> this just dropped me straight into dependency hell.
> Surely it would make more sense to keep the parsers... well, parsers... and 
> build support for delegating parsing to other services into some higher level 
> in the stack (such as the server, where the CXF dependency is more benign). 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (TIKA-2882) Parsers should not include HTTP client code

2019-05-28 Thread Sergey Beryozkin (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16849697#comment-16849697
 ] 

Sergey Beryozkin edited comment on TIKA-2882 at 5/28/19 1:39 PM:
-

I see, I was thinking of the 2.x branch :-)
Lets start with the https://github.com/apache/tika/tree/2.x/tika-parser-modules 
idea in 2.0 master ?


was (Author: sergey_beryozkin):
I see, I was thinking of the 2.x branch :-)
Lets starts with the 
https://github.com/apache/tika/tree/2.x/tika-parser-modules idea in 2.0 master ?

> Parsers should not include HTTP client code
> ---
>
> Key: TIKA-2882
> URL: https://issues.apache.org/jira/browse/TIKA-2882
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.21
>Reporter: Jonathan Essex
>Priority: Major
>
> Folks, does it really make sense for a parser to have a REST client built in?
> The GROBID and NLTKNERecogniser parsers use the apache CXF client directly. 
>  
> Since I don't use CXF and my entire app is built on a different JAX-RS stack 
> this just dropped me straight into dependency hell.
> Surely it would make more sense to keep the parsers... well, parsers... and 
> build support for delegating parsing to other services into some higher level 
> in the stack (such as the server, where the CXF dependency is more benign). 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2882) Parsers should not include HTTP client code

2019-05-28 Thread Sergey Beryozkin (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16849697#comment-16849697
 ] 

Sergey Beryozkin commented on TIKA-2882:


I see, I was thinking of the 2.x branch :-)
Lets starts with the 
https://github.com/apache/tika/tree/2.x/tika-parser-modules idea in 2.0 master ?

> Parsers should not include HTTP client code
> ---
>
> Key: TIKA-2882
> URL: https://issues.apache.org/jira/browse/TIKA-2882
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.21
>Reporter: Jonathan Essex
>Priority: Major
>
> Folks, does it really make sense for a parser to have a REST client built in?
> The GROBID and NLTKNERecogniser parsers use the apache CXF client directly. 
>  
> Since I don't use CXF and my entire app is built on a different JAX-RS stack 
> this just dropped me straight into dependency hell.
> Surely it would make more sense to keep the parsers... well, parsers... and 
> build support for delegating parsing to other services into some higher level 
> in the stack (such as the server, where the CXF dependency is more benign). 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2882) Parsers should not include HTTP client code

2019-05-28 Thread Sergey Beryozkin (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16849593#comment-16849593
 ] 

Sergey Beryozkin commented on TIKA-2882:


[~talli...@apache.org], so as far as Tika 2.0 is concerned, would it make sense 
to start applying similar ideas in the 1.x line ? Or make the 2.0 branch a 
master ?

> Parsers should not include HTTP client code
> ---
>
> Key: TIKA-2882
> URL: https://issues.apache.org/jira/browse/TIKA-2882
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.21
>Reporter: Jonathan Essex
>Priority: Major
>
> Folks, does it really make sense for a parser to have a REST client built in?
> The GROBID and NLTKNERecogniser parsers use the apache CXF client directly. 
>  
> Since I don't use CXF and my entire app is built on a different JAX-RS stack 
> this just dropped me straight into dependency hell.
> Surely it would make more sense to keep the parsers... well, parsers... and 
> build support for delegating parsing to other services into some higher level 
> in the stack (such as the server, where the CXF dependency is more benign). 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2882) Parsers should not include HTTP client code

2019-05-27 Thread Sergey Beryozkin (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16848866#comment-16848866
 ] 

Sergey Beryozkin commented on TIKA-2882:


Oh, is it multipart ? In that case may be it has to be replaced with something 
neutral such as Apache HttpClient or even the manual multipart payload creation.

> Parsers should not include HTTP client code
> ---
>
> Key: TIKA-2882
> URL: https://issues.apache.org/jira/browse/TIKA-2882
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.21
>Reporter: Jonathan Essex
>Priority: Major
>
> Folks, does it really make sense for a parser to have a REST client built in?
> The GROBID and NLTKNERecogniser parsers use the apache CXF client directly. 
>  
> Since I don't use CXF and my entire app is built on a different JAX-RS stack 
> this just dropped me straight into dependency hell.
> Surely it would make more sense to keep the parsers... well, parsers... and 
> build support for delegating parsing to other services into some higher level 
> in the stack (such as the server, where the CXF dependency is more benign). 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2882) Parsers should not include HTTP client code

2019-05-26 Thread Sergey Beryozkin (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16848489#comment-16848489
 ] 

Sergey Beryozkin commented on TIKA-2882:


Give a try please, I can help with the migration if needed

> Parsers should not include HTTP client code
> ---
>
> Key: TIKA-2882
> URL: https://issues.apache.org/jira/browse/TIKA-2882
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.21
>Reporter: Jonathan Essex
>Priority: Major
>
> Folks, does it really make sense for a parser to have a REST client built in?
> The GROBID and NLTKNERecogniser parsers use the apache CXF client directly. 
>  
> Since I don't use CXF and my entire app is built on a different JAX-RS stack 
> this just dropped me straight into dependency hell.
> Surely it would make more sense to keep the parsers... well, parsers... and 
> build support for delegating parsing to other services into some higher level 
> in the stack (such as the server, where the CXF dependency is more benign). 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2882) Parsers should not include HTTP client code

2019-05-26 Thread Sergey Beryozkin (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16848485#comment-16848485
 ] 

Sergey Beryozkin commented on TIKA-2882:


Can you consider a PR where CXF WebClient code is replaced by JAX-RS 2.0 client 
API ?

> Parsers should not include HTTP client code
> ---
>
> Key: TIKA-2882
> URL: https://issues.apache.org/jira/browse/TIKA-2882
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.21
>Reporter: Jonathan Essex
>Priority: Major
>
> Folks, does it really make sense for a parser to have a REST client built in?
> The GROBID and NLTKNERecogniser parsers use the apache CXF client directly. 
>  
> Since I don't use CXF and my entire app is built on a different JAX-RS stack 
> this just dropped me straight into dependency hell.
> Surely it would make more sense to keep the parsers... well, parsers... and 
> build support for delegating parsing to other services into some higher level 
> in the stack (such as the server, where the CXF dependency is more benign). 
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (TIKA-2862) Make PDF Parser GraalVM native mode ready

2019-05-22 Thread Sergey Beryozkin (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Beryozkin resolved TIKA-2862.

Resolution: Not A Problem

The issue is at the PDFBox level so it will be addressed in PDFBox then Tika 
will get it as part of the regular  PDFBox dependency update


> Make PDF Parser GraalVM native mode ready 
> --
>
> Key: TIKA-2862
> URL: https://issues.apache.org/jira/browse/TIKA-2862
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.20
>Reporter: Sergey Beryozkin
>    Assignee: Sergey Beryozkin
>Priority: Major
>
> PDF Parser is not Graal native mode ready yet, the following is reported when 
> it is processed as part of Quarkus native mode build:
> {noformat}
> Error: Detected a FileDescriptor in the image heap. You can manually 
> delay class initialization to image run time by using the option 
> --delay-class-initialization-to-runtime=. ...
> Detailed message:
>  Trace:     object org.apache.fontbox.ttf.BufferedRandomAccessFile
>      object org.apache.fontbox.ttf.RAFDataStream
>      object org.apache.fontbox.ttf.TrueTypeFont
>      object org.apache.pdfbox.pdmodel.font.PDType1Font
>      method 
> org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults()
>  Call path from entry point to 
> org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults():
>  
>      at 
> org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults(PDAcroForm.java:106)
>      at 
> org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.(PDAcroForm.java:93)
>      at 
> org.apache.pdfbox.pdmodel.PDDocumentCatalog.getAcroForm(PDDocumentCatalog.java:108)
>      at org.apache.tika.parser.pdf.PDFParser.handleXFAOnly(PDFParser.java:534)
>      at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:164)
> {noformat} 
> See also 
> [https://medium.com/graalvm/understanding-class-initialization-in-graalvm-native-image-generation-d765b7e4d6ed]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: TXTParser in Tika 1.21

2019-05-21 Thread Sergey Beryozkin
Hi Tim

No, that is fine, in my initial experiment I chose only 3 parsers to try in
the native mode, and TXTParser was one of them (so all other parsers were
not reflected upon during the native image creation), this is why I saw the
tests passing in the 'normal' Java mode (with all the parsers being loaded)
but suddenly failing for 1.21. It is still all on my branch so I just added
the TextAndCsvParser to the native mode list and it all works again :-)

Sergey

On Mon, May 20, 2019 at 8:59 PM Tim Allison  wrote:

> Y, that was by design.  Not intended to surprise.  We'll be out w
> 1.21.1 or 1.22 soon enough if that's a breaking change... :(
>
> On Mon, May 20, 2019 at 11:12 AM Sergey Beryozkin 
> wrote:
> >
> > I don't really mind though :-) as it looks like both parsers can handle
> the
> > text content, the reason I had a test failure is that I was specifying
> the
> > class names for the native reflection (and I'd probably just use String
> for
> > a plain text :-)).
> >
> > Sergey
> >
> > On Mon, May 20, 2019 at 4:02 PM Sergey Beryozkin 
> > wrote:
> >
> > > org.apache.tika.parser.csv.TextAndCSVParser is being picked up for a
> > > text/plain content, is it expected ?
> > >
> > > Cheers, Sergey
> > >
> > > On Mon, May 20, 2019 at 2:42 PM Sergey Beryozkin  >
> > > wrote:
> > >
> > >> Hi Tim, All
> > >>
> > >> I've just spotted that one of my tests fails with 1.21 (it is only
> > >> specific to a GraalVM native mode, sorry did not check it earlier).
> The
> > >> test just parses a plain text file (other tests involving ODT and PDF
> are
> > >> fine).
> > >> Has something changed around the way the TXTParser is selected how it
> > >> parses the content ? I'll investigate but may be you might be aware of
> > >> something obvious :-)
> > >>
> > >> Sergey
> > >>
> > >>
> > >>
>


Re: TXTParser in Tika 1.21

2019-05-20 Thread Sergey Beryozkin
I don't really mind though :-) as it looks like both parsers can handle the
text content, the reason I had a test failure is that I was specifying the
class names for the native reflection (and I'd probably just use String for
a plain text :-)).

Sergey

On Mon, May 20, 2019 at 4:02 PM Sergey Beryozkin 
wrote:

> org.apache.tika.parser.csv.TextAndCSVParser is being picked up for a
> text/plain content, is it expected ?
>
> Cheers, Sergey
>
> On Mon, May 20, 2019 at 2:42 PM Sergey Beryozkin 
> wrote:
>
>> Hi Tim, All
>>
>> I've just spotted that one of my tests fails with 1.21 (it is only
>> specific to a GraalVM native mode, sorry did not check it earlier). The
>> test just parses a plain text file (other tests involving ODT and PDF are
>> fine).
>> Has something changed around the way the TXTParser is selected how it
>> parses the content ? I'll investigate but may be you might be aware of
>> something obvious :-)
>>
>> Sergey
>>
>>
>>


Re: TXTParser in Tika 1.21

2019-05-20 Thread Sergey Beryozkin
org.apache.tika.parser.csv.TextAndCSVParser is being picked up for a
text/plain content, is it expected ?

Cheers, Sergey

On Mon, May 20, 2019 at 2:42 PM Sergey Beryozkin 
wrote:

> Hi Tim, All
>
> I've just spotted that one of my tests fails with 1.21 (it is only
> specific to a GraalVM native mode, sorry did not check it earlier). The
> test just parses a plain text file (other tests involving ODT and PDF are
> fine).
> Has something changed around the way the TXTParser is selected how it
> parses the content ? I'll investigate but may be you might be aware of
> something obvious :-)
>
> Sergey
>
>
>


TXTParser in Tika 1.21

2019-05-20 Thread Sergey Beryozkin
Hi Tim, All

I've just spotted that one of my tests fails with 1.21 (it is only specific
to a GraalVM native mode, sorry did not check it earlier). The test just
parses a plain text file (other tests involving ODT and PDF are fine).
Has something changed around the way the TXTParser is selected how it
parses the content ? I'll investigate but may be you might be aware of
something obvious :-)

Sergey


Re: [VOTE] Release Apache Tika 1.21 Candidate #2

2019-05-18 Thread Sergey Beryozkin
+1

Thanks, Sergey

On Sat, May 18, 2019 at 11:31 AM Tim Allison  wrote:

> Any fellow devs willing to vote? We have 2 votes so far. I should have time
> on Monday to run the release if the vote passes.
>
> Cheers,
> Tim
>
> On Tue, May 14, 2019 at 10:15 PM Tim Allison  wrote:
>
> > A candidate for the Tika 1.21 release is available at:
> >
> >   https://dist.apache.org/repos/dist/dev/tika/
> >
> > The release candidate is a zip archive of the sources in:
> >   https://github.com/apache/tika/tree/1.21-rc2/
> >
> > The SHA-512 checksum of the archive is:
> >
> >
> 67748553a44b3acb009f0e99ac595c5babfe04d4a75abd2efde614ca26f177c863f7aa598d6911a7b3ca146075c84ecdf0fc3c337d7145d050c889fb4cc4f14f
> >
> >
> > In addition, a staged maven repository is available here:
> >
> >
> https://repository.apache.org/content/repositories/orgapachetika-1048/org/apache/tika
> >
> >
> > Please vote on releasing this package as Apache Tika 1.21.
> >
> > The vote is open for the next 72 hours and passes if a majority of at
> > least three +1 Tika PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Tika 1.21
> > [ ] -1 Do not release this package because...
> >
> > Here's my +1.
> >
> > Cheers,
> >
> >   Tim
> >
>


[jira] [Commented] (TIKA-2862) Make PDF Parser GraalVM native mode ready

2019-05-14 Thread Sergey Beryozkin (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16839640#comment-16839640
 ] 

Sergey Beryozkin commented on TIKA-2862:


See https://github.com/apache/pdfbox/pull/69

> Make PDF Parser GraalVM native mode ready 
> --
>
> Key: TIKA-2862
> URL: https://issues.apache.org/jira/browse/TIKA-2862
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.20
>Reporter: Sergey Beryozkin
>    Assignee: Sergey Beryozkin
>Priority: Major
>
> PDF Parser is not Graal native mode ready yet, the following is reported when 
> it is processed as part of Quarkus native mode build:
> {noformat}
> Error: Detected a FileDescriptor in the image heap. You can manually 
> delay class initialization to image run time by using the option 
> --delay-class-initialization-to-runtime=. ...
> Detailed message:
>  Trace:     object org.apache.fontbox.ttf.BufferedRandomAccessFile
>      object org.apache.fontbox.ttf.RAFDataStream
>      object org.apache.fontbox.ttf.TrueTypeFont
>      object org.apache.pdfbox.pdmodel.font.PDType1Font
>      method 
> org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults()
>  Call path from entry point to 
> org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults():
>  
>      at 
> org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults(PDAcroForm.java:106)
>      at 
> org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.(PDAcroForm.java:93)
>      at 
> org.apache.pdfbox.pdmodel.PDDocumentCatalog.getAcroForm(PDDocumentCatalog.java:108)
>      at org.apache.tika.parser.pdf.PDFParser.handleXFAOnly(PDFParser.java:534)
>      at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:164)
> {noformat} 
> See also 
> [https://medium.com/graalvm/understanding-class-initialization-in-graalvm-native-image-generation-d765b7e4d6ed]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (TIKA-2862) Make PDF Parser GraalVM native mode ready

2019-05-14 Thread Sergey Beryozkin (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Beryozkin updated TIKA-2862:
---
Summary: Make PDF Parser GraalVM native mode ready   (was: Make PDF Parser 
Graal native mode ready )

> Make PDF Parser GraalVM native mode ready 
> --
>
> Key: TIKA-2862
> URL: https://issues.apache.org/jira/browse/TIKA-2862
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.20
>Reporter: Sergey Beryozkin
>    Assignee: Sergey Beryozkin
>Priority: Major
>
> PDF Parser is not Graal native mode ready yet, the following is reported when 
> it is processed as part of Quarkus native mode build:
> {noformat}
> Error: Detected a FileDescriptor in the image heap. You can manually 
> delay class initialization to image run time by using the option 
> --delay-class-initialization-to-runtime=. ...
> Detailed message:
>  Trace:     object org.apache.fontbox.ttf.BufferedRandomAccessFile
>      object org.apache.fontbox.ttf.RAFDataStream
>      object org.apache.fontbox.ttf.TrueTypeFont
>      object org.apache.pdfbox.pdmodel.font.PDType1Font
>      method 
> org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults()
>  Call path from entry point to 
> org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults():
>  
>      at 
> org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults(PDAcroForm.java:106)
>      at 
> org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.(PDAcroForm.java:93)
>      at 
> org.apache.pdfbox.pdmodel.PDDocumentCatalog.getAcroForm(PDDocumentCatalog.java:108)
>      at org.apache.tika.parser.pdf.PDFParser.handleXFAOnly(PDFParser.java:534)
>      at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:164)
> {noformat} 
> See also 
> [https://medium.com/graalvm/understanding-class-initialization-in-graalvm-native-image-generation-d765b7e4d6ed]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Quarkus integration

2019-05-10 Thread Sergey Beryozkin
I've managed to get the PDFParser running in the native mode, but I had to
delay the initialization of
org.apache.pdfbox.pdmodel.font.PDType1Font, this class has static
PDType1Font instances, one of them leading to
org.apache.fontbox.ttf.RAFDataStream which opens a file handler thus Graal
can not convert it to the native code during the build time, so one needs
to delay the initialization of PDType1Font till the run time.

If we start from the PDF parser the the call path to RAFDataStream starts
from:

org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults(PDAcroForm.java:106)
 at
org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.(PDAcroForm.java:93)
 at
org.apache.pdfbox.pdmodel.PDDocumentCatalog.getAcroForm(PDDocumentCatalog.java:108)
 org.apache.tika.parser.pdf.PDFParser.handleXFAOnly(PDFParser.java:534)

I guess I may need to create a PR for PDFBox where RAFDataStream opens a
stream lazily, with a check like ensureOpen() being added to its read
methods...

Sergey

On Fri, May 3, 2019 at 1:22 PM Sergey Beryozkin 
wrote:

> Yes, please add 'sergeyb', I've just assigned myself a CXF issue as
> 'sergeyb'. Sorry about these multiple ids, but indeed I'll try to keep
> using a single one.
>
> Thanks, Sergey
>
>
>
> On Fri, May 3, 2019 at 12:13 PM Tim Allison  wrote:
>
>> I can add 'sergeyb' if you'd prefer!
>>
>> On Fri, May 3, 2019 at 5:43 AM Sergey Beryozkin 
>> wrote:
>> >
>> > Though I might need to settle on the 'sergeyb' eventually since it is my
>> > apache committer id.
>> > Thanks...
>> >
>> > On Fri, May 3, 2019 at 10:29 AM Sergey Beryozkin 
>> > wrote:
>> >
>> > > Oh, I forgot I had a 'sergey_beryozkin' id as well, this is not good,
>> > > shows how long ago I did contribute :-) (did try sergey.beryozkin
>> though).
>> > >
>> > > Thanks for checking it, I've just assigned this issue to myself.
>> > > Cheers, Sergey
>> > >
>> > >
>> > > On Thu, May 2, 2019 at 6:08 PM Sergey Beryozkin > >
>> > > wrote:
>> > >
>> > >> Hi Tim
>> > >>
>> > >> I can't assign
>> > >> https://issues.apache.org/jira/browse/TIKA-2862
>> > >>
>> > >> to myself, I used to be able to assign, I know I had some time away
>> from
>> > >> Tika, but I'm keen to return with few contributions :-)
>> > >> Please update my record for me to be able to assign the issues again
>> > >>
>> > >> Cheers, Sergey
>> > >>
>> > >> On Tue, Apr 30, 2019 at 6:22 PM Sergey Beryozkin <
>> sberyoz...@gmail.com>
>> > >> wrote:
>> > >>
>> > >>> Hi Tim, All
>> > >>>
>> > >>> I've started working on integrating Tika with Quarkus [1]. The main
>> idea
>> > >>> is to be able to use Tika in the native image mode.
>> > >>> It's quite likely I'll start creating the PRs soon, to get the
>> native
>> > >>> image related issues resolved, these are related to some libraries
>> > >>> statically initializing FileDescriptors, etc.
>> > >>>
>> > >>> Thanks, Sergey
>> > >>>
>> > >>> [1]
>> > >>>
>> https://github.com/sberyozkin/quarkus/tree/tika_extension/extensions/tika
>> > >>> [2]
>> > >>>
>> https://github.com/sberyozkin/quarkus-quickstarts/tree/tika/getting-started-tika
>> > >>>
>> > >>>
>>
>


[jira] [Updated] (TIKA-2862) Make PDF Parser Graal native mode ready

2019-05-10 Thread Sergey Beryozkin (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Beryozkin updated TIKA-2862:
---
Description: 
PDF Parser is not Graal native mode ready yet, the following is reported when 
it is processed as part of Quarkus native mode build:
{noformat}
Error: Detected a FileDescriptor in the image heap. You can manually delay 
class initialization to image run time by using the option 
--delay-class-initialization-to-runtime=. ...
Detailed message:
 Trace:     object org.apache.fontbox.ttf.BufferedRandomAccessFile
     object org.apache.fontbox.ttf.RAFDataStream
     object org.apache.fontbox.ttf.TrueTypeFont
     object org.apache.pdfbox.pdmodel.font.PDType1Font
     method 
org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults()
 Call path from entry point to 
org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults(): 
     at 
org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults(PDAcroForm.java:106)
     at 
org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.(PDAcroForm.java:93)
     at 
org.apache.pdfbox.pdmodel.PDDocumentCatalog.getAcroForm(PDDocumentCatalog.java:108)
     at org.apache.tika.parser.pdf.PDFParser.handleXFAOnly(PDFParser.java:534)
     at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:164)
{noformat} 

See also 
[https://medium.com/graalvm/understanding-class-initialization-in-graalvm-native-image-generation-d765b7e4d6ed]

  was:
PDF Parser is not Graal native mode ready yet, the following is reported when 
it is processed as part of Quarkus native mode build:

Error: Detected a FileDescriptor in the image heap. You can manually delay 
class initialization to image run time by using the option 
--delay-class-initialization-to-runtime=. ...
Detailed message:
Trace:     object org.apache.fontbox.ttf.BufferedRandomAccessFile
    object org.apache.fontbox.ttf.RAFDataStream
    object org.apache.fontbox.ttf.TrueTypeFont
    object org.apache.pdfbox.pdmodel.font.PDType1Font
    method 
org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults()
Call path from entry point to 
org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults(): 
    at 
org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults(PDAcroForm.java:106)
    at 
org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.(PDAcroForm.java:93)
    at 
org.apache.pdfbox.pdmodel.PDDocumentCatalog.getAcroForm(PDDocumentCatalog.java:108)
    at org.apache.tika.parser.pdf.PDFParser.handleXFAOnly(PDFParser.java:534)
    at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:164)

 

See also 
[https://medium.com/graalvm/understanding-class-initialization-in-graalvm-native-image-generation-d765b7e4d6ed]


> Make PDF Parser Graal native mode ready 
> 
>
> Key: TIKA-2862
> URL: https://issues.apache.org/jira/browse/TIKA-2862
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.20
>Reporter: Sergey Beryozkin
>    Assignee: Sergey Beryozkin
>Priority: Major
>
> PDF Parser is not Graal native mode ready yet, the following is reported when 
> it is processed as part of Quarkus native mode build:
> {noformat}
> Error: Detected a FileDescriptor in the image heap. You can manually 
> delay class initialization to image run time by using the option 
> --delay-class-initialization-to-runtime=. ...
> Detailed message:
>  Trace:     object org.apache.fontbox.ttf.BufferedRandomAccessFile
>      object org.apache.fontbox.ttf.RAFDataStream
>      object org.apache.fontbox.ttf.TrueTypeFont
>      object org.apache.pdfbox.pdmodel.font.PDType1Font
>      method 
> org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults()
>  Call path from entry point to 
> org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults():
>  
>      at 
> org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults(PDAcroForm.java:106)
>      at 
> org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.(PDAcroForm.java:93)
>      at 
> org.apache.pdfbox.pdmodel.PDDocumentCatalog.getAcroForm(PDDocumentCatalog.java:108)
>      at org.apache.tika.parser.pdf.PDFParser.handleXFAOnly(PDFParser.java:534)
>      at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:164)
> {noformat} 
> See also 
> [https://medium.com/graalvm/understanding-class-initialization-in-graalvm-native-image-generation-d765b7e4d6ed]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (TIKA-2862) Make PDF Parser Graal native mode ready

2019-05-10 Thread Sergey Beryozkin (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16837432#comment-16837432
 ] 

Sergey Beryozkin edited comment on TIKA-2862 at 5/10/19 4:37 PM:
-

The call path from PDType1Font to RAFDataStream:

{noformat}
at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:132)
at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:87)
at 
org.apache.pdfbox.pdmodel.font.FileSystemFontProvider.readTrueTypeFont(FileSystemFontProvider.java:731)
at 
org.apache.pdfbox.pdmodel.font.FileSystemFontProvider.getTrueTypeFont(FileSystemFontProvider.java:696)
at 
org.apache.pdfbox.pdmodel.font.FileSystemFontProvider.access$200(FileSystemFontProvider.java:55)
at 
org.apache.pdfbox.pdmodel.font.FileSystemFontProvider$FSFontInfo.getFont(FileSystemFontProvider.java:132)
at 
org.apache.pdfbox.pdmodel.font.FontMapperImpl.findFont(FontMapperImpl.java:436)
at 
org.apache.pdfbox.pdmodel.font.FontMapperImpl.findFontBoxFont(FontMapperImpl.java:382)
at 
org.apache.pdfbox.pdmodel.font.FontMapperImpl.getFontBoxFont(FontMapperImpl.java:359)
at 
org.apache.pdfbox.pdmodel.font.PDType1Font.(PDType1Font.java:146)
at 
org.apache.pdfbox.pdmodel.font.PDType1Font.(PDType1Font.java:91)
{noformat}


was (Author: sergey_beryozkin):
The call path from PDType1Font to RAFDataStream:

{noformat}
17:31:09,714 ERROR [org.apa.pdf.pdm.fon.FileSystemFontProvider] Could not load 
font file: /usr/share/fonts/liberation/LiberationSans-Regular.ttf: 
java.lang.NullPointerException
at 
org.apache.fontbox.ttf.RAFDataStream.readSignedShort(RAFDataStream.java:77)
at 
org.apache.fontbox.ttf.TTFDataStream.read32Fixed(TTFDataStream.java:50)
at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:132)
at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:87)
at 
org.apache.pdfbox.pdmodel.font.FileSystemFontProvider.readTrueTypeFont(FileSystemFontProvider.java:731)
at 
org.apache.pdfbox.pdmodel.font.FileSystemFontProvider.getTrueTypeFont(FileSystemFontProvider.java:696)
at 
org.apache.pdfbox.pdmodel.font.FileSystemFontProvider.access$200(FileSystemFontProvider.java:55)
at 
org.apache.pdfbox.pdmodel.font.FileSystemFontProvider$FSFontInfo.getFont(FileSystemFontProvider.java:132)
at 
org.apache.pdfbox.pdmodel.font.FontMapperImpl.findFont(FontMapperImpl.java:436)
at 
org.apache.pdfbox.pdmodel.font.FontMapperImpl.findFontBoxFont(FontMapperImpl.java:382)
at 
org.apache.pdfbox.pdmodel.font.FontMapperImpl.getFontBoxFont(FontMapperImpl.java:359)
at 
org.apache.pdfbox.pdmodel.font.PDType1Font.(PDType1Font.java:146)
at 
org.apache.pdfbox.pdmodel.font.PDType1Font.(PDType1Font.java:91)
{noformat}

> Make PDF Parser Graal native mode ready 
> 
>
> Key: TIKA-2862
> URL: https://issues.apache.org/jira/browse/TIKA-2862
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.20
>Reporter: Sergey Beryozkin
>    Assignee: Sergey Beryozkin
>Priority: Major
>
> PDF Parser is not Graal native mode ready yet, the following is reported when 
> it is processed as part of Quarkus native mode build:
> Error: Detected a FileDescriptor in the image heap. You can manually 
> delay class initialization to image run time by using the option 
> --delay-class-initialization-to-runtime=. ...
> Detailed message:
> Trace:     object org.apache.fontbox.ttf.BufferedRandomAccessFile
>     object org.apache.fontbox.ttf.RAFDataStream
>     object org.apache.fontbox.ttf.TrueTypeFont
>     object org.apache.pdfbox.pdmodel.font.PDType1Font
>     method 
> org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults()
> Call path from entry point to 
> org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults():
>  
>     at 
> org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults(PDAcroForm.java:106)
>     at 
> org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.(PDAcroForm.java:93)
>     at 
> org.apache.pdfbox.pdmodel.PDDocumentCatalog.getAcroForm(PDDocumentCatalog.java:108)
>     at org.apache.tika.parser.pdf.PDFParser.handleXFAOnly(PDFParser.java:534)
>     at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:164)
>  
> See also 
> [https://medium.com/graalvm/understanding-class-initialization-in-graalvm-native-image-generation-d765b7e4d6ed]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (TIKA-2862) Make PDF Parser Graal native mode ready

2019-05-10 Thread Sergey Beryozkin (JIRA)


[ 
https://issues.apache.org/jira/browse/TIKA-2862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16837432#comment-16837432
 ] 

Sergey Beryozkin commented on TIKA-2862:


The call path from PDType1Font to RAFDataStream:

{noformat}
17:31:09,714 ERROR [org.apa.pdf.pdm.fon.FileSystemFontProvider] Could not load 
font file: /usr/share/fonts/liberation/LiberationSans-Regular.ttf: 
java.lang.NullPointerException
at 
org.apache.fontbox.ttf.RAFDataStream.readSignedShort(RAFDataStream.java:77)
at 
org.apache.fontbox.ttf.TTFDataStream.read32Fixed(TTFDataStream.java:50)
at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:132)
at org.apache.fontbox.ttf.TTFParser.parse(TTFParser.java:87)
at 
org.apache.pdfbox.pdmodel.font.FileSystemFontProvider.readTrueTypeFont(FileSystemFontProvider.java:731)
at 
org.apache.pdfbox.pdmodel.font.FileSystemFontProvider.getTrueTypeFont(FileSystemFontProvider.java:696)
at 
org.apache.pdfbox.pdmodel.font.FileSystemFontProvider.access$200(FileSystemFontProvider.java:55)
at 
org.apache.pdfbox.pdmodel.font.FileSystemFontProvider$FSFontInfo.getFont(FileSystemFontProvider.java:132)
at 
org.apache.pdfbox.pdmodel.font.FontMapperImpl.findFont(FontMapperImpl.java:436)
at 
org.apache.pdfbox.pdmodel.font.FontMapperImpl.findFontBoxFont(FontMapperImpl.java:382)
at 
org.apache.pdfbox.pdmodel.font.FontMapperImpl.getFontBoxFont(FontMapperImpl.java:359)
at 
org.apache.pdfbox.pdmodel.font.PDType1Font.(PDType1Font.java:146)
at 
org.apache.pdfbox.pdmodel.font.PDType1Font.(PDType1Font.java:91)
{noformat}

> Make PDF Parser Graal native mode ready 
> 
>
> Key: TIKA-2862
> URL: https://issues.apache.org/jira/browse/TIKA-2862
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.20
>Reporter: Sergey Beryozkin
>    Assignee: Sergey Beryozkin
>Priority: Major
>
> PDF Parser is not Graal native mode ready yet, the following is reported when 
> it is processed as part of Quarkus native mode build:
> Error: Detected a FileDescriptor in the image heap. You can manually 
> delay class initialization to image run time by using the option 
> --delay-class-initialization-to-runtime=. ...
> Detailed message:
> Trace:     object org.apache.fontbox.ttf.BufferedRandomAccessFile
>     object org.apache.fontbox.ttf.RAFDataStream
>     object org.apache.fontbox.ttf.TrueTypeFont
>     object org.apache.pdfbox.pdmodel.font.PDType1Font
>     method 
> org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults()
> Call path from entry point to 
> org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults():
>  
>     at 
> org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults(PDAcroForm.java:106)
>     at 
> org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.(PDAcroForm.java:93)
>     at 
> org.apache.pdfbox.pdmodel.PDDocumentCatalog.getAcroForm(PDDocumentCatalog.java:108)
>     at org.apache.tika.parser.pdf.PDFParser.handleXFAOnly(PDFParser.java:534)
>     at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:164)
>  
> See also 
> [https://medium.com/graalvm/understanding-class-initialization-in-graalvm-native-image-generation-d765b7e4d6ed]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Quarkus integration

2019-05-03 Thread Sergey Beryozkin
Yes, please add 'sergeyb', I've just assigned myself a CXF issue as
'sergeyb'. Sorry about these multiple ids, but indeed I'll try to keep
using a single one.

Thanks, Sergey



On Fri, May 3, 2019 at 12:13 PM Tim Allison  wrote:

> I can add 'sergeyb' if you'd prefer!
>
> On Fri, May 3, 2019 at 5:43 AM Sergey Beryozkin 
> wrote:
> >
> > Though I might need to settle on the 'sergeyb' eventually since it is my
> > apache committer id.
> > Thanks...
> >
> > On Fri, May 3, 2019 at 10:29 AM Sergey Beryozkin 
> > wrote:
> >
> > > Oh, I forgot I had a 'sergey_beryozkin' id as well, this is not good,
> > > shows how long ago I did contribute :-) (did try sergey.beryozkin
> though).
> > >
> > > Thanks for checking it, I've just assigned this issue to myself.
> > > Cheers, Sergey
> > >
> > >
> > > On Thu, May 2, 2019 at 6:08 PM Sergey Beryozkin 
> > > wrote:
> > >
> > >> Hi Tim
> > >>
> > >> I can't assign
> > >> https://issues.apache.org/jira/browse/TIKA-2862
> > >>
> > >> to myself, I used to be able to assign, I know I had some time away
> from
> > >> Tika, but I'm keen to return with few contributions :-)
> > >> Please update my record for me to be able to assign the issues again
> > >>
> > >> Cheers, Sergey
> > >>
> > >> On Tue, Apr 30, 2019 at 6:22 PM Sergey Beryozkin <
> sberyoz...@gmail.com>
> > >> wrote:
> > >>
> > >>> Hi Tim, All
> > >>>
> > >>> I've started working on integrating Tika with Quarkus [1]. The main
> idea
> > >>> is to be able to use Tika in the native image mode.
> > >>> It's quite likely I'll start creating the PRs soon, to get the native
> > >>> image related issues resolved, these are related to some libraries
> > >>> statically initializing FileDescriptors, etc.
> > >>>
> > >>> Thanks, Sergey
> > >>>
> > >>> [1]
> > >>>
> https://github.com/sberyozkin/quarkus/tree/tika_extension/extensions/tika
> > >>> [2]
> > >>>
> https://github.com/sberyozkin/quarkus-quickstarts/tree/tika/getting-started-tika
> > >>>
> > >>>
>


Re: Quarkus integration

2019-05-03 Thread Sergey Beryozkin
Though I might need to settle on the 'sergeyb' eventually since it is my
apache committer id.
Thanks...

On Fri, May 3, 2019 at 10:29 AM Sergey Beryozkin 
wrote:

> Oh, I forgot I had a 'sergey_beryozkin' id as well, this is not good,
> shows how long ago I did contribute :-) (did try sergey.beryozkin though).
>
> Thanks for checking it, I've just assigned this issue to myself.
> Cheers, Sergey
>
>
> On Thu, May 2, 2019 at 6:08 PM Sergey Beryozkin 
> wrote:
>
>> Hi Tim
>>
>> I can't assign
>> https://issues.apache.org/jira/browse/TIKA-2862
>>
>> to myself, I used to be able to assign, I know I had some time away from
>> Tika, but I'm keen to return with few contributions :-)
>> Please update my record for me to be able to assign the issues again
>>
>> Cheers, Sergey
>>
>> On Tue, Apr 30, 2019 at 6:22 PM Sergey Beryozkin 
>> wrote:
>>
>>> Hi Tim, All
>>>
>>> I've started working on integrating Tika with Quarkus [1]. The main idea
>>> is to be able to use Tika in the native image mode.
>>> It's quite likely I'll start creating the PRs soon, to get the native
>>> image related issues resolved, these are related to some libraries
>>> statically initializing FileDescriptors, etc.
>>>
>>> Thanks, Sergey
>>>
>>> [1]
>>> https://github.com/sberyozkin/quarkus/tree/tika_extension/extensions/tika
>>> [2]
>>> https://github.com/sberyozkin/quarkus-quickstarts/tree/tika/getting-started-tika
>>>
>>>


Re: Quarkus integration

2019-05-03 Thread Sergey Beryozkin
Oh, I forgot I had a 'sergey_beryozkin' id as well, this is not good, shows
how long ago I did contribute :-) (did try sergey.beryozkin though).

Thanks for checking it, I've just assigned this issue to myself.
Cheers, Sergey


On Thu, May 2, 2019 at 6:08 PM Sergey Beryozkin 
wrote:

> Hi Tim
>
> I can't assign
> https://issues.apache.org/jira/browse/TIKA-2862
>
> to myself, I used to be able to assign, I know I had some time away from
> Tika, but I'm keen to return with few contributions :-)
> Please update my record for me to be able to assign the issues again
>
> Cheers, Sergey
>
> On Tue, Apr 30, 2019 at 6:22 PM Sergey Beryozkin 
> wrote:
>
>> Hi Tim, All
>>
>> I've started working on integrating Tika with Quarkus [1]. The main idea
>> is to be able to use Tika in the native image mode.
>> It's quite likely I'll start creating the PRs soon, to get the native
>> image related issues resolved, these are related to some libraries
>> statically initializing FileDescriptors, etc.
>>
>> Thanks, Sergey
>>
>> [1]
>> https://github.com/sberyozkin/quarkus/tree/tika_extension/extensions/tika
>> [2]
>> https://github.com/sberyozkin/quarkus-quickstarts/tree/tika/getting-started-tika
>>
>>


[jira] [Assigned] (TIKA-2862) Make PDF Parser Graal native mode ready

2019-05-03 Thread Sergey Beryozkin (JIRA)


 [ 
https://issues.apache.org/jira/browse/TIKA-2862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Beryozkin reassigned TIKA-2862:
--

Assignee: Sergey Beryozkin

> Make PDF Parser Graal native mode ready 
> 
>
> Key: TIKA-2862
> URL: https://issues.apache.org/jira/browse/TIKA-2862
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.20
>Reporter: Sergey Beryozkin
>    Assignee: Sergey Beryozkin
>Priority: Major
>
> PDF Parser is not Graal native mode ready yet, the following is reported when 
> it is processed as part of Quarkus native mode build:
> Error: Detected a FileDescriptor in the image heap. You can manually 
> delay class initialization to image run time by using the option 
> --delay-class-initialization-to-runtime=. ...
> Detailed message:
> Trace:     object org.apache.fontbox.ttf.BufferedRandomAccessFile
>     object org.apache.fontbox.ttf.RAFDataStream
>     object org.apache.fontbox.ttf.TrueTypeFont
>     object org.apache.pdfbox.pdmodel.font.PDType1Font
>     method 
> org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults()
> Call path from entry point to 
> org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults():
>  
>     at 
> org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults(PDAcroForm.java:106)
>     at 
> org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.(PDAcroForm.java:93)
>     at 
> org.apache.pdfbox.pdmodel.PDDocumentCatalog.getAcroForm(PDDocumentCatalog.java:108)
>     at org.apache.tika.parser.pdf.PDFParser.handleXFAOnly(PDFParser.java:534)
>     at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:164)
>  
> See also 
> [https://medium.com/graalvm/understanding-class-initialization-in-graalvm-native-image-generation-d765b7e4d6ed]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Quarkus integration

2019-05-02 Thread Sergey Beryozkin
Hi Tim

I can't assign
https://issues.apache.org/jira/browse/TIKA-2862

to myself, I used to be able to assign, I know I had some time away from
Tika, but I'm keen to return with few contributions :-)
Please update my record for me to be able to assign the issues again

Cheers, Sergey

On Tue, Apr 30, 2019 at 6:22 PM Sergey Beryozkin 
wrote:

> Hi Tim, All
>
> I've started working on integrating Tika with Quarkus [1]. The main idea
> is to be able to use Tika in the native image mode.
> It's quite likely I'll start creating the PRs soon, to get the native
> image related issues resolved, these are related to some libraries
> statically initializing FileDescriptors, etc.
>
> Thanks, Sergey
>
> [1]
> https://github.com/sberyozkin/quarkus/tree/tika_extension/extensions/tika
> [2]
> https://github.com/sberyozkin/quarkus-quickstarts/tree/tika/getting-started-tika
>
>


[jira] [Created] (TIKA-2862) Make PDF Parser Graal native mode ready

2019-05-02 Thread Sergey Beryozkin (JIRA)
Sergey Beryozkin created TIKA-2862:
--

 Summary: Make PDF Parser Graal native mode ready 
 Key: TIKA-2862
 URL: https://issues.apache.org/jira/browse/TIKA-2862
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.20
Reporter: Sergey Beryozkin


PDF Parser is not Graal native mode ready yet, the following is reported when 
it is processed as part of Quarkus native mode build:

Error: Detected a FileDescriptor in the image heap. You can manually delay 
class initialization to image run time by using the option 
--delay-class-initialization-to-runtime=. ...
Detailed message:
Trace:     object org.apache.fontbox.ttf.BufferedRandomAccessFile
    object org.apache.fontbox.ttf.RAFDataStream
    object org.apache.fontbox.ttf.TrueTypeFont
    object org.apache.pdfbox.pdmodel.font.PDType1Font
    method 
org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults()
Call path from entry point to 
org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults(): 
    at 
org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.verifyOrCreateDefaults(PDAcroForm.java:106)
    at 
org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm.(PDAcroForm.java:93)
    at 
org.apache.pdfbox.pdmodel.PDDocumentCatalog.getAcroForm(PDDocumentCatalog.java:108)
    at org.apache.tika.parser.pdf.PDFParser.handleXFAOnly(PDFParser.java:534)
    at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:164)

 

See also 
[https://medium.com/graalvm/understanding-class-initialization-in-graalvm-native-image-generation-d765b7e4d6ed]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Quarkus integration

2019-04-30 Thread Sergey Beryozkin
Hi Tim, All

I've started working on integrating Tika with Quarkus [1]. The main idea is
to be able to use Tika in the native image mode.
It's quite likely I'll start creating the PRs soon, to get the native image
related issues resolved, these are related to some libraries statically
initializing FileDescriptors, etc.

Thanks, Sergey

[1]
https://github.com/sberyozkin/quarkus/tree/tika_extension/extensions/tika
[2]
https://github.com/sberyozkin/quarkus-quickstarts/tree/tika/getting-started-tika


Re: Integrating Tika with Apache Beam

2017-12-28 Thread Sergey Beryozkin

Hi All

A short update that my original TikaIO contribution changed a lot after 
the round of reviews, but the good news it stayed as a native Beam IO 
component and will be available from Beam 2.3.0.
TikaIO will now return something called ParseResult which includes the 
complete String and metadata content, but also Throwable if the 
exception occurred in some file.
Tika Streaming is not utilized at the moment - but as soon as the good 
use cases emerge then I'm sure Beam community will be open to enhancing 
TikaIO further...


Cheers. Sergey


On 21/09/17 18:54, Chris Mattmann wrote:

Thanks Sergey, feel free to CC me directly at mattm...@apache.org on the Beam 
thread.
My own 2c is that Tika’s “metadata” extraction can be any order, and with our 
tika-dl module
and the new feature extraction from multimedia files using Tensorflow and DL4j 
these are
perfect examples where the order/extraction doesn’t matter…



On 9/21/17, 2:52 AM, "Sergey Beryozkin" <sberyoz...@gmail.com> wrote:

 Hi Guys
 
 TikaIO is getting some serious attention now on the Beam dev, and

 unfortunately it is not all about it being a great addition to Beam.
 
 The team is wondering what one can do with TikaIO vs someone just doing

 some custom Beam function.
 
 TikaIO and as any other Bounded text reader will produce the data in the

 ordered way, but they can be made totally unordered to the pipeline by
 the Beam runtime.
 
 I gave one example where we used the Tika output to save it all to

 Lucene (with the file name associated) and then search for the files
 which contain a certain word.
 
 Tim, Chris, others, if you have some interesting examples to share where

 it did not matter in which order Tika-produced data were made eventually
 available, then please let me know, or reply directly to a Beam dev
 thread titled "TikaIO concerns".
 
 Note, if Beam devs decide they don't want it then one option can be to

 create a tika-integrations/beam module and experiment there - I'm not
 saying it will need to be done but it's something that may be worth
 considering
 
 Sergey

 On 15/09/17 12:02, Sergey Beryozkin wrote:
 > Hi Chris
 >
 > thanks,
 >
 > at the moment TikaIO (originally renamed TikaReader as it can only read
 > but we renamed it to follow the convention) is a bounded reader, so you
 > can say ask it to read
 >
 > /files/*.pdf
 >
 > and it will read all the N files there, and will end the run.
 >
 > I'm not sure yet what is the best strategy to making it the unbounded
 > reader where it can continuously poll or be notified of the new files
 > becoming available...There are some ideas about scheduling the bounded
 > Beam pipelines, haven't looked yet...
 >
 > In the short term, the simplest solution would be simply to create a new
 > instance of TikaIO pipeline, and point it to the new temp folder where a
 > new batch of files has been dropped to.
 >
 > Thanks, Sergey
 > On 11/09/17 22:41, Mattmann, Chris A (3010) wrote:
 >> Amazing work, thank you Sergey!!
 >>
 >> 
++
 >>
 >> Chris Mattmann, Ph.D.
 >> Principal Data Scientist, Engineering Administrative Office (3010)
 >> Manager, NSF & Open Source Projects Formulation and Development
 >> Offices (8212)
 >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 >> Office: 180-503E, Mailstop: 180-503
 >> Email: chris.a.mattm...@nasa.gov
 >> WWW:  http://sunset.usc.edu/~mattmann/
 >> 
++
 >>
 >> Director, Information Retrieval and Data Science Group (IRDS)
 >> Adjunct Associate Professor, Computer Science Department
 >> University of Southern California, Los Angeles, CA 90089 USA
 >> WWW: http://irds.usc.edu/
 >> 
++
 >>
     >>
 >> On 9/11/17, 7:33 AM, "Allison, Timothy B." <talli...@mitre.org> wrote:
 >>
 >>  What great news!  Thank you, Sergey!!!
 >>  -Original Message-
 >>  From: Sergey Beryozkin [mailto:sberyoz...@gmail.com]
 >>  Sent: Monday, September 11, 2017 9:18 AM
 >>  To: Allison, Timothy B. <talli...@mitre.org>; dev@tika.apache.org
 >>  Subject: Re: Integrating Tika with Apache Beam
 >>  Hi Tim, All
 >>  It took it some time, but finally Beam TikaIO component is in its
 >> 2.

Tika Write Builder

2017-12-28 Thread Sergey Beryozkin

Hi All,

Right now Tika can help with reading from files in many formats.

Would it make sense to consider a new project, which would help users 
write into a user-preferred format, using a builder API ?


I realize different formats have different capabilities but I reckon 
some minimalistic API can be created which will work for many mainstream 
formats, with some of the API methods becoming optional for some other 
formats.


Example,

TikaWriter writer = TikaWriter.newInstance(Formats.PDF)
// start
writer.header("someheader").tableofContent(new TableOfContents(...));
// body
writer.asTable(new String[][]{...});
writer.attachment(new Image());
...

I guess this can become quite complicated, but may be one can start with 
some easy Writer API which can be easily mapped to PDF/etc and then take 
it from there, slowly adding more methods, etc...


Just one idea for 2018 and new 2.0 master :-)

Thanks, Sergey

P.S I saw some related discussion at the Beam dev, about writing to 
different formats, and thought, may be something like that can be done 
for Tika, which might be of help in general but also complement in time 
the Beam TikaIO (which can only read at the moment)





Re: Tika 2 parsers

2017-10-25 Thread Sergey Beryozkin
As Tim indicated the 2.x line is not actively developed at the moment, 
but what is already there now is sufficient for the initial try (ex. 
with PDF/ODT parsers)


Sergey


On 25/10/17 08:30, Gethin James wrote:

I did have a look for the source, what branch is it?
https://github.com/apache/tika/tree/2.x doesn't seem to have been updated
since May.

On 24 October 2017 at 22:15, Sergey Beryozkin <sberyoz...@gmail.com> wrote:


I did try the modules in the earlier version of the CXF demo,

see the right panel,

https://github.com/apache/cxf/commit/c2ccecb23ba23497c95be89
f9b37f38c69faba7a#diff-b5ed531ebf92978dcbcf1ac6cc6331c0

They should be available in the snapshot repo

Cheers, Sergey

On 24/10/17 19:45, Allison, Timothy B. wrote:


We'll switch master over to the 2.0 layout after our next release, which
should happen shortly after the release of PDFBox 2.0.8...roughly in the
next week for PDFBox, next month for Tika.

We have abandoned keeping the current 2.x up to date, and I was hoping
there would at least be a build here: https://builds.apache.org/view
/T/view/Tika/job/tika-2.x/, but there isn't a clean build there.

So, unfortunately, for now, your best bet is to build it yourself from
source.  Sorry.



-Original Message-
From: Gethin James [mailto:gja...@nuxeo.com]
Sent: Tuesday, October 24, 2017 12:19 PM
To: dev@tika.apache.org
Subject: Tika 2 parsers

Hi, I am interested in trying the more modular approach of using the Tika
2 parsers.  Are the Tika 2 artifacts available in a maven repo somewhere?
Is the any documentation on how to use them or how they differ from Tika 1?

Thanks,
Gethin.






Re: Tika 2 parsers

2017-10-24 Thread Sergey Beryozkin

I did try the modules in the earlier version of the CXF demo,

see the right panel,

https://github.com/apache/cxf/commit/c2ccecb23ba23497c95be89f9b37f38c69faba7a#diff-b5ed531ebf92978dcbcf1ac6cc6331c0

They should be available in the snapshot repo

Cheers, Sergey
On 24/10/17 19:45, Allison, Timothy B. wrote:

We'll switch master over to the 2.0 layout after our next release, which should 
happen shortly after the release of PDFBox 2.0.8...roughly in the next week for 
PDFBox, next month for Tika.

We have abandoned keeping the current 2.x up to date, and I was hoping there 
would at least be a build here: 
https://builds.apache.org/view/T/view/Tika/job/tika-2.x/, but there isn't a 
clean build there.

So, unfortunately, for now, your best bet is to build it yourself from source.  
Sorry.



-Original Message-
From: Gethin James [mailto:gja...@nuxeo.com]
Sent: Tuesday, October 24, 2017 12:19 PM
To: dev@tika.apache.org
Subject: Tika 2 parsers

Hi, I am interested in trying the more modular approach of using the Tika 2 
parsers.  Are the Tika 2 artifacts available in a maven repo somewhere?  Is the 
any documentation on how to use them or how they differ from Tika 1?

Thanks,
Gethin.



[jira] [Resolved] (TIKA-2476) Metadata.toString always returns a trailing space

2017-10-11 Thread Sergey Beryozkin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Beryozkin resolved TIKA-2476.

Resolution: Fixed
  Assignee: Sergey Beryozkin

> Metadata.toString always returns a trailing space
> -
>
> Key: TIKA-2476
> URL: https://issues.apache.org/jira/browse/TIKA-2476
> Project: Tika
>  Issue Type: Improvement
>  Components: core
>Affects Versions: 1.16
>Reporter: Sergey Beryozkin
>    Assignee: Sergey Beryozkin
>Priority: Trivial
> Fix For: 1.17
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (TIKA-2472) Implement Metadata.hashCode

2017-10-11 Thread Sergey Beryozkin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Beryozkin resolved TIKA-2472.

Resolution: Fixed

> Implement Metadata.hashCode
> ---
>
> Key: TIKA-2472
> URL: https://issues.apache.org/jira/browse/TIKA-2472
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 1.16
>    Reporter: Sergey Beryozkin
>Assignee: Sergey Beryozkin
>Priority: Trivial
> Fix For: 1.17
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2472) Implement Metadata.hashCode

2017-10-10 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16198709#comment-16198709
 ] 

Sergey Beryozkin commented on TIKA-2472:


This is fixed now...

> Implement Metadata.hashCode
> ---
>
> Key: TIKA-2472
> URL: https://issues.apache.org/jira/browse/TIKA-2472
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 1.16
>    Reporter: Sergey Beryozkin
>Assignee: Sergey Beryozkin
>Priority: Trivial
> Fix For: 1.17
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2472) Implement Metadata.hashCode

2017-10-07 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16195792#comment-16195792
 ] 

Sergey Beryozkin commented on TIKA-2472:


Ken, thanks for the tip, makes sense to follow this path at the Metadata level 
as well

> Implement Metadata.hashCode
> ---
>
> Key: TIKA-2472
> URL: https://issues.apache.org/jira/browse/TIKA-2472
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 1.16
>    Reporter: Sergey Beryozkin
>Assignee: Sergey Beryozkin
>Priority: Trivial
> Fix For: 1.17
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


Re: [DISCUSS] Enable specific ContentHandler for tika-server

2017-10-06 Thread Sergey Beryozkin
Konstantin, by the way, if you are interested in having a good 
discussion to do with using the serialized lambdas then you will be 
welcome to comment on the relevant text in the Tika Concerns Beam 
thread, though may be Beam knows how to take care of the issues you 
raised...


Thanks, Sergey
On 06/10/17 18:27, Sergey Beryozkin wrote:

On 06/10/17 18:08, Konstantin Gribov wrote:

My +1 to this idea.

IMHO, second option is more flexible. I also like Nick's suggestion about
using default package for handlers and interpret dot-separated string as
fqcn. Solr does similar thing and it's very convenient to use (but 
they use

prefix `solr.` for their classes in predefined package and any other is
interpreted as fqcn).

I'll add that you could allow user to pass several comma-separated 
handlers

to allow build content-handler stack if user wants to.

I would disagree with Sergey about serialized lambdas for 2 reasons:
- it's useful only for java-clients;
- it could bring very nasty bugs leading to RCE class vulnerabilities, so
it's very controversial from security PoV.
Sure. I was not actually suggesting to use them in Tika natively, I only 
referred to it as the alternative mentioned in the context of the Beam 
integration work


Sergey


On Thu, Sep 28, 2017 at 11:35 PM Giuseppe Totaro <totarope...@gmail.com>
wrote:


Hi folks,

if I am not wrong, currently you cannot configure a specific 
ContentHandler
while using tika-server. I mean that you can configure your own 
parser [0]
but you cannot control which ContentHandler the parser leverages to 
extract

text and metadata (e.g., you cannot use PhoneExtractingContentHandler,
StandardsExtractingContentHandler, etc).
If it is correct, it would be nice to enable the use of specific
ContentHandlers within tika-server and I would like to discuss how to 
solve

this issue generally.

I propose two solutions:

    1. augment the TikaConfig class so that a specific ContentHandler 
can be

    used in tika-config.xml;
    2. determine the ContentHandler to use for parsing through HTTP 
headers,

    for example:
    curl -T filename.pdf http://localhost:9998/meta --header
    "X-Content-Handler: PhoneExtractingContentHandler"
    This should affect also the TikaResource.java class.

I look forward to having your feedback. I strongly believe that every 
user
who wants to use Tika as a service through tika-server and needs to 
extract
content and metadata like phone numbers, standard references, etc 
would be

very happy.

Thanks a lot,
Giuseppe




--
Sergey Beryozkin

Talend Community Coders
http://coders.talend.com/


[jira] [Commented] (TIKA-2472) Implement Metadata.hashCode

2017-10-06 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16195303#comment-16195303
 ] 

Sergey Beryozkin commented on TIKA-2472:


I've got a bit of shock with this code:
{code:java}
@Test
public void testIt() {
Map<String, String[]> map1 = new HashMap<String, String[]>();
map1.put("A", new String[] {"a"});
Map<String, String[]> map2 = new HashMap<String, String[]>();
map2.put("A", new String[] {"a"});

System.out.println(map1.equals(map2));
System.out.println(map1.hashCode() == map2.hashCode());
}
{code}
Seeing 'false' printed in both cases which is obvious really given that 
'identity' situation for the arrays.
Eugene, you are right, thanks for being on top of these changes, you'll make me 
a Java champion soon :-)

Guys, should we update Metadata to use List of Strings ? (though it is a sep 
issue)

> Implement Metadata.hashCode
> ---
>
> Key: TIKA-2472
> URL: https://issues.apache.org/jira/browse/TIKA-2472
> Project: Tika
>      Issue Type: Improvement
>    Affects Versions: 1.16
>Reporter: Sergey Beryozkin
>Assignee: Sergey Beryozkin
>Priority: Trivial
> Fix For: 1.17
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Reopened] (TIKA-2472) Implement Metadata.hashCode

2017-10-06 Thread Sergey Beryozkin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Beryozkin reopened TIKA-2472:


With thanks to Eugene...

> Implement Metadata.hashCode
> ---
>
> Key: TIKA-2472
> URL: https://issues.apache.org/jira/browse/TIKA-2472
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 1.16
>    Reporter: Sergey Beryozkin
>Assignee: Sergey Beryozkin
>Priority: Trivial
> Fix For: 1.17
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (TIKA-2472) Implement Metadata.hashCode

2017-10-06 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16195275#comment-16195275
 ] 

Sergey Beryozkin commented on TIKA-2472:


I'd not qualify it as incorrect but as sub-optimal. And I know how the relevant 
Map hashCode is implemented - I copied that to ParseResult as a temp 
substitution (to be honest it does not really matter how hashCode or even 
equals are implemented if ParseResult will keep a file location which is the 
real key). That said I've no problems with making this code done better

> Implement Metadata.hashCode
> ---
>
> Key: TIKA-2472
> URL: https://issues.apache.org/jira/browse/TIKA-2472
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 1.16
>    Reporter: Sergey Beryozkin
>Assignee: Sergey Beryozkin
>Priority: Trivial
> Fix For: 1.17
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


Re: [DISCUSS] Enable specific ContentHandler for tika-server

2017-10-06 Thread Sergey Beryozkin

On 06/10/17 18:08, Konstantin Gribov wrote:

My +1 to this idea.

IMHO, second option is more flexible. I also like Nick's suggestion about
using default package for handlers and interpret dot-separated string as
fqcn. Solr does similar thing and it's very convenient to use (but they use
prefix `solr.` for their classes in predefined package and any other is
interpreted as fqcn).

I'll add that you could allow user to pass several comma-separated handlers
to allow build content-handler stack if user wants to.

I would disagree with Sergey about serialized lambdas for 2 reasons:
- it's useful only for java-clients;
- it could bring very nasty bugs leading to RCE class vulnerabilities, so
it's very controversial from security PoV.
Sure. I was not actually suggesting to use them in Tika natively, I only 
referred to it as the alternative mentioned in the context of the Beam 
integration work


Sergey


On Thu, Sep 28, 2017 at 11:35 PM Giuseppe Totaro 
wrote:


Hi folks,

if I am not wrong, currently you cannot configure a specific ContentHandler
while using tika-server. I mean that you can configure your own parser [0]
but you cannot control which ContentHandler the parser leverages to extract
text and metadata (e.g., you cannot use PhoneExtractingContentHandler,
StandardsExtractingContentHandler, etc).
If it is correct, it would be nice to enable the use of specific
ContentHandlers within tika-server and I would like to discuss how to solve
this issue generally.

I propose two solutions:

1. augment the TikaConfig class so that a specific ContentHandler can be
used in tika-config.xml;
2. determine the ContentHandler to use for parsing through HTTP headers,
for example:
curl -T filename.pdf http://localhost:9998/meta --header
"X-Content-Handler: PhoneExtractingContentHandler"
This should affect also the TikaResource.java class.

I look forward to having your feedback. I strongly believe that every user
who wants to use Tika as a service through tika-server and needs to extract
content and metadata like phone numbers, standard references, etc would be
very happy.

Thanks a lot,
Giuseppe



[jira] [Resolved] (TIKA-2472) Implement Metadata.hashCode

2017-10-02 Thread Sergey Beryozkin (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-2472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Beryozkin resolved TIKA-2472.

Resolution: Fixed

> Implement Metadata.hashCode
> ---
>
> Key: TIKA-2472
> URL: https://issues.apache.org/jira/browse/TIKA-2472
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 1.16
>    Reporter: Sergey Beryozkin
>Assignee: Sergey Beryozkin
>Priority: Trivial
> Fix For: 1.17
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (TIKA-2472) Implement Metadata.hashCode

2017-10-02 Thread Sergey Beryozkin (JIRA)
Sergey Beryozkin created TIKA-2472:
--

 Summary: Implement Metadata.hashCode
 Key: TIKA-2472
 URL: https://issues.apache.org/jira/browse/TIKA-2472
 Project: Tika
  Issue Type: Improvement
Affects Versions: 1.16
Reporter: Sergey Beryozkin
Assignee: Sergey Beryozkin
Priority: Trivial
 Fix For: 1.17






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


Re: [DISCUSS] Enable specific ContentHandler for tika-server

2017-09-28 Thread Sergey Beryozkin

Hi

Option #1 is also good - a question how to pass a ContentHandler to a 
Beam function was open, and given that passing TikaConfig is needed 
anyway, having a way to specify a handler there can be handy too...


Cheers, Sergey
On 28/09/17 22:17, Chris Mattmann wrote:

I am +1 for this. Option #2 sounds like a slick way to handle this for me that 
would
remain back compat with tika-python which is of strong interest to me.

Cheers,
Chris




On 9/28/17, 1:35 PM, "Giuseppe Totaro"  wrote:

 Hi folks,
 
 if I am not wrong, currently you cannot configure a specific ContentHandler

 while using tika-server. I mean that you can configure your own parser [0]
 but you cannot control which ContentHandler the parser leverages to extract
 text and metadata (e.g., you cannot use PhoneExtractingContentHandler,
 StandardsExtractingContentHandler, etc).
 If it is correct, it would be nice to enable the use of specific
 ContentHandlers within tika-server and I would like to discuss how to solve
 this issue generally.
 
 I propose two solutions:
 
1. augment the TikaConfig class so that a specific ContentHandler can be

used in tika-config.xml;
2. determine the ContentHandler to use for parsing through HTTP headers,
for example:
curl -T filename.pdf http://localhost:9998/meta --header
"X-Content-Handler: PhoneExtractingContentHandler"
This should affect also the TikaResource.java class.
 
 I look forward to having your feedback. I strongly believe that every user

 who wants to use Tika as a service through tika-server and needs to extract
 content and metadata like phone numbers, standard references, etc would be
 very happy.
 
 Thanks a lot,

 Giuseppe
 





Re: TikaIO concerns

2017-09-23 Thread Sergey Beryozkin

Please see comments below, and I'm positive this thread is nearly over :-)
On 22/09/17 22:49, Eugene Kirpichov wrote:

On Fri, Sep 22, 2017 at 2:20 PM Sergey Beryozkin <sberyoz...@gmail.com>
wrote:


Hi,
On 22/09/17 22:02, Eugene Kirpichov wrote:

Sure - with hundreds of different file formats and the abundance of

weird /

malformed / malicious files in the wild, it's quite expected that

sometimes

the library will crash.

Some kinds of issues are easier to address than others. We can catch
exceptions and return a ParseResult representing a failure to parse this
document. Addressing freezes and native JVM process crashes is much

harder

and probably not necessary in the first version.

Sergey - I think, the moment you introduce ParseResult into the code,

other

changes I suggested will follow "by construction":
- There'll be 1 ParseResult per document, containing filename, content

and

metadata, since per discussion above it probably doesn't make sense to
deliver these in separate PCollection elements


I was still harboring the hope that may be using a container bean like
ParseResult (with the other changes you proposed) can somehow let us
stream from Tika into the pipeline.

If it is 1 ParseResult per document then it means that until Tika has
parsed all the document the pipeline will not see it.


This is correct, and this is the API I'm suggesting to start with, because
it's simple and sufficiently useful. I suggest to get into this state
first, and then deal with creating a separate API that allows to not hold
the entire parse result as a single PCollection element in memory. This
should work fine for cases when each document's parse result (not the input
document itself!) is up to a few hundred megabytes in size.

+1. I was contemplating about it yesterday evening and I had to admit I 
had no real clue what I wanted to achieve with the document being 
streamed through the pipeline - partially because my Beam knowledge was 
still pretty limited but also because I had difficulties with coming 
with the concrete use cases.

So yes. lets make the 'mainstream' case working well first.




I'm sorry if I may be starting to go in circles. But let me ask this.
How can a Beam user write a Beam function which will ensure the Tika
content pieces are seen ordered by the pipeline, without TikaIO ?


To answer this, I'd need you to clarify what you mean by "seen ordered by
the pipeline" - order is a very vague term when it comes to parallel
processing. What would you like the pipeline to compute that requires order
within a document, but does NOT require having the contents of a document
as a single String?
See above, I don't know :-). The case which I do like, and I'll work on 
a demo at a later stage at a dedicate branch, is what I described 
earlier. I would use sat FileIO to get me a list of 1000s matching PDFs, 
run that though Tika(IO) and I'd have a function which will output the 
list of matching PDFs (or other formats). Ex: if someone needs to find 
all the Word docs in a given online library, which talk about some 
event. I think it won't matter in this case whether the ordering of the 
individual lines matters or not, we have a link to the file name and 
it's enough...


But I'll return to this favourite case of mine later :-)


Or are you asking simply how can users use Tika for arbitrary use cases
without TikaIO?


I thought later, I was really interested, was it important for any of 
Beam IO's consumers that the individual data chunks come ordered or not, 
and if it was, how that was achieved...Knowing that would help me/us to 
consider what can possibly be done at a later stage


If you'd like to talk about it later then it is OK...

Thanks for the help
Sergey





May be knowing that will help coming up with the idea how to generalize
somehow with the help of TikaIO ?


- Since you're returning a single value per document, there's no reason

to

use a BoundedReader
- Likewise, there's no reason to use asynchronicity because you're not
delivering the result incrementally

I'd suggest to start the refactoring by removing the asynchronous

codepath,

then converting from BoundedReader to ParDo or MapElements, then

converting

from String to ParseResult.

This is a good plan, thanks, I guess at least for small documents it
should work well (unless I've misunderstood a ParseResult idea)

Thanks, Sergey


On Fri, Sep 22, 2017 at 12:10 PM Sergey Beryozkin <sberyoz...@gmail.com>
wrote:


Hi Tim, All
On 22/09/17 18:17, Allison, Timothy B. wrote:

Y, I think you have it right.


Tika library has a big problem with crashes and freezes


I wouldn't want to overstate it.  Crashes and freezes are exceedingly

rare, but when you are processing millions/billions of files in the wild
[1], they will happen.  We fix the problems or try to get our

dependencies

to fix the problems when we can,

I only would like to add to this that IMHO it would be more correct to
state 

Re: TikaIO concerns

2017-09-22 Thread Sergey Beryozkin

Hi,
On 22/09/17 22:02, Eugene Kirpichov wrote:

Sure - with hundreds of different file formats and the abundance of weird /
malformed / malicious files in the wild, it's quite expected that sometimes
the library will crash.

Some kinds of issues are easier to address than others. We can catch
exceptions and return a ParseResult representing a failure to parse this
document. Addressing freezes and native JVM process crashes is much harder
and probably not necessary in the first version.

Sergey - I think, the moment you introduce ParseResult into the code, other
changes I suggested will follow "by construction":
- There'll be 1 ParseResult per document, containing filename, content and
metadata, since per discussion above it probably doesn't make sense to
deliver these in separate PCollection elements


I was still harboring the hope that may be using a container bean like 
ParseResult (with the other changes you proposed) can somehow let us 
stream from Tika into the pipeline.


If it is 1 ParseResult per document then it means that until Tika has 
parsed all the document the pipeline will not see it.


I'm sorry if I may be starting to go in circles. But let me ask this. 
How can a Beam user write a Beam function which will ensure the Tika 
content pieces are seen ordered by the pipeline, without TikaIO ?


May be knowing that will help coming up with the idea how to generalize 
somehow with the help of TikaIO ?



- Since you're returning a single value per document, there's no reason to
use a BoundedReader
- Likewise, there's no reason to use asynchronicity because you're not
delivering the result incrementally

I'd suggest to start the refactoring by removing the asynchronous codepath,
then converting from BoundedReader to ParDo or MapElements, then converting
from String to ParseResult.
This is a good plan, thanks, I guess at least for small documents it 
should work well (unless I've misunderstood a ParseResult idea)


Thanks, Sergey


On Fri, Sep 22, 2017 at 12:10 PM Sergey Beryozkin <sberyoz...@gmail.com>
wrote:


Hi Tim, All
On 22/09/17 18:17, Allison, Timothy B. wrote:

Y, I think you have it right.


Tika library has a big problem with crashes and freezes


I wouldn't want to overstate it.  Crashes and freezes are exceedingly

rare, but when you are processing millions/billions of files in the wild
[1], they will happen.  We fix the problems or try to get our dependencies
to fix the problems when we can,

I only would like to add to this that IMHO it would be more correct to
state it's not a Tika library's 'fault' that the crashes might occur.
Tika does its best to get the latest libraries helping it to parse the
files, but indeed there will always be some file there that might use
some incomplete format specific tag etc which may cause the specific
parser to spin - but Tika will include the updated parser library asap.

And with Beam's help the crashes that can kill the Tika jobs completely
will probably become a history...

Cheers, Sergey

but given our past history, I have no reason to believe that these

problems won't happen again.


Thank you, again!

Best,

  Tim

[1] Stuff on the internet or ... some of our users are forensics

examiners dealing with broken/corrupted files


P.S./FTR  
1) We've gathered a TB of data from CommonCrawl and we run regression

tests against this TB (thank you, Rackspace for hosting our vm!) to try to
identify these problems.

2) We've started a fuzzing effort to try to identify problems.
3) We added "tika-batch" for robust single box fileshare/fileshare

processing for our low volume users

4) We're trying to get the message out.  Thank you for working with us!!!

-Original Message-
From: Eugene Kirpichov [mailto:kirpic...@google.com.INVALID]
Sent: Friday, September 22, 2017 12:48 PM
To: d...@beam.apache.org
Cc: dev@tika.apache.org
Subject: Re: TikaIO concerns

Hi Tim,
  From what you're saying it sounds like the Tika library has a big

problem with crashes and freezes, and when applying it at scale (eg. in the
context of Beam) requires explicitly addressing this problem, eg. accepting
the fact that in many realistic applications some documents will just need
to be skipped because they are unprocessable? This would be first example
of a Beam IO that has this concern, so I'd like to confirm that my
understanding is correct.


On Fri, Sep 22, 2017 at 9:34 AM Allison, Timothy B. <talli...@mitre.org>
wrote:


Reuven,

Thank you!  This suggests to me that it is a good idea to integrate
Tika with Beam so that people don't have to 1) (re)discover the need
to make their wrappers robust and then 2) have to reinvent these
wheels for robustness.

For kicks, see William Palmer's post on his toe-stubbing efforts with
Hadoop [1].  He and other Tika users independently have wound up
carrying out exactly your recommendation for 1) below.

We have a MockParser that you can get to simulate regular exceptions,
OOMs an

Re: TikaIO concerns

2017-09-22 Thread Sergey Beryozkin

Hi Tim, All
On 22/09/17 18:17, Allison, Timothy B. wrote:

Y, I think you have it right.


Tika library has a big problem with crashes and freezes


I wouldn't want to overstate it.  Crashes and freezes are exceedingly rare, but 
when you are processing millions/billions of files in the wild [1], they will 
happen.  We fix the problems or try to get our dependencies to fix the problems 
when we can,


I only would like to add to this that IMHO it would be more correct to 
state it's not a Tika library's 'fault' that the crashes might occur. 
Tika does its best to get the latest libraries helping it to parse the 
files, but indeed there will always be some file there that might use 
some incomplete format specific tag etc which may cause the specific 
parser to spin - but Tika will include the updated parser library asap.


And with Beam's help the crashes that can kill the Tika jobs completely 
will probably become a history...


Cheers, Sergey

but given our past history, I have no reason to believe that these problems 
won't happen again.

Thank you, again!

Best,

 Tim

[1] Stuff on the internet or ... some of our users are forensics examiners 
dealing with broken/corrupted files

P.S./FTR  
1) We've gathered a TB of data from CommonCrawl and we run regression tests 
against this TB (thank you, Rackspace for hosting our vm!) to try to identify 
these problems.
2) We've started a fuzzing effort to try to identify problems.
3) We added "tika-batch" for robust single box fileshare/fileshare processing 
for our low volume users
4) We're trying to get the message out.  Thank you for working with us!!!

-Original Message-
From: Eugene Kirpichov [mailto:kirpic...@google.com.INVALID]
Sent: Friday, September 22, 2017 12:48 PM
To: d...@beam.apache.org
Cc: dev@tika.apache.org
Subject: Re: TikaIO concerns

Hi Tim,
 From what you're saying it sounds like the Tika library has a big problem with 
crashes and freezes, and when applying it at scale (eg. in the context of Beam) 
requires explicitly addressing this problem, eg. accepting the fact that in 
many realistic applications some documents will just need to be skipped because 
they are unprocessable? This would be first example of a Beam IO that has this 
concern, so I'd like to confirm that my understanding is correct.

On Fri, Sep 22, 2017 at 9:34 AM Allison, Timothy B. 
wrote:


Reuven,

Thank you!  This suggests to me that it is a good idea to integrate
Tika with Beam so that people don't have to 1) (re)discover the need
to make their wrappers robust and then 2) have to reinvent these
wheels for robustness.

For kicks, see William Palmer's post on his toe-stubbing efforts with
Hadoop [1].  He and other Tika users independently have wound up
carrying out exactly your recommendation for 1) below.

We have a MockParser that you can get to simulate regular exceptions,
OOMs and permanent hangs by asking Tika to parse a  xml [2].


However if processing the document causes the process to crash, then
it

will be retried.
Any ideas on how to get around this?

Thank you again.

Cheers,

Tim

[1]
http://openpreservation.org/blog/2014/03/21/tika-ride-characterising-w
eb-content-nanite/
[2]
https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/test-documents/mock/example.xml



Re: TikaIO concerns

2017-09-22 Thread Sergey Beryozkin
 don't think the result of applying Tika to a single file can be encoded as a 
PCollection element.

Given both of these, I think that it's not possible to create a general-purpose 
TikaIO transform that will be better than manual invocation of Tika as a DoFn 
on the result of FileIO.readMatches().

However, looking at the examples at https://tika.apache.org/1.16/examples.html 
- almost all of the examples involve extracting a single String from each 
document. This use case, with the assumption that individual documents are 
small enough, can certainly be simplified and TikaIO could be a facade for 
doing just this.

E.g. TikaIO could:
- take as input a PCollection
- return a PCollection<KV<String, TikaIO.ParseResult>>, where ParseResult is a 
class with properties { String content, Metadata metadata }
- be configured by: a Parser (it implements Serializable so can be specified at pipeline 
construction time) and a ContentHandler whose toString() will go into "content". 
ContentHandler does not implement Serializable, so you can not specify it at construction time 
- however, you can let the user specify either its class (if it's a simple handler like a 
BodyContentHandler) or specify a lambda for creating the handler (SerializableFunction<Void, 
ContentHandler>), and potentially you can have a simpler facade for Tika.parseAsString() - 
e.g. call it TikaIO.parseAllAsStrings().

Example usage would look like:

   PCollection<KV<String, ParseResult>> parseResults = 
p.apply(FileIO.match().filepattern(...))
 .apply(FileIO.readMatches())
 .apply(TikaIO.parseAllAsStrings())

or:

 .apply(TikaIO.parseAll()
 .withParser(new AutoDetectParser())
 .withContentHandler(() -> new BodyContentHandler(new 
ToXMLContentHandler(

You could also have shorthands for letting the user avoid using FileIO directly 
in simple cases, for example:
 p.apply(TikaIO.parseAsStrings().from(filepattern))

This would of course be implemented as a ParDo or even MapElements, and you'll 
be able to share the code between parseAll and regular parse.

On Thu, Sep 21, 2017 at 7:38 AM Sergey Beryozkin 
<sberyoz...@gmail.com<mailto:sberyoz...@gmail.com>> wrote:
Hi Tim
On 21/09/17 14:33, Allison, Timothy B. wrote:

Thank you, Sergey.

My knowledge of Apache Beam is limited -- I saw Davor and Jean-Baptiste's talk 
at ApacheCon in Miami, and I was and am totally impressed, but I haven't had a 
chance to work with it yet.

  From my perspective, if I understand this thread (and I may not!), getting unordered 
text from _a given file_ is a non-starter for most applications.  The implementation 
needs to guarantee order per file, and the user has to be able to link the 
"extract" back to a unique identifier for the document.  If the current 
implementation doesn't do those things, we need to change it, IMHO.


Right now Tika-related reader does not associate a given text fragment
with the file name, so a function looking at some text and trying to
find where it came from won't be able to do so.

So I asked how to do it in Beam, how to attach some context to the given
piece of data. I hope it can be done and if not - then perhaps some
improvement can be applied.

Re the unordered text - yes - this is what we currently have with Beam +
TikaIO :-).

The use-case I referred to earlier in this thread (upload PDFs - save
the possibly unordered text to Lucene with the file name 'attached', let
users search for the files containing some words - phrases, this works
OK given that I can see PDF parser for ex reporting the lines) can be
supported OK with the current TikaIO (provided we find a way to 'attach'
a file name to the flow).

I see though supporting the total ordering can be a big deal in other
cases. Eugene, can you please explain how it can be done, is it
achievable in principle, without the users having to do some custom
coding ?


To the question of -- why is this in Beam at all; why don't we let users call 
it if they want it?...

No matter how much we do to Tika, it will behave badly sometimes -- permanent 
hangs requiring kill -9 and OOMs to name a few.  I imagine folks using Beam -- 
folks likely with large batches of unruly/noisy documents -- are more likely to 
run into these problems than your average 
couple-of-thousand-docs-from-our-own-company user. So, if there are things we 
can do in Beam to prevent developers around the world from having to reinvent 
the wheel for defenses against these problems, then I'd be enormously grateful 
if we could put Tika into Beam.  That means:

1) a process-level timeout (because you can't actually kill a thread in Java)
2) a process-level restart on OOM
3) avoid trying to reprocess a badly behaving document

If Beam automatically handles those problems, then I'd say, y, let users write 
their own code.  If there is so much as a single configuration knob (and it 
sounds like Beam is against complex configuration

Re: TikaIO concerns

2017-09-22 Thread Sergey Beryozkin
l of the examples
involve extracting a single String from each document. This use case, with
the assumption that individual documents are small enough, can certainly be
simplified and TikaIO could be a facade for doing just this.



E.g. TikaIO could:

- take as input a PCollection

- return a PCollection<KV<String, TikaIO.ParseResult>>, where ParseResult
is a class with properties { String content, Metadata metadata }

- be configured by: a Parser (it implements Serializable so can be
specified at pipeline construction time) and a ContentHandler whose
toString() will go into "content". ContentHandler does not implement
Serializable, so you can not specify it at construction time - however, you
can let the user specify either its class (if it's a simple handler like a
BodyContentHandler) or specify a lambda for creating the handler
(SerializableFunction<Void, ContentHandler>), and potentially you can have
a simpler facade for Tika.parseAsString() - e.g. call it
TikaIO.parseAllAsStrings().



Example usage would look like:



   PCollection<KV<String, ParseResult>> parseResults =
p.apply(FileIO.match().filepattern(...))

 .apply(FileIO.readMatches())

 .apply(TikaIO.parseAllAsStrings())



or:



 .apply(TikaIO.parseAll()

 .withParser(new AutoDetectParser())

 .withContentHandler(() -> new BodyContentHandler(new
ToXMLContentHandler(



You could also have shorthands for letting the user avoid using FileIO
directly in simple cases, for example:

 p.apply(TikaIO.parseAsStrings().from(filepattern))



This would of course be implemented as a ParDo or even MapElements, and
you'll be able to share the code between parseAll and regular parse.



On Thu, Sep 21, 2017 at 7:38 AM Sergey Beryozkin <sberyoz...@gmail.com>
wrote:

Hi Tim
On 21/09/17 14:33, Allison, Timothy B. wrote:

Thank you, Sergey.

My knowledge of Apache Beam is limited -- I saw Davor and

Jean-Baptiste's talk at ApacheCon in Miami, and I was and am totally
impressed, but I haven't had a chance to work with it yet.


  From my perspective, if I understand this thread (and I may not!),

getting unordered text from _a given file_ is a non-starter for most
applications.  The implementation needs to guarantee order per file, and
the user has to be able to link the "extract" back to a unique identifier
for the document.  If the current implementation doesn't do those things,
we need to change it, IMHO.



Right now Tika-related reader does not associate a given text fragment
with the file name, so a function looking at some text and trying to
find where it came from won't be able to do so.

So I asked how to do it in Beam, how to attach some context to the given
piece of data. I hope it can be done and if not - then perhaps some
improvement can be applied.

Re the unordered text - yes - this is what we currently have with Beam +
TikaIO :-).

The use-case I referred to earlier in this thread (upload PDFs - save
the possibly unordered text to Lucene with the file name 'attached', let
users search for the files containing some words - phrases, this works
OK given that I can see PDF parser for ex reporting the lines) can be
supported OK with the current TikaIO (provided we find a way to 'attach'
a file name to the flow).

I see though supporting the total ordering can be a big deal in other
cases. Eugene, can you please explain how it can be done, is it
achievable in principle, without the users having to do some custom
coding ?


To the question of -- why is this in Beam at all; why don't we let users

call it if they want it?...


No matter how much we do to Tika, it will behave badly sometimes --

permanent hangs requiring kill -9 and OOMs to name a few.  I imagine folks
using Beam -- folks likely with large batches of unruly/noisy documents --
are more likely to run into these problems than your average
couple-of-thousand-docs-from-our-own-company user. So, if there are things
we can do in Beam to prevent developers around the world from having to
reinvent the wheel for defenses against these problems, then I'd be
enormously grateful if we could put Tika into Beam.  That means:


1) a process-level timeout (because you can't actually kill a thread in

Java)

2) a process-level restart on OOM
3) avoid trying to reprocess a badly behaving document

If Beam automatically handles those problems, then I'd say, y, let users

write their own code.  If there is so much as a single configuration knob
(and it sounds like Beam is against complex configuration...yay!) to get
that working in Beam, then I'd say, please integrate Tika into Beam.  From
a safety perspective, it is critical to keep the extraction process
entirely separate (jvm, vm, m, rack, data center!) from the
transformation+loading steps.  IMHO, very few devs realize this because
Tika works well lots of the time...which is why it is critical for us to
make it easy for people to get it right all of the time.

Re: TikaIO concerns

2017-09-21 Thread Sergey Beryozkin

Hi Eugene

Thank you, very helpful, let me read it few times before I get what 
exactly I need to clarify :-), two questions so far:


On 21/09/17 21:40, Eugene Kirpichov wrote:

Thanks all for the discussion. It seems we have consensus that both
within-document order and association with the original filename are
necessary, but currently absent from TikaIO.

*Association with original file:*
Sergey - Beam does not *automatically* provide a way to associate an
element with the file it originated from: automatically tracking data
provenance is a known very hard research problem on which many papers have
been written, and obvious solutions are very easy to break. See related
discussion at
https://lists.apache.org/thread.html/32aab699db3901d9f0191ac7dbc0091b31cb8be85eee6349deaee671@%3Cuser.beam.apache.org%3E
  .

If you want the elements of your PCollection to contain additional
information, you need the elements themselves to contain this information:
the elements are self-contained and have no metadata associated with them
(beyond the timestamp and windows, universal to the whole Beam model).

*Order within a file:*
The only way to have any kind of order within a PCollection is to have the
elements of the PCollection contain something ordered, e.g. have a
PCollection<List>, where each List is for one file [I'm assuming
Tika, at a low level, works on a per-file basis?]. However, since TikaIO
can be applied to very large files, this could produce very large elements,
which is a bad idea. Because of this, I don't think the result of applying
Tika to a single file can be encoded as a PCollection element.

Given both of these, I think that it's not possible to create a
*general-purpose* TikaIO transform that will be better than manual
invocation of Tika as a DoFn on the result of FileIO.readMatches().

However, looking at the examples at
https://tika.apache.org/1.16/examples.html - almost all of the examples
involve extracting a single String from each document. This use case, with
the assumption that individual documents are small enough, can certainly be
simplified and TikaIO could be a facade for doing just this.

E.g. TikaIO could:
- take as input a PCollection
- return a PCollection<KV<String, TikaIO.ParseResult>>, where ParseResult
is a class with properties { String content, Metadata metadata }


and what is the 'String' in KV<String,...> given that TikaIO.ParseResult 
represents the content + (Tika) Metadata of the file such as the author 
name, etc ? Is it the file name ?

- be configured by: a Parser (it implements Serializable so can be
specified at pipeline construction time) and a ContentHandler whose
toString() will go into "content". ContentHandler does not implement
Serializable, so you can not specify it at construction time - however, you
can let the user specify either its class (if it's a simple handler like a
BodyContentHandler) or specify a lambda for creating the handler
(SerializableFunction<Void, ContentHandler>), and potentially you can have
a simpler facade for Tika.parseAsString() - e.g. call it
TikaIO.parseAllAsStrings().

Example usage would look like:

   PCollection<KV<String, ParseResult>> parseResults =
p.apply(FileIO.match().filepattern(...))
 .apply(FileIO.readMatches())
 .apply(TikaIO.parseAllAsStrings())

or:

 .apply(TikaIO.parseAll()
 .withParser(new AutoDetectParser())
 .withContentHandler(() -> new BodyContentHandler(new
ToXMLContentHandler(

You could also have shorthands for letting the user avoid using FileIO
directly in simple cases, for example:
 p.apply(TikaIO.parseAsStrings().from(filepattern))

This would of course be implemented as a ParDo or even MapElements, and
you'll be able to share the code between parseAll and regular parse.

OK. What about the current source on the master, should be marked 
Experimental till I manage to write something new with the above ideas 
in mind ? Or there's enough time till 2.2.0 gets released ?


Thanks, Sergey

On Thu, Sep 21, 2017 at 7:38 AM Sergey Beryozkin <sberyoz...@gmail.com>
wrote:


Hi Tim
On 21/09/17 14:33, Allison, Timothy B. wrote:

Thank you, Sergey.

My knowledge of Apache Beam is limited -- I saw Davor and

Jean-Baptiste's talk at ApacheCon in Miami, and I was and am totally
impressed, but I haven't had a chance to work with it yet.


  From my perspective, if I understand this thread (and I may not!),

getting unordered text from _a given file_ is a non-starter for most
applications.  The implementation needs to guarantee order per file, and
the user has to be able to link the "extract" back to a unique identifier
for the document.  If the current implementation doesn't do those things,
we need to change it, IMHO.



Right now Tika-related reader does not associate a given text fragment
with the file name, so a function looking at some text and trying to
find where it came from won't be able to do so.

S

  1   2   3   >