Re: Bump dependabot to weekly?

2024-04-29 Thread Eric Pugh
I like the less noise!  And if you don’t get today’s AWS update, well, it will 
show up in a few days based on their relentless release cycle!


> On Apr 29, 2024, at 10:47 AM, Tilman Hausherr  wrote:
> 
> The positive side is that it's less interruptions.
> One negative side is that there seems to be a maximum. Today it didn't report 
> the AWS update, which was detected in the past.
> Tilman
> 
> On 29.04.2024 16:34, Tim Allison wrote:
>> The move to weekly dependabot has been a bit of a relief for me personally.
>> Our mail list isn't clogged w daily dependabot updates (and yes, I know I
>> can apply a filter :/).
>> 
>> How is it working for everyone else?
>> 
>> On Wed, Apr 10, 2024 at 4:09 PM Tim Allison > <mailto:talli...@apache.org>> wrote:
>> 
>>>> you start deleting them reflexively out of your email!
>>> Not Tilman!!!
>>> 
>>> Let's move to weekly and see how that works?
>>> 
>>> On Wed, Apr 10, 2024 at 3:57 PM Eric Pugh
>>> mailto:ep...@opensourceconnections.com>> 
>>> wrote:
>>>> Hence why I like the monthly unless it’s a special case….  The flood of
>>> updates just means you start deleting them reflexively out of your email!
>>>  Now, if you have a dependency and you’re maybe actively working on it, and
>>> it’s changing quickly, then that might be an argument for daily.
>>>>> On Apr 10, 2024, at 12:53 PM, Tilman Hausherr 
>>> wrote:
>>>>> I'm fine with daily because this way we can learn ASAP if there are
>>> troubles with new dependency versions, although I'm now too busy.
>>>>> Tilman
>>>>> 
>>>>> 
>>>>> 
>>>>> -- Original-Nachricht --
>>>>> Von: Tim Allison 
>>>>> Betreff: Bump dependabot to weekly?
>>>>> Datum: 10.04.2024, 18:08 Uhr
>>>>> An:  
>>>>> 
>>>>> All,
>>>>>  Tilman has been doing heroic work keeping us up to date with
>>>>> dependabot's PRs. Given our pace of releases, would it make sense to
>>>>> backoff to weekly updates?
>>>>>  Before running regression tests, we'd run the update plugin to make
>>>>> sure that we're up to date.
>>>>>  What do you think?
>>>>> 
>>>>>Best,
>>>>> 
>>>>> Tim
>>>>> 
>>>> ___
>>>> Eric Pugh | Founder | OpenSource Connections, LLC | 434.466.1467 |
>>> http://www.opensourceconnections.com 
>>> <http://www.opensourceconnections.com/> <
>>> http://www.opensourceconnections.com/> | My Free/Busy <
>>> http://tinyurl.com/eric-cal>
>>>> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <
>>> https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw
>>>> This e-mail and all contents, including attachments, is considered to be
>>> Company Confidential unless explicitly stated otherwise, regardless of
>>> whether attachments are marked as such.

___
Eric Pugh | Founder | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | 
My Free/Busy <http://tinyurl.com/eric-cal>  
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed 
<https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>

This e-mail and all contents, including attachments, is considered to be 
Company Confidential unless explicitly stated otherwise, regardless of whether 
attachments are marked as such.



Re: Bump dependabot to weekly?

2024-04-10 Thread Eric Pugh
Hence why I like the monthly unless it’s a special case….  The flood of updates 
just means you start deleting them reflexively out of your email!   Now, if you 
have a dependency and you’re maybe actively working on it, and it’s changing 
quickly, then that might be an argument for daily.

> On Apr 10, 2024, at 12:53 PM, Tilman Hausherr  wrote:
> 
> I'm fine with daily because this way we can learn ASAP if there are troubles 
> with new dependency versions, although I'm now too busy.
> 
> Tilman 
> 
> 
> 
> -- Original-Nachricht --
> Von: Tim Allison 
> Betreff: Bump dependabot to weekly?
> Datum: 10.04.2024, 18:08 Uhr
> An:  
> 
> All,
>  Tilman has been doing heroic work keeping us up to date with
> dependabot's PRs. Given our pace of releases, would it make sense to
> backoff to weekly updates?
>  Before running regression tests, we'd run the update plugin to make
> sure that we're up to date.
>  What do you think?
> 
>Best,
> 
> Tim
> 

___
Eric Pugh | Founder | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | 
My Free/Busy <http://tinyurl.com/eric-cal>  
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed 
<https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>

This e-mail and all contents, including attachments, is considered to be 
Company Confidential unless explicitly stated otherwise, regardless of whether 
attachments are marked as such.



Re: Bump dependabot to weekly?

2024-04-10 Thread Eric Pugh
Or even monthly?   Some projects release so frequently that you get many 
upgrades between release cycles, so it feels more treadmill-ish….   

On the Quepid project I changed it to run on the first day of the month, and 
that’s been plenty ;-).




> On Apr 10, 2024, at 12:08 PM, Tim Allison  wrote:
> 
> All,
>  Tilman has been doing heroic work keeping us up to date with
> dependabot's PRs. Given our pace of releases, would it make sense to
> backoff to weekly updates?
>  Before running regression tests, we'd run the update plugin to make
> sure that we're up to date.
>  What do you think?
> 
>Best,
> 
>     Tim

___
Eric Pugh | Founder | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | 
My Free/Busy <http://tinyurl.com/eric-cal>  
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed 
<https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>

This e-mail and all contents, including attachments, is considered to be 
Company Confidential unless explicitly stated otherwise, regardless of whether 
attachments are marked as such.



Re: Document chunking

2024-04-09 Thread Eric Pugh
Your approach sounds great as well Nick….   

> On Apr 9, 2024, at 2:21 AM, Michael Wechner  wrote:
> 
> Thanks for sharing your approach!
> 
> Do you already have some code to share?
> 
> Today I read about https://github.com/infiniflow/ragflow which might also 
> have some interesting chunking approaches.
> 
> Thanks
> 
> Michael
> 
> Am 09.04.24 um 01:25 schrieb Nick Burch:
>> On Mon, 8 Apr 2024, Tim Allison wrote:
>>> Not sure we should jump on the bandwagon, but anything we can do to support 
>>> smart chunking would benefit us.
>>> 
>>> Could just be more integrations with parsers that turn out to be useful. I
>>> haven’t had much joy with some. Here’s one that I haven’t evaluated yet:
>>> https://github.com/Filimoa/open-parse
>> 
>> I played around with chunking a bit late last year, but owing to not getting 
>> any of the AI jobs I went for, I didn't get it beyond a rough protype. I can 
>> say that most people are doing a terrible job in their out-of-the box 
>> configs...
>> 
>> My current suggested (but not fully tested) approach is:
>>  * Define a range of chunk sizes that you'd like (min / ideal / max)
>>  * Parse as XHTML with Tika
>>  * Keep track of headings and table headers
>>  * Break on headings
>>  * If a chunk is too big, break on other elements (eg div or p)
>>  * If a chunk is too small, and near other small chunks, join them
>>  * Include 1-2 headings above the current one at the top,
>>as a targetted bit of Table of Contents. (eg chunk starts on H3, put
>>the H2 in as well)
>>  * If you broke up a huge table, repeat the table headers at the
>>start of every chunk
>>  * When you're done chunking + adding bits back at the top, convert
>>to markdown on output
>> 
>> Happy to explain more! But sadly lacking time right now to do much on that
>> 
>> Nick
> 

___
Eric Pugh | Founder | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | 
My Free/Busy <http://tinyurl.com/eric-cal>  
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed 
<https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>

This e-mail and all contents, including attachments, is considered to be 
Company Confidential unless explicitly stated otherwise, regardless of whether 
attachments are marked as such.



Re: Replace baseline language detection in tika-server and tika-app in 3.x?

2024-04-08 Thread Eric Pugh
Time to move on?   Lucene 10 will be on 17+, Solr 10 will be on 17+, OpenNLP is 
already there….Java 11 is EOL and has been for a while….   

Any other file parsers that are being optimized to take advantage of the newer 
features that are in recent Java versions that we know about?   

> On Apr 8, 2024, at 7:02 AM, Tim Allison  wrote:
> 
> Sorry, more correctly:
> 
> OpenNLP is effectively EOL'd for our 3.x because OpenNLP >= 2.3.0
> requires Java 17 and our 3.x is still on 11.
> 
> On Mon, Apr 8, 2024 at 6:30 AM Tim Allison  wrote:
>> 
>> All,
>>  As Brian pointed out, optimaize is no longer maintained, and it has
>> some dependencies that have aged out. Should we replace our baseline
>> langdetect in tika-app and tika-server in 3.x?
>>  I'd say that we should go with our OpenNLP based language detection,
>> but that, too, is effectively EOL'd because OpenNLP >= 2.3.0 requires
>> Java 17.
>>  Thoughts?
>> 
>>Best,
>> 
>>Tim
>> 
>> -- Forwarded message -
>> From: Brian Laskey 
>> Date: Fri, Mar 8, 2024 at 2:38 PM
>> Subject: RE: Replacing full tika-app.jar to directly using tiki-core /
>> and parsers
>> To: u...@tika.apache.org 
>> 
>> 
>> Hi Tim
>> 
>> 
>> 
>> Thanks this is helpful.
>> 
>> 
>> 
>> For tika-app we found the dependency on org.apache.tika »
>> tika-langdetect-optimaize brings in some older 3rd party jars, and
>> unfortunately it appears that the com.optimaize.languagedetector »
>> language-detector 0.6 is unmaintained so it’s dependencies on
>> vulnerable versions of guava (18.0) cause us problems with security
>> scans. I could be wrong but I don’t believe we need this component for
>> our usage of just detect and parse?
>> 
>> 
>> 
>> We have a sort of microservice process (java based) which is ingesting
>> files parsed from tika. It was nice that we could separate the tika
>> process in it’s own heap space as a separate java process rather than
>> adding it to our app, but I suppose we could work around that
>> 
>> 
>> 
>> Thank you
>> 
>> Brian Laskey
>> 
>> 
>> 
>> From: Tim Allison 
>> Reply-To: "u...@tika.apache.org" 
>> Date: Friday, March 8, 2024 at 9:44 AM
>> To: "u...@tika.apache.org" 
>> Subject: [EXTERNAL] Re: Replacing full tika-app.jar to directly using
>> tiki-core / and parsers
>> 
>> 
>> 
>> Hi Brian, A few thoughts: 1) tika-app is basically tika-core +
>> tika-parsers-standard-package. Which components are you trying to
>> avoid? tika-serialization and jackson? boilerpipecontenthandler and
>> some of its dependencies? I ask, because we
>> 
>> Hi Brian,
>> 
>>  A few thoughts:
>> 
>> 
>> 
>> 1) tika-app is basically tika-core + tika-parsers-standard-package.
>> Which components are you trying to avoid? tika-serialization and
>> jackson? boilerpipecontenthandler and some of its dependencies? I ask,
>> because we could factor out a tika-app-core with no parsers in Tika
>> 3.x, which is what we do now with tika-server-core and
>> tika-server-standard.
>> 
>> 
>> 
>> 2) Unrelated, there are probably more efficient ways of running Tika
>> than calling it per file on the commandline. That is a robust option,
>> at least!
>> 
>> 
>> 
>> If all you want is detect and text extraction, and you want to run it
>> from the commandline, write two classes, whose main()s call:
>> 
>> System.out.println(Tika.detect(File f));
>> 
>> 
>> 
>> or
>> 
>> 
>> 
>> System.out.println(Tika.parseToString(File f))
>> 
>> 
>> 
>> On Thu, Mar 7, 2024 at 5:04 PM Brian Laskey  wrote:
>> 
>> Hello Tika community,
>> 
>> 
>> 
>> Our team is migrating away from usage of tika-app.jar (2.6 currently)
>> to something with more minimal third party dependencies which we can
>> control.
>> 
>> 
>> 
>> Is there any good documentation or pathway to describe how a team
>> could map the tika-app functionality we use to the same behavior using
>> just tika-core and tika-parsers-standard-package
>> 
>> (I assume)?
>> 
>> 
>> 
>> The tika-app functions we use today are:
>> 
>> 
>> 
>> Mime-type detection
>> 
>> java -jar tika-app.jar -d 
>> 
>> 
>> 
>> and
>> 
>> Text extraction attempts
>> 
>> java -jar tika-app.jar -t 
>> 
>> 
>> 
>> Is there a subset of tika parser jars we would need to include to have
>> equivalent functionality if we wrote our own wrapper main class?
>> 
>> 
>> 
>> Thank you,
>> 
>> Brian Laskey

___
Eric Pugh | Founder | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | 
My Free/Busy <http://tinyurl.com/eric-cal>  
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed 
<https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>

This e-mail and all contents, including attachments, is considered to be 
Company Confidential unless explicitly stated otherwise, regardless of whether 
attachments are marked as such.



Re: Support page?

2023-05-12 Thread Eric Pugh
Definitely content gets out of date.   There probably needs on that Solr page 
to be a bit more of a roll call….  I could imagine on that page doing a “We’re 
moving this page from Confluence to new location, please speak up if you want 
to be included”, and use that as a filter.

However, to the basic idea, I think it’s a great idea.

Eric


> On May 12, 2023, at 11:20 AM, Ken Krugler  wrote:
> 
> Hi Tim,
> 
> In general it’s helpful for users of a project to be able to more easily find 
> (paid) help.
> 
> But there is the issue of stale information. For example, SemanticAnalyzer 
> <http://semanticanalyzer.info/> seems to be dead now. I’m pretty sure there 
> are more.
> 
> I wish there was an easy way to require everyone to re-register every year, 
> and drive this from some DB-centric backend.
> 
> Though that sounds like an Apache Infra thing :)
> 
> Though maybe a mark-down page in the Git repo could also work - haven’t spent 
> much time thinking about this...
> 
> — Ken
> 
> 
>> On May 12, 2023, at 5:50 AM, Tim Allison  wrote:
>> 
>> All,
>> I was chatting with Eric Pugh this morning, and he mentioned that
>> Tika doesn't have an equivalent to this page:
>> https://cwiki.apache.org/confluence/display/solr/Support
>> I realize there are sensitivities about corporate connections with
>> ASF projects. I'd want to copy the header pretty much literally about
>> no endorsements, etc.
>> What would you think of adding something similar to our wiki or our website?
>> 
>>   Best,
>> 
>>   Tim
> 
> --
> Ken Krugler
> http://www.scaleunlimited.com
> Custom big data solutions
> Flink, Pinot, Solr, Elasticsearch
> 
> 
> 

___
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | 
My Free/Busy <http://tinyurl.com/eric-cal>  
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed 
<https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>

This e-mail and all contents, including attachments, is considered to be 
Company Confidential unless explicitly stated otherwise, regardless of whether 
attachments are marked as such.



Re: docker versions?

2022-10-27 Thread Eric Pugh
Would it make sense to just release new versions of Tika when you want to 
release new versions of the Docker image?

> On Oct 27, 2022, at 3:55 PM, Tim Allison  wrote:
> 
> With TIKA-3906, we added a "docker version" number to the tika version,
> e.g. 2.5.0.1.  When the next version of Tika comes out, say 2.6.0, should
> we start with a four digit docker version, e.g. 2.6.0.0 for our docker
> releases or should we go back to three digits?
> 
> Thank you, all!
> 
> Cheers,
> 
>  Tim

___
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | 
My Free/Busy <http://tinyurl.com/eric-cal>  
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed 
<https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>

This e-mail and all contents, including attachments, is considered to be 
Company Confidential unless explicitly stated otherwise, regardless of whether 
attachments are marked as such.



Re: [DISCUSS] support for Java 8?

2022-03-25 Thread Eric Pugh
If Java 11 makes life easier, then shipping it for Tika 2 would make sense to 
me.   Java 8 is well….old….

> On Mar 25, 2022, at 12:04 PM, Tilman Hausherr  wrote:
> 
> Weak +1 for keeping java 8 because it's long term supported by Oracle.
> Tilman
> 
> Am 25.03.2022 um 15:46 schrieb Tim Allison:
>> All,
>>   I'm somewhat interested in moving to require Java 11 to clean up
>> some dependency stuff.  This is not a burning need.
>>  I wanted to get a sense from our community. Do we still need to
>> support 8?  If so, for how long?
>> 
>>   Cheers,
>> 
>> Tim
> 
> 

___
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | 
My Free/Busy <http://tinyurl.com/eric-cal>  
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed 
<https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>

This e-mail and all contents, including attachments, is considered to be 
Company Confidential unless explicitly stated otherwise, regardless of whether 
attachments are marked as such.



Re: [DRAFT] Dedicated ANNOUNCE for Tika 1.x EoL?

2022-02-10 Thread Eric Pugh
Likewise looks great.


> On Feb 10, 2022, at 8:05 PM, Lewis John McGibbney  wrote:
> 
> This looks great Tim.
> 
> On 2022/02/09 15:51:39 Tim Allison wrote:
>> What do you think?
>> 
>> Subject: [ANNOUNCE] Apache Tika 1.x End-Of-Life (EOL) announcement
>> 
>> The Apache Tika Project Team would like to inform you that the Apache Tika
>> 1.x branch is now in security-only maintenance until September 30, 2022.
>> After that date, we will not make updates or releases from our 1.x branch.
>> We will continue to make security fixes and security-related
>> dependency upgrades in our 1.x branch as necessary until September 30,
>> 2022.
>> 
>> We initially announced this on our website on December 16, 2021 with
>> the release of Tika 2.2.0: https://tika.apache.org/
>> 
>> Questions and Answers:
>> 
>> With the announcement of Tika 1.x EoL, what happens to
>> Tika 1.x resources?
>> 
>> All resources will stay where they are. Users will still
>> be able to download source code from our branch_1x branch from
>> github[1]; and published artifacts will remain available on
>> maven central and in the Apache archives[2].
>> 
>> [1] https://github.com/apache/tika/tree/branch_1x
>> [2] https://archive.apache.org/dist/tika/
>> 
>> Is there an immediate need to upgrade to Tika 2.x in my projects?
>> 
>> As of today, there aren't known vulnerabilities affecting the
>> soon-to-be-released Tika 1.28.1.  However, considering that there are
>> several breaking changes in the 2.x branch, we encourage making the
>> migration soon to allow time to adjust your client code as
>> necessary.  For up-to-date documentation on migrating to 2.x, see [3].
>> 
>> [3] https://cwiki.apache.org/confluence/display/TIKA/Migrating+to+Tika+2.0.0
>> 
>> My friends / colleagues and I would like to see Tika 1.x being
>> maintained after September 30, 2022. What can we do?
>> 
>> You may fork the existing source and support it on your own.
>> 
>> Kind regards
>> -
>> The Apache Tika Team
>> 
>> On Wed, Feb 9, 2022 at 10:06 AM Tim Allison  wrote:
>>> 
>>> +1
>>> 
>>> And here's a model for what it could look like:
>>> https://lists.apache.org/thread/zz3v90hd1ycrhfvy76n1crsn26sydhmq
>>> 
>>> On Wed, Feb 9, 2022 at 10:03 AM lewis john mcgibbney  
>>> wrote:
>>>> 
>>>> Hi dev@,
>>>> We have more than six months until the official EoL date for Tika 1.x.
>>>> Tim mentioned that some narrative was provided in the the recent release
>>>> announcement but I think we could help ourselves by explicitly sending a
>>>> dedicated 1.x EoL ANNOUNCEMENT.
>>>> … this assumes that such an email would be moderated through.
>>>> I say it’s worth a bash.
>>>> Any comments?
>>>> Thanks
>>>> lewismc
>>>> --
>>>> http://home.apache.org/~lewismc/
>>>> http://people.apache.org/keys/committer/lewismc
>> 

___
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | 
My Free/Busy <http://tinyurl.com/eric-cal>  
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed 
<https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>

This e-mail and all contents, including attachments, is considered to be 
Company Confidential unless explicitly stated otherwise, regardless of whether 
attachments are marked as such.



Re: [DISCUSS] upgrading log4j to to log4j2 in Tika's 1.x branch

2021-12-13 Thread Eric Pugh
Isn’t the goal of Tika 2 to mean that we no longer work on Tika 1?   Does the 
Tika community have enough developer bandwidth to continue to maintain Tika 1 
while also pushing forward on Tika 2?

I worry that we’ll fall into that situation where people just end up using Tika 
1 for forever, especially if there are new updates to it that are happening, 
which then encourages folks not to move to Tika 2.




> On Dec 13, 2021, at 2:49 PM, Tim Allison  wrote:
> 
> Sounds like 2 +1 to my -0. :D  I'll start working on this now.
> 
> On Mon, Dec 13, 2021 at 2:09 PM Nicholas DiPiazza
>  wrote:
>> 
>> I prefer upgrade to log4j2
>> 
>> On Mon, Dec 13, 2021, 12:05 PM Tim Allison  wrote:
>> 
>>> All,
>>>  I'm currently in the process of building the rc1 for Tika 2.x. On
>>> TIKA-3616, Luís Filipe Nassif asked if we could upgrade log4j to
>>> log4j2 in the 1.x branch.  I think we avoided that because it would be
>>> a breaking change(?).  There are security vulns in log4j and it hit
>>> EOL
>>> in August 2015.
>>>  Should we upgrade the Tika 1.x branch for log4j2?
>>> 
>>>  Best,
>>> 
>>>   Tim
>>> 
>>> 
>>> [1]
>>> https://issues.apache.org/jira/browse/TIKA-3616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17457595#comment-17457595
>>> 

___
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | 
My Free/Busy <http://tinyurl.com/eric-cal>  
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed 
<https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>

This e-mail and all contents, including attachments, is considered to be 
Company Confidential unless explicitly stated otherwise, regardless of whether 
attachments are marked as such.



Re: Proposed topics for next Tika meetups?

2021-11-09 Thread Eric Pugh
As far as question 1 goes, anything more then once, is infinitely better ;-).   
I’d be happy with quarterly, as that would be a lot more then what we do today!

As far as question 2 goes, I think you could do a agenda of:

Get to Know the Users
Presentation
Open Discussion

You have tika-pipes, which I was interested in.   I’d love to learn how many 
folks use Tika in Solr as well and discuss what the future of Tika in Solr is.

> On Nov 9, 2021, at 2:00 PM, Tim Allison  wrote:
> 
> All,
>   Many thanks to those who attended today.  It was great to e-meet
> old friends and users from around the world.  Many thanks to Lewis
> McGibbney for getting the ball rolling on these.
>   Let's use this thread to discuss possible topics and scheduling for
> the next meetups?
> 
> Question 1: Pace...one a month or so?
> 
> Question 2: Topics?
> a) tika-pipes hands-on workshop
> b) get to know the users -- 5 minute go-around the room "this is how
> we use it; these are our pain points"
> c) ???
> 
>  Again, thank you!
> 
>   Best,
> 
>  Tim

___
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | 
My Free/Busy <http://tinyurl.com/eric-cal>  
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed 
<https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>

This e-mail and all contents, including attachments, is considered to be 
Company Confidential unless explicitly stated otherwise, regardless of whether 
attachments are marked as such.



[jira] [Created] (TIKA-3497) Update README for installing Tika Server as a service for 2.0 release

2021-07-24 Thread David Eric Pugh (Jira)
David Eric Pugh created TIKA-3497:
-

 Summary: Update README for installing Tika Server as a service for 
2.0 release
 Key: TIKA-3497
 URL: https://issues.apache.org/jira/browse/TIKA-3497
 Project: Tika
  Issue Type: Improvement
  Components: server
Affects Versions: 2.0.0-ALPHA
Reporter: David Eric Pugh
 Fix For: 2.0.1


Some small tweaks after manually testing the scripts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3495) parent-child in solr emitter doesn't seem to include parent id (_nest_parent_)

2021-07-23 Thread David Eric Pugh (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17386334#comment-17386334
 ] 

David Eric Pugh commented on TIKA-3495:
---

Looking at that json file you linked to, nest_parent is of type text_simple, 
and in the docs it is listed as being a string type..?   I don't see nest_path 
in what you shared in the attachment.

> parent-child in solr emitter doesn't seem to include parent id (_nest_parent_)
> --
>
> Key: TIKA-3495
> URL: https://issues.apache.org/jira/browse/TIKA-3495
> Project: Tika
>  Issue Type: Bug
>  Components: tika-pipes
>Affects Versions: 2.0.0
>Reporter: Tim Allison
>Priority: Minor
> Attachments: Screenshot from 2021-07-23 11-21-38.png, Screenshot from 
> 2021-07-23 11-22-02.png, Screenshot from 2021-07-23 11-45-33.png
>
>
> I'm trying to draft examples of indexing parent-child relationships with the 
> Solr emitter on the tika-pipes wiki page.  With the latest 2.0.1-SNAPSHOT 
> build, I'm not seeing any of the following fields populated: _root_, 
> _nest_parent_, _nest_path_.
> Should these be auto-populated by Solrj or do we need to add these paths?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (TIKA-3495) parent-child in solr emitter doesn't seem to include parent id (_nest_parent_)

2021-07-23 Thread David Eric Pugh (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17386315#comment-17386315
 ] 

David Eric Pugh edited comment on TIKA-3495 at 7/23/21, 3:44 PM:
-

This area of Solr has been changing a bit.  According to 
[https://solr.apache.org/guide/8_9/indexing-nested-documents.html#example-indexing-syntax]
 it appears that you don't need to do anything in solrj other than nest them, 
however, is your schema set up properly with those fields    It does look 
like it from the screenshots...  


was (Author: epugh):
This area of Solr has been changing a bit.  According to 
[https://solr.apache.org/guide/8_9/indexing-nested-documents.html#example-indexing-syntax]
 it appears that you don't need to do anything in solrj, however, is your 
schema set up properly with those fields

> parent-child in solr emitter doesn't seem to include parent id (_nest_parent_)
> --
>
> Key: TIKA-3495
> URL: https://issues.apache.org/jira/browse/TIKA-3495
> Project: Tika
>  Issue Type: Bug
>  Components: tika-pipes
>Affects Versions: 2.0.0
>Reporter: Tim Allison
>Priority: Minor
> Attachments: Screenshot from 2021-07-23 11-21-38.png, Screenshot from 
> 2021-07-23 11-22-02.png
>
>
> I'm trying to draft examples of indexing parent-child relationships with the 
> Solr emitter on the tika-pipes wiki page.  With the latest 2.0.1-SNAPSHOT 
> build, I'm not seeing any of the following fields populated: _root_, 
> _nest_parent_, _nest_path_.
> Should these be auto-populated by Solrj or do we need to add these paths?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3495) parent-child in solr emitter doesn't seem to include parent id (_nest_parent_)

2021-07-23 Thread David Eric Pugh (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17386315#comment-17386315
 ] 

David Eric Pugh commented on TIKA-3495:
---

This area of Solr has been changing a bit.  According to 
[https://solr.apache.org/guide/8_9/indexing-nested-documents.html#example-indexing-syntax]
 it appears that you don't need to do anything in solrj, however, is your 
schema set up properly with those fields

> parent-child in solr emitter doesn't seem to include parent id (_nest_parent_)
> --
>
> Key: TIKA-3495
> URL: https://issues.apache.org/jira/browse/TIKA-3495
> Project: Tika
>  Issue Type: Bug
>  Components: tika-pipes
>Affects Versions: 2.0.0
>Reporter: Tim Allison
>Priority: Minor
> Attachments: Screenshot from 2021-07-23 11-21-38.png, Screenshot from 
> 2021-07-23 11-22-02.png
>
>
> I'm trying to draft examples of indexing parent-child relationships with the 
> Solr emitter on the tika-pipes wiki page.  With the latest 2.0.1-SNAPSHOT 
> build, I'm not seeing any of the following fields populated: _root_, 
> _nest_parent_, _nest_path_.
> Should these be auto-populated by Solrj or do we need to add these paths?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-1570) Seeking a stop method for better use with Apache Commons Daemon

2021-05-13 Thread David Eric Pugh (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17343965#comment-17343965
 ] 

David Eric Pugh commented on TIKA-1570:
---

The associated pr seems reasonable, would be nice to have docs on using commons 
daemon.

> Seeking a stop method for better use with Apache Commons Daemon
> ---
>
> Key: TIKA-1570
> URL: https://issues.apache.org/jira/browse/TIKA-1570
> Project: Tika
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 1.7
>Reporter: Jason Borg
>Priority: Minor
>
> I've got tika-server-1.7.jar from http://tika.apache.org/download.html
> I've downloaded v1.0.15 of the Windows binaries for Apache Commons Daemon 
> from http://commons.apache.org/proper/commons-daemon/binaries.html
> I can get Tika started as a service, but I can't determine what to use for a 
> stop method.
> prunsrv.exe //IS//tika-daemon --DisplayName "Tika Daemon" --Classpath 
> "C:\Tika Service\tika-server-1.7.jar" --StartClass 
> "org.apache.tika.server.TikaServerCli" --StopClass 
> "org.apache.tika.server.TikaServerCli" --StartMethod main --StopMethod main 
> --Description "Tika Daemon Windows Service" --StartMode java --StopMode java
> This starts, and works as I'd hope, but when trying to stop the service it 
> doesn't respond. Obviously org.apache.tika.server.TikaServerCli.main(string[] 
> args) isn't a suitable stop method, but I'm lost for alternatives.
> Using Daemon in exe mode works for start, but gives inconsistent results for 
> stop. Adding a stop method to Tika would be ideal.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-1570) Seeking a stop method for better use with Apache Commons Daemon

2021-05-13 Thread David Eric Pugh (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17343963#comment-17343963
 ] 

David Eric Pugh commented on TIKA-1570:
---

I might suggest trying to go down the docker on windows route...

> Seeking a stop method for better use with Apache Commons Daemon
> ---
>
> Key: TIKA-1570
> URL: https://issues.apache.org/jira/browse/TIKA-1570
> Project: Tika
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 1.7
>Reporter: Jason Borg
>Priority: Minor
>
> I've got tika-server-1.7.jar from http://tika.apache.org/download.html
> I've downloaded v1.0.15 of the Windows binaries for Apache Commons Daemon 
> from http://commons.apache.org/proper/commons-daemon/binaries.html
> I can get Tika started as a service, but I can't determine what to use for a 
> stop method.
> prunsrv.exe //IS//tika-daemon --DisplayName "Tika Daemon" --Classpath 
> "C:\Tika Service\tika-server-1.7.jar" --StartClass 
> "org.apache.tika.server.TikaServerCli" --StopClass 
> "org.apache.tika.server.TikaServerCli" --StartMethod main --StopMethod main 
> --Description "Tika Daemon Windows Service" --StartMode java --StopMode java
> This starts, and works as I'd hope, but when trying to stop the service it 
> doesn't respond. Obviously org.apache.tika.server.TikaServerCli.main(string[] 
> args) isn't a suitable stop method, but I'm lost for alternatives.
> Using Daemon in exe mode works for start, but gives inconsistent results for 
> stop. Adding a stop method to Tika would be ideal.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-1570) Seeking a stop method for better use with Apache Commons Daemon

2021-05-13 Thread David Eric Pugh (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17343962#comment-17343962
 ] 

David Eric Pugh commented on TIKA-1570:
---

Unfortunately they are Linux only.   However I have used NSSM https://nssm.cc/ 
on Windows for running Java processes

> Seeking a stop method for better use with Apache Commons Daemon
> ---
>
> Key: TIKA-1570
> URL: https://issues.apache.org/jira/browse/TIKA-1570
> Project: Tika
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 1.7
>Reporter: Jason Borg
>Priority: Minor
>
> I've got tika-server-1.7.jar from http://tika.apache.org/download.html
> I've downloaded v1.0.15 of the Windows binaries for Apache Commons Daemon 
> from http://commons.apache.org/proper/commons-daemon/binaries.html
> I can get Tika started as a service, but I can't determine what to use for a 
> stop method.
> prunsrv.exe //IS//tika-daemon --DisplayName "Tika Daemon" --Classpath 
> "C:\Tika Service\tika-server-1.7.jar" --StartClass 
> "org.apache.tika.server.TikaServerCli" --StopClass 
> "org.apache.tika.server.TikaServerCli" --StartMethod main --StopMethod main 
> --Description "Tika Daemon Windows Service" --StartMode java --StopMode java
> This starts, and works as I'd hope, but when trying to stop the service it 
> doesn't respond. Obviously org.apache.tika.server.TikaServerCli.main(string[] 
> args) isn't a suitable stop method, but I'm lost for alternatives.
> Using Daemon in exe mode works for start, but gives inconsistent results for 
> stop. Adding a stop method to Tika would be ideal.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: high level parser module names in 2.x

2021-05-11 Thread Eric Pugh
Sounds good to me.

On Tue, May 11, 2021 at 9:33 AM Tim Allison  wrote:

> If there aren't objections, I'll make this change today or tomorrow.
>
> Cheers,
>
>Tim
>
> On Tue, Apr 20, 2021 at 10:57 AM Tim Allison  wrote:
> >
> > How about:
> >
> > standard
> > extended
> > ml (for machine learning)
> >
> > On Wed, Mar 10, 2021 at 10:37 AM Nick Burch 
> wrote:
> > >
> > > On Tue, 9 Mar 2021, Tim Allison wrote:
> > > > Would this be better?
> > > >
> > > > tika-parsers-basic
> > > > tika-parsers-complex
> > > > tika-parsers-¯\_(ツ)_/¯
> > >
> > > GStreamer has 4 levels of plugins, Base, Good, Ugly and Bad.
> Descriptions
> > > of what qualifies for what at
> https://gstreamer.freedesktop.org/modules/ .
> > > I can see developers getting upset if we sling their hard work into
> "bad"
> > > though, and I can see a lot of users avoiding it without checking the
> > > details, so maybe not one to follow exactly!
> > >
> > > I think 1-2 word descriptions are required, something like
> > > tika-parsers-networking-needed-medium or
> > > tika-pasers-requiring-external-native-code just seems too lengthy and
> > > unwieldy.
> > >
> > > Anyone know any other open source projects with plugin collections,
> where
> > > we might be able to pinch ideas on groupings?
> > >
> > > Nick
>


Re: high level parser module names in 2.x

2021-03-09 Thread Eric Pugh
I’d like to see the discriminators on the parsers be more about the type of 
parser, and what it’s going to drag along/impact my system with, and these 
names reflect more the history of Tika’s evolution.

Starting with the descriptive paragraphs, here is some brainstorming of names:

with the exception of optional OCR, these
should be lightish weight dependencies in pure java with no
parsers/resources that require network calls.

—tika-parsers-files
—tika-parsers-alljava
—tika-parsers-local
—tika-parsers-simple
—tika-parsers-lightweight
—tika-parsers-aluminum

these can require native libs and/or have
heavier dependencies, including network calls.

—tika-parsers-heavy
—tika-parsers-complex
—tika-parsers-extended-dependencies
—tika-parsers-iron


anything goes. dl4j as a dependency, etc.

—tika-parsers-anything-goes
—tika-parsers-sandbox
—tika-parsers-deep
—tika-parsers-model-driven
—tika-parsers-lead




> On Mar 9, 2021, at 12:03 PM, Tim Allison  wrote:
> 
> All,
>  I was recently chatting about Tika 2.x with some Tika friends and
> they had some hesitation about the names for the three high level
> parser modules.
> 
> They are currently:
> 
> tika-parsers-classic
> tika-parsers-extended
> tika-parsers-advanced
> 
> The quibbles weren't with the delineation, but with the naming.
> 
> In my mind, this is what I've been thinking as definitions:
> 
> tika-parsers-classic -- with the exception of optional OCR, these
> should be lightish weight dependencies in pure java with no
> parsers/resources that require network calls.
> 
> tika-parsers-extended -- these can require native libs and/or have
> heavier dependencies, including network calls.
> 
> tika-parsers-advanced -- anything goes. dl4j as a dependency, etc.
> 
> Some options for classic-> basic, base, ...what else?
> 
> Any other recommendations for these names?  Thank you!
> 
> Best,
> 
>   Tim

___
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | 
My Free/Busy <http://tinyurl.com/eric-cal>  
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed 
<https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>

This e-mail and all contents, including attachments, is considered to be 
Company Confidential unless explicitly stated otherwise, regardless of whether 
attachments are marked as such.



Re: Config Tika Server

2021-01-18 Thread Eric Pugh
I’ve done two projects around this.   

https://github.com/o19s/powerpoint-discovery-demo 
<https://github.com/o19s/powerpoint-discovery-demo> demonstrates hocr + 
converting PPT’s to static images for web friendlier (sorta!) highlighting in 
context.

https://github.com/o19s/pdf-discovery-demo/ 
<https://github.com/o19s/pdf-discovery-demo/> is similar but newer, and does 
the same think for PDF’s, however we use pdf.js to render the PDF natively in 
the web.

Eric


> On Jan 18, 2021, at 8:48 AM, Tim Allison  wrote:
> 
> We aren’t currently extracting position in any formats. I _think_ it is
> straightforward to get coordinates from PDFs, but I’d have to look at the
> ppt/x apis for location.
> 
> What, specifically, are you trying to accomplish?
> 
> Tesseract in hocr mode does extract coordinates if that’s of any use...
> 
> On Mon, Jan 18, 2021 at 8:05 AM Nilton Monteiro 
> wrote:
> 
>> Hello, I would like to know if its possible to extract the position of the
>> texts, tables, graphs, and pages in PPT files.
>> I triied Tika-python to parse the ppt file, but I did not find options to
>> get these informations.
>> I understand that I need to config tika server to obtain that. Could you
>> please hep me with that?
>> 
>> Thanks,
>> Nilton
>> 

___
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | 
My Free/Busy <http://tinyurl.com/eric-cal>  
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed 
<https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>

This e-mail and all contents, including attachments, is considered to be 
Company Confidential unless explicitly stated otherwise, regardless of whether 
attachments are marked as such.



[jira] [Commented] (TIKA-3258) Run OCR on PDFs with 'auto' mode as default in Tika 2.0.0

2021-01-06 Thread David Eric Pugh (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17259809#comment-17259809
 ] 

David Eric Pugh commented on TIKA-3258:
---

I'm thinking that this is a pointer towards two general approaches to using 
Tika.  

The first approach is the "Let me just grab it and use it, and hope it does the 
right thing" and this feature feels very much in line with that mode.  It's 
what I do when I have a few docs that I want to look at, often via the GUI app.

The second approach is the "I'm building a application using Tika, and I need 
to control what Tika does".   This is where scale, robustness, control really 
matter.

Both use cases are ones I experience regularly, and I'd like to see preserved.  
Indeed, I think it's important to make both use cases easier!

> Run OCR on PDFs with 'auto' mode as default in Tika 2.0.0
> -
>
> Key: TIKA-3258
> URL: https://issues.apache.org/jira/browse/TIKA-3258
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Major
>
> In Tika 1.x we currently have the fiddly mess that users have to configure 
> OCR of PDFs...it doesn't just work out of the box.  We did this initially 
> because of concerns (well, reality) of crazy resource consumption for some 
> PDFs that can have thousands of images per page that are stitched together to 
> make a reasonable composite.
> Since then, we've added option 2, which renders each page and then runs OCR 
> on that composite image rather than running OCR on each inline image...so 
> we'll only call tesseract once per page.  Second, we've added an 'auto' mode 
> that runs OCR only on pages that didn't have much text extracted.  While 
> there is plenty of room for improvement in the 'auto' heuristic, I think we 
> should move to running OCR automatically on PDFs as default in 2.0.0. 
> Under this proposal, users will now have to disable OCR if they have 
> tesseract installed but don't want to run it on PDFs.
> This will be a breaking change, and we'll make sure to document it early and 
> often in the "Breaking Changes" sections of the readme.txt.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Expected private/secret keys in the source (TIKA-3205)

2020-09-29 Thread Eric Pugh
Tika: We break all the rules for good reasons!

> On Sep 29, 2020, at 4:21 PM, Nick Burch  wrote:
> 
> Hey All
> 
> Just a quick heads-up that for TIKA-3205 I generated a few new small private 
> keys (RSA, DSA, EC) and added them to the parser test documents folder, for 
> unit testing the new mime magics for keys and certificates. They're not 
> protecting or using anything.
> 
> One automated security scanning tool has already emailed me to warn that I 
> committed secrets (GitGuardian), and I think there's a chance others might do 
> too...
> 
> So before anyone else gets a notification and worries, I felt it best to give 
> everyone a heads-up that yes, there are private key files in the Tika source 
> tree, and yes, they are supposed to be there!
> 
> Cheers
> Nick

___
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | 
My Free/Busy <http://tinyurl.com/eric-cal>  
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed 
<https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>

This e-mail and all contents, including attachments, is considered to be 
Company Confidential unless explicitly stated otherwise, regardless of whether 
attachments are marked as such.



[jira] [Commented] (TIKA-3166) Actually maven-modularize the packages for 2.0

2020-08-20 Thread David Eric Pugh (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181262#comment-17181262
 ] 

David Eric Pugh commented on TIKA-3166:
---

I did a diff, and while I can't say that I read through it in detail etc, I 
didn't see anything that made me cringe!   Great to see forward movement.

> Actually maven-modularize the packages for 2.0
> --
>
> Key: TIKA-3166
> URL: https://issues.apache.org/jira/browse/TIKA-3166
> Project: Tika
>  Issue Type: Bug
>Reporter: Tim Allison
>Priority: Major
>
> Let's get around to maven-modularizing the packages for Tika 2.0 according to 
> [~bobpaulin]'s 2.x branch...and, maybe, maybe, ship Tika 2.0.0-ALPHA some 
> time soonish*?!?
> I'll start working on branch_2x.
> *soonish in Open Source Time



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3093) Enable tika-server to forward parse results to another endpoint

2020-04-24 Thread David Eric Pugh (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17091703#comment-17091703
 ] 

David Eric Pugh commented on TIKA-3093:
---

Out of curiosity, is this type of behavior, the "Let me chain a set of 
interactions together" something that already exists?   

Imagine System A is a CMS, System B is Tika, and System C is Solr...

What if I wanted to do something like "Send a request for a doc by id to system 
A, have it dig up doc by id in System A, then forward to System B for 
Extraction, and then forward to System C for storage"..   

Is there an already existing pattern for this that Tika could conform too?   It 
feels like a pipe of some kind...   



> Enable tika-server to forward parse results to another endpoint
> ---
>
> Key: TIKA-3093
> URL: https://issues.apache.org/jira/browse/TIKA-3093
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Major
>
> bq. I see the "send the results to a remote network service" thing as 
> probably being separate from the Content Handler.
> The above is from [~nick] on TIKA-2972.
> It would be useful to allow users to forward the results of parsing to 
> another endpoint.  For example, a user could specify a Solr 
> URL/update/json/docs handler or an elastic //_doc/<_id>
> We may want to allow users to do custom mapping before redirecting to another 
> URL, whitelisting/blacklisting of metadata keys, etc.
> I'd propose using /rmeta as the basis for this.
> cc [~ehatcher] and [~dadoonet].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Issue with > 200% CPU after bulk usage

2020-04-16 Thread Eric Pugh
Does anyone have a good example of combining Tika with some sort of pool of 
Docker containers?   I think a lot of folks treat their Tika server like a pet, 
not like a cow.  
https://cloudscaling.com/blog/cloud-computing/the-history-of-pets-vs-cattle/ 
<https://cloudscaling.com/blog/cloud-computing/the-history-of-pets-vs-cattle/>

I wonder if we could ship some “recipes” that describe how to deploy a pool of 
Tika’s.Tika running over 200% for 1 hour, kill it and start the next.



> On Apr 16, 2020, at 9:40 AM, Nick Burch  wrote:
> 
> On Wed, 15 Apr 2020, hans.mei...@avident-it.se wrote:
>> I have encountered an issue with Tika running locally on a box that the Java 
>> runtime goes up to over 200% CPU, after running a bulk load of documents 
>> over a couple of days, it is more than 3 million documents.
> 
> Can you do a thread dump to show what the JVM is doing?
> https://access.redhat.com/solutions/18178
> 
> Nick

___
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | 
My Free/Busy <http://tinyurl.com/eric-cal>  
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed 
<https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>

This e-mail and all contents, including attachments, is considered to be 
Company Confidential unless explicitly stated otherwise, regardless of whether 
attachments are marked as such.



Re: Tika master branch not building

2020-04-08 Thread Eric Pugh
If you bump cxf one version, then the complaints stop ;-)

https://github.com/apache/tika/pull/316/commits/ee93e6d3d91bdfc40f838556c13698ea5c78e936



> On Apr 8, 2020, at 5:35 AM, Sergey Beryozkin  wrote:
> 
> Hi Lewis
> 
> Getting one of the latest releases should be fine; while I've been out of
> touch with CXF recently, I can ask around for some version advice as the
> guys deal with the security vulnerabilities seriously there, if addressing
> this issue proves problematic
> Cheers, Sergey
> 
> On Tue, Apr 7, 2020 at 10:44 PM Lewis John McGibbney 
> wrote:
> 
>> I suspected this was the case folks :)
>> I actually really like this idea.
>> I'll take the action item to address this seeing as I pulled it up...
>> seeing as I am also working on tika-server right now I'll also take the
>> action item to address the vulnerable CXF deps.
>> Thanks,
>> Lewis
>> 
>> On 2020/04/06 16:19:16, Tim Allison  wrote:
>>>> We shouldn't have any at release time, but they will obviously creep in
>>> between releases
>>> 
>>> Except the time, where I did the release and was trying to build it for
>>> updating the site, and this had already kicked in. :(
>>> 
>>> Y, we can turn this to warn, as long as we run it with fail as part of
>> the
>>> release process.
>>> 
>>> On Mon, Apr 6, 2020 at 9:59 AM Nick Burch  wrote:
>>> 
>>>> On Mon, 6 Apr 2020, Eric Pugh wrote:
>>>>> Maybe this needs better documentation, however this is a “works as
>>>>> designed” feature!
>>>>> 
>>>>> To avoid the build failing, run mvn package -Dossindex.fail=false
>>>> 
>>>> Should we maybe have this set to false by default, and only enabled
>>>> on release builds?
>>>> 
>>>> (We shouldn't have any at release time, but they will obviously creep
>> in
>>>> between releases)
>>>> 
>>>> Nick
>>> 
>> 

___
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | 
My Free/Busy <http://tinyurl.com/eric-cal>  
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed 
<https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>

This e-mail and all contents, including attachments, is considered to be 
Company Confidential unless explicitly stated otherwise, regardless of whether 
attachments are marked as such.



[jira] [Commented] (TIKA-2368) Clean up SentimentParser dependencies

2020-04-06 Thread David Eric Pugh (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076673#comment-17076673
 ] 

David Eric Pugh commented on TIKA-2368:
---

I'm actually not sure I touched {{SentimentParser}}, as everything  I did was 
in the {{tika-nlp}} project, which appears to only have a {{AgePredictor}} 
client.   That client is what drags in all the large list of dependencies.   I 
ended up commenting on this ticket because of the comment in {{tika-nlp}} that 
said "fix me when TIKA-2368" is fixed.  

You are right though, that all the changes I made, well, we just keep the flag 
to say "don't alert on failure" ;)

> Clean up SentimentParser dependencies
> -
>
> Key: TIKA-2368
> URL: https://issues.apache.org/jira/browse/TIKA-2368
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Blocker
>
> Is there any way to avoid reliance on edu.usc.ir's sentiment-analysis-parser? 
>  I ask because:
> {noformat}
> [WARNING] sentiment-analysis-parser-0.1.jar, tika-parsers-1.15-SNAPSHOT.jar 
> define 1 overlapping classes: 
> [WARNING]   - org.apache.tika.parser.sentiment.analysis.SentimentParser
> [WARNING] tika-core-1.15-SNAPSHOT.jar, tika-translate-1.15-SNAPSHOT.jar 
> define 4 overlapping classes: 
> [WARNING]   - org.apache.tika.language.translate.DefaultTranslator$1
> [WARNING]   - org.apache.tika.language.translate.EmptyTranslator
> [WARNING]   - org.apache.tika.language.translate.DefaultTranslator
> [WARNING]   - org.apache.tika.language.translate.Translator
> {noformat}
> We should be ok keeping things as they are and excluding SentimentParser and 
> tika-translate, but can we easily move the code that's still in edu.usc.ir's 
> package into Tika?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-2368) Clean up SentimentParser dependencies

2020-04-06 Thread David Eric Pugh (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17076501#comment-17076501
 ] 

David Eric Pugh commented on TIKA-2368:
---

In [https://github.com/apache/tika/pull/316] I messed with the dependency list, 
and I think I got the scanning check to pass.  

> Clean up SentimentParser dependencies
> -
>
> Key: TIKA-2368
> URL: https://issues.apache.org/jira/browse/TIKA-2368
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Priority: Blocker
>
> Is there any way to avoid reliance on edu.usc.ir's sentiment-analysis-parser? 
>  I ask because:
> {noformat}
> [WARNING] sentiment-analysis-parser-0.1.jar, tika-parsers-1.15-SNAPSHOT.jar 
> define 1 overlapping classes: 
> [WARNING]   - org.apache.tika.parser.sentiment.analysis.SentimentParser
> [WARNING] tika-core-1.15-SNAPSHOT.jar, tika-translate-1.15-SNAPSHOT.jar 
> define 4 overlapping classes: 
> [WARNING]   - org.apache.tika.language.translate.DefaultTranslator$1
> [WARNING]   - org.apache.tika.language.translate.EmptyTranslator
> [WARNING]   - org.apache.tika.language.translate.DefaultTranslator
> [WARNING]   - org.apache.tika.language.translate.Translator
> {noformat}
> We should be ok keeping things as they are and excluding SentimentParser and 
> tika-translate, but can we easily move the code that's still in edu.usc.ir's 
> package into Tika?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Tika master branch not building

2020-04-06 Thread Eric Pugh
Not sure I have an opinion, though I’ll note that many of the PR’s in this 
project are driven by various vulnerabilities!  It’s a treadmill of updating 
dependencies considering how many there are!

For example, the current failure is fixed by bumping the CXF to 
3.3.6

Also, Lewis, I think we are on the latest version of the plugin?
https://github.com/apache/tika/blob/master/tika-parent/pom.xml#L382 suggests we 
are on 3.1.0.

Eric


> On Apr 6, 2020, at 9:59 AM, Nick Burch  wrote:
> 
> On Mon, 6 Apr 2020, Eric Pugh wrote:
>> Maybe this needs better documentation, however this is a “works as designed” 
>> feature!
>> 
>> To avoid the build failing, run mvn package -Dossindex.fail=false
> 
> Should we maybe have this set to false by default, and only enabled on 
> release builds?
> 
> (We shouldn't have any at release time, but they will obviously creep in 
> between releases)
> 
> Nick

___
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | 
My Free/Busy <http://tinyurl.com/eric-cal>  
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed 
<https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>

This e-mail and all contents, including attachments, is considered to be 
Company Confidential unless explicitly stated otherwise, regardless of whether 
attachments are marked as such.



Re: Tika master branch not building

2020-04-06 Thread Eric Pugh
Maybe this needs better documentation, however this is a “works as designed” 
feature!

To avoid the build failing, run mvn package -Dossindex.fail=false

This caught me as well the first time, I wonder if there was a nice way of 
giving a helpful error message?   


> On Apr 6, 2020, at 2:19 AM, lewis john mcgibbney  wrote:
> 
> I'm also seeing a depreciation notice for the ossindex-maven-plugin as well
> 
> https://github.com/OSSIndex/ossindex-maven-plugin#deprecated-please-upgrade-to-ossindex-maven
> 
> Any info please folks?
> Thanks
> 
> On Sun, Apr 5, 2020 at 11:14 PM lewis john mcgibbney 
> wrote:
> 
>> Hi dev@,
>> Working on TIKA-3082, I just tried to build master branch
>> 
>> Downgrading my Java version to 1.8
>> java -version
>> java version "1.8.0_221"
>> Java(TM) SE Runtime Environment (build 1.8.0_221-b11)
>> Java HotSpot(TM) 64-Bit Server VM (build 25.221-b11, mixed mode)
>> 
>> [INFO] --- ossindex-maven-plugin:3.1.0:audit (audit-dependencies) @
>> tika-parsers ---
>> [INFO] Checking for vulnerabilities; 154 artifacts
>> [INFO] Exclude coordinates: []
>> [INFO] Exclude vulnerability identifiers: []
>> [INFO] CVSS-score threshold: 0.0
>> [INFO]
>> 
>> [INFO] Reactor Summary for Apache Tika 2.0.0-SNAPSHOT:
>> [INFO]
>> [INFO] Apache Tika parent . SUCCESS [
>> 2.663 s]
>> [INFO] Apache Tika core ... SUCCESS [
>> 10.059 s]
>> [INFO] Apache Tika parsers  FAILURE [
>> 4.035 s]
>> [INFO] Apache Tika OSGi bundle  SKIPPED
>> [INFO] Apache Tika XMP  SKIPPED
>> [INFO] Apache Tika serialization .. SKIPPED
>> [INFO] Apache Tika batch .. SKIPPED
>> [INFO] Apache Tika language detection . SKIPPED
>> [INFO] Apache Tika application  SKIPPED
>> [INFO] Apache Tika translate .. SKIPPED
>> [INFO] Apache Tika server . SKIPPED
>> [INFO] Apache Tika fuzzing  SKIPPED
>> [INFO] Apache Tika eval ... SKIPPED
>> [INFO] Apache Tika examples ... SKIPPED
>> [INFO] Apache Tika Java-7 Components .. SKIPPED
>> [INFO] Apache Tika Deep Learning (powered by DL4J)  SKIPPED
>> [INFO] Apache Tika Natural Language Processing  SKIPPED
>> [INFO] Apache Tika  SKIPPED
>> [INFO]
>> 
>> [INFO] BUILD FAILURE
>> [INFO]
>> 
>> [INFO] Total time:  17.641 s
>> [INFO] Finished at: 2020-04-05T23:08:02-07:00
>> [INFO]
>> 
>> [ERROR] Failed to execute goal
>> org.sonatype.ossindex.maven:ossindex-maven-plugin:3.1.0:audit
>> (audit-dependencies) on project tika-parsers: Detected 2 vulnerable
>> components:
>> [ERROR]   org.apache.cxf:cxf-core:jar:3.3.5:compile;
>> https://ossindex.sonatype.org/component/pkg:maven/org.apache.cxf/cxf-core@3.3.5
>> [ERROR] * [CVE-2020-1954] Apache CXF has the ability to integrate with
>> JMX by registering an Instrumentati... (5.3);
>> https://ossindex.sonatype.org/vuln/20bc51e8-29c6-4168-9326-ae0ed18e5d51
>> [ERROR]   org.apache.cxf:cxf-rt-frontend-jaxrs:jar:3.3.5:compile;
>> https://ossindex.sonatype.org/component/pkg:maven/org.apache.cxf/cxf-rt-frontend-jaxrs@3.3.5
>> [ERROR] * [CVE-2020-1954] Apache CXF has the ability to integrate with
>> JMX by registering an Instrumentati... (5.3);
>> https://ossindex.sonatype.org/vuln/20bc51e8-29c6-4168-9326-ae0ed18e5d51
>> [ERROR]
>> [ERROR] -> [Help 1]
>> [ERROR]
>> [ERROR] To see the full stack trace of the errors, re-run Maven with the
>> -e switch.
>> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
>> [ERROR]
>> [ERROR] For more information about the errors and possible solutions,
>> please read the following articles:
>> [ERROR] [Help 1]
>> http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException
>> [ERROR]
>> [ERROR] After correcting the problems, you can resume the build with the
>> command
>> [ERROR]   mvn  -rf :tika-parsers
>> 
>> I

[jira] [Commented] (TIKA-3075) Add an HTTP parser

2020-03-19 Thread David Eric Pugh (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17062619#comment-17062619
 ] 

David Eric Pugh commented on TIKA-3075:
---

Not sure I understand what this issue is about?  As in be able to extract data 
from HTTP logs of some kind?

> Add an HTTP parser
> --
>
> Key: TIKA-3075
> URL: https://issues.apache.org/jira/browse/TIKA-3075
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Reporter: GGYF
>Priority: Major
>
> Add an HTTP parser that processes the http content



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3035) Tika-app --extract mode outputs to stderr instead of stdout

2020-02-25 Thread David Eric Pugh (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17044533#comment-17044533
 ] 

David Eric Pugh commented on TIKA-3035:
---

Tried it with tika-app-1.23.jar and worked great.   

It feels to me like tika-app is somewhat over loaded...   Sometimes we use it 
to fire up the GUI app, othertimes we use it like a CLI.   But the CLI is kind 
of messy.   I wonder if we need a tika-cli project seperate from tika-app where 
all the inputs and outputs are properly thought through?

> Tika-app --extract mode outputs to stderr instead of stdout
> ---
>
> Key: TIKA-3035
> URL: https://issues.apache.org/jira/browse/TIKA-3035
> Project: Tika
>  Issue Type: Bug
>  Components: app
>Affects Versions: 1.23
>Reporter: Soren Daugaard
>Priority: Major
>  Labels: app, extract
> Attachments: testPDF_childAttachments.pdf
>
>
> In version 1.23 of Tika I am noticing a problem using the extract 
> functionality. When extracting items from a file the "Extracting ... to ... " 
> output goes to {{stderr}} instead of {{stdout}}.  
> This problem is observed using the runnable jar `tika-app-1.23.jar` . 
> _*Example to re-create problem:*_
> Here we explode {{testPDF_childAttachments.pdf}} and redirects standard error 
> to /{{dev/null}}:
> {code:java}
> $ java -jar tika-app-1.23.jar --extract-dir=tika-test/out/ -z 
> testPDF_childAttachments.pdf 2> /dev/null
> {code}
> If I do not redirect stderr I see:
> {code:java}
> $ java -jar tika-app-1.23.jar --extract-dir=tika-test/out/ -z 
> testPDF_childAttachments.pdf
> INFO  As a convenience, TikaCLI has turned on extraction of
> inline images for the PDFParser (TIKA-2374).
> Aside from the -z option, this is not the default behavior
> in Tika generally or in tika-server.
> Jan 31, 2020 8:06:01 PM org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
> WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
> See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
> for optional dependencies.Jan 31, 2020 8:06:01 PM 
> org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
> WARNING: Tesseract OCR is installed and will be automatically applied to 
> image files unless
> you've excluded the TesseractOCRParser from the default parser.
> Tesseract may dramatically slow down content extraction (TIKA-2359).
> As of Tika 1.15 (and prior versions), Tesseract is automatically called.
> In future versions of Tika, users may need to turn the TesseractOCRParser on 
> via TikaConfig.
> Jan 31, 2020 8:06:01 PM org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
> WARNING: org.xerial's sqlite-jdbc is not loaded.
> Please provide the jar on your classpath to parse sqlite files.
> See tika-parsers/pom.xml for the correct version.
> Extracting 'image0.jpg' (image/jpeg) to 
> tika-test/out/3975acae-089c-43ae-a3bc-04e4987a0282-image0.jpg
> Extracting 'image1.tif' (image/tiff) to 
> tika-test/out/8d11e4e3-735b-4b0b-9441-3ed4332c2f53-image1.tif
> WARN  No Unicode mapping for f_i (31) in font SCZFMD+HelveticaNeueLTStd-Roman
> Extracting 'Press Quality(1).joboptions' (text/plain) to 
> tika-test/out/28c3fb48-30ea-403b-8a35-252c8f692305-Press Quality(1).joboptions
> Extracting 'Unit10.doc' (application/msword) to 
> tika-test/out/008b9157-75f3-453b-bdfd-d5403c56891c-Unit10.doc
> {code}
> Using 1.22 I correctly see the extracted files in {{stdout}} when redirecting 
> {{stderr}}:
> {code:java}
> $ java -jar tika-app-1.22.jar --extract-dir=tika-test/out/ -z 
> testPDF_childAttachments.pdf 2> /dev/null
> Extracting 'image0.jpg' (image/jpeg) to 
> tika-test/out/4ec61a12-4e5f-4de3-bee8-fa15521c374a-image0.jpg
> Extracting 'image1.tif' (image/tiff) to 
> tika-test/out/004fbeb5-4b0e-4d35-8c50-23a420dccc99-image1.tif
> Extracting 'Press Quality(1).joboptions' (text/plain) to 
> tika-test/out/8f6174d1-f0c7-4143-990d-a922c2e9513a-Press Quality(1).joboptions
> Extracting 'Unit10.doc' (application/msword) to 
> tika-test/out/b2508bee-745d-4051-b927-0f5c31b97c1e-Unit10.doc
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3035) Tika-app --extract mode outputs to stderr instead of stdout

2020-02-25 Thread David Eric Pugh (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17044406#comment-17044406
 ] 

David Eric Pugh commented on TIKA-3035:
---

Here is my command:

java -cp tika-app-1.23-SNAPSHOT.jar org.apache.tika.cli.TikaCLI 
--config=tika-config.xml --xmp --jsonRecursive --extract --pretty-print -x 
./files/alvarez20140715a.pdf

I went to try it out with a non snapshot version of tika-app, and 
tika.apache.org appears down ??!!??   

> Tika-app --extract mode outputs to stderr instead of stdout
> ---
>
> Key: TIKA-3035
> URL: https://issues.apache.org/jira/browse/TIKA-3035
> Project: Tika
>  Issue Type: Bug
>  Components: app
>Affects Versions: 1.23
>Reporter: Soren Daugaard
>Priority: Major
>  Labels: app, extract
> Attachments: testPDF_childAttachments.pdf
>
>
> In version 1.23 of Tika I am noticing a problem using the extract 
> functionality. When extracting items from a file the "Extracting ... to ... " 
> output goes to {{stderr}} instead of {{stdout}}.  
> This problem is observed using the runnable jar `tika-app-1.23.jar` . 
> _*Example to re-create problem:*_
> Here we explode {{testPDF_childAttachments.pdf}} and redirects standard error 
> to /{{dev/null}}:
> {code:java}
> $ java -jar tika-app-1.23.jar --extract-dir=tika-test/out/ -z 
> testPDF_childAttachments.pdf 2> /dev/null
> {code}
> If I do not redirect stderr I see:
> {code:java}
> $ java -jar tika-app-1.23.jar --extract-dir=tika-test/out/ -z 
> testPDF_childAttachments.pdf
> INFO  As a convenience, TikaCLI has turned on extraction of
> inline images for the PDFParser (TIKA-2374).
> Aside from the -z option, this is not the default behavior
> in Tika generally or in tika-server.
> Jan 31, 2020 8:06:01 PM org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
> WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
> See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
> for optional dependencies.Jan 31, 2020 8:06:01 PM 
> org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
> WARNING: Tesseract OCR is installed and will be automatically applied to 
> image files unless
> you've excluded the TesseractOCRParser from the default parser.
> Tesseract may dramatically slow down content extraction (TIKA-2359).
> As of Tika 1.15 (and prior versions), Tesseract is automatically called.
> In future versions of Tika, users may need to turn the TesseractOCRParser on 
> via TikaConfig.
> Jan 31, 2020 8:06:01 PM org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
> WARNING: org.xerial's sqlite-jdbc is not loaded.
> Please provide the jar on your classpath to parse sqlite files.
> See tika-parsers/pom.xml for the correct version.
> Extracting 'image0.jpg' (image/jpeg) to 
> tika-test/out/3975acae-089c-43ae-a3bc-04e4987a0282-image0.jpg
> Extracting 'image1.tif' (image/tiff) to 
> tika-test/out/8d11e4e3-735b-4b0b-9441-3ed4332c2f53-image1.tif
> WARN  No Unicode mapping for f_i (31) in font SCZFMD+HelveticaNeueLTStd-Roman
> Extracting 'Press Quality(1).joboptions' (text/plain) to 
> tika-test/out/28c3fb48-30ea-403b-8a35-252c8f692305-Press Quality(1).joboptions
> Extracting 'Unit10.doc' (application/msword) to 
> tika-test/out/008b9157-75f3-453b-bdfd-d5403c56891c-Unit10.doc
> {code}
> Using 1.22 I correctly see the extracted files in {{stdout}} when redirecting 
> {{stderr}}:
> {code:java}
> $ java -jar tika-app-1.22.jar --extract-dir=tika-test/out/ -z 
> testPDF_childAttachments.pdf 2> /dev/null
> Extracting 'image0.jpg' (image/jpeg) to 
> tika-test/out/4ec61a12-4e5f-4de3-bee8-fa15521c374a-image0.jpg
> Extracting 'image1.tif' (image/tiff) to 
> tika-test/out/004fbeb5-4b0e-4d35-8c50-23a420dccc99-image1.tif
> Extracting 'Press Quality(1).joboptions' (text/plain) to 
> tika-test/out/8f6174d1-f0c7-4143-990d-a922c2e9513a-Press Quality(1).joboptions
> Extracting 'Unit10.doc' (application/msword) to 
> tika-test/out/b2508bee-745d-4051-b927-0f5c31b97c1e-Unit10.doc
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3037) Tika Docs should highlight Tika-Server

2020-02-24 Thread David Eric Pugh (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17043796#comment-17043796
 ] 

David Eric Pugh commented on TIKA-3037:
---

[~tallison]did you see the gettingstarted.apt patch file?   I don't think it 
was applied to the 
https://svn.apache.org/repos/asf/tika/site/src/site/apt/1.23/gettingstarted.apt 
or a notional 
https://svn.apache.org/repos/asf/tika/site/src/site/apt/1.24/gettingstarted.apt 
file...

Also, Googling for "Tika Server" returns 
https://cwiki.apache.org/confluence/display/TIKA/TikaJAXRS.   What do you think 
about renaming that page 
https://cwiki.apache.org/confluence/display/TIKA/TikaServer, and then I guess 
having a placeholder page at /TikaJAXRS?

> Tika Docs should highlight Tika-Server
> --
>
> Key: TIKA-3037
> URL: https://issues.apache.org/jira/browse/TIKA-3037
> Project: Tika
>  Issue Type: Improvement
>  Components: documentation
>Affects Versions: 1.23
>Reporter: David Eric Pugh
>Priority: Major
> Fix For: 1.24
>
> Attachments: gettingstarted.apt.patch
>
>
> Currently the Tika website and many of the project docs don't surface the 
> Tika Server project.  This is a ticket to track fixing those issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [jira] [Commented] (TIKA-3040) PDF inline OCR: Exception while processing certain image (others in same PDF work)

2020-02-12 Thread Eric Pugh
n(ServiceInvokerInterceptor.java:59)
>>  at 
>> org.apache.cxf.interceptor.ServiceInvokerInterceptor.handleMessage(ServiceInvokerInterceptor.java:96)
>>  at 
>> org.apache.cxf.phase.PhaseInterceptorChain.doIntercept(PhaseInterceptorChain.java:308)
>>  at 
>> org.apache.cxf.transport.ChainInitiationObserver.onMessage(ChainInitiationObserver.java:121)
>>  at 
>> org.apache.cxf.transport.http.AbstractHTTPDestination.invoke(AbstractHTTPDestination.java:267)
>>  at 
>> org.apache.cxf.transport.http_jetty.JettyHTTPDestination.doService(JettyHTTPDestination.java:247)
>>  at 
>> org.apache.cxf.transport.http_jetty.JettyHTTPHandler.handle(JettyHTTPHandler.java:79)
>>  at 
>> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
>>  at 
>> org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:235)
>>  at 
>> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1296)
>>  at 
>> org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:190)
>>  at 
>> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1211)
>>  at 
>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>>  at 
>> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:221)
>>  at 
>> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:127)
>>  at org.eclipse.jetty.server.Server.handle(Server.java:500) at 
>> org.eclipse.jetty.server.HttpChannel.lambda$handle$1(HttpChannel.java:386) 
>> at org.eclipse.jetty.server.HttpChannel.dispatch(HttpChannel.java:560) at 
>> org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:378) at 
>> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:268) 
>> at 
>> org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:311)
>>  at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103) at 
>> org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:117) at 
>> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:782)
>>  at 
>> org.eclipse.jetty.util.thread.QueuedThreadPool$Runner.run(QueuedThreadPool.java:914)
>>  at java.base/java.lang.Thread.run(Thread.java:834)
> 
> 
> 
> --
> This message was sent by Atlassian Jira
> (v8.3.4#803005)

___
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | 
My Free/Busy <http://tinyurl.com/eric-cal>  
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed 
<https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>

This e-mail and all contents, including attachments, is considered to be 
Company Confidential unless explicitly stated otherwise, regardless of whether 
attachments are marked as such.



Re: [EXTERNAL] Do we have a community supported approach for deploying Tika Server in production?

2020-02-06 Thread Eric Pugh
Dave, I pushed up TIKA-3039 with this change for your review and commit!


> On Feb 6, 2020, at 7:20 AM, Eric Pugh  wrote:
> 
> Great!
> 
> 
>> On Feb 5, 2020, at 10:55 PM, David Meikle > <mailto:da...@meikle.io>> wrote:
>> 
>> Hi Eric,
>> 
>> +1 - I think we should drop that and rely on tika-docker instead.
>> 
>> I'm about to push more to it tonight, and then we could include it as a
>> sub-module in Tika to do regular development snapshots too.
>> 
>> Cheers,
>> Dave
>> 
>> On Wed, 5 Feb 2020 at 15:34, Eric Pugh > <mailto:ep...@opensourceconnections.com>>
>> wrote:
>> 
>>> Following this thread, should we deprecate/remove the Tika Docker support
>>> that is in Tika-server project?
>>> 
>>> The `mvn dockerfile:build` command now relies on a plugin that is no
>>> longer supported according to https://github.com/spotify/dockerfile-maven 
>>> <https://github.com/spotify/dockerfile-maven>,
>>> and it seems like the Tika-docker project is really the right place for
>>> this!
>>> 
>>> I’m thinking that this might help reduce the footprint of things we need
>>> to support.
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>>> On Jan 9, 2020, at 12:08 AM, Chris Mattmann >>> <mailto:mattm...@apache.org>> wrote:
>>>> 
>>>> +1
>>>> 
>>>> 
>>>> 
>>>> Note there is also a USC tika dockers repo where I put the data science
>>> stuff too:
>>>> 
>>>> 
>>>> 
>>>> http://github.com/USCDataScience/tika-dockers 
>>>> <http://github.com/USCDataScience/tika-dockers>
>>>> 
>>>> 
>>>> 
>>>> I’ll continue to push DL and ML Tika stuff there.
>>>> 
>>>> Cheers,
>>>> 
>>>> Chris
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> From: Dave Meikle 
>>>> Reply-To: "dev@tika.apache.org" 
>>>> Date: Wednesday, January 8, 2020 at 2:18 PM
>>>> To: "" 
>>>> Subject: Re: [EXTERNAL] Do we have a community supported approach for
>>> deploying Tika Server in production?
>>>> 
>>>> 
>>>> 
>>>> Hi Eric,
>>>> 
>>>> 
>>>> 
>>>> Will take a look. On a related note, I've created a new repos:
>>>> 
>>>> https://github.com/apache/tika-docker 
>>>> <https://github.com/apache/tika-docker>
>>>> 
>>>> 
>>>> 
>>>> Thinking based on looking at the PRs and Issues on LogicalSpark
>>>> 
>>>> docker-tikaserver, I'll create an updated docker file using what you've
>>>> 
>>>> added here and look to publish builds to docker hub from that.
>>>> 
>>>> 
>>>> 
>>>> What do you think?
>>>> 
>>>> 
>>>> 
>>>> Cheers,
>>>> 
>>>> Dave
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Wed, 8 Jan 2020 at 03:16, Eric Pugh 
>>>> 
>>>> wrote:
>>>> 
>>>> 
>>>> 
>>>> Hi all, I’ve gone ahead and added the -spawnChild property as a default
>>>> 
>>>> when running Tika Server as a service.   I’d love some eyes on the PR,
>>> and
>>>> 
>>>> if this looks good, get it committed.
>>>> 
>>>> 
>>>> 
>>>> Feedback welcome!
>>>> 
>>>> 
>>>> 
>>>> Eric
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>>> On Dec 17, 2019, at 12:53 PM, Eric Pugh <
>>> ep...@opensourceconnections.com <mailto:ep...@opensourceconnections.com>>
>>>> 
>>>> wrote:
>>>> 
>>>>> 
>>>> 
>>>>> Cool.
>>>> 
>>>>> 
>>>> 
>>>>> It’s the auto run that I really need, and the other part that I don’t
>>>> 
>>>> think I’ve tackled properly is the managing of logs…
>>>> 
>>>>> 
>>>> 
>>>

Re: [EXTERNAL] Do we have a community supported approach for deploying Tika Server in production?

2020-02-06 Thread Eric Pugh
Great!


> On Feb 5, 2020, at 10:55 PM, David Meikle  wrote:
> 
> Hi Eric,
> 
> +1 - I think we should drop that and rely on tika-docker instead.
> 
> I'm about to push more to it tonight, and then we could include it as a
> sub-module in Tika to do regular development snapshots too.
> 
> Cheers,
> Dave
> 
> On Wed, 5 Feb 2020 at 15:34, Eric Pugh  <mailto:ep...@opensourceconnections.com>>
> wrote:
> 
>> Following this thread, should we deprecate/remove the Tika Docker support
>> that is in Tika-server project?
>> 
>> The `mvn dockerfile:build` command now relies on a plugin that is no
>> longer supported according to https://github.com/spotify/dockerfile-maven,
>> and it seems like the Tika-docker project is really the right place for
>> this!
>> 
>> I’m thinking that this might help reduce the footprint of things we need
>> to support.
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>>> On Jan 9, 2020, at 12:08 AM, Chris Mattmann  wrote:
>>> 
>>> +1
>>> 
>>> 
>>> 
>>> Note there is also a USC tika dockers repo where I put the data science
>> stuff too:
>>> 
>>> 
>>> 
>>> http://github.com/USCDataScience/tika-dockers
>>> 
>>> 
>>> 
>>> I’ll continue to push DL and ML Tika stuff there.
>>> 
>>> Cheers,
>>> 
>>> Chris
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> From: Dave Meikle 
>>> Reply-To: "dev@tika.apache.org" 
>>> Date: Wednesday, January 8, 2020 at 2:18 PM
>>> To: "" 
>>> Subject: Re: [EXTERNAL] Do we have a community supported approach for
>> deploying Tika Server in production?
>>> 
>>> 
>>> 
>>> Hi Eric,
>>> 
>>> 
>>> 
>>> Will take a look. On a related note, I've created a new repos:
>>> 
>>> https://github.com/apache/tika-docker
>>> 
>>> 
>>> 
>>> Thinking based on looking at the PRs and Issues on LogicalSpark
>>> 
>>> docker-tikaserver, I'll create an updated docker file using what you've
>>> 
>>> added here and look to publish builds to docker hub from that.
>>> 
>>> 
>>> 
>>> What do you think?
>>> 
>>> 
>>> 
>>> Cheers,
>>> 
>>> Dave
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Wed, 8 Jan 2020 at 03:16, Eric Pugh 
>>> 
>>> wrote:
>>> 
>>> 
>>> 
>>> Hi all, I’ve gone ahead and added the -spawnChild property as a default
>>> 
>>> when running Tika Server as a service.   I’d love some eyes on the PR,
>> and
>>> 
>>> if this looks good, get it committed.
>>> 
>>> 
>>> 
>>> Feedback welcome!
>>> 
>>> 
>>> 
>>> Eric
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>>> On Dec 17, 2019, at 12:53 PM, Eric Pugh <
>> ep...@opensourceconnections.com>
>>> 
>>> wrote:
>>> 
>>>> 
>>> 
>>>> Cool.
>>> 
>>>> 
>>> 
>>>> It’s the auto run that I really need, and the other part that I don’t
>>> 
>>> think I’ve tackled properly is the managing of logs…
>>> 
>>>> 
>>> 
>>>> I’m going to check with my project to see if they support Snap packages.
>>> 
>>>> 
>>> 
>>>> Eric
>>> 
>>>> 
>>> 
>>>> 
>>> 
>>>>> On Dec 16, 2019, at 5:10 PM, Tom Barber >> 
>>> t...@spicule.co.uk>> wrote:
>>> 
>>>>> 
>>> 
>>>>> Just saw this fly by and FYI on Linux systems that support Snap
>>> 
>>> packages (Ubuntu/Debian/Arch/Fedora etc) you can `snap install
>> tika-server`
>>> 
>>> doesn’t yet auto-run I don’t believe but you can just run
>> `tika-server.run`
>>> 
>>> and adding an init script wouldn’t take 5 minutes.
>>> 
>>>>> 
>>> 
>>>>> Tom
>>> 
>>>>> 
>>> 
>>>>> On 16 December 2019 at 18:42:55, Eric Pugh (
>>> 
>>> ep...@opensourceconnections.com <mailto:ep...@opensourceconnections.com
>>>

[jira] [Commented] (TIKA-3038) Miredot license key expired

2020-02-05 Thread David Eric Pugh (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17030925#comment-17030925
 ] 

David Eric Pugh commented on TIKA-3038:
---

Also, the url for the plugin has changed from https to just http in the 
tika-server/pom.xml file:   http://nexus.qmino.com/content/repositories/miredot

> Miredot license key expired
> ---
>
> Key: TIKA-3038
> URL: https://issues.apache.org/jira/browse/TIKA-3038
> Project: Tika
>  Issue Type: Task
>  Components: documentation
>Affects Versions: 1.23
>    Reporter: David Eric Pugh
>Priority: Major
>
> I figured out why no Miredot API docs..  Key expired Jan 31st, 2020!
> https://issues.apache.org/jira/browse/TIKA-2253
> Do we think this is valuable enough documentation to keep? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (TIKA-3038) Miredot license key expired

2020-02-05 Thread David Eric Pugh (Jira)
David Eric Pugh created TIKA-3038:
-

 Summary: Miredot license key expired
 Key: TIKA-3038
 URL: https://issues.apache.org/jira/browse/TIKA-3038
 Project: Tika
  Issue Type: Task
  Components: documentation
Affects Versions: 1.23
Reporter: David Eric Pugh


I figured out why no Miredot API docs..  Key expired Jan 31st, 2020!

https://issues.apache.org/jira/browse/TIKA-2253

Do we think this is valuable enough documentation to keep? 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-2253) Obtain new Miredot license key and upgrade plugin version in tika-server

2020-02-05 Thread David Eric Pugh (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-2253?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17030904#comment-17030904
 ] 

David Eric Pugh commented on TIKA-2253:
---

Hi all...The license has expired ;-)

> Obtain new Miredot license key and upgrade plugin version in tika-server
> 
>
> Key: TIKA-2253
> URL: https://issues.apache.org/jira/browse/TIKA-2253
> Project: Tika
>  Issue Type: Task
>  Components: documentation, server
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.15
>
>
> As per our recent mailing list conversation 
> http://www.mail-archive.com/dev%40tika.apache.org/msg20558.html our Miredot 
> license has expired.
> The kind folks over at Miredot have provided us with a new key it is valid 
> until January 31st, 2020 after which we are free to request a new key.
> Thanks Miredot!
> PR coming.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3037) Tika Docs should highlight Tika-Server

2020-02-05 Thread David Eric Pugh (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17030900#comment-17030900
 ] 

David Eric Pugh commented on TIKA-3037:
---

Okay, I've attached a SVN DIFF patch file to the 1.23/gettingstarted.apt file 
as a patch.

> Tika Docs should highlight Tika-Server
> --
>
> Key: TIKA-3037
> URL: https://issues.apache.org/jira/browse/TIKA-3037
> Project: Tika
>  Issue Type: Improvement
>  Components: documentation
>Affects Versions: 1.23
>    Reporter: David Eric Pugh
>Priority: Major
> Attachments: gettingstarted.apt.patch
>
>
> Currently the Tika website and many of the project docs don't surface the 
> Tika Server project.  This is a ticket to track fixing those issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-3037) Tika Docs should highlight Tika-Server

2020-02-05 Thread David Eric Pugh (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Eric Pugh updated TIKA-3037:
--
Attachment: gettingstarted.apt.patch

> Tika Docs should highlight Tika-Server
> --
>
> Key: TIKA-3037
> URL: https://issues.apache.org/jira/browse/TIKA-3037
> Project: Tika
>  Issue Type: Improvement
>  Components: documentation
>Affects Versions: 1.23
>    Reporter: David Eric Pugh
>Priority: Major
> Attachments: gettingstarted.apt.patch
>
>
> Currently the Tika website and many of the project docs don't surface the 
> Tika Server project.  This is a ticket to track fixing those issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3037) Tika Docs should highlight Tika-Server

2020-02-05 Thread David Eric Pugh (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17030862#comment-17030862
 ] 

David Eric Pugh commented on TIKA-3037:
---

Okay, in https://svn.apache.org/repos/asf/tika/site/src/site/apt/ I expected 
some sort of "trunk" or "per-release" dir, but it seems like the .apt files are 
copied per release.   I'll go ahead and make my edits against 
https://svn.apache.org/repos/asf/tika/site/src/site/apt/1.23/gettingstarted.apt,
 and then diff as a patch file.

> Tika Docs should highlight Tika-Server
> --
>
> Key: TIKA-3037
> URL: https://issues.apache.org/jira/browse/TIKA-3037
> Project: Tika
>  Issue Type: Improvement
>  Components: documentation
>Affects Versions: 1.23
>Reporter: David Eric Pugh
>Priority: Major
>
> Currently the Tika website and many of the project docs don't surface the 
> Tika Server project.  This is a ticket to track fixing those issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3037) Tika Docs should highlight Tika-Server

2020-02-05 Thread David Eric Pugh (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17030806#comment-17030806
 ] 

David Eric Pugh commented on TIKA-3037:
---

I put some edits into the wiki at 
https://cwiki.apache.org/confluence/display/TIKA/TikaJAXRS and would love 
another set of eyes...  Hopefully the approach I'm taking makes sense?

> Tika Docs should highlight Tika-Server
> --
>
> Key: TIKA-3037
> URL: https://issues.apache.org/jira/browse/TIKA-3037
> Project: Tika
>  Issue Type: Improvement
>  Components: documentation
>Affects Versions: 1.23
>    Reporter: David Eric Pugh
>Priority: Major
>
> Currently the Tika website and many of the project docs don't surface the 
> Tika Server project.  This is a ticket to track fixing those issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3037) Tika Docs should highlight Tika-Server

2020-02-05 Thread David Eric Pugh (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17030766#comment-17030766
 ] 

David Eric Pugh commented on TIKA-3037:
---

Thanks [~nick]

> Tika Docs should highlight Tika-Server
> --
>
> Key: TIKA-3037
> URL: https://issues.apache.org/jira/browse/TIKA-3037
> Project: Tika
>  Issue Type: Improvement
>  Components: documentation
>Affects Versions: 1.23
>    Reporter: David Eric Pugh
>Priority: Major
>
> Currently the Tika website and many of the project docs don't surface the 
> Tika Server project.  This is a ticket to track fixing those issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3037) Tika Docs should highlight Tika-Server

2020-02-05 Thread David Eric Pugh (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17030760#comment-17030760
 ] 

David Eric Pugh commented on TIKA-3037:
---

Another comment, so the page 
https://cwiki.apache.org/confluence/display/TIKA/TikaJAXRS really should be 
https://cwiki.apache.org/confluence/display/TIKA/TikaServer, however, search 
engines know about https://cwiki.apache.org/confluence/display/TIKA/TikaJAXRS.  
Do we leave a placeholder page at TikaJAXRS pointing to TikaServer?

> Tika Docs should highlight Tika-Server
> --
>
> Key: TIKA-3037
> URL: https://issues.apache.org/jira/browse/TIKA-3037
> Project: Tika
>  Issue Type: Improvement
>  Components: documentation
>Affects Versions: 1.23
>    Reporter: David Eric Pugh
>Priority: Major
>
> Currently the Tika website and many of the project docs don't surface the 
> Tika Server project.  This is a ticket to track fixing those issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-3037) Tika Docs should highlight Tika-Server

2020-02-05 Thread David Eric Pugh (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17030748#comment-17030748
 ] 

David Eric Pugh commented on TIKA-3037:
---

So...   Where does the HTML for the website live?   What is the best way to 
supply patches?  I'd like to update the 
https://tika.apache.org/1.23/gettingstarted.html page.  I could download it, 
edit it, and attach that as file?

> Tika Docs should highlight Tika-Server
> --
>
> Key: TIKA-3037
> URL: https://issues.apache.org/jira/browse/TIKA-3037
> Project: Tika
>  Issue Type: Improvement
>  Components: documentation
>Affects Versions: 1.23
>    Reporter: David Eric Pugh
>Priority: Major
>
> Currently the Tika website and many of the project docs don't surface the 
> Tika Server project.  This is a ticket to track fixing those issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [EXTERNAL] Do we have a community supported approach for deploying Tika Server in production?

2020-02-05 Thread Eric Pugh
Following this thread, should we deprecate/remove the Tika Docker support that 
is in Tika-server project?  

The `mvn dockerfile:build` command now relies on a plugin that is no longer 
supported according to https://github.com/spotify/dockerfile-maven, and it 
seems like the Tika-docker project is really the right place for this!

I’m thinking that this might help reduce the footprint of things we need to 
support.








> On Jan 9, 2020, at 12:08 AM, Chris Mattmann  wrote:
> 
> +1
> 
> 
> 
> Note there is also a USC tika dockers repo where I put the data science stuff 
> too:
> 
> 
> 
> http://github.com/USCDataScience/tika-dockers
> 
> 
> 
> I’ll continue to push DL and ML Tika stuff there.
> 
> Cheers,
> 
> Chris
> 
> 
> 
> 
> 
> 
> 
> 
> 
> From: Dave Meikle 
> Reply-To: "dev@tika.apache.org" 
> Date: Wednesday, January 8, 2020 at 2:18 PM
> To: "" 
> Subject: Re: [EXTERNAL] Do we have a community supported approach for 
> deploying Tika Server in production?
> 
> 
> 
> Hi Eric,
> 
> 
> 
> Will take a look. On a related note, I've created a new repos:
> 
> https://github.com/apache/tika-docker
> 
> 
> 
> Thinking based on looking at the PRs and Issues on LogicalSpark
> 
> docker-tikaserver, I'll create an updated docker file using what you've
> 
> added here and look to publish builds to docker hub from that.
> 
> 
> 
> What do you think?
> 
> 
> 
> Cheers,
> 
> Dave
> 
> 
> 
> 
> 
> 
> 
> On Wed, 8 Jan 2020 at 03:16, Eric Pugh 
> 
> wrote:
> 
> 
> 
> Hi all, I’ve gone ahead and added the -spawnChild property as a default
> 
> when running Tika Server as a service.   I’d love some eyes on the PR, and
> 
> if this looks good, get it committed.
> 
> 
> 
> Feedback welcome!
> 
> 
> 
> Eric
> 
> 
> 
> 
> 
> 
> 
>> On Dec 17, 2019, at 12:53 PM, Eric Pugh 
> 
> wrote:
> 
>> 
> 
>> Cool.
> 
>> 
> 
>> It’s the auto run that I really need, and the other part that I don’t
> 
> think I’ve tackled properly is the managing of logs…
> 
>> 
> 
>> I’m going to check with my project to see if they support Snap packages.
> 
>> 
> 
>> Eric
> 
>> 
> 
>> 
> 
>>> On Dec 16, 2019, at 5:10 PM, Tom Barber  
> t...@spicule.co.uk>> wrote:
> 
>>> 
> 
>>> Just saw this fly by and FYI on Linux systems that support Snap
> 
> packages (Ubuntu/Debian/Arch/Fedora etc) you can `snap install tika-server`
> 
> doesn’t yet auto-run I don’t believe but you can just run `tika-server.run`
> 
> and adding an init script wouldn’t take 5 minutes.
> 
>>> 
> 
>>> Tom
> 
>>> 
> 
>>> On 16 December 2019 at 18:42:55, Eric Pugh (
> 
> ep...@opensourceconnections.com <mailto:ep...@opensourceconnections.com>)
> 
> wrote:
> 
>>> 
> 
>>>> Hi folks!
> 
>>>> 
> 
>>>> I’ve got a mostly completed PR for having install scripts for Tika
> 
> Server, and I’m hoping a committer will take a look at the PR, and give
> 
> feedback (and ideally commit in time for 1.24!)
> 
>>>> 
> 
>>>> A couple of things:
> 
>>>> 
> 
>>>> 1) This was completely influenced by
> 
> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script
> 
> < 
> 
> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script
> 
>> < 
> 
> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script
> 
> < 
> 
> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script>>,
> 
> in fact I started with the Solr scripts.
> 
>>>> 
> 
>>>> 2) I’ve deleted all the Solr specific aspects (I think), however there
> 
> may still be more to delete.
> 
>>>> 
> 
>>>> 3) This requires a change to how we release Tika, previously we ship
> 
> tika-app.jar and Tika-eval.jar, and Tika-server.jar, and now, I think, we
> 
> want to add the tika-server-bin.tgz and tika-server-bin.zip binary
> 
> distributions.
> 
>>>> 
> 
>>>> I’m happy to start writing accompanying “how to deploy Tika Server”
> 
> docs if this PR looks good! Or, please give input and I’ll make the updates.
> 
>>>> 
> 
>>>> Eric
> 
>>>> 
> 
>>>> 
> 
>>>>> On Dec 12, 20

[jira] [Created] (TIKA-3037) Tika Docs should highlight Tika-Server

2020-02-05 Thread David Eric Pugh (Jira)
David Eric Pugh created TIKA-3037:
-

 Summary: Tika Docs should highlight Tika-Server
 Key: TIKA-3037
 URL: https://issues.apache.org/jira/browse/TIKA-3037
 Project: Tika
  Issue Type: Improvement
  Components: documentation
Affects Versions: 1.23
Reporter: David Eric Pugh


Currently the Tika website and many of the project docs don't surface the Tika 
Server project.  This is a ticket to track fixing those issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [EXTERNAL] Do we have a community supported approach for deploying Tika Server in production?

2020-01-07 Thread Eric Pugh
Hi all, I’ve gone ahead and added the -spawnChild property as a default when 
running Tika Server as a service.   I’d love some eyes on the PR, and if this 
looks good, get it committed.   

Feedback welcome!

Eric



> On Dec 17, 2019, at 12:53 PM, Eric Pugh  
> wrote:
> 
> Cool.   
> 
> It’s the auto run that I really need, and the other part that I don’t think 
> I’ve tackled properly is the managing of logs…
> 
> I’m going to check with my project to see if they support Snap packages.
> 
> Eric
> 
> 
>> On Dec 16, 2019, at 5:10 PM, Tom Barber > <mailto:t...@spicule.co.uk>> wrote:
>> 
>> Just saw this fly by and FYI on Linux systems that support Snap packages 
>> (Ubuntu/Debian/Arch/Fedora etc) you can `snap install tika-server` doesn’t 
>> yet auto-run I don’t believe but you can just run `tika-server.run` and 
>> adding an init script wouldn’t take 5 minutes.
>> 
>> Tom
>> 
>> On 16 December 2019 at 18:42:55, Eric Pugh (ep...@opensourceconnections.com 
>> <mailto:ep...@opensourceconnections.com>) wrote:
>> 
>>> Hi folks! 
>>> 
>>> I’ve got a mostly completed PR for having install scripts for Tika Server, 
>>> and I’m hoping a committer will take a look at the PR, and give feedback 
>>> (and ideally commit in time for 1.24!) 
>>> 
>>> A couple of things: 
>>> 
>>> 1) This was completely influenced by 
>>> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script
>>>  
>>> <https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script><https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script
>>>  
>>> <https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script>>,
>>>  in fact I started with the Solr scripts. 
>>> 
>>> 2) I’ve deleted all the Solr specific aspects (I think), however there may 
>>> still be more to delete.  
>>> 
>>> 3) This requires a change to how we release Tika, previously we ship 
>>> tika-app.jar and Tika-eval.jar, and Tika-server.jar, and now, I think, we 
>>> want to add the tika-server-bin.tgz and tika-server-bin.zip binary 
>>> distributions. 
>>> 
>>> I’m happy to start writing accompanying “how to deploy Tika Server” docs if 
>>> this PR looks good! Or, please give input and I’ll make the updates.
>>> 
>>> Eric 
>>> 
>>> 
>>> > On Dec 12, 2019, at 2:39 PM, Eric Pugh >> > <mailto:ep...@opensourceconnections.com>> wrote: 
>>> >  
>>> > I’ve created this JIRA to track this work: 
>>> > https://issues.apache.org/jira/browse/TIKA-3010 
>>> > <https://issues.apache.org/jira/browse/TIKA-3010> 
>>> > <https://issues.apache.org/jira/browse/TIKA-3010 
>>> > <https://issues.apache.org/jira/browse/TIKA-3010>> 
>>> >  
>>> > And a WIP progress PR is at https://github.com/apache/tika/pull/305 
>>> > <https://github.com/apache/tika/pull/305> 
>>> > <https://github.com/apache/tika/pull/305 
>>> > <https://github.com/apache/tika/pull/305>> 
>>> >  
>>> > My thought is to put something together that mimics how we deploy Solr, 
>>> > and see how that works. I have a need for an install process that a 
>>> > general IT person can follow, who isn’t a Tika expert or a Docker users. 
>>> >  
>>> >  
>>> >  
>>> >  
>>> >> On Dec 4, 2019, at 12:28 PM, Chris Mattmann >> >> <mailto:mattm...@apache.org> <mailto:mattm...@apache.org 
>>> >> <mailto:mattm...@apache.org>>> wrote: 
>>> >>  
>>> >> Thanks for bringing this conversation up Eric. 
>>> >>  
>>> >>  
>>> >>  
>>> >> Historically if you look over the last 5 years, I think what you are 
>>> >> asking below has sort of already become the de facto 
>>> >> truth. Most people are in fact using Tika server, whether they are 
>>> >> individual devs, govvies, commercial folk and the like.  
>>> >>  
>>> >> Big, small and medium projects. Evidenced by the expansion of Tika APIs 
>>> >> into pretty much every PL I know and use of  
>>> >> actively today. 
>>> >>  
>>> >>  
>>> >>  
>>> &

Re: [EXTERNAL] Do we have a community supported approach for deploying Tika Server in production?

2019-12-17 Thread Eric Pugh
Cool.   

It’s the auto run that I really need, and the other part that I don’t think 
I’ve tackled properly is the managing of logs…

I’m going to check with my project to see if they support Snap packages.

Eric


> On Dec 16, 2019, at 5:10 PM, Tom Barber  wrote:
> 
> Just saw this fly by and FYI on Linux systems that support Snap packages 
> (Ubuntu/Debian/Arch/Fedora etc) you can `snap install tika-server` doesn’t 
> yet auto-run I don’t believe but you can just run `tika-server.run` and 
> adding an init script wouldn’t take 5 minutes.
> 
> Tom
> 
> On 16 December 2019 at 18:42:55, Eric Pugh (ep...@opensourceconnections.com 
> <mailto:ep...@opensourceconnections.com>) wrote:
> 
>> Hi folks! 
>> 
>> I’ve got a mostly completed PR for having install scripts for Tika Server, 
>> and I’m hoping a committer will take a look at the PR, and give feedback 
>> (and ideally commit in time for 1.24!) 
>> 
>> A couple of things: 
>> 
>> 1) This was completely influenced by 
>> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script
>>  
>> <https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script><https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script
>>  
>> <https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script>>,
>>  in fact I started with the Solr scripts. 
>> 
>> 2) I’ve deleted all the Solr specific aspects (I think), however there may 
>> still be more to delete.  
>> 
>> 3) This requires a change to how we release Tika, previously we ship 
>> tika-app.jar and Tika-eval.jar, and Tika-server.jar, and now, I think, we 
>> want to add the tika-server-bin.tgz and tika-server-bin.zip binary 
>> distributions. 
>> 
>> I’m happy to start writing accompanying “how to deploy Tika Server” docs if 
>> this PR looks good! Or, please give input and I’ll make the updates.
>> 
>> Eric 
>> 
>> 
>> > On Dec 12, 2019, at 2:39 PM, Eric Pugh > > <mailto:ep...@opensourceconnections.com>> wrote: 
>> >  
>> > I’ve created this JIRA to track this work: 
>> > https://issues.apache.org/jira/browse/TIKA-3010 
>> > <https://issues.apache.org/jira/browse/TIKA-3010> 
>> > <https://issues.apache.org/jira/browse/TIKA-3010 
>> > <https://issues.apache.org/jira/browse/TIKA-3010>> 
>> >  
>> > And a WIP progress PR is at https://github.com/apache/tika/pull/305 
>> > <https://github.com/apache/tika/pull/305> 
>> > <https://github.com/apache/tika/pull/305 
>> > <https://github.com/apache/tika/pull/305>> 
>> >  
>> > My thought is to put something together that mimics how we deploy Solr, 
>> > and see how that works. I have a need for an install process that a 
>> > general IT person can follow, who isn’t a Tika expert or a Docker users. 
>> >  
>> >  
>> >  
>> >  
>> >> On Dec 4, 2019, at 12:28 PM, Chris Mattmann > >> <mailto:mattm...@apache.org> <mailto:mattm...@apache.org 
>> >> <mailto:mattm...@apache.org>>> wrote: 
>> >>  
>> >> Thanks for bringing this conversation up Eric. 
>> >>  
>> >>  
>> >>  
>> >> Historically if you look over the last 5 years, I think what you are 
>> >> asking below has sort of already become the de facto 
>> >> truth. Most people are in fact using Tika server, whether they are 
>> >> individual devs, govvies, commercial folk and the like.  
>> >>  
>> >> Big, small and medium projects. Evidenced by the expansion of Tika APIs 
>> >> into pretty much every PL I know and use of  
>> >> actively today. 
>> >>  
>> >>  
>> >>  
>> >> Given that, we probably should update the main website docs to make this 
>> >> more prominent. The tika server docs on the 
>> >> wiki are pretty darn good. But they don’t get prime real estate. Would be 
>> >> wonderful if someone wants to update the  
>> >> website to make it more prominent. 
>> >>  
>> >>  
>> >>  
>> >> The downstream Tika Python lib that I maintain has tons of activity is 
>> >> used by more than 350+ projects and relies solely 
>> >> on Tika-Server. My recommendation to the Solr folks (having created 7633) 
>> >> from the 2

Re: [EXTERNAL] Do we have a community supported approach for deploying Tika Server in production?

2019-12-16 Thread Eric Pugh
Hi folks!

I’ve got a mostly completed PR for having install scripts for Tika Server, and 
I’m hoping a committer will take a look at the PR, and give feedback (and 
ideally commit in time for 1.24!)

A couple of things:

1) This was completely influenced by 
https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script
 
<https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script>,
 in fact I started with the Solr scripts.

2) I’ve deleted all the Solr specific aspects (I think), however there may 
still be more to delete.   

3) This requires a change to how we release Tika, previously we ship 
tika-app.jar and Tika-eval.jar, and Tika-server.jar, and now, I think, we want 
to add the tika-server-bin.tgz and tika-server-bin.zip binary distributions.

I’m happy to start writing accompanying “how to deploy Tika Server” docs if 
this PR looks good!   Or, please give input and I’ll make the updates.

Eric


> On Dec 12, 2019, at 2:39 PM, Eric Pugh  
> wrote:
> 
> I’ve created this JIRA to track this work: 
> https://issues.apache.org/jira/browse/TIKA-3010 
> <https://issues.apache.org/jira/browse/TIKA-3010>
> 
> And a WIP progress PR is at https://github.com/apache/tika/pull/305 
> <https://github.com/apache/tika/pull/305>
> 
> My thought is to put something together that mimics how we deploy Solr, and 
> see how that works.   I have a need for an install process that a general IT 
> person can follow, who isn’t a Tika expert or a Docker users.
> 
> 
> 
> 
>> On Dec 4, 2019, at 12:28 PM, Chris Mattmann > <mailto:mattm...@apache.org>> wrote:
>> 
>> Thanks for bringing this conversation up Eric.
>> 
>> 
>> 
>> Historically if you look over the last 5 years, I think what you are asking 
>> below has sort of already become the de facto
>> truth. Most people are in fact using Tika server, whether they are 
>> individual devs, govvies, commercial folk and the like. 
>> 
>> Big, small and medium projects. Evidenced by the expansion of Tika APIs into 
>> pretty much every PL I know and use of 
>> actively today.
>> 
>> 
>> 
>> Given that, we probably should update the main website docs to make this 
>> more prominent. The tika server docs on the
>> wiki are pretty darn good. But they don’t get prime real estate. Would be 
>> wonderful if someone wants to update the 
>> website to make it more prominent.
>> 
>> 
>> 
>> The downstream Tika Python lib that I maintain has tons of activity is used 
>> by more than 350+ projects and relies solely
>> on Tika-Server. My recommendation to the Solr folks (having created 7633) 
>> from the 2014 DARPA MEMEX days was to 
>> move towards Tika Server based SolrCell dep and that’s the right way to go 
>> IMO.
>> 
>> 
>> 
>> Chris
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> From: Eric Pugh > <mailto:ep...@opensourceconnections.com>>
>> Reply-To: "dev@tika.apache.org <mailto:dev@tika.apache.org>" 
>> mailto:dev@tika.apache.org>>
>> Date: Wednesday, December 4, 2019 at 12:24 PM
>> To: "tika-...@apache.org <mailto:tika-...@apache.org>" > <mailto:tika-...@apache.org>>
>> Subject: [EXTERNAL] Do we have a community supported approach for deploying 
>> Tika Server in production?
>> 
>> 
>> 
>> Hi all - Hoping this is a reasonable Tika-dev versus Tika-user question!
>> 
>> 
>> 
>> Over in Solr land there has been renewed discussion about streamlining what 
>> Solr is   
>> 
>> 
>> 
>> In regards to rich content extraction and the Tika project, it seems like 
>> the two ideas that continue to preserve the existing behavior are:
>> 
>> 
>> 
>> 1) To convert the ExtractingRequestHandler into a Package (Plugin) for Solr. 
>>   This slims down the standard Solr download, and *might* make it easier to 
>> update the version of Tika + dependent jars used?
>> 
>> 
>> 
>> 2) The second approach is to instead require Tika-Server to be running 
>> (https://issues.apache.org/jira/browse/SOLR-7633 
>> <https://issues.apache.org/jira/browse/SOLR-7633>) and just have Solr 
>> delegate the call to Tika-Server.
>> 
>> 
>> 
>> 
>> 
>> I was thinking about why I like option 1 better than 2, and I think it boils 
>> down to how mature the IT organization I am working with is.  Some IT 
>> organizations have large dev-ops teams, and are working at m

[jira] [Commented] (TIKA-3010) Tika needs service installation script

2019-12-12 Thread David Eric Pugh (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16995174#comment-16995174
 ] 

David Eric Pugh commented on TIKA-3010:
---

Made more progress.   Now, when you run the `package` goal on the tika-server 
project, it creates two assemblies, `tika-server-2.0.0-SNAPSHOT-bin.tar.gz` and 
`tika-server-2.0.0-SNAPSHOT-bin.zip` which contain the shaded `tika-server.jar` 
and the `./bin/tika` script to stop and start Tika Server.  

The next step I think is to in introduce the `./bin/init.d` directory and the 
`install_tika_service.sh` script.

> Tika needs service installation script 
> ---
>
> Key: TIKA-3010
> URL: https://issues.apache.org/jira/browse/TIKA-3010
> Project: Tika
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 1.23
>    Reporter: David Eric Pugh
>Priority: Major
>
> With motion towards removing the tight integration of Tika into Solr, and the 
> fact that many folks deploy Tika-Server as a microservice, we should have a 
> community supported way of installing Tika.
> I'm thinking of something modeled on what Solr does: 
> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (TIKA-3010) Tika needs service installation script

2019-12-12 Thread David Eric Pugh (Jira)


 [ 
https://issues.apache.org/jira/browse/TIKA-3010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Eric Pugh updated TIKA-3010:
--
Flags: Patch,Important  (was: Important)

> Tika needs service installation script 
> ---
>
> Key: TIKA-3010
> URL: https://issues.apache.org/jira/browse/TIKA-3010
> Project: Tika
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 1.23
>    Reporter: David Eric Pugh
>Priority: Major
>
> With motion towards removing the tight integration of Tika into Solr, and the 
> fact that many folks deploy Tika-Server as a microservice, we should have a 
> community supported way of installing Tika.
> I'm thinking of something modeled on what Solr does: 
> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: [EXTERNAL] Do we have a community supported approach for deploying Tika Server in production?

2019-12-12 Thread Eric Pugh
I’ve created this JIRA to track this work: 
https://issues.apache.org/jira/browse/TIKA-3010 
<https://issues.apache.org/jira/browse/TIKA-3010>

And a WIP progress PR is at https://github.com/apache/tika/pull/305

My thought is to put something together that mimics how we deploy Solr, and see 
how that works.   I have a need for an install process that a general IT person 
can follow, who isn’t a Tika expert or a Docker users.




> On Dec 4, 2019, at 12:28 PM, Chris Mattmann  wrote:
> 
> Thanks for bringing this conversation up Eric.
> 
> 
> 
> Historically if you look over the last 5 years, I think what you are asking 
> below has sort of already become the de facto
> truth. Most people are in fact using Tika server, whether they are individual 
> devs, govvies, commercial folk and the like. 
> 
> Big, small and medium projects. Evidenced by the expansion of Tika APIs into 
> pretty much every PL I know and use of 
> actively today.
> 
> 
> 
> Given that, we probably should update the main website docs to make this more 
> prominent. The tika server docs on the
> wiki are pretty darn good. But they don’t get prime real estate. Would be 
> wonderful if someone wants to update the 
> website to make it more prominent.
> 
> 
> 
> The downstream Tika Python lib that I maintain has tons of activity is used 
> by more than 350+ projects and relies solely
> on Tika-Server. My recommendation to the Solr folks (having created 7633) 
> from the 2014 DARPA MEMEX days was to 
> move towards Tika Server based SolrCell dep and that’s the right way to go 
> IMO.
> 
> 
> 
> Chris
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> From: Eric Pugh  <mailto:ep...@opensourceconnections.com>>
> Reply-To: "dev@tika.apache.org <mailto:dev@tika.apache.org>" 
> mailto:dev@tika.apache.org>>
> Date: Wednesday, December 4, 2019 at 12:24 PM
> To: "tika-...@apache.org <mailto:tika-...@apache.org>"  <mailto:tika-...@apache.org>>
> Subject: [EXTERNAL] Do we have a community supported approach for deploying 
> Tika Server in production?
> 
> 
> 
> Hi all - Hoping this is a reasonable Tika-dev versus Tika-user question!
> 
> 
> 
> Over in Solr land there has been renewed discussion about streamlining what 
> Solr is   
> 
> 
> 
> In regards to rich content extraction and the Tika project, it seems like the 
> two ideas that continue to preserve the existing behavior are:
> 
> 
> 
> 1) To convert the ExtractingRequestHandler into a Package (Plugin) for Solr.  
>  This slims down the standard Solr download, and *might* make it easier to 
> update the version of Tika + dependent jars used?
> 
> 
> 
> 2) The second approach is to instead require Tika-Server to be running 
> (https://issues.apache.org/jira/browse/SOLR-7633) and just have Solr delegate 
> the call to Tika-Server.
> 
> 
> 
> 
> 
> I was thinking about why I like option 1 better than 2, and I think it boils 
> down to how mature the IT organization I am working with is.  Some IT 
> organizations have large dev-ops teams, and are working at major scale, and 
> managing a fleet of Tika-Server on Kubernetes with Load Balancer dynamically 
> scaling up and down is simple and second nature!  However, many organizations 
> aren’t like that.
> 
> 
> 
> So I guess what I’m asking is do we have a reasonable supported approach for 
> deploying Tika Server for non-tika savvy organizations?   I’m thinking about 
> Solr, and specifically the fact that Solr has a well defined set of Service 
> Installation scripts.   When I follow the directions in 
> https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#taking-solr-to-production
>  I can feel confident that when the server is rebooted, then Solr will come 
> back up!   Plus there is log rotation and all the rest.
> 
> 
> 
> In contrast, when I look at Tika website, specifically 
> https://tika.apache.org/1.22/gettingstarted.htm pagel, the message is to run 
> Tika as a command line application, or embedded in your application.   
> 
> 
> 
> I’m wondering if Tika-Server needs to be made more prominent, and treated as 
> the “primary method of interacting with Tika”?   Do we need as a community to 
> focus more on Tika-Server?   In our getting started documentation, in our 
> usage documentation, and in our examples?
> 
> 
> 
> Do we need to create the equivalent of the Service Installation scripts for 
> Tika-Server?   
> 
> 
> 
> Wanted to stoke the discussion!
> 
> 
> 
> Eric
> 
> 
> 
> ___
> 
> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1

[jira] [Created] (TIKA-3010) Tika needs service installation script

2019-12-12 Thread David Eric Pugh (Jira)
David Eric Pugh created TIKA-3010:
-

 Summary: Tika needs service installation script 
 Key: TIKA-3010
 URL: https://issues.apache.org/jira/browse/TIKA-3010
 Project: Tika
  Issue Type: Improvement
  Components: server
Affects Versions: 1.23
Reporter: David Eric Pugh


With motion towards removing the tight integration of Tika into Solr, and the 
fact that many folks deploy Tika-Server as a microservice, we should have a 
community supported way of installing Tika.

I'm thinking of something modeled on what Solr does: 
https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Miredot documentation is missing for 1.23...

2019-12-12 Thread Eric Pugh
https://tika.apache.org/1.23/miredot/ <https://tika.apache.org/1.23/miredot/>  
url has a 404.   Looks like https://tika.apache.org/1.22/miredot/ 
<https://tika.apache.org/1.22/miredot/> works.


___
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | 
My Free/Busy <http://tinyurl.com/eric-cal>  
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed 
<https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>

This e-mail and all contents, including attachments, is considered to be 
Company Confidential unless explicitly stated otherwise, regardless of whether 
attachments are marked as such.



Do we have a community supported approach for deploying Tika Server in production?

2019-12-04 Thread Eric Pugh
Hi all - Hoping this is a reasonable Tika-dev versus Tika-user question!

Over in Solr land there has been renewed discussion about streamlining what 
Solr is   

In regards to rich content extraction and the Tika project, it seems like the 
two ideas that continue to preserve the existing behavior are:

1) To convert the ExtractingRequestHandler into a Package (Plugin) for Solr.   
This slims down the standard Solr download, and *might* make it easier to 
update the version of Tika + dependent jars used?

2) The second approach is to instead require Tika-Server to be running 
(https://issues.apache.org/jira/browse/SOLR-7633) and just have Solr delegate 
the call to Tika-Server.


I was thinking about why I like option 1 better than 2, and I think it boils 
down to how mature the IT organization I am working with is.  Some IT 
organizations have large dev-ops teams, and are working at major scale, and 
managing a fleet of Tika-Server on Kubernetes with Load Balancer dynamically 
scaling up and down is simple and second nature!  However, many organizations 
aren’t like that.

So I guess what I’m asking is do we have a reasonable supported approach for 
deploying Tika Server for non-tika savvy organizations?   I’m thinking about 
Solr, and specifically the fact that Solr has a well defined set of Service 
Installation scripts.   When I follow the directions in 
https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#taking-solr-to-production
 I can feel confident that when the server is rebooted, then Solr will come 
back up!   Plus there is log rotation and all the rest.

In contrast, when I look at Tika website, specifically 
https://tika.apache.org/1.22/gettingstarted.htm pagel, the message is to run 
Tika as a command line application, or embedded in your application.   

I’m wondering if Tika-Server needs to be made more prominent, and treated as 
the “primary method of interacting with Tika”?   Do we need as a community to 
focus more on Tika-Server?   In our getting started documentation, in our usage 
documentation, and in our examples?

Do we need to create the equivalent of the Service Installation scripts for 
Tika-Server?   

Wanted to stoke the discussion!

Eric

___
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | 
My Free/Busy <http://tinyurl.com/eric-cal>  
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed 
<https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>

This e-mail and all contents, including attachments, is considered to be 
Company Confidential unless explicitly stated otherwise, regardless of whether 
attachments are marked as such.



Re: regression tests for 1.23-rc1

2019-11-22 Thread Eric Pugh
I feel like you just experienced a wonderful lesson that we all peridodically 
experience….  “Extracting data at scale”

I wonder, is there any, way of coming up with hueristics to predict how long 
the process would take?  “Based on your settings, based on your doc types, 
based on sizes, based on historical records….   It will take 20 hours to run”…



> On Nov 22, 2019, at 8:25 AM, Tim Allison  wrote:
> 
> All,
>  I started the regression tests on a random set of 500k files.  I found
> this morning that it was _still_ going.  It turns out I had accidentally
> configured extract images for PDFs, which adds to the processing time and
> leads to more OOMs.
>  I restarted the regression tests this morning with that feature turned
> off.
> 
>   Best,
> 
>   Tim

___
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | 
My Free/Busy <http://tinyurl.com/eric-cal>  
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed 
<https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>

This e-mail and all contents, including attachments, is considered to be 
Company Confidential unless explicitly stated otherwise, regardless of whether 
attachments are marked as such.



Re: [EXTERNAL] Docker image along with 1.23?

2019-11-21 Thread Eric Pugh
That makes sense.   Having a robust Dockerfile, even if it isn’t published, is 
a great way of modeling best practices in running Tika in server mode.



> On Nov 21, 2019, at 3:26 AM, Nick Burch  wrote:
> 
> On Thu, 21 Nov 2019, Oleg Tikhonov wrote:
>> My question is more pragmatic.
>> What we put inside the Dockerfile, on which image it will be based on (say
>> Ubuntu) ...
>> What will contain an entrypoint? Tika Server? Should we "install" a
>> tesseract? Anything more?
> 
> If we want to be trendy, then Sergey Beryozkin did some cool stuck with 
> Quarkus and a GraalVM native image of Tika, video online at
> https://aceu19.apachecon.com/session/apache-tika-goes-native-graalvm-and-quarkus
> 
> I'd possibly suggest two dockerfiles (but not published images!), both based 
> on a fairly thin common Java base image (so probably ubuntu rather than 
> alphine). One with just Tika Server + tesseract + english tesseract data, one 
> with all the optional Tika dependencies (sql natives libraries etc) and 
> tesseract and all the available tesseract languages
> 
> Some other projects are currently leading the debate on ASF binary releases 
> that bundle the JVM, I'd suggest we wait for that to resolve before we think 
> about trying to publish pre-built images ourselves. Linking to images from 
> external organisations we trust should be fine though, eg similar to 
> http://httpd.apache.org/docs/current/platform/windows.html#down
> 
> Nick

___
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | 
My Free/Busy <http://tinyurl.com/eric-cal>  
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed 
<https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>

This e-mail and all contents, including attachments, is considered to be 
Company Confidential unless explicitly stated otherwise, regardless of whether 
attachments are marked as such.



Re: [EXTERNAL] Docker image along with 1.23?

2019-11-20 Thread Eric Pugh
I was thinking more of producing the actual image, so that others don’t have to 
go through the pain of compiling an image.   Having the Dockerfile made 
available as well does give a nice recipe for modifying the “official” image.   
I recently tested Tesseract 3 with the latest Tika, and I did it by tweaking 
the existing Dockerfile that LogicalSpark has published.

I don’t know how other projects at ASF handle the image publishing.




> On Nov 20, 2019, at 7:02 PM, Chris Mattmann  wrote:
> 
> Nick, TBH, I don’t get it. If we ship the “Dockerfile” we are simply shipping 
> text file, 
> code. Under a license. If we create a “docker image” and then publish it to 
> the ASF 
> hub then I agree with you.
> 
> 
> 
> My suggestion and my interpretation of Tim’s is to ship a standard 
> “Dockerfile”. Do you
> agree with this? It should be air covered (as former VP, Legal, at least it 
> would have been
> with me). 
> 
> 
> 
> Cheers,
> 
> Chris
> 
> 
> 
> 
> 
> 
> 
> 
> 
> From: Nick Burch 
> Reply-To: "dev@tika.apache.org" 
> Date: Wednesday, November 20, 2019 at 3:57 PM
> To: "Allison, Timothy B (US 1760-Affiliate)" 
> Cc: "" 
> Subject: [EXTERNAL] Re: Docker image along with 1.23?
> 
> 
> 
> On Wed, 20 Nov 2019, Tim Allison wrote:
> 
> Eric Pugh recently asked on another channel if we had any plans to
> 
> release an official docker image for 1.23.
> 
> 
> 
> Depending on what we put in the container, we do need to be a little 
> 
> careful. There's "platform dependencies" under non-compatible licenses 
> 
> that we can optionally use if people have installed them, which we 
> 
> ourselves can't directly ship under ASF rules. (Tesseract is fine as 
> 
> that's Apache Licenses, Java itself is trickier, see the Netbeans 
> 
> discussions on legal-discuss@ and LEGAL jira)
> 
> 
> 
> Shipping an official docker container with the Tika Server on seems to me 
> 
> to be a helpful step for users, but we just need to make sure we're 
> 
> following ASF policies. (The Apache Software Foundation mission is to 
> 
> "provide software for the public good", but source code is the main focus 
> 
> for the mission, binaries are trickier!)
> 
> 
> 
> Nick
> 
> 
> 

___
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | 
My Free/Busy <http://tinyurl.com/eric-cal>  
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed 
<https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>

This e-mail and all contents, including attachments, is considered to be 
Company Confidential unless explicitly stated otherwise, regardless of whether 
attachments are marked as such.



Re: [EXTERNAL] Tika 1.23?

2019-11-20 Thread Eric Pugh
+1 from contributor

On Wed, Nov 20, 2019 at 12:09 PM Chris Mattmann  wrote:

> +1 ship it
>
>
>
>
>
>
>
> From: Tim Allison 
> Reply-To: "dev@tika.apache.org" , "Allison, Timothy
> B (US 1760-Affiliate)" 
> Date: Wednesday, November 20, 2019 at 9:07 AM
> To: "" 
> Subject: [EXTERNAL] Tika 1.23?
>
>
>
> All,
>
>   I've abandoned hope of getting the contenthandler factory configuration
>
> stuff into 1.23.  We've added some new mime types, upgraded POI and made a
>
> number of other useful changes.
>
>   WDYT about kicking off regression tests shortly?  Any blockers?
>
>
>
>   Best,
>
>
>
> Tim
>
>
>
>


Re: Grant write access to our wiki to Eric Pugh

2019-10-31 Thread Eric Pugh
Thanks Nick, I’ll dig a bit more on those two links.

If nothing else, I’d like to get the examples all up to 1.23.



> On Oct 31, 2019, at 9:16 AM, Nick Burch  <mailto:apa...@gagravarr.org>> wrote:
> 
> On Wed, 30 Oct 2019, Eric Pugh wrote:
>> I’ve been going through the Wiki a lot over the past three months, and I’d 
>> love to go through and clean out/update the old content.
> 
> Wonderful, thanks!
> 
> In case you're also feeling keen, the source for the website is
> https://svn.apache.org/repos/asf/tika/site/src/site/apt 
> <https://svn.apache.org/repos/asf/tika/site/src/site/apt> and the example 
> programs are https://svn.apache.org/repos/asf/tika/trunk/tika-example/src 
> <https://svn.apache.org/repos/asf/tika/trunk/tika-example/src>
> 
>> What do you think of me cloning a wiki page, making a whole sale set of 
>> edits, getting review of those edits from the community, and assuming it 
>> passes muster, then bringing the edits back to the original page?
> 
> As long as it's easy to review the changes, I'd say go for whatever makes 
> your life easier! If doing it in the wiki is best, your plan sounds good. If 
> you want to download the page markup, tweak offline, and share a diff, that's 
> likely fine too. They who volunteers to do the hard work largely gets to pick 
> their method :)
> 
> Thanks
> Nick

___
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | 
My Free/Busy <http://tinyurl.com/eric-cal>  
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed 
<https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>

This e-mail and all contents, including attachments, is considered to be 
Company Confidential unless explicitly stated otherwise, regardless of whether 
attachments are marked as such.



Re: Grant write access to our wiki to Eric Pugh

2019-10-30 Thread Eric Pugh
Thanks folks….

I’ve been going through the Wiki a lot over the past three months, and I’d love 
to go through and clean out/update the old content.  

 What do you think of me cloning a wiki page, making a whole sale set of edits, 
getting review of those edits from the community, and assuming it passes 
muster, then bringing the edits back to the original page?   



Eric

> On Oct 29, 2019, at 7:00 PM, Ken Krugler  wrote:
> 
> 
>> On Oct 29, 2019, at 3:10 PM, Nick Burch > <mailto:apa...@gagravarr.org>> wrote:
>> 
>> On Tue, 29 Oct 2019, Tim Allison wrote:
>>> Anyone object if I grant write access to our wiki to Eric Pugh.  He slacked 
>>> me a request.
>> 
>> I'd almost be tempted to say that we should grant access to all ASF 
>> Committers to our wiki.
> 
> +1, CTR FTW :)
> 
> — Ken
> 
>> (Note - not all confluence users, as that includes fresh spamy sign-ups). As 
>> long as we get notifications of changes (which I think we still do 
>> post-migration?), so we can double check their changes, it should help other 
>> ASF project committers contribute
>> 
>> Otherwise, for non-committers, I'd suggest we take an approach similar to 
>> the old-wiki one from incubator, which is to grant access to anyone who 
>> writes a vaguely sensible looking email to the list to request it (see above 
>> for change notifications double-check!)
>> 
>> Thanks
>> Nick
> 
> --
> Ken Krugler
> http://www.scaleunlimited.com <http://www.scaleunlimited.com/>
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
> 

___
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | 
My Free/Busy <http://tinyurl.com/eric-cal>  
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed 
<https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>

This e-mail and all contents, including attachments, is considered to be 
Company Confidential unless explicitly stated otherwise, regardless of whether 
attachments are marked as such.



Re: reconfiguring ossindex-maven-plugin for releases?

2019-10-29 Thread Eric Pugh
I think especially if you documented that option in the Readme.  

I actually ran into the same issue when I ran “mvn eclipse:eclipse”, and had to 
add that parameter to get my Eclipse config files built!


> On Oct 29, 2019, at 11:40 AM, Tim Allison  wrote:
> 
> Or should we just require users to build w:  -Dossindex.fail=false
> 
> On Tue, Oct 29, 2019 at 11:38 AM Tim Allison  wrote:
> 
>> All,
>>  Now that we are using the ossindex-maven-plugin, there's an annoying
>> feature for folks trying to build earlier releases...namely they can't if a
>> new vulnerability has crept in since we made the release.
>>  Is there a elegant way to handle this?  My knuckle-dragger idea would be
>> to set it to "warn" for the tagged release as part of the release process,
>> and then turn it back to "fail the build" for our working branches.
>>  Any better ideas?
>> 
>>  Cheers,
>> 
>>  Tim
>> 

___
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | 
My Free/Busy <http://tinyurl.com/eric-cal>  
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed 
<https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>

This e-mail and all contents, including attachments, is considered to be 
Company Confidential unless explicitly stated otherwise, regardless of whether 
attachments are marked as such.



[jira] [Commented] (TIKA-2968) Display specific command for Tesseract if you are running in Verbose mode

2019-10-23 Thread David Eric Pugh (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-2968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957801#comment-16957801
 ] 

David Eric Pugh commented on TIKA-2968:
---

And on a related aspect, maybe, if we want the Verbose mode to display the 
command to run a external parser like tesseract, should this fix be for all 
external parsers?

> Display specific command for Tesseract if you are running in Verbose mode
> -
>
> Key: TIKA-2968
> URL: https://issues.apache.org/jira/browse/TIKA-2968
> Project: Tika
>  Issue Type: Improvement
>  Components: cli, ocr
>Affects Versions: 1.22
>    Reporter: David Eric Pugh
>Assignee: Tim Allison
>Priority: Minor
>
> I am attempting to write my own tika-config.xml to configure Tesseract, 
> leveraging what was done in TIKA-2705, instead of using the property file 
> method.To help me understand what is happening I am running in --verbose 
> mode, so seeing the specific parameters sent to Tesseract would be useful.   
> This is to display the Tesseract parameters.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-2968) Display specific command for Tesseract if you are running in Verbose mode

2019-10-23 Thread David Eric Pugh (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-2968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957799#comment-16957799
 ] 

David Eric Pugh commented on TIKA-2968:
---

Hey community, any chance of this being added for 1.23, or alternatively 
closing as not a good fix?

> Display specific command for Tesseract if you are running in Verbose mode
> -
>
> Key: TIKA-2968
> URL: https://issues.apache.org/jira/browse/TIKA-2968
> Project: Tika
>  Issue Type: Improvement
>  Components: cli, ocr
>Affects Versions: 1.22
>    Reporter: David Eric Pugh
>Assignee: Tim Allison
>Priority: Minor
>
> I am attempting to write my own tika-config.xml to configure Tesseract, 
> leveraging what was done in TIKA-2705, instead of using the property file 
> method.To help me understand what is happening I am running in --verbose 
> mode, so seeing the specific parameters sent to Tesseract would be useful.   
> This is to display the Tesseract parameters.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (TIKA-2971) Link to download OpenNLP models needs to be http not https

2019-10-22 Thread David Eric Pugh (Jira)
David Eric Pugh created TIKA-2971:
-

 Summary: Link to download OpenNLP models needs to be http not https
 Key: TIKA-2971
 URL: https://issues.apache.org/jira/browse/TIKA-2971
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.22
Reporter: David Eric Pugh


In running the tests, I noticed that `ModelGetter.groovy` file was always 
complaining about download the models from Sourceforge.   The link that works 
is the HTTP version, not the HTTPS version that the groovy script has.   I 
notice as well that the `get-models.sh` also uses HTTP urls.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-2624) Rendering PDFs for OCR with Tesseract uses different DPI than claimed

2019-10-22 Thread David Eric Pugh (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-2624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16957204#comment-16957204
 ] 

David Eric Pugh commented on TIKA-2624:
---

I am rereading this thread via JIRA versus the github PR, and it seems like 
since 2.0 isn't imminent that merging this would be good idea?

> Rendering PDFs for OCR with Tesseract uses different DPI than claimed
> -
>
> Key: TIKA-2624
> URL: https://issues.apache.org/jira/browse/TIKA-2624
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.17
>Reporter: Ewan Mellor
>Assignee: Tim Allison
>Priority: Major
>
> Tika has two properties in {{PDFParser.properties}} that control what happens 
> in AbstractPDF2XHTML when a PDF is rendered before being passed to Tesseract 
> for OCR.  These are {{ocrDPI}} (default 300) and {{ocrImageScale}} (default 
> 2.0).
> {{ocrDPI}} is passed to {{ImageIOUtil.writeImage}}, which uses it as the 
> metadata in the image (i.e. it doesn't control scaling at all, it's just an 
> advertised metadata field).
> {{ocrImageScale}} is passed to PDFBox's {{PDFRenderer.renderImage}}, which 
> uses it to specify the scale for rendering.  This value is such that 1.0 == 
> 72dpi, and therefore Tika's default is to request 144dpi for rendering.
> This means that Tika is asking PDFBox to render at 144dpi, and then 
> advertising 300dpi in the image metadata.  This makes no sense to me, and is 
> surely going to confuse Tesseract.
> Instead of doing this, we should remove {{ocrImageScale}}, and use the same 
> DPI value in both places.
> We should keep the existing default DPI value, since Tesseract is trained at 
> 300dpi by default, so this will mean that all stages between PDFRenderer and 
> Tesseract are defaulting to 300dpi.
> This change will have the side-effect that the temporary images between the 
> PDF rendering and Tesseract will be 4x larger (144dpi to 300dpi).  This will 
> have a memory and temporary disk space impact, but I think that it's still 
> best to have the whole pipeline using 300dpi.  People who have memory 
> constraints will need to reduce ocrDPI and make the corresponding changes on 
> the Tesseract side.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-2970) Configuring Tesseract for OCR of PDF via Tika Config is not working

2019-10-20 Thread David Eric Pugh (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-2970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16955612#comment-16955612
 ] 

David Eric Pugh commented on TIKA-2970:
---

It's a work in progress, however here is a unit test: 
https://github.com/apache/tika/pull/291

> Configuring Tesseract for OCR of PDF via Tika Config is not working
> ---
>
> Key: TIKA-2970
> URL: https://issues.apache.org/jira/browse/TIKA-2970
> Project: Tika
>  Issue Type: Improvement
>  Components: ocr
>Affects Versions: 1.22
>    Reporter: David Eric Pugh
>Priority: Critical
>
> Based on TIKA-2705, I thought I could eliminate the use of the properties 
> files for configuring PDF and OCR processing, and just use a tika-config.xml 
> file.
> I believe I have a unit test that demonstrates that if you need to override 
> the tesseract path for OCR, you end up always with the default Tesseract 
> configuration, which leads to Tika throwing an error: 
> https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/pdf/AbstractPDF2XHTML.java#L328
>
> In stepping through the code, it seems like every time we consult the context:
> ```
> TesseractOCRConfig tesseractConfig =
> context.get(TesseractOCRConfig.class, 
> DEFAULT_TESSERACT_CONFIG);
> ```
> We always get back the default.  The context never has our customized 
> TesseractOCRConfig!   Despite the fact that when we load up the TikaConfig in 
> the first case, I notice that we do create a TesseractOCRParser object WITH 
> the various parameters...   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (TIKA-2970) Configuring Tesseract for OCR of PDF via Tika Config is not working

2019-10-20 Thread David Eric Pugh (Jira)
David Eric Pugh created TIKA-2970:
-

 Summary: Configuring Tesseract for OCR of PDF via Tika Config is 
not working
 Key: TIKA-2970
 URL: https://issues.apache.org/jira/browse/TIKA-2970
 Project: Tika
  Issue Type: Improvement
  Components: ocr
Affects Versions: 1.22
Reporter: David Eric Pugh


Based on TIKA-2705, I thought I could eliminate the use of the properties files 
for configuring PDF and OCR processing, and just use a tika-config.xml file.

I believe I have a unit test that demonstrates that if you need to override the 
tesseract path for OCR, you end up always with the default Tesseract 
configuration, which leads to Tika throwing an error: 
https://github.com/apache/tika/blob/master/tika-parsers/src/main/java/org/apache/tika/parser/pdf/AbstractPDF2XHTML.java#L328
   

In stepping through the code, it seems like every time we consult the context:

```
TesseractOCRConfig tesseractConfig =
context.get(TesseractOCRConfig.class, DEFAULT_TESSERACT_CONFIG);
```
We always get back the default.  The context never has our customized 
TesseractOCRConfig!   Despite the fact that when we load up the TikaConfig in 
the first case, I notice that we do create a TesseractOCRParser object WITH the 
various parameters...   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-2705) Allow configuration of TesseractOCRParser as we do for other parsers

2019-10-20 Thread David Eric Pugh (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-2705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16955515#comment-16955515
 ] 

David Eric Pugh commented on TIKA-2705:
---

I know this is marked as resolved, but I'm definitly not able to make this 
happen.   It seems like the defaultConfig is always used, and the parameters 
are ignored...

> Allow configuration of TesseractOCRParser as we do for other parsers
> 
>
> Key: TIKA-2705
> URL: https://issues.apache.org/jira/browse/TIKA-2705
> Project: Tika
>  Issue Type: Task
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Major
> Fix For: 1.19, 2.0.0
>
>
> It would be handy to be able to configure tesseract via our regular 
> tika-config set up.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (TIKA-2969) Unit test for TesseractOCRParserTest.java has confusing behavior when Tesseract not on path

2019-10-20 Thread David Eric Pugh (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-2969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16955498#comment-16955498
 ] 

David Eric Pugh commented on TIKA-2969:
---

I noticed that when I run `mvn test` the output is:   

```
Tesseract executable isn't on the path, so skipping tests.  If Tesseract is 
installed in a custom location, please update TesseractOCRConfig.properties in 
src/test/resources.
[WARNING] Tests run: 13, Failures: 0, Errors: 0, Skipped: 6, Time elapsed: 
3.447 s - in org.apache.tika.parser.ocr.TesseractOCRParserTest
```

However, due to the use of canRun(), a test like testPDFOCR() appears to have 
completed, as it doesn't use the assumeTrue() concept.   Is the best fix the 
warning message, and change over to assumeTrue() everywhere?   I'd love some 
input.   https://github.com/apache/tika/pull/290

> Unit test for TesseractOCRParserTest.java has confusing behavior when 
> Tesseract not on path
> ---
>
> Key: TIKA-2969
> URL: https://issues.apache.org/jira/browse/TIKA-2969
> Project: Tika
>  Issue Type: Improvement
>  Components: ocr
>Affects Versions: 1.22
>Reporter: David Eric Pugh
>Priority: Minor
>
> Tesseract isn't installed on my path by default, I have to set the 
> tesseractPath and tessdataPath properties.   In trying to sort things out I 
> ran the TesseractOCRParserTest and was shocked that it worked..   It wasn't 
> till i dug in more that I realized that the unit tests check with the 
> canRun() method, and then either don't run, but with no feedback to the user, 
> or there is the assumeTrue() assert, which just stops the unit tests.   
> This issue is to make this test communicate better for the next person!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (TIKA-2969) Unit test for TesseractOCRParserTest.java has confusing behavior when Tesseract not on path

2019-10-20 Thread David Eric Pugh (Jira)
David Eric Pugh created TIKA-2969:
-

 Summary: Unit test for TesseractOCRParserTest.java has confusing 
behavior when Tesseract not on path
 Key: TIKA-2969
 URL: https://issues.apache.org/jira/browse/TIKA-2969
 Project: Tika
  Issue Type: Improvement
  Components: ocr
Affects Versions: 1.22
Reporter: David Eric Pugh


Tesseract isn't installed on my path by default, I have to set the 
tesseractPath and tessdataPath properties.   In trying to sort things out I ran 
the TesseractOCRParserTest and was shocked that it worked..   It wasn't till i 
dug in more that I realized that the unit tests check with the canRun() method, 
and then either don't run, but with no feedback to the user, or there is the 
assumeTrue() assert, which just stops the unit tests.   

This issue is to make this test communicate better for the next person!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (TIKA-2968) Display specific command for Tesseract if you are running in Verbose mode

2019-10-18 Thread David Eric Pugh (Jira)
David Eric Pugh created TIKA-2968:
-

 Summary: Display specific command for Tesseract if you are running 
in Verbose mode
 Key: TIKA-2968
 URL: https://issues.apache.org/jira/browse/TIKA-2968
 Project: Tika
  Issue Type: Improvement
  Components: cli, ocr
Affects Versions: 1.22
Reporter: David Eric Pugh


I am attempting to write my own tika-config.xml to configure Tesseract, 
leveraging what was done in TIKA-2705, instead of using the property file 
method.To help me understand what is happening I am running in --verbose 
mode, so seeing the specific parameters sent to Tesseract would be useful.   
This is to display the Tesseract parameters.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Re: Questions

2019-09-10 Thread Eric Pugh
Hey Keith…

Your question #3 made me curious, as I thought GitHub was a mirror, but 
https://devclass.com/2019/04/30/apache-heads-to-github/ 
<https://devclass.com/2019/04/30/apache-heads-to-github/> looks like Github is 
the authoritative repo.   The https://tika.apache.org/contribute.html 
<https://tika.apache.org/contribute.html> also says the same thing…

So yes, I think the title does need updating.   The Apache Spark’s Github 
description is “Apache Spark”, so we could be “Apache Tika”.

Not sure I can answer 1.   

As far as 2,  I find I typically use the tike-app jar unless I am carefully 
choosing which dependencies I want.

> On Sep 9, 2019, at 8:21 AM, Keith Bennett  wrote:
> 
> Hello, everyone. I am a Tika committer but have not been active for a long 
> time. I've been looking over the code and would appreciate if you could 
> answer some questions:
> 
> 1) There is a Jira issue (at 
> https://issues.apache.org/jira/browse/DRILL-6256?jql=text%20~%20%22readme%20java%207%22)
>  regarding the mention of Java 1.7 in the README 
> (https://github.com/apache/tika/blob/master/README.md). It was marked as 
> fixed, but I still see Java 7 mentioned. Tika should work with the most 
> recent versions of Java, right? Should we not update the readme accordingly? 
> I noticed that there is a "tika-java7" directory in the project consisting 
> solely of a TikaFileTypeDetector class. Can you help me understand what the 
> connection with Java version 7 is? Is it that Tika code should not use 
> features that were absent in Java 7 (such as lambdas)?
> 
> 2) I would like to bring "Rika" (https://github.com/ricn/rika), a Ruby 
> wrapper around Tika, up to date with respect to the dependency jar files 
> packaged with it. I thought I would check out the commit to which the 1.22 
> tag was attached, and do a fresh maven install, and use the files that were 
> installed ("~/.m2/repository/**/*jar"). Then again, Rika unconditionally 
> loads all the jar files; would it be faster to just use the jar file of the 
> Tika distribution (e.g. tika-app-1.22.jar) so that only one instead of n 
> files needs to be loaded? 
> 
> 3) The description for the Github repo at https://github.com/apache/tika says 
> "Tika Mirror". Is it really a mirror, or has it become the authoritative 
> source? (Given that I saw mentions of pull requests, I suspect the latter.) 
> If the latter, I suggest changing that text to something like "Tika 
> Authoritative Repository", as it is currently misleading.
> 
> Thanks,
> Keith
> 
> --
> Keith R. Bennett
> about.me/keithrbennett
> 

___
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | 
My Free/Busy <http://tinyurl.com/eric-cal>  
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed 
<https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>

This e-mail and all contents, including attachments, is considered to be 
Company Confidential unless explicitly stated otherwise, regardless of whether 
attachments are marked as such.



[jira] [Commented] (TIKA-2931) Tika CLI shouldn't log with System.out.println

2019-08-29 Thread Eric Pugh (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-2931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918723#comment-16918723
 ] 

Eric Pugh commented on TIKA-2931:
-

Okay, I've made a PR that fixes this problem, with a test.   
https://github.com/apache/tika/pull/281

The commit that I actually want is unfortunantly slightly buried..  

https://github.com/apache/tika/pull/281/commits/eb1d0f5449280ba40778e6e2b635213b6979d5cc

> Tika CLI shouldn't log with System.out.println
> --
>
> Key: TIKA-2931
> URL: https://issues.apache.org/jira/browse/TIKA-2931
> Project: Tika
>  Issue Type: Improvement
>    Reporter: Eric Pugh
>Assignee: Tim Allison
>Priority: Minor
>
> Running Tika-app on the command line, I expect to get back the output on 
> STDOUT to be a single JSON response, with logging going to STDERR, which is 
> what happens except if you have a embedded image then there is what I think 
> is a stray System.out.println:
> https://github.com/apache/tika/blob/72f4f9bd999569797360b16f92b02ea92216ac22/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java#L1054
> This causes my output to be a mix of regular text and JSON!  See below 
> example.
> Extracting 'image0.tif' (image/tiff) to ./image0.tif
> [
>   {
> "Author": "Federal Reserve Board",
> "Content-Length": "345888"
>   }
> ]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (TIKA-2931) Tika CLI shouldn't log with System.out.println

2019-08-28 Thread Eric Pugh (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-2931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16918137#comment-16918137
 ] 

Eric Pugh commented on TIKA-2931:
-

Looks like the TikaCLI test does rely on this behavior...

https://github.com/apache/tika/blob/72f4f9bd999569797360b16f92b02ea92216ac22/tika-app/src/test/java/org/apache/tika/cli/TikaCLITest.java#L406

> Tika CLI shouldn't log with System.out.println
> --
>
> Key: TIKA-2931
> URL: https://issues.apache.org/jira/browse/TIKA-2931
> Project: Tika
>  Issue Type: Improvement
>    Reporter: Eric Pugh
>Priority: Minor
>
> Running Tika-app on the command line, I expect to get back the output on 
> STDOUT to be a single JSON response, with logging going to STDERR, which is 
> what happens except if you have a embedded image then there is what I think 
> is a stray System.out.println:
> https://github.com/apache/tika/blob/72f4f9bd999569797360b16f92b02ea92216ac22/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java#L1054
> This causes my output to be a mix of regular text and JSON!  See below 
> example.
> Extracting 'image0.tif' (image/tiff) to ./image0.tif
> [
>   {
> "Author": "Federal Reserve Board",
> "Content-Length": "345888"
>   }
> ]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (TIKA-2931) Tika CLI shouldn't log with System.out.println

2019-08-28 Thread Eric Pugh (Jira)
Eric Pugh created TIKA-2931:
---

 Summary: Tika CLI shouldn't log with System.out.println
 Key: TIKA-2931
 URL: https://issues.apache.org/jira/browse/TIKA-2931
 Project: Tika
  Issue Type: Improvement
Reporter: Eric Pugh


Running Tika-app on the command line, I expect to get back the output on STDOUT 
to be a single JSON response, with logging going to STDERR, which is what 
happens except if you have a embedded image then there is what I think is a 
stray System.out.println:

https://github.com/apache/tika/blob/72f4f9bd999569797360b16f92b02ea92216ac22/tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java#L1054

This causes my output to be a mix of regular text and JSON!  See below example.

Extracting 'image0.tif' (image/tiff) to ./image0.tif
[
  {
"Author": "Federal Reserve Board",
"Content-Length": "345888"
  }
]




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


Re: TesseractOCRParserTest needed extra parameters to run...

2019-08-20 Thread Eric Pugh
I poked around at other parsers for Tika that require additional installation 
steps to see how they warn the user, like the GrobidNERecogniser class...   It 
turns out the way that is handled is by NOT having a unit test at all ;-(

 

> On Aug 20, 2019, at 10:46 AM, Eric Pugh  
> wrote:
> 
> In order to get the TesseractOCRParserTest to run, having installed Tesseract 
> on OSX using “brew install tesseract”, I had to be explicit about the paths.
> 
> Any thoughts on how we could convey to a user that they might need to tweak 
> the path to run the unit tests?  I was thinking about adding some sort of 
> messaging, but I don’t know if that is a pattern that we have in Tika with 
> these external dependencies?
> 
> Thoughts?
> 
> diff --git 
> a/tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java
>  
> b/tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java
> index 9ebcee068..32db2c442 100644
> --- 
> a/tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java
> +++ 
> b/tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java
> @@ -51,6 +51,7 @@ public class TesseractOCRParserTest extends TikaTest {
>  
>  public static boolean canRun() {
>  TesseractOCRConfig config = new TesseractOCRConfig();
> +config.setTesseractPath("/usr/local/bin");
>  TesseractOCRParserTest tesseractOCRTest = new 
> TesseractOCRParserTest();
>  return tesseractOCRTest.canRun(config);
>  }
> @@ -164,6 +165,8 @@ public class TesseractOCRParserTest extends TikaTest {
>BasicContentHandlerFactory.HANDLER_TYPE 
> handlerType,
>TesseractOCRConfig.OUTPUT_TYPE outputType) throws 
> Exception {
>  TesseractOCRConfig config = new TesseractOCRConfig();
> +config.setTesseractPath("/usr/local/bin");
> +
> config.setTessdataPath("/usr/local/Cellar/tesseract/4.1.0/share/tessdata");
>  config.setOutputType(outputType);
>  
>  Parser parser = new RecursiveParserWrapper(new AutoDetectParser(),
> ___
> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | 
> http://www.opensourceconnections.com <http://www.opensourceconnections.com/> 
> | My Free/Busy <http://tinyurl.com/eric-cal>  
> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed 
> <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>
>   
> This e-mail and all contents, including attachments, is considered to be 
> Company Confidential unless explicitly stated otherwise, regardless of 
> whether attachments are marked as such.
> 

___
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | 
My Free/Busy <http://tinyurl.com/eric-cal>  
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed 
<https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>

This e-mail and all contents, including attachments, is considered to be 
Company Confidential unless explicitly stated otherwise, regardless of whether 
attachments are marked as such.



TesseractOCRParserTest needed extra parameters to run...

2019-08-20 Thread Eric Pugh
In order to get the TesseractOCRParserTest to run, having installed Tesseract 
on OSX using “brew install tesseract”, I had to be explicit about the paths.

Any thoughts on how we could convey to a user that they might need to tweak the 
path to run the unit tests?  I was thinking about adding some sort of 
messaging, but I don’t know if that is a pattern that we have in Tika with 
these external dependencies?

Thoughts?

diff --git 
a/tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java
 
b/tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java
index 9ebcee068..32db2c442 100644
--- 
a/tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java
+++ 
b/tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java
@@ -51,6 +51,7 @@ public class TesseractOCRParserTest extends TikaTest {
 
 public static boolean canRun() {
 TesseractOCRConfig config = new TesseractOCRConfig();
+config.setTesseractPath("/usr/local/bin");
 TesseractOCRParserTest tesseractOCRTest = new TesseractOCRParserTest();
 return tesseractOCRTest.canRun(config);
 }
@@ -164,6 +165,8 @@ public class TesseractOCRParserTest extends TikaTest {
   BasicContentHandlerFactory.HANDLER_TYPE handlerType,
   TesseractOCRConfig.OUTPUT_TYPE outputType) throws 
Exception {
 TesseractOCRConfig config = new TesseractOCRConfig();
+config.setTesseractPath("/usr/local/bin");
+
config.setTessdataPath("/usr/local/Cellar/tesseract/4.1.0/share/tessdata");
 config.setOutputType(outputType);
 
 Parser parser = new RecursiveParserWrapper(new AutoDetectParser(),
___
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | 
My Free/Busy <http://tinyurl.com/eric-cal>  
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed 
<https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>

This e-mail and all contents, including attachments, is considered to be 
Company Confidential unless explicitly stated otherwise, regardless of whether 
attachments are marked as such.



Re: Tika Tikka Masala Project

2019-03-11 Thread Eric Pugh
Thank you for sharing!   I learned some techniques as well.  Are you going to 
publish the steps you did, maybe in Github, so others can follow along?

Eric


> On Mar 10, 2019, at 3:41 PM, megan hazlett  wrote:
> 
> Hi Tika,
> I recently did an analysis of Tikka Masala recipes using Python's tika
> package. I've attached my Google Slides presentation that I share with
> Chris Mattmann at NASA JPL.
> he
> Enjoy your day!
> 
> https://docs.google.com/presentation/d/1bmAInwzNxMWUQVL-YrYpgFUI6XXRifDmXdoOCgYCbTI/edit?usp=sharing
> 
> Best,
> Megan Hazlett

___
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | 
My Free/Busy <http://tinyurl.com/eric-cal>  
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed 
<https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>

This e-mail and all contents, including attachments, is considered to be 
Company Confidential unless explicitly stated otherwise, regardless of whether 
attachments are marked as such.



Re: experiences with Tika in Docker

2017-06-01 Thread Eric Pugh
As the Tika project starts embracing more non Java tools (I’m thinking of 
Tesseract for example), dockerizing your Tika setup becomes more and more 
valuable.   

For example, I run my tests for my application on my local Mac, as well as on 
CircleCI.   I have a dockeriezed Tika service that does the OCR stuff, and I 
know it’s the same work on both.   It’s less exciting if I’m in an “all Java” 
world.

 
> On Jun 1, 2017, at 7:55 AM, Allison, Timothy B. <talli...@mitre.org> wrote:
> 
> Thank you, Thejan!
> 
> -Original Message-
> From: Thejan Wijesinghe [mailto:thejan.k.wijesin...@gmail.com] 
> Sent: Wednesday, May 31, 2017 5:40 PM
> To: dev@tika.apache.org
> Subject: Re: experiences with Tika in Docker
> 
> Hi Tim,
> 
> I've used Tika -server in docker but as a single instance only. Yes, its 
> ability to limit container's resources with related to memory & CPU in the 
> host machine is great, it gives us so much flexibility, we could enforce 
> hard/soft memory limits, we could even manipulate the host machine's CPU 
> cycles. Yes, it also limits risks of executing arbitrary code & XXE 
> vulnerabilities. I already asked Prof. Chris Mattmann about officially moving 
> to dockerhub. He said I need to make a mail to apache infra asking about 
> this. Unfortunately, I still couldn't find a time to make that mail.
> 
> We already have multiple dockerfiles in Tika, , dockerfile in tika-server, 
> InceptionRestDockerfile, InceptionVideoRestDockerfile, 
> Im2txtRestDockerfile(PR #180-for image captioning).
> 
> Part of my GSoC project is to unify the existing REST services such as object 
> recognition, image captioning. My idea is to unify all of those REST services 
> where the user can start/terminate, see statistics of any REST service 
> through a web based GUI. I'm expecting to use a fusion of nginx(as the 
> reverse proxy server) & docker to make it work. So obviously we will see 
> docker much often in Tika.
> 
> +1 for your thought to looking into hardening the tika-server with the 
> +help
> of docker.
> 
> best,
> ThejanW
> 
> On Thu, Jun 1, 2017 at 1:03 AM, Allison, Timothy B. <talli...@mitre.org>
> wrote:
> 
>> Dave Meikle, Tom and All,
>> 
>>How many of us are using Tika in Docker?  If so, how exactly are 
>> you using it?  Single instance, swarm, Kubernetes, something else?  
>> People fear I/O hit with tika-server...what are your experiences?
>> I really like the ability to limit the number of CPUs in the Docker 
>> container.  If a single doc causes multithreaded gc to go nuts, that 
>> won't kill an entire machine.  This also cleanly limits the risk from 
>> XXE or arbitrary code execution, right?
>> 
>> If this is one of the ways of the future for big data, we might want 
>> to look into hardening tika-server (OOMs, timeouts).  What do you all think?
>> 
>>Cheers,
>> 
>>Tim
>> 
>> Timothy B. Allison, Ph.D.
>> Principal Artificial Intelligence Engineer Group Lead K83E/Human 
>> Language Technology The MITRE Corporation
>> 7515 Colshire Drive, McLean, VA  22102
>> 703-983-2473 (phone); 703-983-1379 (fax)
>> 
>> 


___
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | 
My Free/Busy <http://tinyurl.com/eric-cal>  
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed 
<https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>

This e-mail and all contents, including attachments, is considered to be 
Company Confidential unless explicitly stated otherwise, regardless of whether 
attachments are marked as such.



Re: Tika talk next week - help needed!

2017-05-16 Thread Eric Pugh
Nick,

It was great to read through 
http://events.linuxfoundation.org/sites/events/files/slides/WhatsNewWithApacheTika_1.pdf…
Wow there is a lot in Tika.

And I think that might be the one challenge with the talk structure, there is 
SOO much information.

I think I’d like to see “How does Tika actually architected” to support so many 
amazing use cases.If this talk is meant for folks who don’t already know a 
lot about the project, then they might get overwhelmed with the long lists, 
such as all the file types it can handle.   Maybe change some of them to “here 
is an eye chart of logos, don’t actually read it” and consolidate some pages.




Eric

> On May 16, 2017, at 10:38 AM, Thamme Gowda <thammego...@apache.org> wrote:
> 
> Nick,
> Here are some pointers:
> 1. Image recognition using Tensorflow:
> https://wiki.apache.org/tika/TikaAndVision; Link to Paper:
> https://memex.jpl.nasa.gov/MFSEC17.pdf
> 2. Image Recognition using Deeplearning4j -
> https://wiki.apache.org/tika/TikaAndVisionDL4J
> 3. Sentiment Analysis using OpenNLP: https://github.com/apache/tika/pull/169
> 4. Video labeling using tensorflow image rec:
> https://wiki.apache.org/tika/TikaAndVisionVideo
> 5.  Named Entity Extraction using OpenNLP and CoreNLP:
> https://wiki.apache.org/tika/TikaAndNER
> 
> *Coming soon (Work in progress):*
> 6. Image Captioning (Image-to-Text) https://github.com/apache/tika/pull/180
> 
> Cheers,
> -Thamme
> 
> *--*
> *Thamme Gowda*
> TG | @thammegowda <https://twitter.com/thammegowda>
> ~Sent via somebody's Webmail server!
> 
> On Tue, May 16, 2017 at 6:59 AM, Chris Mattmann <mattm...@apache.org> wrote:
> 
>> Yep, literally take a look at the Tika wiki – there are examples a plenty
>> and even
>> screen shots. Further, if you look at the MEMEX site under our new
>> publications
>> section, there are a few examples (like the ICMR paper on forensics) that
>> show it
>> in action.
>> 
>> http://memex.jpl.nasa.gov/#publications
>> 
>> 
>> 
>> On 5/16/17, 6:21 AM, "Konstantin Gribov" <gros...@gmail.com> wrote:
>> 
>>IIRC, image and video labeling basic support was added (Chris & Thamme
>>could you elaborate on that, please), TSD (TIKA-2309, time stamped data
>>envelope format) support, slf4j migration (ongoing on 2.x branch).
>> 
>>вт, 16 мая 2017 г. в 16:06, Allison, Timothy B. <talli...@mitre.org>:
>> 
>>> Doh!  Sorry for the delay...might add configuration of
>> EncodingDetectors,
>>> but that's probably too far into the weeds?
>>> 
>>> -Original Message-
>>> From: Nick Burch [mailto:n...@apache.org]
>>> Sent: Sunday, May 14, 2017 11:34 AM
>>> To: dev@tika.apache.org
>>> Subject: Tika talk next week - help needed!
>>> 
>>> Hi All
>>> 
>>> Last year in Seville, I gave a talk on Tika entitled "Apache Tika -
>> What’s
>>> new with 2.0?". For ApacheCon Miami next week, I've been roped into
>> giving
>>> an updated version...
>>> 
>>> https://apachecon2017.sched.com/event/9zvD/apache-tika-
>> whats-new-with-20-nick-burch-apache-software-foundation
>>> 
>>> My slides from Seville are available at:
>>> 
>>> http://events.linuxfoundation.org/sites/events/files/slides/
>> WhatsNewWithApacheTika_1.pdf
>>> 
>>> Beyond updating the list of releases and parsers, and the slide
>>> background, what should I change?
>>> 
>>> Maybe some more on Tika eval? More details on some of the NLP /
>> Entity
>>> Recognition / Image Recoginition stuff? Some screenshots of that
>> stuff?
>>> More on translation? Something else?
>>> 
>>> Ideas greatly appreciated! Good screenshots even more so :)
>>> 
>>> Cheers
>>> Nick
>>> 
>>--
>> 
>>Best regards,
>>Konstantin Gribov
>> 
>> 
>> 
>> 


___
Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | 
My Free/Busy <http://tinyurl.com/eric-cal>  
Co-Author: Apache Solr Enterprise Search Server, 3rd Ed 
<https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>

This e-mail and all contents, including attachments, is considered to be 
Company Confidential unless explicitly stated otherwise, regardless of whether 
attachments are marked as such.



[jira] [Created] (TIKA-2106) "hocr" case on Linux fails, but works on OSX. Related to TIKA-2093

2016-09-30 Thread Eric Pugh (JIRA)
Eric Pugh created TIKA-2106:
---

 Summary: "hocr" case on Linux fails, but works on OSX.  Related to 
TIKA-2093
 Key: TIKA-2106
 URL: https://issues.apache.org/jira/browse/TIKA-2106
 Project: Tika
  Issue Type: Bug
  Components: ocr
 Environment: Bug in Linux, but fine in OSX.
Reporter: Eric Pugh


We pass a output type, either TXT or HOCR to the Tesseract command line.   When 
we call the command line we lowercase it to "txt" or "hocr".  However, when we 
read back in the output, we don't lower case it.  on OSX the constructed file 
path "output.HOCR" is actually found, but in Linux it doesn't.  This patch 
lower cases the HOCR to hocr and TXT to txt in the constructed file path.

I didn't write a unit test as I don't have a good linux env to test it in, but 
I was able to put a patched version of the Tika Parser Jar into my Docker Build 
to test it works.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-2093) Add hOCR output type to the TesseractOCRParser

2016-09-29 Thread Eric Pugh (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15534613#comment-15534613
 ] 

Eric Pugh edited comment on TIKA-2093 at 9/30/16 12:52 AM:
---

BTW, just got to updating my project with the latest 1.14-SNAPSHOT, and the 
hOCR process is working *great*.   Thanks for getting this patch in.

Not sure who marks things "Resolved", but from my perspective, it's Resolved.


was (Author: epugh):
BTW, just got to updating my project with the latest 1.14-SNAPSHOT, and the 
hOCR process is working *great*.   Thanks for getting this patch in.

> Add hOCR output type to the TesseractOCRParser
> --
>
> Key: TIKA-2093
> URL: https://issues.apache.org/jira/browse/TIKA-2093
> Project: Tika
>  Issue Type: Improvement
>  Components: ocr
>Affects Versions: 1.13
>Reporter: Eric Pugh
>Assignee: Tim Allison
>  Labels: easyfix, features, newbie
> Fix For: 1.14
>
>
> I've tweaked the TesseractOCRParser and TesseractOCRConfig to add the "txt" 
> or "hocr" parameters that allows you to get specific outputs.  There are also 
> "pdf" and in the next version of Tesseract a "tsv" outputs, but didn't add 
> support for those.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2093) Add hOCR output type to the TesseractOCRParser

2016-09-29 Thread Eric Pugh (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15534613#comment-15534613
 ] 

Eric Pugh commented on TIKA-2093:
-

BTW, just got to updating my project with the latest 1.14-SNAPSHOT, and the 
hOCR process is working *great*.   Thanks for getting this patch in.

> Add hOCR output type to the TesseractOCRParser
> --
>
> Key: TIKA-2093
> URL: https://issues.apache.org/jira/browse/TIKA-2093
> Project: Tika
>  Issue Type: Improvement
>  Components: ocr
>Affects Versions: 1.13
>    Reporter: Eric Pugh
>Assignee: Tim Allison
>  Labels: easyfix, features, newbie
> Fix For: 1.14
>
>
> I've tweaked the TesseractOCRParser and TesseractOCRConfig to add the "txt" 
> or "hocr" parameters that allows you to get specific outputs.  There are also 
> "pdf" and in the next version of Tesseract a "tsv" outputs, but didn't add 
> support for those.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-2093) Add hOCR output type to the TesseractOCRParser

2016-09-23 Thread Eric Pugh (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-2093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15516177#comment-15516177
 ] 

Eric Pugh commented on TIKA-2093:
-

Thanks for this, and the addition of the HOCRPassthroughHandler, I'll give it a 
test today, however I suspect this is exactly what I need.

> Add hOCR output type to the TesseractOCRParser
> --
>
> Key: TIKA-2093
> URL: https://issues.apache.org/jira/browse/TIKA-2093
> Project: Tika
>  Issue Type: Improvement
>  Components: ocr
>Affects Versions: 1.13
>    Reporter: Eric Pugh
>Assignee: Tim Allison
>  Labels: easyfix, features, newbie
> Fix For: 1.14
>
>
> I've tweaked the TesseractOCRParser and TesseractOCRConfig to add the "txt" 
> or "hocr" parameters that allows you to get specific outputs.  There are also 
> "pdf" and in the next version of Tesseract a "tsv" outputs, but didn't add 
> support for those.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-2093) Add hOCR output type to the TesseractOCRParser

2016-09-22 Thread Eric Pugh (JIRA)
Eric Pugh created TIKA-2093:
---

 Summary: Add hOCR output type to the TesseractOCRParser
 Key: TIKA-2093
 URL: https://issues.apache.org/jira/browse/TIKA-2093
 Project: Tika
  Issue Type: Improvement
  Components: ocr
Affects Versions: 1.13
Reporter: Eric Pugh
 Fix For: 1.14


I've tweaked the TesseractOCRParser and TesseractOCRConfig to add the "txt" or 
"hocr" parameters that allows you to get specific outputs.  There are also 
"pdf" and in the next version of Tesseract a "tsv" outputs, but didn't add 
support for those.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)