Following this thread, should we deprecate/remove the Tika Docker support that is in Tika-server project?
The `mvn dockerfile:build` command now relies on a plugin that is no longer supported according to https://github.com/spotify/dockerfile-maven, and it seems like the Tika-docker project is really the right place for this! I’m thinking that this might help reduce the footprint of things we need to support. > On Jan 9, 2020, at 12:08 AM, Chris Mattmann <mattm...@apache.org> wrote: > > +1 > > > > Note there is also a USC tika dockers repo where I put the data science stuff > too: > > > > http://github.com/USCDataScience/tika-dockers > > > > I’ll continue to push DL and ML Tika stuff there. > > Cheers, > > Chris > > > > > > > > > > From: Dave Meikle <dmei...@apache.org> > Reply-To: "dev@tika.apache.org" <dev@tika.apache.org> > Date: Wednesday, January 8, 2020 at 2:18 PM > To: "<dev@tika.apache.org>" <dev@tika.apache.org> > Subject: Re: [EXTERNAL] Do we have a community supported approach for > deploying Tika Server in production? > > > > Hi Eric, > > > > Will take a look. On a related note, I've created a new repos: > > https://github.com/apache/tika-docker > > > > Thinking based on looking at the PRs and Issues on LogicalSpark > > docker-tikaserver, I'll create an updated docker file using what you've > > added here and look to publish builds to docker hub from that. > > > > What do you think? > > > > Cheers, > > Dave > > > > > > > > On Wed, 8 Jan 2020 at 03:16, Eric Pugh <ep...@opensourceconnections.com> > > wrote: > > > > Hi all, I’ve gone ahead and added the -spawnChild property as a default > > when running Tika Server as a service. I’d love some eyes on the PR, and > > if this looks good, get it committed. > > > > Feedback welcome! > > > > Eric > > > > > > > >> On Dec 17, 2019, at 12:53 PM, Eric Pugh <ep...@opensourceconnections.com> > > wrote: > >> > >> Cool. > >> > >> It’s the auto run that I really need, and the other part that I don’t > > think I’ve tackled properly is the managing of logs… > >> > >> I’m going to check with my project to see if they support Snap packages. > >> > >> Eric > >> > >> > >>> On Dec 16, 2019, at 5:10 PM, Tom Barber <t...@spicule.co.uk <mailto: > > t...@spicule.co.uk>> wrote: > >>> > >>> Just saw this fly by and FYI on Linux systems that support Snap > > packages (Ubuntu/Debian/Arch/Fedora etc) you can `snap install tika-server` > > doesn’t yet auto-run I don’t believe but you can just run `tika-server.run` > > and adding an init script wouldn’t take 5 minutes. > >>> > >>> Tom > >>> > >>> On 16 December 2019 at 18:42:55, Eric Pugh ( > > ep...@opensourceconnections.com <mailto:ep...@opensourceconnections.com>) > > wrote: > >>> > >>>> Hi folks! > >>>> > >>>> I’ve got a mostly completed PR for having install scripts for Tika > > Server, and I’m hoping a committer will take a look at the PR, and give > > feedback (and ideally commit in time for 1.24!) > >>>> > >>>> A couple of things: > >>>> > >>>> 1) This was completely influenced by > > https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script > > < > > https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script > >> < > > https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script > > < > > https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#service-installation-script>>, > > in fact I started with the Solr scripts. > >>>> > >>>> 2) I’ve deleted all the Solr specific aspects (I think), however there > > may still be more to delete. > >>>> > >>>> 3) This requires a change to how we release Tika, previously we ship > > tika-app.jar and Tika-eval.jar, and Tika-server.jar, and now, I think, we > > want to add the tika-server-bin.tgz and tika-server-bin.zip binary > > distributions. > >>>> > >>>> I’m happy to start writing accompanying “how to deploy Tika Server” > > docs if this PR looks good! Or, please give input and I’ll make the updates. > >>>> > >>>> Eric > >>>> > >>>> > >>>>> On Dec 12, 2019, at 2:39 PM, Eric Pugh < > > ep...@opensourceconnections.com <mailto:ep...@opensourceconnections.com>> > > wrote: > >>>>> > >>>>> I’ve created this JIRA to track this work: > > https://issues.apache.org/jira/browse/TIKA-3010 < > > https://issues.apache.org/jira/browse/TIKA-3010> < > > https://issues.apache.org/jira/browse/TIKA-3010 < > > https://issues.apache.org/jira/browse/TIKA-3010>> > >>>>> > >>>>> And a WIP progress PR is at https://github.com/apache/tika/pull/305 > > <https://github.com/apache/tika/pull/305> < > > https://github.com/apache/tika/pull/305 < > > https://github.com/apache/tika/pull/305>> > >>>>> > >>>>> My thought is to put something together that mimics how we deploy > > Solr, and see how that works. I have a need for an install process that a > > general IT person can follow, who isn’t a Tika expert or a Docker users. > >>>>> > >>>>> > >>>>> > >>>>> > >>>>>> On Dec 4, 2019, at 12:28 PM, Chris Mattmann <mattm...@apache.org > > <mailto:mattm...@apache.org> <mailto:mattm...@apache.org <mailto: > > mattm...@apache.org>>> wrote: > >>>>>> > >>>>>> Thanks for bringing this conversation up Eric. > >>>>>> > >>>>>> > >>>>>> > >>>>>> Historically if you look over the last 5 years, I think what you > > are asking below has sort of already become the de facto > >>>>>> truth. Most people are in fact using Tika server, whether they are > > individual devs, govvies, commercial folk and the like. > >>>>>> > >>>>>> Big, small and medium projects. Evidenced by the expansion of Tika > > APIs into pretty much every PL I know and use of > >>>>>> actively today. > >>>>>> > >>>>>> > >>>>>> > >>>>>> Given that, we probably should update the main website docs to make > > this more prominent. The tika server docs on the > >>>>>> wiki are pretty darn good. But they don’t get prime real estate. > > Would be wonderful if someone wants to update the > >>>>>> website to make it more prominent. > >>>>>> > >>>>>> > >>>>>> > >>>>>> The downstream Tika Python lib that I maintain has tons of activity > > is used by more than 350+ projects and relies solely > >>>>>> on Tika-Server. My recommendation to the Solr folks (having created > > 7633) from the 2014 DARPA MEMEX days was to > >>>>>> move towards Tika Server based SolrCell dep and that’s the right > > way to go IMO. > >>>>>> > >>>>>> > >>>>>> > >>>>>> Chris > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> From: Eric Pugh <ep...@opensourceconnections.com <mailto: > > ep...@opensourceconnections.com> <mailto:ep...@opensourceconnections.com > > <mailto:ep...@opensourceconnections.com>>> > >>>>>> Reply-To: "dev@tika.apache.org <mailto:dev@tika.apache.org> > > <mailto:dev@tika.apache.org <mailto:dev@tika.apache.org>>" < > > dev@tika.apache.org <mailto:dev@tika.apache.org> <mailto: > > dev@tika.apache.org <mailto:dev@tika.apache.org>>> > >>>>>> Date: Wednesday, December 4, 2019 at 12:24 PM > >>>>>> To: "tika-...@apache.org <mailto:tika-...@apache.org> <mailto: > > tika-...@apache.org <mailto:tika-...@apache.org>>" <tika-...@apache.org > > <mailto:tika-...@apache.org> <mailto:tika-...@apache.org <mailto: > > tika-...@apache.org>>> > >>>>>> Subject: [EXTERNAL] Do we have a community supported approach for > > deploying Tika Server in production? > >>>>>> > >>>>>> > >>>>>> > >>>>>> Hi all - Hoping this is a reasonable Tika-dev versus Tika-user > > question! > >>>>>> > >>>>>> > >>>>>> > >>>>>> Over in Solr land there has been renewed discussion about > > streamlining what Solr is.... > >>>>>> > >>>>>> > >>>>>> > >>>>>> In regards to rich content extraction and the Tika project, it > > seems like the two ideas that continue to preserve the existing behavior > > are: > >>>>>> > >>>>>> > >>>>>> > >>>>>> 1) To convert the ExtractingRequestHandler into a Package (Plugin) > > for Solr. This slims down the standard Solr download, and *might* make it > > easier to update the version of Tika + dependent jars used? > >>>>>> > >>>>>> > >>>>>> > >>>>>> 2) The second approach is to instead require Tika-Server to be > > running (https://issues.apache.org/jira/browse/SOLR-7633 < > > https://issues.apache.org/jira/browse/SOLR-7633>< > > https://issues.apache.org/jira/browse/SOLR-7633 < > > https://issues.apache.org/jira/browse/SOLR-7633>>) and just have Solr > > delegate the call to Tika-Server. > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> I was thinking about why I like option 1 better than 2, and I think > > it boils down to how mature the IT organization I am working with is. Some > > IT organizations have large dev-ops teams, and are working at major scale, > > and managing a fleet of Tika-Server on Kubernetes with Load Balancer > > dynamically scaling up and down is simple and second nature! However, many > > organizations aren’t like that. > >>>>>> > >>>>>> > >>>>>> > >>>>>> So I guess what I’m asking is do we have a reasonable supported > > approach for deploying Tika Server for non-tika savvy organizations? I’m > > thinking about Solr, and specifically the fact that Solr has a well defined > > set of Service Installation scripts. When I follow the directions in > > https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#taking-solr-to-production > > < > > https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#taking-solr-to-production > >> < > > https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#taking-solr-to-production > > < > > https://lucene.apache.org/solr/guide/8_3/taking-solr-to-production.html#taking-solr-to-production>> > > I can feel confident that when the server is rebooted, then Solr will come > > back up! Plus there is log rotation and all the rest. > >>>>>> > >>>>>> > >>>>>> > >>>>>> In contrast, when I look at Tika website, specifically > > https://tika.apache.org/1.22/gettingstarted.htm < > > https://tika.apache.org/1.22/gettingstarted.htm>< > > https://tika.apache.org/1.22/gettingstarted.htm < > > https://tika.apache.org/1.22/gettingstarted.htm>> pagel, the message is > > to run Tika as a command line application, or embedded in your > > application. > >>>>>> > >>>>>> > >>>>>> > >>>>>> I’m wondering if Tika-Server needs to be made more prominent, and > > treated as the “primary method of interacting with Tika”? Do we need as a > > community to focus more on Tika-Server? In our getting started > > documentation, in our usage documentation, and in our examples? > >>>>>> > >>>>>> > >>>>>> > >>>>>> Do we need to create the equivalent of the Service Installation > > scripts for Tika-Server? > >>>>>> > >>>>>> > >>>>>> > >>>>>> Wanted to stoke the discussion! > >>>>>> > >>>>>> > >>>>>> > >>>>>> Eric > >>>>>> > >>>>>> > >>>>>> > >>>>>> _______________________ > >>>>>> > >>>>>> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | > > 434.466.1467 | http://www.opensourceconnections.com < > > http://www.opensourceconnections.com/>< > > http://www.opensourceconnections.com/ < > > http://www.opensourceconnections.com/>>< > > http://www.opensourceconnections.com/ < > > http://www.opensourceconnections.com/> < > > http://www.opensourceconnections.com/ < > > http://www.opensourceconnections.com/>>> | My Free/Busy < > > http://tinyurl.com/eric-cal <http://tinyurl.com/eric-cal> < > > http://tinyurl.com/eric-cal <http://tinyurl.com/eric-cal>>> > >>>>>> > >>>>>> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed < > > https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw > > < > > https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw> > > < > > https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw > > < > > https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>>> > > > >>>>>> > >>>>>> This e-mail and all contents, including attachments, is considered > > to be Company Confidential unless explicitly stated otherwise, regardless > > of whether attachments are marked as such. > >>>>> > >>>>> _______________________ > >>>>> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | > > 434.466.1467 | http://www.opensourceconnections.com < > > http://www.opensourceconnections.com/>< > > http://www.opensourceconnections.com/ < > > http://www.opensourceconnections.com/>> | My Free/Busy < > > http://tinyurl.com/eric-cal <http://tinyurl.com/eric-cal>> > >>>>> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed < > > https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw > > < > > https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>> > > > >>>>> This e-mail and all contents, including attachments, is considered > > to be Company Confidential unless explicitly stated otherwise, regardless > > of whether attachments are marked as such. > >>>>> > >>>> > >>>> _______________________ > >>>> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 > > | http://www.opensourceconnections.com < > > http://www.opensourceconnections.com/>< > > http://www.opensourceconnections.com/ < > > http://www.opensourceconnections.com/>> | My Free/Busy < > > http://tinyurl.com/eric-cal <http://tinyurl.com/eric-cal>> > >>>> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed < > > https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw > > < > > https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw>> > > > >>>> This e-mail and all contents, including attachments, is considered to > > be Company Confidential unless explicitly stated otherwise, regardless of > > whether attachments are marked as such. > >>>> > >>> > >>> Spicule Limited is registered in England & Wales. Company Number: > > 09954122. Registered office: First Floor, Telecom House, 125-135 Preston > > Road, Brighton, England, BN1 6AF. VAT No. 251478891. > >>> > >>> > >>> > >>> All engagements are subject to Spicule Terms and Conditions of > > Business. This email and its contents are intended solely for the > > individual to whom it is addressed and may contain information that is > > confidential, privileged or otherwise protected from disclosure, > > distributing or copying. Any views or opinions presented in this email are > > solely those of the author and do not necessarily represent those of > > Spicule Limited. The company accepts no liability for any damage caused by > > any virus transmitted by this email. If you have received this message in > > error, please notify us immediately by reply email before deleting it from > > your system. Service of legal notice cannot be effected on Spicule Limited > > by email. > >>> > >> > >> _______________________ > >> Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | > > http://www.opensourceconnections.com < > > http://www.opensourceconnections.com/> | My Free/Busy < > > http://tinyurl.com/eric-cal> > >> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed < > > https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw> > > > >> This e-mail and all contents, including attachments, is considered to be > > Company Confidential unless explicitly stated otherwise, regardless of > > whether attachments are marked as such. > >> > > > > _______________________ > > Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | > > http://www.opensourceconnections.com < > > http://www.opensourceconnections.com/> | My Free/Busy < > > http://tinyurl.com/eric-cal> > > Co-Author: Apache Solr Enterprise Search Server, 3rd Ed < > > https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw> > > > > This e-mail and all contents, including attachments, is considered to be > > Company Confidential unless explicitly stated otherwise, regardless of > > whether attachments are marked as such. > > > > > > > _______________________ Eric Pugh | Founder & CEO | OpenSource Connections, LLC | 434.466.1467 | http://www.opensourceconnections.com <http://www.opensourceconnections.com/> | My Free/Busy <http://tinyurl.com/eric-cal> Co-Author: Apache Solr Enterprise Search Server, 3rd Ed <https://www.packtpub.com/big-data-and-business-intelligence/apache-solr-enterprise-search-server-third-edition-raw> This e-mail and all contents, including attachments, is considered to be Company Confidential unless explicitly stated otherwise, regardless of whether attachments are marked as such.