[jira] [Commented] (TIKA-1334) Add presentation layer for results of each run

2017-05-03 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15994941#comment-15994941
 ] 

Tyler Palsulich commented on TIKA-1334:
---

The format should probably be in the form:

{noformat}
[
  {
"mime-type": "something",
"count": 1234,
"version": "a"
  },
  {
"mime-type": "something",
"count": 4321,
"version": "b"
  },
  ...
]
{noformat}

> Add presentation layer for results of each run
> --
>
> Key: TIKA-1334
> URL: https://issues.apache.org/jira/browse/TIKA-1334
> Project: Tika
>  Issue Type: Sub-task
>  Components: cli, general, server
>Reporter: Tim Allison
> Attachments: static_stats.zip
>
>
> If I'm doing this, it'll probably be vintage mid-90s html.  If someone with 
> some .js kung-fu wants to take this, please do.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


Re: Squashing GitHub pull requests while merging

2016-05-07 Thread Tyler Palsulich
A contributor should be able to squash the commits in the pull request
before we merge into the Tika. So, we don't need to mess up Tika's history.
Right?

Tyler
On May 6, 2016 8:41 PM, "Mattmann, Chris A (3980)" <
chris.a.mattm...@jpl.nasa.gov> wrote:

> Squashing messes up history and atm requires infra intervention song would
> suggest we stay away from it for now
>
> Sent from my iPhone
>
> > On May 6, 2016, at 2:20 PM, Ken Krugler 
> wrote:
> >
> > I was perusing https://wiki.apache.org/tika/UsingGit <
> https://wiki.apache.org/tika/UsingGit>, and noticed that it doesn’t talk
> about squashing a pull request’s commits while merging.
> >
> > This is described at https://mahout.apache.org/developers/github.html <
> https://mahout.apache.org/developers/github.html>
> >
> > Isn't this something we’d want to do as well?
> >
> > Thanks,
> >
> > — Ken
> >
> > --
> > Ken Krugler
> > +1 530-210-6378
> > http://www.scaleunlimited.com
> > custom big data solutions & training
> > Hadoop, Cascading, Cassandra & Solr
> >
> >
> >
>


Re: JIRA issue?

2016-04-21 Thread Tyler Palsulich
Hi Ben,

Sorry for the inconvenience. The infrastructure team had to disable the
create and comment features of JIRA for many projects to mitigate spam.
Hopefully everything will be back up and running again soon.

Thanks for emailing.

Tyler
Hi,

I'd like to create an issue on the JIRA. When I visit
https://issues.apache.org/jira/browse/TIKA/ and hit Create I don't see Tika
as an option. I can only create issues for Zookeeper and other projects

Thanks,
Ben

--
about.me/benmccann


Re: [VOTE] Apache Tika 1.12 Release Candidate #1

2016-02-13 Thread Tyler Palsulich
A bit late to the party, but +1 from me.

Tyler

On Thu, Feb 4, 2016 at 1:44 PM, Lewis John Mcgibbney <
lewis.mcgibb...@gmail.com> wrote:

> Hi Chris,
> +1 to release this release candidate
> Thanks
> Lewis
>
> On Tue, Feb 2, 2016 at 4:24 PM, Lewis John Mcgibbney <
> lewis.mcgibb...@gmail.com> wrote:
>
> > Hi Chris,
> >
> > Signatures all good. Verified using the scripts apachestuff.
> > mvn install and all tests pass fine on MacOSX 10.9.5
> > Ran DRAT from master branch with following output
> >
> > Notes Binaries Archives Standards Apache Generated Unknown
> > 0 2 0 868 836 0 32
> > Issue filed in Jira to address and resolve the unknown's
> >
> > https://issues.apache.org/jira/browse/TIKA-1848
> >
> > On Thu, Jan 28, 2016 at 12:01 AM, 
> wrote:
> >
> >>
> >> A first candidate for the Tika 1.12 release is available at:
> >>
> >>   https://dist.apache.org/repos/dist/dev/tika/
> >>
> >> The release candidate is a zip archive of the sources in:
> >>
> >>
> https://git-wip-us.apache.org/repos/asf?p=tika.git;a=tag;h=203a26ba5e65db24
> >> 27f9e84bc4ff31e569ae661c
> >> <
> https://git-wip-us.apache.org/repos/asf?p=tika.git;a=tag;h=203a26ba5e65db2427f9e84bc4ff31e569ae661c
> >
> >>
> >>
> >> The SHA1 checksum of the archive is:
> >> 30e64645af643959841ac3bb3c41f7e64eba7e5f
> >>
> >> In addition, a staged maven repository is available here:
> >>
> >> https://repository.apache.org/content/repositories/orgapachetika-1015/
> >>
> >>
> >> Please vote on releasing this package as Apache Tika 1.12.
> >> The vote is open for the next 72 hours and passes if a majority of at
> >> least three +1 Tika PMC votes are cast.
> >>
> >> [ ] +1 Release this package as Apache Tika 1.12
> >> [ ] -1 Do not release this package because…
> >>
> >> Cheers,
> >> Chris
> >>
> >> P.S. Of course here is my +1.
> >>
> >>
>
>
> --
> *Lewis*
>


Re: [VOTE] Moving SCM to Git

2016-01-02 Thread Tyler Palsulich
Hi,

Just reiterating my +1 for the move. A huge benefit in my eyes is a reduced
barrier to entry for new developers and contributors.

Tyler
On Jan 2, 2016 4:34 PM, "Mattmann, Chris A (3980)" <
chris.a.mattm...@jpl.nasa.gov> wrote:

> One final note - this isn't a vote to make GitHub the canonical repo. In
> the future if Whimsy goes well I'd like to explore that but here I am
> simply proposing to use the ASF writeable Git repos (which happen to be
> mirrored to GH).
>
> Cheers,
> Chris
>
> Sent from my iPhone
>
> > On Jan 2, 2016, at 4:31 PM, Mattmann, Chris A (3980) <
> chris.a.mattm...@jpl.nasa.gov> wrote:
> >
> > Hey Ken,
> >
> > Projects have been using writeable git repos at the ASF since 2009-2010.
> The recent conversation at the foundation level was - should we allow
> GitHub as a canonical external repo and more broadly - is this possible in
> general? The Whimsy project is currently undergoing that experiment and
> it's going well but nothing official to report yet.
> >
> > Beyond that - projects can release from and use writeable Git repos.
> Some projects were getting around history by squashing commits ahead of the
> repo and getting around infra's checks on master (aka trunk) by using
> different main branch names but we're not in that boat.
> >
> > Cheers,
> > Chris
> >
> >
> > Sent from my iPhone
> >
> >> On Jan 2, 2016, at 3:47 PM, Ken Krugler <kkrugler_li...@transpac.com>
> wrote:
> >>
> >> Hi Chris,
> >>
> >> I'd be +1, but I don't have the essence of the "Re: git (Was:
> ASF/GitHub Findings of Fact / Statements of Principles)" thread on the
> Apache members list clearly in my mind.
> >>
> >> Specifically, while that thread was spinning merrily away, there were
> concerns about immutability when using git.
> >>
> >> E.g. one comment was...
> >>
> >>> releases must correspond to an immutable tag in a repository on ASF
> hardware.
> >>>
> >>> "Canonical" is needed for releases, and for IP provenance, so I'd
> augment the above with a second requirement: for each release tag, we must
> be able to establish the provenance of all files referenced by that tag.
> >>>
> >>> I believe that is the essence of the Foundation's requirements for
> version control. Both can be satisfied via svn or git. Git may require
> external sources to satisfy one or both of those requirements. svn
> inherently has the first nailed, and is much easier for provenance (there
> may be edge cases I'm missing offhand, but we know the ICLA/grant
> associated with each change leading up to the tagged release).
> >>
> >> Did it wind up as "projects can experiment with using git for official
> releases"?
> >>
> >> Thanks,
> >>
> >> -- Ken
> >>
> >>> From: Mattmann, Chris A (3980)
> >>> Sent: January 1, 2016 8:30:16pm PST
> >>> To: dev@tika.apache.org
> >>> Subject: [VOTE] Moving SCM to Git
> >>>
> >>> Hi Everyone,
> >>>
> >>> DISCUSS thread here: http://s.apache.org/wVE
> >>>
> >>> Time to officially VOTE on moving Tika to Git. I’ve made a wiki
> >>> page for our SCM explaining how to use Git at Apache, and how to
> >>> use it with Github, and how to use it even in a traditional SVN
> >>> sense. The page is here:
> >>>
> >>> https://wiki.apache.org/tika/UsingGit
> >>>
> >>>
> >>> I’ve also linked it from the main wiki page. I took the liberty
> >>> of updating the only other 2 pages on the wiki that referenced
> >>> SCM with (pending) Git instructions as well:
> >>>
> >>> https://wiki.apache.org/tika/DeveloperResources
> >>> https://wiki.apache.org/tika/ReleaseProcess
> >>>
> >>> From the DISCUSS thread it would seem the following members of
> >>> the community support this move:
> >>>
> >>> Chris Mattmann
> >>> Tyler Palsulich
> >>> Bob Paulin
> >>> Hong-Thai Nguyen
> >>>
> >>> Oleg Tikhonov
> >>> David Meikle
> >>>
> >>>
> >>> Given the above I’m going to count the above people as +1 in
> >>> this VOTE if I don’t hear otherwise.
> >>>
> >>> Nick Burch said he would be more supportive if there was a guide,
> >>> so I made one and updated the other wiki docs 

RE: NER Parser tests behind proxy?

2015-11-23 Thread Tyler Palsulich
Apologies if i missed a discussion about this earlier, but should we be
downloading a model by default?

Tyler
On Nov 23, 2015 8:03 AM, "Allison, Timothy B."  wrote:

> The problem comes down to: ModelGetter.groovy which is trying to grab:
> ${basedir}/src/test/resources/org/apache/tika/parser/ner/opennlp/ner-person.bin
>
> If we could build a small model (and I mean really small) and package it
> with Tika, we wouldn't have to worry about http connectivity outside of the
> usual maven stuff.
>
> -Original Message-
> From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov]
> Sent: Monday, November 23, 2015 10:52 AM
> To: dev@tika.apache.org
> Cc: ThammeGowda Narayanaswamy 
> Subject: Re: NER Parser tests behind proxy?
>
> Hey Tim,
>
> I’m not seeing these of course b/c I’m not behind a proxy. Thamme, any
> ideas?
>
> Cheers,
> Chris
>
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398) NASA Jet
> Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Adjunct Associate Professor, Computer Science Department University of
> Southern California, Los Angeles, CA 90089 USA
> ++
>
>
>
>
>
> -Original Message-
> From: "Allison, Timothy B." 
> Reply-To: "dev@tika.apache.org" 
> Date: Thursday, November 19, 2015 at 5:36 PM
> To: "dev@tika.apache.org" 
> Subject: NER Parser tests behind proxy?
>
> >My proxy is configured for git/maven/etc, but how do I configure it
> >within the test so that I don't get this?
> >
> >GET : http://opennlp.sourceforge.net/models-1.5/en-ner-person.bin ->
> >tika-parsers\src\test\resources\org\apache\tika\parser\ner\opennlp\ner-
> >per
> >son.bin
> >[INFO]
> >---
> >-
> >[INFO] Reactor Summary:
> >[INFO]
> >[INFO] Apache Tika parent  SUCCESS
> >[3.264s] [INFO] Apache Tika core ..
> >SUCCESS [44.470s] [INFO] Apache Tika parsers
> >... FAILURE [1:56.462s] [INFO] Apache Tika
> >XMP ... SKIPPED [INFO] Apache Tika
> >serialization . SKIPPED [INFO] Apache Tika
> >batch . SKIPPED [INFO] Apache Tika
> >application ... SKIPPED [INFO] Apache Tika OSGi
> >bundle ... SKIPPED [INFO] Apache Tika translate
> >. SKIPPED [INFO] Apache Tika server
> > SKIPPED [INFO] Apache Tika examples
> >.. SKIPPED [INFO] Apache Tika Java-7
> >Components . SKIPPED [INFO] Apache Tika
> >... SKIPPED [INFO]
> >---
> >-
> >[INFO] BUILD FAILURE
> >[INFO]
> >---
> >-
> >[INFO] Total time: 2:45.245s
> >[INFO] Finished at: Thu Nov 19 20:29:34 EST 2015 [INFO] Final Memory:
> >52M/482M [INFO]
> >---
> >-
> >[ERROR] Failed to execute goal
> >org.codehaus.groovy.maven:gmaven-plugin:1.0:execute (testSetup) on
> >project tika-parsers: java.net.ConnectException: Connection refused:
> >connect -> [Help 1]
> >org.apache.maven.lifecycle.LifecycleExecutionException: Failed to
> >execute goal org.codehaus.groovy.maven:gmaven-plugin:1.0:execute
> >(testSetup) on project tika-parsers: java.net.ConnectException:
> Connection refused:
> >connect
> >   at
> >org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.j
> >ava
> >:217)
> >   at
> >org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.j
> >ava
> >:153)
> >   at
> >org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.j
> >ava
> >:145)
> >   at
> >org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject
> >(Li
> >fecycleModuleBuilder.java:84)
> >   at
> >org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject
> >(Li
> >fecycleModuleBuilder.java:59)
> >   at
> >org.apache.maven.lifecycle.internal.LifecycleStarter.singleThreadedBuil
> >d(L
> >ifecycleStarter.java:183)
> >   at
> >org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleS
> >tar
> >ter.java:161)
> >   at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:320)
> >   at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:156)
> >   at 

Re: [DISCUSS] Moving to Git

2015-11-18 Thread Tyler Palsulich
+1 from me.

Tyler
On Nov 18, 2015 6:46 AM, "Mattmann, Chris A (3980)" <
chris.a.mattm...@jpl.nasa.gov> wrote:

> Hey Team,
>
> I propose we move to writeable git repos for Tika for our repository.
> I mostly interact with Git & Github nowadays even with Tika using the
> mirroring and PR interaction support.
>
> Thoughts?
>
> Cheers,
> Chris
>
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++
>
>
>
>


Re: Named Entity Recognition support in trunk

2015-11-18 Thread Tyler Palsulich
That's awesome! Great work.

Have we tried running any benchmarks?

Tyler
On Nov 18, 2015 6:42 AM, "Mattmann, Chris A (3980)" <
chris.a.mattm...@jpl.nasa.gov> wrote:

> Hey Folks,
>
> With the commit of TIKA-1787/GH-61 in trunk we now have full integration
> of Named Entity Recognition with Stanford NER/NLP and Apache OpenNLP.
> Will also look to see if we can integrate NLTK too. This is a *big
> deal* since NER is something we’ve always wanted to pull into Tika.
>
> Woot!
>
> Cheers,
> Chris
>
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++
>
>
>
>


Re: [VOTE] Apache Tika 1.11 Release Candidate #1

2015-10-22 Thread Tyler Palsulich
+1 from me -- builds, tests pass, sanity check files parse, and sums look
good. But, I get a warning that the signature is not certified with a
trusted signature.

Tyler

On Wed, Oct 21, 2015 at 6:43 AM Allison, Timothy B. 
wrote:

> +0 (some regressions in ppt content)
>
> I just finished the batch comparison run on  ~1.8 million files in our
> govdocs1 and commoncrawl corpora comparing Tika 1.10 to 1.11-rc1.  As a
> caveat, the eval code is still in development and there may be bugs in the
> reports.
>
> Results are here:
> https://github.com/tballison/share/blob/master/tika_comparisons/tika_1_10_vs_1_11-rc1.zip
>
> Key reports:
> contents/content_diffs.csv (file had one corrupt row when viewing in
> Excel...manually deleted offending content)
> exceptions/newExceptionsInBByMimeTypeByStackTrace.csv (small handful)
> exceptions/fixedExceptionsInBByMimeType.csv  (none!)
> mimes/mime_diffs_A_to_B.csv
>
> On the positive side:
> From "mime_diffs_A_to_B.csv", it looks like we are catching more pdfs as
> pdfs (that text/xhtml) than we were...great!  We're identifying more files
> as images (jpeg, pict) than as xhtml, and, from a quick look, this appears
> to be an improvement.  We have at least 9 new x-hwp-v5 (great!).
>
> On the negative side:
>
> 1) We have a few regressions in ppt exceptions (six of the same aioobe).
> 2) We have regressions in ppt content (it looks like we're not adding a
> new line/word break where we need to).  The regressions are small per file,
> but they affect ~220 ppts out of ~1500 (~15%).
>
> Other than the regressions in ppt content, I'd be +1, but I don't think
> this is severe enough to warrant a re-spin.  Happy to look into a fix,
> though, if we want a re-spin...and even if we don't, I'll start looking
> into this asap.
>
> -Original Message-
> From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov]
> Sent: Monday, October 19, 2015 10:23 AM
> To: dev@tika.apache.org
> Cc: u...@tika.apache.org
> Subject: [VOTE] Apache Tika 1.11 Release Candidate #1
>
> Hi Folks,
>
> A first candidate for the Tika 1.11 release is available at:
>
>   https://dist.apache.org/repos/dist/dev/tika/
>
> The release candidate is a zip archive of the sources in:
>   http://svn.apache.org/repos/asf/tika/tags/1.11-rc1/
>
> The SHA1 checksum of the archive is
> d0dde7b3a4f1a2fb6ccd741552ea180dddab630a
>
> In addition, a staged maven repository is available here:
>
> https://repository.apache.org/content/repositories/orgapachetika-1014/
>
>
> Please vote on releasing this package as Apache Tika 1.11.
> The vote is open for the next 72 hours and passes if a majority of at
> least three +1 Tika PMC votes are cast.
>
> [ ] +1 Release this package as Apache Tika 1.11 [ ] -1 Do not release this
> package because…
>
> Cheers,
> Chris
>
> P.S. Of course here is my +1.
>
>
>
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398) NASA Jet
> Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Adjunct Associate Professor, Computer Science Department University of
> Southern California, Los Angeles, CA 90089 USA
> ++
>
>
>
>


Re: Tika Tesseract configuration

2015-10-17 Thread Tyler Palsulich
Hi Aditya,

The wiki (https://wiki.apache.org/tika/TikaOCR) also had some good
information about setting up and configuring Tesseract.

Let me know if you have any questions.

Thanks,
Tyler

On Wed, Oct 14, 2015, 6:59 AM Aditya Dhulipala  wrote:

> Hi Tika devs,
>
> Scratch that previous email.
>
> I found the TesseractOCRConfig .properties file
>
> I was looking for it in the wrong location.
>
> Sorry for the confusion.
>
> Thanks!
> --
> Aditya
>
>
> adi
>
> On Wed, Oct 14, 2015 at 9:52 AM, Aditya Dhulipala 
> wrote:
>
>> Tika Devs!
>>
>> I'm trying to run Tika with Tesseract.
>> I finished installing tesseract and confirmed that its working correctly.
>>
>> I ran an image against Tika server expecting that tesseractOCR would be
>> enabled by default.
>>
>> But I noticed that the extracted metadata didn't have OCR output.
>>
>> Is this because tesseract is disabled by default?
>>
>> Should there be a TesseractConfig.properties files somewhere? (I read
>> about this in the TesseractOCRParser source. But I didn't find this file
>> anywhere)
>>
>>
>> [image: Inline image 1]Hi
>>
>> Thanks!
>> --
>> Aditya
>>
>>
>>
>>
>


Re: svn commit: r1706077 - /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/gdal/TestGDALParser.java

2015-10-01 Thread Tyler Palsulich
Hi Chris,

It looks like these two lines are equivalent (assert not null versus assert
true not null). Right?

Tyler

On Wed, Sep 30, 2015, 9:45 AM   wrote:

> Author: mattmann
> Date: Wed Sep 30 16:45:32 2015
> New Revision: 1706077
>
> URL: http://svn.apache.org/viewvc?rev=1706077=rev
> Log:
> - Files isn't always present (just found test case on older version of
> GDAL)
>
> Modified:
>
> tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/gdal/TestGDALParser.java
>
> Modified:
> tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/gdal/TestGDALParser.java
> URL:
> http://svn.apache.org/viewvc/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/gdal/TestGDALParser.java?rev=1706077=1706076=1706077=diff
>
> ==
> ---
> tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/gdal/TestGDALParser.java
> (original)
> +++
> tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/gdal/TestGDALParser.java
> Wed Sep 30 16:45:32 2015
> @@ -69,7 +69,7 @@ public class TestGDALParser extends Tika
>  assertNotNull(met);
>  assertNotNull(met.get("Driver"));
>  assertEquals(expectedDriver, met.get("Driver"));
> -assertNotNull(met.get("Files"));
> +assumeTrue(met.get("Files") != null);
>  assertNotNull(met.get("Coordinate System"));
>  assertEquals(expectedCoordinateSystem, met.get("Coordinate
> System"));
>  assertNotNull(met.get("Size"));
>
>
>


Re: svn commit: r1706077 - /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/gdal/TestGDALParser.java

2015-10-01 Thread Tyler Palsulich
Hi Chris,

Ah, got it. I misread assume as assert. Doh!

Tyler

On Thu, Oct 1, 2015, 6:45 AM Mattmann, Chris A (3980) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> Hey Tyler,
>
> assertNotNull returns void whereas I needed something testable for
> assumeTrue (since apparently gdal doesn’t always print out the
> Files output on all systems and versions which I found out yesterday).
>
> Make sense?
>
> Cheers,
> Chris
>
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++
>
>
>
>
>
> -Original Message-
> From: Tyler Palsulich <tpalsul...@gmail.com>
> Reply-To: "dev@tika.apache.org" <dev@tika.apache.org>
> Date: Thursday, October 1, 2015 at 6:39 AM
> To: "dev@tika.apache.org" <dev@tika.apache.org>, "comm...@tika.apache.org"
> <comm...@tika.apache.org>
> Subject: Re: svn commit: r1706077 -
> /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/gdal/TestGDAL
> Parser.java
>
> >Hi Chris,
> >
> >It looks like these two lines are equivalent (assert not null versus
> >assert
> >true not null). Right?
> >
> >Tyler
> >
> >On Wed, Sep 30, 2015, 9:45 AM  <mattm...@apache.org> wrote:
> >
> >> Author: mattmann
> >> Date: Wed Sep 30 16:45:32 2015
> >> New Revision: 1706077
> >>
> >> URL: http://svn.apache.org/viewvc?rev=1706077=rev
> >> Log:
> >> - Files isn't always present (just found test case on older version of
> >> GDAL)
> >>
> >> Modified:
> >>
> >>
> >>tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/gdal/TestGDA
> >>LParser.java
> >>
> >> Modified:
> >>
> >>tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/gdal/TestGDA
> >>LParser.java
> >> URL:
> >>
> >>
> http://svn.apache.org/viewvc/tika/trunk/tika-parsers/src/test/java/org/ap
> >>ache/tika/parser/gdal/TestGDALParser.java?rev=1706077=1706076=17060
> >>77=diff
> >>
> >>
> >>=
> >>=
> >> ---
> >>
> >>tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/gdal/TestGDA
> >>LParser.java
> >> (original)
> >> +++
> >>
> >>tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/gdal/TestGDA
> >>LParser.java
> >> Wed Sep 30 16:45:32 2015
> >> @@ -69,7 +69,7 @@ public class TestGDALParser extends Tika
> >>  assertNotNull(met);
> >>  assertNotNull(met.get("Driver"));
> >>  assertEquals(expectedDriver, met.get("Driver"));
> >> -assertNotNull(met.get("Files"));
> >> +assumeTrue(met.get("Files") != null);
> >>  assertNotNull(met.get("Coordinate System"));
> >>  assertEquals(expectedCoordinateSystem, met.get("Coordinate
> >> System"));
> >>  assertNotNull(met.get("Size"));
> >>
> >>
> >>
>
>


[jira] [Commented] (TIKA-1743) NetworkParser can create Unbounded Number of Threads

2015-09-22 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14903878#comment-14903878
 ] 

Tyler Palsulich commented on TIKA-1743:
---

[Copied from the list]

This sounds like a great idea! We should make the size of the pool configurable 
with TikaConfig.

> NetworkParser can create Unbounded Number of Threads
> 
>
> Key: TIKA-1743
> URL: https://issues.apache.org/jira/browse/TIKA-1743
> Project: Tika
>  Issue Type: Bug
>Reporter: Bob Paulin
>
> The current NetworkParser class creates new instances of the Thread class 
> which each call to parse.  This could create an unbounded number of threads 
> created by this class.  I'd suggest replacing this logic with a 
> ThreadPoolExecutor and a configurable number of threads.  This will help 
> prevent creating an unbounded number of threads and allow the user to tune 
> performance to the hardware.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [jira] [Created] (TIKA-1743) NetworkParser can create Unbounded Number of Threads

2015-09-22 Thread Tyler Palsulich
This sounds like a great idea! We should make the size of the pool
configurable with TikaConfig.

On Tue, Sep 22, 2015, 3:04 PM Bob Paulin (JIRA)  wrote:

> Bob Paulin created TIKA-1743:
> 
>
>  Summary: NetworkParser can create Unbounded Number of Threads
>  Key: TIKA-1743
>  URL: https://issues.apache.org/jira/browse/TIKA-1743
>  Project: Tika
>   Issue Type: Bug
> Reporter: Bob Paulin
>
>
> The current NetworkParser class creates new instances of the Thread class
> which each call to parse.  This could create an unbounded number of threads
> created by this class.  I'd suggest replacing this logic with a
> ThreadPoolExecutor and a configurable number of threads.  This will help
> prevent creating an unbounded number of threads and allow the user to tune
> performance to the hardware.
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)
>


Re: [ANNOUNCE] Welcome Bob Paulin as Tika Committer + PMC Member

2015-09-16 Thread Tyler Palsulich
Welcome!

On Wed, Sep 16, 2015, 6:37 PM Allison, Timothy B. 
wrote:

> Welcome!  Great to have you on board!
>
> Cheers,
>
> Tim
>
> -Original Message-
> From: Bob Paulin [mailto:b...@bobpaulin.com]
> Sent: Wednesday, September 16, 2015 9:16 PM
> To: dev@tika.apache.org
> Subject: Re: [ANNOUNCE] Welcome Bob Paulin as Tika Committer + PMC Member
>
> Hi Tika Community,
>
> I'm an independent developer [1], speaker[2], podcaster[3], and Java User
> Group [4] leader from Chicago.  I specialize in modular development with
> OSGi and commit code to Apache Felix.  For fun I coach football and
> robotics.  I have 3 kids and 1 very understanding wife.  Excited to be a
> part of the Tika Community!
>
> - Bob Paulin
> [1] https://github.com/bobpaulin
> [2] http://www.slideshare.net/bobpaulin
> [3] http://www.javaoffheap.com/
> [4] http://www.meetup.com/ChicagoJUG/
>
> On 9/16/2015 7:05 PM, David Meikle wrote:
> > Hello All,
> >
> > Please welcome Bob Paulin as he joins us as the latest Tika committer
> and PMC Member.
> >
> > Bob, please feel free to say a bit about yourself as an introduction to
> the group.
> >
> > Welcome aboard,
> > Dave
> >
> >
> >
> >
> >
>
>


[jira] [Commented] (TIKA-1672) Integrate tika-java7 component

2015-08-30 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14722705#comment-14722705
 ] 

Tyler Palsulich commented on TIKA-1672:
---

Hmm. Maybe we should rename the module? Right now, it doesn't make sense to 
have a java7 component when the entire project depends on Java 7.

 Integrate tika-java7 component
 --

 Key: TIKA-1672
 URL: https://issues.apache.org/jira/browse/TIKA-1672
 Project: Tika
  Issue Type: Improvement
Reporter: Tyler Palsulich
 Fix For: 1.11


 Code requiring Java 7 doesn't need to be in a separate module now that 
 TIKA-1536 (upgrade to Java 7) is done.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [ANNOUNCE] Apache Tika 1.10 release

2015-08-08 Thread Tyler Palsulich
Thanks, Dave!

On Sat, Aug 8, 2015, 7:01 AM David Meikle dmei...@apache.org wrote:

 The Apache Tika project is pleased to announce the release of Apache Tika
 1.10. The release contents have been pushed out to the main Apache release
 site and to the Central sync, so the releases should be available as soon
 as the mirrors get the syncs.

 Apache Tika is a toolkit for detecting and extracting metadata and
 structured text content from various documents using existing parser
 libraries.

 Apache Tika 1.10 contains a number of improvements and bug fixes. Details
 can be found in the changes file:
 http://www.apache.org/dist/tika/CHANGES-1.10.txt 
 http://www.apache.org/dist/tika/CHANGES-1.10.txt

 Apache Tika is available in source form from the following download page:
 http://www.apache.org/dyn/closer.cgi/tika/apache-tika-1.10-src.zip 
 http://www.apache.org/dyn/closer.cgi/tika/apache-tika-1.10-src.zip

 Apache Tika is also available in binary form or for use using Maven 2 from
 the Central Repository: http://repo1.maven.org/maven2/org/apache/tika/ 
 http://repo1.maven.org/maven2/org/apache/tika/

 In the initial 48 hours, the release may not be available on all mirrors.
 When downloading from a mirror site, please remember to verify the
 downloads using signatures found on the Apache site:
 https://people.apache.org/keys/group/tika.asc 
 https://people.apache.org/keys/group/tika.asc

 For more information on Apache Tika, visit the project home page:
 http://tika.apache.org/ http://tika.apache.org/

 -- David Meikle, on behalf of the Apache Tika community




Re: [VOTE] Apache Tika 1.10 Release Candidate #1

2015-08-04 Thread Tyler Palsulich
Everything looks good to me! +1

Thanks, Dave!

Tyler

On Tue, Aug 4, 2015, 6:48 AM Ken Krugler kkrugler_li...@transpac.com
wrote:

 +1

 Built on Mac, tested with Bixo.

 -- Ken

  From: David Meikle
  Sent: August 2, 2015 12:15:24am PDT
  To: dev@tika.apache.org; u...@tika.apache.org
  Subject: [VOTE] Apache Tika 1.10 Release Candidate #1
 
  Hi Everyone,
 
  A candidate for the Apache Tika 1.10 release is available at:
 
  https://dist.apache.org/repos/dist/dev/tika/
 
  The release candidate is a zip archive of the sources in:
 
  http://svn.apache.org/repos/asf/tika/tags/1.10-rc1/
 
  The SHA1 checksum of the archive is
 
  b1573adcb194e2c09b77eccc3b1edd16bd4ac67d.
 
  In addition, a staged maven repository is available here:
 
  https://repository.apache.org/content/repositories/orgapachetika-1013
 
 
  Please vote on releasing this package as Apache Tika 1.10.
  The vote is open for the next 72 hours and passes if a majority of at
 least
  three +1 Tika PMC votes are cast.
 
  [ ] +1 Release this package as Apache Tika 1.10
 
  [ ] -1 Do not release this package because...
 
  Here is my +1!
 
  Cheers,
  Dave

 --
 Ken Krugler
 +1 530-210-6378
 http://www.scaleunlimited.com
 custom big data solutions  training
 Hadoop, Cascading, Cassandra  Solr





 --
 Ken Krugler
 +1 530-210-6378
 http://www.scaleunlimited.com
 custom big data solutions  training
 Hadoop, Cascading, Cassandra  Solr








[jira] [Commented] (TIKA-1362) Add GoogleTranslate implementation of Translation API

2015-07-10 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14623246#comment-14623246
 ] 

Tyler Palsulich commented on TIKA-1362:
---

If you have a pressing need for better configuration abilities for the Google 
Translator, feel free to open up a new issue and upload a patch! :) We'd be 
happy to help you get started. Check out the [contributing 
page|https://tika.apache.org/contribute.html] for some general information.

 Add GoogleTranslate implementation of Translation API
 -

 Key: TIKA-1362
 URL: https://issues.apache.org/jira/browse/TIKA-1362
 Project: Tika
  Issue Type: Bug
  Components: translation
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.6


 Add an implementation of the Translation API that uses the Google Translate 
 v2 API and Apache CXF: 
 https://www.googleapis.com/language/translate/v2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1672) Integrate tika-java7 component

2015-07-02 Thread Tyler Palsulich (JIRA)
Tyler Palsulich created TIKA-1672:
-

 Summary: Integrate tika-java7 component
 Key: TIKA-1672
 URL: https://issues.apache.org/jira/browse/TIKA-1672
 Project: Tika
  Issue Type: Improvement
Reporter: Tyler Palsulich
 Fix For: 1.10


Code requiring Java 7 doesn't need to be in a separate module now that 
TIKA-1536 (upgrade to Java 7) is done.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1536) Upgrade compiler definition in pom's to Java 7

2015-07-02 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich resolved TIKA-1536.
---
Resolution: Fixed

Upgraded in  r1688779. Thanks, all. Will open a new issue regarding integrating 
tika-java7.

 Upgrade compiler definition in pom's to Java 7
 --

 Key: TIKA-1536
 URL: https://issues.apache.org/jira/browse/TIKA-1536
 Project: Tika
  Issue Type: Improvement
  Components: packaging
Affects Versions: 1.7
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.10

 Attachments: TIKA-1536.patch


 Since we committed TIKA-1423 it would appear through [mailing 
 list|http://www.mail-archive.com/dev%40tika.apache.org/msg11542.html] 
 commentary that there is a willingness to drop support for Java 1.6 in favour 
 of = Java 1.7.
 This issue simply addresses this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1536) Upgrade compiler definition in pom's to Java 7

2015-06-29 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605772#comment-14605772
 ] 

Tyler Palsulich commented on TIKA-1536:
---

Yep, see http://apache.markmail.org/thread/7oubuh4hp6rdlbch.

 Upgrade compiler definition in pom's to Java 7
 --

 Key: TIKA-1536
 URL: https://issues.apache.org/jira/browse/TIKA-1536
 Project: Tika
  Issue Type: Improvement
  Components: packaging
Affects Versions: 1.7
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.10

 Attachments: TIKA-1536.patch


 Since we committed TIKA-1423 it would appear through [mailing 
 list|http://www.mail-archive.com/dev%40tika.apache.org/msg11542.html] 
 commentary that there is a willingness to drop support for Java 1.6 in favour 
 of = Java 1.7.
 This issue simply addresses this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (TIKA-1481) TikaJAXRS get metadata calls give different results

2015-06-29 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich closed TIKA-1481.
-
Resolution: Not A Problem

Hi [~arbuzovada]. Sorry for the trouble! Did you make sure to respond to the 
automated response, confirming your subscription?

I'm closing this issue as not a problem. But, don't hesitate to let us know if 
you have any more issues.

 TikaJAXRS get metadata calls give different results
 ---

 Key: TIKA-1481
 URL: https://issues.apache.org/jira/browse/TIKA-1481
 Project: Tika
  Issue Type: Bug
  Components: server
Affects Versions: 1.6
 Environment: Windows 8, JDK 1.8
Reporter: Darya Arbuzova
Priority: Minor
 Attachments: sample.csv


 Hello!
 I'm trying to use Tika in server mode.
 I downloaded tika-server-1.6.jar from http://mirror.vorboss.net/apache/tika/.
 I have tried to get file metadata in 2 different ways (as explained here: 
 http://wiki.apache.org/tika/TikaJAXRS ):
 {{ curl -T sample.csv http://localhost:9998/meta --header Content-Type: 
 text/csv}}
 {{Content-Encoding,windows-1252}}
 {{Content-Type,text/plain; charset=windows-1252}}
 and
 {{ curl -X PUT -d @sample.csv http://localhost:9998/meta --header 
 Content-Type: text/csv}}
 {{Content-Encoding,ISO-8859-1}}
 {{Content-Type,text/plain; charset=ISO-8859-1}}
 How come they give different results in encoding if I call the same 
 {{http://localhost:9998/meta}}?
 What could the other differences appear and which is the preferable way to 
 get metadata?
 Many thanks!
 Best regards,
 Darya Arbuzova



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-756) XMP output from Tika CLI

2015-06-29 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich resolved TIKA-756.
--
Resolution: Fixed

Marking this as Fixed, since there are a few more references to tika-parser 
components (see TikaToXMP). Feel free to reopen if you disagree.

 XMP output from Tika CLI
 

 Key: TIKA-756
 URL: https://issues.apache.org/jira/browse/TIKA-756
 Project: Tika
  Issue Type: New Feature
  Components: cli, metadata
Reporter: Jukka Zitting
Assignee: Jörg Ehrlich
  Labels: metadata, xmp
 Attachments: tika-xmp.patch, tika-xmp_styleAndHeader.patch


 It would be great if the Tika CLI could output metadata also in the XMP 
 format.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (TIKA-1429) Unable to View a 9mb file even after setting a large Heap Size of 3GB while TIKA GUI

2015-06-29 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich closed TIKA-1429.
-
Resolution: Not A Problem

Closing this as not a problem. The file needs to be kept in memory for the GUI 
to work. So, the problem should be fixed with a higher limit.

 Unable to  View a 9mb file even after setting  a large Heap Size of 3GB  
 while TIKA GUI
 ---

 Key: TIKA-1429
 URL: https://issues.apache.org/jira/browse/TIKA-1429
 Project: Tika
  Issue Type: Bug
  Components: gui
Affects Versions: 1.6
 Environment: Windows 8
Reporter: Gautham Gowrishankar
Priority: Minor

  we seem to have found an issue while tika1.6 jar as a GUI (-g option),It 
 seems to work for smaller .tsv files but we running into GC Overload 
 Excpetion while running on of the files in your DataSet. Strangely it seems 
 to work with -x option.
 There might be an issue with  at 
 org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:284). 
 Just bringing it to your notice.
 Below are the logs.
 =
 Exception in thread AWT-EventQueue-0 java.lang.OutOfMemoryError: GC 
 overhead l
 imit exceeded
 at java.util.Arrays.copyOfRange(Unknown Source)
 at java.lang.String.init(Unknown Source)
 at java.lang.StringBuilder.toString(Unknown Source)
 at java.lang.StackTraceElement.toString(Unknown Source)
 at java.lang.String.valueOf(Unknown Source)
 at java.lang.StringBuilder.append(Unknown Source)
 at java.lang.Throwable.printStackTrace(Unknown Source)
 at java.lang.Throwable.printStackTrace(Unknown Source)
 at org.apache.tika.gui.TikaGUI.handleError(TikaGUI.java:351)
 at org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:284)
 at org.apache.tika.gui.TikaGUI.actionPerformed(TikaGUI.java:238)
 at javax.swing.AbstractButton.fireActionPerformed(Unknown Source)
 at javax.swing.AbstractButton$Handler.actionPerformed(Unknown Source)
 at javax.swing.DefaultButtonModel.fireActionPerformed(Unknown Source)
 at javax.swing.DefaultButtonModel.setPressed(Unknown Source)
 at javax.swing.AbstractButton.doClick(Unknown Source)
 at javax.swing.plaf.basic.BasicMenuItemUI.doClick(Unknown Source)
 at 
 javax.swing.plaf.basic.BasicMenuItemUI$Handler.mouseReleased(Unknown
 Source)
 at java.awt.Component.processMouseEvent(Unknown Source)
 at javax.swing.JComponent.processMouseEvent(Unknown Source)
 at java.awt.Component.processEvent(Unknown Source)
 at java.awt.Container.processEvent(Unknown Source)
 at java.awt.Component.dispatchEventImpl(Unknown Source)
 at java.awt.Container.dispatchEventImpl(Unknown Source)
 at java.awt.Component.dispatchEvent(Unknown Source)
 at java.awt.LightweightDispatcher.retargetMouseEvent(Unknown Source)
 at java.awt.LightweightDispatcher.processMouseEvent(Unknown Source)
 at java.awt.LightweightDispatcher.dispatchEvent(Unknown Source)
 at java.awt.Container.dispatchEventImpl(Unknown Source)
 at java.awt.Window.dispatchEventImpl(Unknown Source)
 at java.awt.Component.dispatchEvent(Unknown Source)
 at java.awt.EventQueue.dispatchEventImpl(Unknown Source)
 Exception in thread AWT-EventQueue-0 java.lang.OutOfMemoryError: GC 
 overhead l
 imit exceeded
 at java.security.ProtectionDomain$1.doIntersectionPrivilege(Unknown 
 Sour
 ce)
 at java.security.ProtectionDomain$1.doIntersectionPrivilege(Unknown 
 Sour
 ce)
 at java.awt.EventQueue$4.run(Unknown Source)
 at java.awt.EventQueue$4.run(Unknown Source)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.security.ProtectionDomain$1.doIntersectionPrivilege(Unknown 
 Sour
 ce)
 at java.awt.EventQueue.dispatchEvent(Unknown Source)
 at java.awt.EventDispatchThread.pumpOneEventForFilters(Unknown Source)
 at java.awt.EventDispatchThread.pumpEventsForFilter(Unknown Source)
 at java.awt.EventDispatchThread.pumpEventsForHierarchy(Unknown Source)
 at java.awt.EventDispatchThread.pumpEvents(Unknown Source)
 at java.awt.EventDispatchThread.pumpEvents(Unknown Source)
 at java.awt.EventDispatchThread.run(Unknown Source)
 Exception in thread AWT-EventQueue-0 java.lang.OutOfMemoryError: GC 
 overhead l
 imit exceeded
 at java.lang.StringBuilder.toString(Unknown Source)
 at 
 com.sun.java.swing.plaf.windows.TMSchema$Part.getControlName(Unknown
 Source)
 at com.sun.java.swing.plaf.windows.XPStyle.isSkinDefined(Unknown 
 Source

[jira] [Commented] (TIKA-1493) Update for JAXRS page with details on passing password

2015-06-29 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605292#comment-14605292
 ] 

Tyler Palsulich commented on TIKA-1493:
---

Can someone familiar with the latest in passing a password to Tika server 
update the wiki page? Or, is setting the environment variable enough?

 Update for JAXRS page with details on passing password
 --

 Key: TIKA-1493
 URL: https://issues.apache.org/jira/browse/TIKA-1493
 Project: Tika
  Issue Type: Improvement
  Components: documentation
Reporter: Peter Bowyer
Priority: Minor
  Labels: documentation, newbie

 I signed up for a wiki account to make the edit, but the page is immutable :(
 It would be really helpful to put on https://wiki.apache.org/tika/TikaJAXRS 
 information about passing the password for encrypted PDFs into TikaJAXRS. In 
 Changelog.txt I discovered the TIKA_PASSWORD environment variable which has 
 worked for me, and it'd be nice to save others having to hunt around.
 I'd also like to know if there's a way to pass it in per-request (a HTTP 
 header? Useful when many different passwords) - not found anything in the 
 source code for that though.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (TIKA-1552) Pdf document parser

2015-06-29 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich closed TIKA-1552.
-
Resolution: Not A Problem

Marking this as not a problem, since Adobe Reader also adds white space.

 Pdf document parser
 ---

 Key: TIKA-1552
 URL: https://issues.apache.org/jira/browse/TIKA-1552
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.7
Reporter: Konstantin
 Attachments: 2014_US_Federal_Budget.pdf, issue.jpg


 Hello,
 We found that when a pdf document has marked text inside frame (table) then 
 after parsing Tika insert tabs between words.
 Original text from attached file:
 Provides $17.7 billion in discretionary funding for the National Aeronautics 
 and Space
 Parsed text (jira removed tabs, so i will add - symbols instead):
 •Provides - $17.7 - 
 billion-in-discretionary-funding-for-the-National-Aeronautics-and-Space
 Please  take a look in attached screenshot.
 On the left side is the parsed text in text editor
 Thank you.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (TIKA-1452) parser.parse() throws exception after which the procesed file is not getting renamed/moved/deleted

2015-06-29 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich closed TIKA-1452.
-
Resolution: Not A Problem

I'm closing this as not a problem. But, please feel free to reopen if you're 
still having this issue!

 parser.parse() throws exception after which the procesed file is not getting 
 renamed/moved/deleted
 --

 Key: TIKA-1452
 URL: https://issues.apache.org/jira/browse/TIKA-1452
 Project: Tika
  Issue Type: Bug
  Components: detector, metadata, parser
Affects Versions: 1.6
 Environment: jre6
Reporter: Abhishek

 I am passing a file as input stream to parser.parse() method while using 
 apache tika library to convert file to text.The method throws an exception 
 (displayed below) but the input stream is closed in the finally block 
 successfully. Then while renaming the file, the File.renameTo method from 
 java.io returns false. I am not able to rename/delete/move the file despite 
 successfully closing the inputStream. I am afraid another instance of file is 
 created, while parser.parse() method processess the file, which doesn't get 
 closed till the time exception is throw. Is that possible? If so what should 
 I do to rename or delete the file.
 The Exception thrown while checking the content type is
 java.lang.NoClassDefFoundError: Could not initialize class 
 com.adobe.xmp.impl.XMPMetaParser
 at com.adobe.xmp.XMPMetaFactory.parseFromBuffer(XMPMetaFactory.java:160)
 at com.adobe.xmp.XMPMetaFactory.parseFromBuffer(XMPMetaFactory.java:144)
 at com.drew.metadata.xmp.XmpReader.extract(XmpReader.java:106)
 at 
 com.drew.imaging.jpeg.JpegMetadataReader.extractMetadataFromJpegSegmentReader(JpegMetadataReader.java:112)
 at 
 com.drew.imaging.jpeg.JpegMetadataReader.readMetadata(JpegMetadataReader.java:71)
 
 at 
 org.apache.tika.parser.image.ImageMetadataExtractor.parseJpeg(ImageMetadataExtractor.java:91)
 at org.apache.tika.parser.jpeg.JpegParser.parse(JpegParser.java:56)
 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
 at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
 at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:121) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (TIKA-1439) PDF embeded with document can not parse.

2015-06-29 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich closed TIKA-1439.
-
Resolution: Duplicate

 PDF embeded with document can not parse.
 

 Key: TIKA-1439
 URL: https://issues.apache.org/jira/browse/TIKA-1439
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.6
 Environment: windows7
Reporter: sunxingzhe
  Labels: pdfbox
 Attachments: PDF2XHTML.java_diff.html


 I insert a Excel file into the pdf file.
 But can not extracte embedded excel resources.
 The attachment file PDF2XHTML.java_diff.html is the diff file.
 Please confirm it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1233) PDFBox can throw StringIndexOutOfBoundsException on some dates

2015-06-29 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich updated TIKA-1233:
--
Fix Version/s: (was: 1.6)
   1.10

 PDFBox can throw StringIndexOutOfBoundsException on some dates
 --

 Key: TIKA-1233
 URL: https://issues.apache.org/jira/browse/TIKA-1233
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.5
Reporter: Tim Allison
Priority: Trivial
  Labels: easyfix
 Fix For: 1.10


 PDFBOX's date parser can throw a StringIndexOutOfBoundsException if a date 
 string for parsing is empty or contains only spaces.  A few of my test pdfs 
 have this feature.
 Until PDFBOX-1803 is resolved, we can add an extra catch to prevent this from 
 causing problems in TIKA
 {noformat}
 @@ -171,6 +171,9 @@
  addMetadata(metadata, TikaCoreProperties.CREATED, 
 info.getCreationDate());
  } catch (IOException e) {
  // Invalid date format, just ignore
 +} catch (StringIndexOutOfBoundsException e){
 +//remove after PDFBOX-1883 is fixed
 +// Invalid date format, just ignore
  }
  try {
  Calendar modified = info.getModificationDate();
 @@ -178,6 +181,9 @@
  addMetadata(metadata, TikaCoreProperties.MODIFIED, modified);
  } catch (IOException e) {
  // Invalid date format, just ignore
 +} catch (StringIndexOutOfBoundsException e){
 +//remove after PDFBOX-1883 is fixed
 +// Invalid date format, just ignore
  }
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1585) Create Example Website with Form Submission

2015-06-29 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich resolved TIKA-1585.
---
Resolution: Fixed

Good idea, [~lewismc]. I added it to 
http://people.apache.org/~tpalsulich/tika.html. The server is down right now. 
If/when another one is started, we'll need to start it with the right CORS 
argument (http://people.apache.org) and I'll update the page with the right IP 
address.

 Create Example Website with Form Submission
 ---

 Key: TIKA-1585
 URL: https://issues.apache.org/jira/browse/TIKA-1585
 Project: Tika
  Issue Type: New Feature
  Components: example, server
Reporter: Tyler Palsulich
Assignee: Tyler Palsulich

 It would be great to have a website where we can direct people who ask what 
 Tika can do for [filetype] without needing them to actually download Tika.
 Some initial work to do that is 
 [here|http://tpalsulich.github.io/TikaExamples/].
 I'm far from a design guru, but I imagine the site as having a form where you 
 can upload a file at the top, checkboxes for if you want metadata, content, 
 or both, and a submit button. The request should be sent with AJAX and the 
 result should populate a {{div}}.
 One issue with AJAX requests is that Tika Server doesn't currently allow 
 Cross-Origin-Resource-Sharing (CORS). So, we either need to maintain a 
 slightly updated tika-server, or update the server to allow configuration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1536) Upgrade compiler definition in pom's to Java 7

2015-06-29 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605300#comment-14605300
 ] 

Tyler Palsulich commented on TIKA-1536:
---

Now that 1.9 is released, are there any blockers for upgrading to Java 1.7?

 Upgrade compiler definition in pom's to Java 7
 --

 Key: TIKA-1536
 URL: https://issues.apache.org/jira/browse/TIKA-1536
 Project: Tika
  Issue Type: Improvement
  Components: packaging
Affects Versions: 1.7
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.10

 Attachments: TIKA-1536.patch


 Since we committed TIKA-1423 it would appear through [mailing 
 list|http://www.mail-archive.com/dev%40tika.apache.org/msg11542.html] 
 commentary that there is a willingness to drop support for Java 1.6 in favour 
 of = Java 1.7.
 This issue simply addresses this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Troubleshooting guide

2015-06-24 Thread Tyler Palsulich
Looks good! Thanks, Nick.

Tyler

On Wed, Jun 24, 2015 at 2:42 PM Nick Burch apa...@gagravarr.org wrote:

 Hi All

 I've had a go at writing up a troubleshooting guide on the wiki, hopefully
 covering the main problems people face (content detected wrong, parser
 missing etc). It's linked from the front page and at
 https://wiki.apache.org/tika/Troubleshooting%20Tika

 Please expand and correct it as needed!

 Thanks
 Nick



Re: Configuring parsers and translators

2015-06-13 Thread Tyler Palsulich
It seems like there are two goals here, both aiming to centralize
configuration:

1. Provide an easy mechanism to configure which parsers to use when
(TIKA-1509).
2. Configure all individual parser parameters in Tika Config (not in, for
example, TesseractOCRConfig.properties) (TIKA-1508).

I'm also in favor of consolidating everything in Tika Config.

Tyler

On Mon, Jun 8, 2015 at 7:25 AM Allison, Timothy B. talli...@mitre.org
wrote:

 Tyler, I see your devil's advocate point.

 I strongly agree with Chris about the benefit of centralizing
 configuration and making it easy to dump and modify the TikaConfig file.

 Even though the TikaConfig file might get ugly, it would be far better to
 have everything nailed down there than searching through service
 loaders...IMHO.

 I opened TIKA-1508 a while ago and haven't had any time to work on
 it...this just deals with simple parameter settings for parsers, not the
 far more difficult/interesting stuff that we've discussed with composite
 parsers.

  My main worry with putting it all into config xml is that we accidently
 end up re-inventing spring badly...

 Yeah, or re-inventing Solr's parameter loading as my example does... :(

 I think that basic parameter setting should at least be fairly trivial to
 code...time allowing...argh.


 -Original Message-
 From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov]
 Sent: Saturday, June 06, 2015 7:01 PM
 To: dev@tika.apache.org
 Subject: Re: Configuring parsers and translators

 Hey Tyler,

 I hear you, but balance that against all the hidden things here
 and there, and everywhere, that I constantly keep discovering and
 having to pour through lines of TikaConfig - service loaders, class
 loaders.

 When things work right - no problem. When something goes wrong;
 HUGE waste of time.

 Cheers,
 Chris

 ++
 Chris Mattmann, Ph.D.
 Chief Architect
 Instrument Software and Science Data Systems Section (398)
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 168-519, Mailstop: 168-527
 Email: chris.a.mattm...@nasa.gov
 WWW:  http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Associate Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++




 -Original Message-
 From: Tyler Palsulich tpalsul...@gmail.com
 Reply-To: dev@tika.apache.org dev@tika.apache.org
 Date: Saturday, June 6, 2015 at 3:59 PM
 To: dev@tika.apache.org dev@tika.apache.org
 Subject: Re: Configuring parsers and translators

 (Devil's advocate hat slightly on.) My one hesitation about putting it all
 into tika-config is that the default might get to be a monstrosity --
 difficult for new users to use.
 
 Tyler
 
 On Sat, Jun 6, 2015 at 3:48 PM Mattmann, Chris A (3980) 
 chris.a.mattm...@jpl.nasa.gov wrote:
 
  I think it would be great to have all this in the Tika Config.
 
  The one thing then is to provide an example default config and
  to make it *hugely* clear rather than all the levels of indirection
  that we currently have going on which makes it super hard when
  there is a config error (SPI, swallowing print messages, etc.)
 
 
  ++
  Chris Mattmann, Ph.D.
  Chief Architect
  Instrument Software and Science Data Systems Section (398)
  NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
  Office: 168-519, Mailstop: 168-527
  Email: chris.a.mattm...@nasa.gov
  WWW:  http://sunset.usc.edu/~mattmann/
  ++
  Adjunct Associate Professor, Computer Science Department
  University of Southern California, Los Angeles, CA 90089 USA
  ++
 
 
 
 
  -Original Message-
  From: Tyler Palsulich tpalsul...@gmail.com
  Reply-To: dev@tika.apache.org dev@tika.apache.org
  Date: Saturday, June 6, 2015 at 3:45 PM
  To: dev@tika.apache.org dev@tika.apache.org
  Subject: Re: Configuring parsers and translators
 
  Hi Nick,
  
  I've been mulling this over since you sent the first message. But, I'm
  afraid I don't have a good solution or developed ideas.
  
  I agree, it would be very nice to consolidate all configuration for all
  parsers in the server and app.
  
  Is it feasible to put everything into tika-config? Then Parser
  implementations would read the config to pull out their own
 configuration.
  Or, would it be better to keep some configuration separate?
 Documentation
  would be an issue if every parser defines its own metadata keys...
 But, it
  might be an improvement since we don't have free form properties and
  configuration files.
  
  Tyler
  
  On Sat, Jun 6, 2015 at 12:30 PM Nick Burch apa...@gagravarr.org
 wrote:
  
   Anyone have any thoughts on this?
  
   On Fri, 8 May

[jira] [Closed] (TIKA-1199) Tika extracts weird signs instead of text

2015-06-09 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich closed TIKA-1199.
-
Resolution: Not A Problem

 Tika extracts weird signs instead of text
 -

 Key: TIKA-1199
 URL: https://issues.apache.org/jira/browse/TIKA-1199
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
 Environment: MacOSX, Linux
Reporter: Marc Teutelink
 Attachments: gaat fout.pdf, 
 plain_text_tika_output_from_gaat_fout_pdf.txt, 
 structured_text_tika_output_from_gaat_fout_pdf.xml


 Tika extracts complete bogus text from the attached document. I have attached 
 the .PDF in question and also added the plain and structured text output from 
 Tika.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1630) Mention APK support in List of Supported Formats

2015-06-09 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich resolved TIKA-1630.
---
   Resolution: Fixed
Fix Version/s: 1.9
 Assignee: Tyler Palsulich

Bolded the Please note for version 1.9. Hopefully that will help clear things 
up.

[~flowlo], thank you for reporting this! Please let us know if you run into any 
other issues or have any other suggested improvements.

 Mention APK support in List of Supported Formats
 

 Key: TIKA-1630
 URL: https://issues.apache.org/jira/browse/TIKA-1630
 Project: Tika
  Issue Type: Improvement
  Components: documentation
Affects Versions: 1.8
Reporter: Lorenz Leutgeb
Assignee: Tyler Palsulich
Priority: Trivial
 Fix For: 1.9


 http://tika.apache.org/1.8/formats.html claims to offer a full list of 
 supported formats does not mention support for APK files at all.
 I trusted that source and only found that tike supports APK files and their 
 respective MIME types from looking at Tikas codebase, which is suboptimal.
 Please add APK files to that list as appropriate (at least include the MIME 
 type Tika understands).
 Consider reevaluating the list to find out whether other formats are missing 
 (this is not covered by this ticket).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [VOTE] Release Apache Tika 1.9 Candidate #2

2015-06-09 Thread Tyler Palsulich
+1 from me. Thanks for running this, Chris!


Tyler

On Mon, Jun 8, 2015 at 1:11 PM Allison, Timothy B. talli...@mitre.org
wrote:

 +1

 Built in Windows and Linux.  Works on problems (that I caused!) in rc1.

 Let's make sure to include last Java 1.6 version in the release notes,
 if that's what we've decided.

 Thank you, Chris!

 Best,

Tim


 -Original Message-
 From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov]
 Sent: Saturday, June 06, 2015 9:47 PM
 To: dev@tika.apache.org
 Cc: u...@tika.apache.org
 Subject: [VOTE] Release Apache Tika 1.9 Candidate #2

 Hi Folks,

 A second candidate for the Tika 1.9 release is available at:

   https://dist.apache.org/repos/dist/dev/tika/

 The release candidate is a zip archive of the sources in:
   http://svn.apache.org/repos/asf/tika/tags/1.9-rc2/

 The SHA1 checksum of the archive is
 9b78c9e9ce9640b402b7fef8e30f3cdbe384f44c.

 In addition, a staged maven repository is available here:
 https://repository.apache.org/content/repositories/orgapachetika-1011/


 Please vote on releasing this package as Apache Tika 1.9.
 The vote is open for the next 72 hours and passes if a majority of at
 least three +1 Tika PMC votes are cast.

 [ ] +1 Release this package as Apache Tika 1.9
 [ ] -1 Do not release this package because…

 Cheers,
 Chris

 P.S. Of course here is my +1.


 ++
 Chris Mattmann, Ph.D.
 Chief Architect
 Instrument Software and Science Data Systems Section (398)
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 168-519, Mailstop: 168-527
 Email: chris.a.mattm...@nasa.gov
 WWW:  http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Associate Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++





Re: Configuring parsers and translators

2015-06-06 Thread Tyler Palsulich
Hi Nick,

I've been mulling this over since you sent the first message. But, I'm
afraid I don't have a good solution or developed ideas.

I agree, it would be very nice to consolidate all configuration for all
parsers in the server and app.

Is it feasible to put everything into tika-config? Then Parser
implementations would read the config to pull out their own configuration.
Or, would it be better to keep some configuration separate? Documentation
would be an issue if every parser defines its own metadata keys... But, it
might be an improvement since we don't have free form properties and
configuration files.

Tyler

On Sat, Jun 6, 2015 at 12:30 PM Nick Burch apa...@gagravarr.org wrote:

 Anyone have any thoughts on this?

 On Fri, 8 May 2015, Nick Burch wrote:
  Hi All
 
  This came up in TIKA-1623, but I thought it might be better brought out
 to
  the list for discussion
 
  To configure parsers on a per-document basis, such as setting PDF
  spacing tolerances, or telling Tesseract what language it should be
  OCRing for, we have the *Config objects. You create one of these, use
  the setters to configure it for your document, pop it onto the Parse
  context and it's used when processing your document
 
  To configure parsers and translators on a per-JVM basis, to apply to all
  documents processed, it's a bit less consistent. At least some look for
  a properties file with a specific name, usually in the tika namespace,
  and grab their settings / keys / etc out of that. At least some expect
  to find a *Config with their program path on it, even though that
  remains constant between documents. None of them support getting their
  settings from the Tika Config
 
 
  As part of our evolution of parser preferences, we're moving towards
  people either being able to set their preferences in code, or being able
  to supply a Tika Config xml which sets their parser preferences or
  overrides certain bits of the default. The code option works for people
  who want to declare certain specific things, the Tika Config one gives
  the same functionality but allows a consistent and clean way to set it
  between Tika App, Tika Server and java code.
 
  Another related example is the External Parser support. Because you can
  have multiple External Parser instances in your setup, one per format /
  program, we look for all the
  org/apache/tika/parser/external/tika-external-parsers.xml files on the
  classpath, and create parser instances based on definitions in there
 
 
  What do we think about setting executable paths and keys/logins for
  parsers like OCR, Strings, Translators etc? Always on ParseContext?
  Properties? Custom xml config? Tika config xml? Other? Combination?
 
  Nick
 



[jira] [Commented] (TIKA-1652) Tika Server should allow config file override from the command line like Tika App

2015-06-06 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14575986#comment-14575986
 ] 

Tyler Palsulich commented on TIKA-1652:
---

I think this is a duplicate of TIKA-1426?

 Tika Server should allow config file override from the command line like Tika 
 App
 -

 Key: TIKA-1652
 URL: https://issues.apache.org/jira/browse/TIKA-1652
 Project: Tika
  Issue Type: Bug
  Components: server
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
 Fix For: 1.9


 Tika-app's TikaCLI allows a command line parameter, --config, to override the 
 Tika config at the command line. For whatever reason, Tika-server doesn't it 
 should since it causes a different control flow for things to get created. I 
 first saw this when testing the CTAKESParser (TIKA-1645) in Tika-server.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Configuring parsers and translators

2015-06-06 Thread Tyler Palsulich
(Devil's advocate hat slightly on.) My one hesitation about putting it all
into tika-config is that the default might get to be a monstrosity --
difficult for new users to use.

Tyler

On Sat, Jun 6, 2015 at 3:48 PM Mattmann, Chris A (3980) 
chris.a.mattm...@jpl.nasa.gov wrote:

 I think it would be great to have all this in the Tika Config.

 The one thing then is to provide an example default config and
 to make it *hugely* clear rather than all the levels of indirection
 that we currently have going on which makes it super hard when
 there is a config error (SPI, swallowing print messages, etc.)


 ++
 Chris Mattmann, Ph.D.
 Chief Architect
 Instrument Software and Science Data Systems Section (398)
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 168-519, Mailstop: 168-527
 Email: chris.a.mattm...@nasa.gov
 WWW:  http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Associate Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++




 -Original Message-
 From: Tyler Palsulich tpalsul...@gmail.com
 Reply-To: dev@tika.apache.org dev@tika.apache.org
 Date: Saturday, June 6, 2015 at 3:45 PM
 To: dev@tika.apache.org dev@tika.apache.org
 Subject: Re: Configuring parsers and translators

 Hi Nick,
 
 I've been mulling this over since you sent the first message. But, I'm
 afraid I don't have a good solution or developed ideas.
 
 I agree, it would be very nice to consolidate all configuration for all
 parsers in the server and app.
 
 Is it feasible to put everything into tika-config? Then Parser
 implementations would read the config to pull out their own configuration.
 Or, would it be better to keep some configuration separate? Documentation
 would be an issue if every parser defines its own metadata keys... But, it
 might be an improvement since we don't have free form properties and
 configuration files.
 
 Tyler
 
 On Sat, Jun 6, 2015 at 12:30 PM Nick Burch apa...@gagravarr.org wrote:
 
  Anyone have any thoughts on this?
 
  On Fri, 8 May 2015, Nick Burch wrote:
   Hi All
  
   This came up in TIKA-1623, but I thought it might be better brought
 out
  to
   the list for discussion
  
   To configure parsers on a per-document basis, such as setting PDF
   spacing tolerances, or telling Tesseract what language it should be
   OCRing for, we have the *Config objects. You create one of these, use
   the setters to configure it for your document, pop it onto the Parse
   context and it's used when processing your document
  
   To configure parsers and translators on a per-JVM basis, to apply to
 all
   documents processed, it's a bit less consistent. At least some look
 for
   a properties file with a specific name, usually in the tika namespace,
   and grab their settings / keys / etc out of that. At least some expect
   to find a *Config with their program path on it, even though that
   remains constant between documents. None of them support getting their
   settings from the Tika Config
  
  
   As part of our evolution of parser preferences, we're moving towards
   people either being able to set their preferences in code, or being
 able
   to supply a Tika Config xml which sets their parser preferences or
   overrides certain bits of the default. The code option works for
 people
   who want to declare certain specific things, the Tika Config one gives
   the same functionality but allows a consistent and clean way to set it
   between Tika App, Tika Server and java code.
  
   Another related example is the External Parser support. Because you
 can
   have multiple External Parser instances in your setup, one per format
 /
   program, we look for all the
   org/apache/tika/parser/external/tika-external-parsers.xml files on the
   classpath, and create parser instances based on definitions in there
  
  
   What do we think about setting executable paths and keys/logins for
   parsers like OCR, Strings, Translators etc? Always on ParseContext?
   Properties? Custom xml config? Tika config xml? Other? Combination?
  
   Nick
  
 




Re: [DISCUSS] Thinking about completely refactoring the ExternalParser and using commons-exec

2015-05-25 Thread Tyler Palsulich
On Mon, May 25, 2015 at 4:05 PM, Nick Burch apa...@gagravarr.org wrote:

 On Mon, 25 May 2015, Mattmann, Chris A (3980) wrote:

 ExternalParser is way broke. I have some patches that somewhat fix it,
 but in doing so, I realized, why not just use commons-exec? I realize that
 this is another dependency into core, but commons-exec simplifies a lot of
 the stuff that's broke with ExternalParser (reading its streams, for one).


 Maybe we could push some or all of external parser into the tika-parsers
 module, so we don't have to add more dependencies into core?


What is the argument for having ExternalParser in core? Provide an
easy-to-extend class for downstream users to create their own external
parser?

Tyler


Re: Any reason we removed the links to other downstream Tika APIs off the main web site?

2015-05-20 Thread Tyler Palsulich
Hi Chris,

I may have botched the version of the index on the site (see the other
thread with Nick's comments.) I'll investigate more tonight or tomorrow, if
you don't beat me to it.

Tyler
On May 20, 2015 4:39 PM, Mattmann, Chris A (3980) 
chris.a.mattm...@jpl.nasa.gov wrote:

 Hey Folks,

 Before, we had links in the description of Tika that Tyler put in
 that included links to e.g., Tika Python and other downstream APIs.
 Would there be objection to putting those links back up, they
 seemed to have been removed? I created a wiki page on our Tika
 wiki with links to downstream API bindings. I would like to add
 the text back in, and then e.g., link to that wiki page. That
 OK?

 If I don’t hear objections in the next day or so I will add the
 link back in.

 Cheers,
 Chris

 ++
 Chris Mattmann, Ph.D.
 Chief Architect
 Instrument Software and Science Data Systems Section (398)
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 168-519, Mailstop: 168-527
 Email: chris.a.mattm...@nasa.gov
 WWW:  http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Associate Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++







[jira] [Commented] (TIKA-1624) Syntax error in DOAP file release section

2015-05-20 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14553281#comment-14553281
 ] 

Tyler Palsulich commented on TIKA-1624:
---

Thanks, Ken. I published the file a few minutes ago.

 Syntax error in DOAP file release section
 -

 Key: TIKA-1624
 URL: https://issues.apache.org/jira/browse/TIKA-1624
 Project: Tika
  Issue Type: Bug
 Environment: 
 http://svn.apache.org/repos/asf/tika/site/src/site/resources/doap.rdf
Reporter: Sebb
Assignee: Ken Krugler

 DOAP files can contain details of multiple release Versions, however each 
 must be listed in a separate release section, for example:
 release
   Version
 nameApache XYZ/name
 created2015-02-16/created
 revision1.6.2/revision
   /Version
 /release
 release
   Version
 nameApache XYZ/name
 created2014-09-24/created
 revision1.6.1/revision
   /Version
 /release
 Please can the project DOAP be corrected accordingly?
 Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Any reason we removed the links to other downstream Tika APIs off the main web site?

2015-05-20 Thread Tyler Palsulich
Hi Chris,

I just looked again. I don't think this was a versioning issue -- I
intentionally removed the links. I think the best place to add them would
be on the Getting Started page [0] (at the bottom). But, it might be better
to link directly to the wiki and make the link more prominent (not at the
very bottom)? That way, we reduce the amount of duplicated information.

On the other hand, I think it would be good to mention (on the front page)
the top level ways you can use Tika: Java, command line, server, GUI, and
wrappers in Python, Julia, and more.

Apologies for the confusion. I believe the versioning issues from the other
thread have been resolved.

Tyler

On Wed, May 20, 2015 at 5:54 PM, Tyler Palsulich tpalsul...@gmail.com
wrote:

 Hi Chris,

 I may have botched the version of the index on the site (see the other
 thread with Nick's comments.) I'll investigate more tonight or tomorrow, if
 you don't beat me to it.

 Tyler
 On May 20, 2015 4:39 PM, Mattmann, Chris A (3980) 
 chris.a.mattm...@jpl.nasa.gov wrote:

 Hey Folks,

 Before, we had links in the description of Tika that Tyler put in
 that included links to e.g., Tika Python and other downstream APIs.
 Would there be objection to putting those links back up, they
 seemed to have been removed? I created a wiki page on our Tika
 wiki with links to downstream API bindings. I would like to add
 the text back in, and then e.g., link to that wiki page. That
 OK?

 If I don’t hear objections in the next day or so I will add the
 link back in.

 Cheers,
 Chris

 ++
 Chris Mattmann, Ph.D.
 Chief Architect
 Instrument Software and Science Data Systems Section (398)
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 168-519, Mailstop: 168-527
 Email: chris.a.mattm...@nasa.gov
 WWW:  http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Associate Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++







[jira] [Commented] (TIKA-1630) Mention APK support in List of Supported Formats

2015-05-20 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14553272#comment-14553272
 ] 

Tyler Palsulich commented on TIKA-1630:
---

That is a very good point. There is a paragraph on the formats page which 
explains in a little bit more detail:
bq. (Please note that Apache Tika is able to detect a much wider range of 
formats than those listed below, this page only documents those formats from 
which Tika is able to extract metadata and/or textual content)

Would it help if we included a link to the mimetypes file (which has all 
filetypes Tika can detect)?

 Mention APK support in List of Supported Formats
 

 Key: TIKA-1630
 URL: https://issues.apache.org/jira/browse/TIKA-1630
 Project: Tika
  Issue Type: Improvement
  Components: documentation
Affects Versions: 1.8
Reporter: Lorenz Leutgeb
Priority: Trivial

 http://tika.apache.org/1.8/formats.html claims to offer a full list of 
 supported formats does not mention support for APK files at all.
 I trusted that source and only found that tike supports APK files and their 
 respective MIME types from looking at Tikas codebase, which is suboptimal.
 Please add APK files to that list as appropriate (at least include the MIME 
 type Tika understands).
 Consider reevaluating the list to find out whether other formats are missing 
 (this is not covered by this ticket).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1630) Mention APK support in List of Supported Formats

2015-05-14 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14544104#comment-14544104
 ] 

Tyler Palsulich commented on TIKA-1630:
---

Hi. Thanks for reporting this! Can you be a little more specific about which 
file is supported? What in the Tika codebase indicates support for APK formats? 
Also, just to be clear, are you referring to android application packages?

 Mention APK support in List of Supported Formats
 

 Key: TIKA-1630
 URL: https://issues.apache.org/jira/browse/TIKA-1630
 Project: Tika
  Issue Type: Improvement
  Components: documentation
Affects Versions: 1.8
Reporter: Lorenz Leutgeb
Priority: Trivial

 http://tika.apache.org/1.8/formats.html claims to offer a full list of 
 supported formats does not mention support for APK files at all.
 I trusted that source and only found that tike supports APK files and their 
 respective MIME types from looking at Tikas codebase, which is suboptimal.
 Please add APK files to that list as appropriate (at least include the MIME 
 type Tika understands).
 Consider reevaluating the list to find out whether other formats are missing 
 (this is not covered by this ticket).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Published Site Changes

2015-05-14 Thread Tyler Palsulich
Hi Everyone,

I was about to update the site for TIKA-1619 (checksums wrong on the site),
but found unpublished changes in the site. This is the status after
checking out the repo and running `mvn install`:

➜  site  svn status
M   publish/1.7/examples.html
M   publish/1.8/examples.html
M   publish/1.8/index.html
M   publish/1.9/examples.html
M   publish/doap.rdf
M   publish/plugin-management.html
X   src/examples-src

Not all of the changes are correct (e.g. make the list of contributors for
1.8 point to the list for 1.7). So, I don't want to commit all of the
changes. Maybe someone (probably me) didn't add site/src when committing to
site/publish?

I think the doap.rdf change was from r1678405
http://svn.apache.org/viewvc?view=revisionrevision=1678405. But, I don't
know about the others.

Anyone have any ideas/clean solutions before I check each page by hand and
redo any necessary 1.7/8/9 changes?

Thanks,
Tyler


[jira] [Commented] (TIKA-1624) Syntax error in DOAP file release section

2015-05-14 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14544150#comment-14544150
 ] 

Tyler Palsulich commented on TIKA-1624:
---

[~kkrugler], yes. I just updated the release instructions.

 Syntax error in DOAP file release section
 -

 Key: TIKA-1624
 URL: https://issues.apache.org/jira/browse/TIKA-1624
 Project: Tika
  Issue Type: Bug
 Environment: 
 http://svn.apache.org/repos/asf/tika/site/src/site/resources/doap.rdf
Reporter: Sebb
Assignee: Ken Krugler

 DOAP files can contain details of multiple release Versions, however each 
 must be listed in a separate release section, for example:
 release
   Version
 nameApache XYZ/name
 created2015-02-16/created
 revision1.6.2/revision
   /Version
 /release
 release
   Version
 nameApache XYZ/name
 created2014-09-24/created
 revision1.6.1/revision
   /Version
 /release
 Please can the project DOAP be corrected accordingly?
 Thanks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Translation API question

2015-05-05 Thread Tyler Palsulich
Hi Sergey,

Unfortunately, not yet. See TIKA-1328.

Tyler

On Tue, May 5, 2015 at 4:51 PM, Sergey Beryozkin sberyoz...@gmail.com
wrote:

 Hi All

 Is it possible to submit a document to the Translation API and get the
 translated words as a sequence of events ? For example, with a regular Tika
 API it is possible to submit a document and get the metadata and the data,
 and these data can be indexed, etc.

 What about submitting a document (for ex, French) to the translation API
 and getting a list of the words in English, so that they can be indexed.

 I'm thinking, may be one then can use a query to find all the documents in
 French that contain a given word as it reads in English. Example: find a
 French doc containing thanks, etc...

 Not sure how much sense it makes though :-)

 Cheers, Sergey



Re: Java 1.6 support for Tika 1.9?

2015-04-27 Thread Tyler Palsulich
I should have included the fact this is the last release planned to support
Java 1.6 in the announcement (as we talked about a while back). But, since
that has passed, should we just update the announcement on the website,
wait another release, or just drop Java 1.6 support when we release 1.9?

I could be persuaded to do any of the above.

Tyler

On Mon, Apr 27, 2015 at 1:30 PM, Konstantin Gribov gros...@gmail.com
wrote:

 As I remember, we thought about announcing some release last java 6
 compatible one and give Tika users some time to migrate. E. g., we can
 announce 1.10 last java 6 release when releasing 1.9. IMHO, in such case it
 wouldn't be a sudden change for downstream project developers and Tika
 users.

 --
 Best regards,
 Konstantin Gribov

 пн, 27 апр. 2015 г. в 20:09, Allison, Timothy B. talli...@mitre.org:

  Hi All,
 
I can't remember where we are on this.  Are we dropping support for
 Java
  1.6 in Tika 1.9?  If so, should we open an issue to integrate tika-java7
  into core, add diamond operators, catching multiple exceptions...
 anything
  else...?
 
Or, do we want to wait for Tika 2.0 or Tika 1.10?
 
Best,
 
   Tim
 
 



Re: comparing Tika's file detect with other tools?

2015-04-22 Thread Tyler Palsulich
Hi Tim,

I do not know about if there would be licensing concerns. But, we do have
TIKA-289 to track merging magic bytes from `file` into Tika.

Tyler

On Wed, Apr 22, 2015 at 10:40 AM, Ken Krugler kkrugler_li...@transpac.com
wrote:

 Hi Tim,

 I don't believe there's any issue with comparing results.

 If you were looking at the source for file, then it gets more gray, but
 I think even that would be OK as long as you weren't copying code or
 directly re-implementing algorithms.

 -- Ken

  From: Allison, Timothy B.
  Sent: April 22, 2015 5:47:17am PDT
  To: dev@tika.apache.org
  Subject: comparing Tika's file detect with other tools?
 
  Would it be frowned upon to compare Tika's file detection with other
 tools, like file?  Any concerns about effectively reverse engineering
 (when we find that Tika is wrong) from a non-Apache project?
 
  Any other sensitivities I should be aware of?
 
  Best,
 
   Tim


 --
 Ken Krugler
 +1 530-210-6378
 http://www.scaleunlimited.com
 custom big data solutions  training
 Hadoop, Cascading, Cassandra  Solr








[jira] [Commented] (TIKA-1585) Create Example Website with Form Submission

2015-04-22 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14507259#comment-14507259
 ] 

Tyler Palsulich commented on TIKA-1585:
---

Is there an Apache hosted location we'd like to stand this up? If not, I'll 
close this issue off.

http://tpalsulich.github.io/TikaExamples/

 Create Example Website with Form Submission
 ---

 Key: TIKA-1585
 URL: https://issues.apache.org/jira/browse/TIKA-1585
 Project: Tika
  Issue Type: New Feature
  Components: example, server
Reporter: Tyler Palsulich
Assignee: Tyler Palsulich

 It would be great to have a website where we can direct people who ask what 
 Tika can do for [filetype] without needing them to actually download Tika.
 Some initial work to do that is 
 [here|http://tpalsulich.github.io/TikaExamples/].
 I'm far from a design guru, but I imagine the site as having a form where you 
 can upload a file at the top, checkboxes for if you want metadata, content, 
 or both, and a submit button. The request should be sent with AJAX and the 
 result should populate a {{div}}.
 One issue with AJAX requests is that Tika Server doesn't currently allow 
 Cross-Origin-Resource-Sharing (CORS). So, we either need to maintain a 
 slightly updated tika-server, or update the server to allow configuration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: NUTCH-1994 and UCAR Dependencies

2015-04-21 Thread Tyler Palsulich
Hi Lewis,

I also tried upgrading Tika in Nutch. But, ran into the same issue
(but, udunits
is found, as expected):

[ivy:retrieve] ::
[ivy:retrieve] ::  UNRESOLVED DEPENDENCIES ::
[ivy:retrieve] ::
[ivy:retrieve] :: edu.ucar#jj2000;5.2: not found
[ivy:retrieve] :: org.itadaki#bzip2;0.9.1: not found
[ivy:retrieve] ::

Thanks for pushing the dependencies out.

Tyler

On Tue, Apr 21, 2015 at 1:50 PM, Lewis John Mcgibbney 
lewis.mcgibb...@gmail.com wrote:

 Hi Folks,
 Whilst addressing NUTCH-1994, I've experienced a dependency problem
 (related to unpublished artifacts on Maven Central) which I am working
 through right now.
 When Kaing the upgrade in Nutch, I get the following

 [ivy:resolve]   -- artifact edu.ucar#udunits;4.5.5!udunits.jar:
 [ivy:resolve]

 http://oss.sonatype.org/content/repositories/releases/edu/ucar/udunits/4.5.5/udunits-4.5.5.jar
 [ivy:resolve] ::
 [ivy:resolve] ::  UNRESOLVED DEPENDENCIES ::
 [ivy:resolve] ::
 [ivy:resolve] :: edu.ucar#jj2000;5.2: not found
 [ivy:resolve] :: org.itadaki#bzip2;0.9.1: not found
 [ivy:resolve] :: edu.ucar#udunits;4.5.5: not found
 [ivy:resolve] ::
 [ivy:resolve]
 [ivy:resolve] :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS

 BUILD FAILED
 /usr/local/trunk_clean/build.xml:112: The following error occurred while
 executing this line:
 /usr/local/trunk_clean/src/plugin/build.xml:60: The following error
 occurred while executing this line:
 /usr/local/trunk_clean/src/plugin/build-plugin.xml:229: impossible to
 resolve dependencies:
 resolve failed - see output for details

 Total time: 17 seconds

 I've just this minutes pushed the edu.ucar#udunits;4.5.5 artifacts so they
 will be available imminently. The remaining artifact at edu.ucar#jj2000;5.2
 has a corrupted POM which means that OSS Nexus will not accepts it. I'll
 send a pull request further upstream for that ASAP.

 Finally, the BZIP dependency is a 3rd party dependency from another Org,
 Licensed under MIT license. So I will register interest to publish this
 dependency, push it, then we will be good to go.

 Lewis



 --
 *Lewis*



[jira] [Commented] (TIKA-1607) Introduce new HashMapString, Object data structure for persitsence of Tika Metadata

2015-04-20 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14503778#comment-14503778
 ] 

Tyler Palsulich commented on TIKA-1607:
---

Good idea! What if you created a subclass of {{Metadata}} 
({{ExtendedMetadata}}?) which supports mapping to a {{ListMapString, 
Object}}. Then, when populating the metadata with a phone number, you can 
check if {{metadata instanceof ExtendedMetadata}} and respond accordingly.

Any drastic changes would be a good candidate for Tika 2.0.

 Introduce new HashMapString, Object data structure for persitsence of Tika 
 Metadata
 -

 Key: TIKA-1607
 URL: https://issues.apache.org/jira/browse/TIKA-1607
 Project: Tika
  Issue Type: Improvement
  Components: core, metadata
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Critical
 Fix For: 1.9


 I am currently working implementing more comprehensive extraction and 
 enhancement of the Tika support for Phone number extraction and metadata 
 modeling.
 Right now we utilize the String[] multivalued support available within Tika 
 to persist phone numbers as 
 {code}
 Metadata: String: String[]
 Metadata: phonenumbers: number1, number2, number3, ...
 {code}
 I would like to propose we extend multi-valued support outside of the 
 String[] paradigm by implementing a more abstract Collection of Objects such 
 that we could consider and implement the phone number use case as follows
 {code}
 Metadata: String:  ListHashMapString,String
 {code}
 Where Object could be a CollectionHashMapString/Property, String/int/long 
 e.g.
 {code}
 Metadata: phonenumbers: [(+162648743476: (LibPN-CountryCode : US), 
 (LibPN-NumberType: International), (etc: etc)...), (+1292611054: 
 LibPN-CountryCode : UK), (LibPN-NumberType: International), (etc: etc)...) 
 (etc)] 
 {code}
 There are obvious backwards compatibility issues with this approach... 
 additionally it is a fundamental change to the code Metadata API. I hope that 
 the String, Object Mapping however is flexible enough to allow me to model 
 Tika Metadata the way I want.
 Any comments folks? Thanks
 Lewis



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[ANNOUNCE] Apache Tika 1.8 Released

2015-04-20 Thread Tyler Palsulich
The Apache Tika project is pleased to announce the release of Apache Tika
1.8. The release
contents have been pushed out to the main Apache release site and to the
Maven Central sync, so the releases should be available as soon as the
mirrors get the syncs.

Apache Tika is a toolkit for detecting and extracting metadata and
structured text content
from various documents using existing parser libraries.

Apache Tika 1.8 contains a number of improvements and bug fixes. Details
can be found in the changes file:
http://www.apache.org/dist/tika/CHANGES-1.8.txt

Apache Tika is available in source form from the following download page:
http://www.apache.org/dyn/closer.cgi/tika/apache-tika-1.8-src.zip

Apache Tika is also available in binary form or for use using Maven 2 from
the Central Repository: http://repo1.maven.org/maven2/org/apache/tika/

In the initial 48 hours, the release may not be available on all mirrors.
When downloading from a mirror site, please remember to verify the
downloads using signatures found on the Apache site:
https://people.apache.org/keys/group/tika.asc

For more information on Apache Tika, visit the project home page:
http://tika.apache.org/

-- Tyler Palsulich, on behalf of the Apache Tika community


[RESULT] [VOTE] Apache Tika 1.8 Release Candidate #2

2015-04-20 Thread Tyler Palsulich
Hi Everyone,

The VOTE to release Tika 1.8 RC #2 has passed with the following tally:

+1:
Chris Mattmann
Hong-Thai Nguyen
Konstantin Gribov
Lewis John Mcgibbney
Oleg Tikhonov
Tim Allison
Tyler Palsulich

±0:
None

-1:
None

I'll move forward with the release process now.

Thank you all for your VOTE and collaboration,
Tyler


Re: [VOTE] Apache Tika 1.8 Release Candidate #2

2015-04-20 Thread Tyler Palsulich
Thank you, Everyone! I'll move forward now.

Lewis, KEYS are here: https://people.apache.org/keys/group/tika.asc.

Of course, I'm also +1.

Tyler

On Mon, Apr 20, 2015 at 3:47 PM, Lewis John Mcgibbney 
lewis.mcgibb...@gmail.com wrote:

 Hi Folks,

 On Thu, Apr 16, 2015 at 2:42 PM, dev-digest-h...@tika.apache.org wrote:

 
   Hi Folks,
  
   A candidate for the Tika 1.8 release is available at:
 https://dist.apache.org/repos/dist/dev/tika/
  
   The release candidate is a zip archive of the sources in:
 http://svn.apache.org/repos/asf/tika/tags/1.8-rc2/
  
   The SHA1 checksum of the archive is
 5e22fee9079370398472e59082d171ae2d7fdd31.
  
   In addition, a staged maven repository is available here:
  
 https://repository.apache.org/content/repositories/orgapachetika-1009
  
   Please vote on releasing this package as Apache Tika 1.8. The vote is
  open
   for the next 72 hours and passes if a majority of at least three +1
 Tika
   PMC votes are cast.
 


 Where is the KEYS?
 All signatures are fine.
 Test are A OK.
 The remaining issue is with the Tika 1616 issue which was patched and
 committed to trunk.
 IMHO this is not a blocker. We could probably release 1.9 in a shorter
 release cycle to accomodate the change


  
   [X] +1 Release this package as Apache Tika 1.8


 I am +1 for releasing this as 1.8.
 Lewis



Re: [VOTE] Apache Tika 1.8 Release Candidate #2

2015-04-19 Thread Tyler Palsulich
Hi Ken,

Sorry for the delayed response. No, that patch is not included in this RC
(as I think you know, given your resolution of TIKA-1606).

Have a good night,
Tyler

On Sun, Apr 19, 2015 at 10:49 AM, Ken Krugler kkrugler_li...@transpac.com
wrote:

 Hi Tyler,

 Does this include Lewis's fix for
 https://issues.apache.org/jira/browse/TIKA-1606?

 It's a simple change (bumping the Guava version), but as seen this can
 have unexpected consequences.

 I'm fine either way.

 -- Ken

  From: Tyler Palsulich
  Sent: April 18, 2015 8:29:22pm PDT
  To: dev@tika.apache.org
  Subject: RE: [VOTE] Apache Tika 1.8 Release Candidate #2
 
  Hi Folks,
 
  If there are no blocking complaints (OSGi?) by Monday (a little longer
 than
  3 days, I realize), I'll mark this as passed and finish the release
 process.
 
  Of course, it's no problem for me to cut another RC, if it's needed.
 
  Have a great weekend!
  Tyler
  I've run into one problem while testing Tika 1.8 with Bixo
 
  It involves a dependency issue involving (of course) Guava, since that
  project loves to break their API :(
 
  The bixo-core jar has these transitive dependencies on various versions
 of
  Guava:
 
  Hadoop - 11.0.2
  Cascading - 14.0.1
  Tika-parsers - 10.0.1
 cdm - 17.0
 
  Everyone winds up using version 10.0.1 (note that Tika has a dependency
 on
  cdm, which wants to use 17.0)
 
  The problem is that Hadoop (for any recent version) uses an API from
  Guava's cache implementation that no longer exists:
 
 
 com.google.common.cache.CacheBuilder.build(Lcom/google/common/cache/CacheLoader;)Lcom/google/common/cache/LoadingCache;
  java.lang.NoSuchMethodError:
 
 com.google.common.cache.CacheBuilder.build(Lcom/google/common/cache/CacheLoader;)Lcom/google/common/cache/LoadingCache;
 at
  org.apache.hadoop.io.compress.CodecPool.createCache(CodecPool.java:62)
 at
  org.apache.hadoop.io.compress.CodecPool.clinit(CodecPool.java:74)
 at
  org.apache.hadoop.io.SequenceFile$Writer.close(SequenceFile.java:1272)
 at
 
 org.apache.hadoop.mapred.SequenceFileOutputFormat$1.close(SequenceFileOutputFormat.java:79)
 
  So what this means is that anyone trying to use Tika with Hadoop will
 need
  to play games with the class loader to get the older version of Guava -
  though that can cause other issues if Hadoop (or Cascading, etc) rely on
  anything that's only in the newer Guava API.
 
  Guava 1.0.01 was released about 3.5 years ago; 11.0.2 was from about 3
  years ago. So it seems like we should upgrade to at least 11.0.2
 
  But I don't know if this is enough of an issue to require another RC.
 
  -- Ken
 
  PS - I've created https://issues.apache.org/jira/browse/TIKA-1606 to
 track
  this.
 
 
  From: Tyler Palsulich
  Sent: April 13, 2015 10:56:29am PDT
  To: dev@tika.apache.org, u...@tika.apache.org
  Subject: [VOTE] Apache Tika 1.8 Release Candidate #2
 
  Hi Folks,
 
  A candidate for the Tika 1.8 release is available at:
   https://dist.apache.org/repos/dist/dev/tika/
 
  The release candidate is a zip archive of the sources in:
   http://svn.apache.org/repos/asf/tika/tags/1.8-rc2/
 
  The SHA1 checksum of the archive is
   5e22fee9079370398472e59082d171ae2d7fdd31.
 
  In addition, a staged maven repository is available here:
   https://repository.apache.org/content/repositories/orgapachetika-1009
 
  Please vote on releasing this package as Apache Tika 1.8. The vote is
  open for the next 72 hours and passes if a majority of at least three +1
  Tika PMC votes are cast.
 
  [ ] +1 Release this package as Apache Tika 1.8
  [ ] ±0 I don't object to this release, but I haven't checked it
  [ ] -1 Do not release this package because...
 
  Thanks,
  Tyler


 --
 Ken Krugler
 +1 530-210-6378
 http://www.scaleunlimited.com
 custom big data solutions  training
 Hadoop, Cascading, Cassandra  Solr








RE: [VOTE] Apache Tika 1.8 Release Candidate #2

2015-04-18 Thread Tyler Palsulich
Hi Folks,

If there are no blocking complaints (OSGi?) by Monday (a little longer than
3 days, I realize), I'll mark this as passed and finish the release process.

Of course, it's no problem for me to cut another RC, if it's needed.

Have a great weekend!
Tyler
I've run into one problem while testing Tika 1.8 with Bixo

It involves a dependency issue involving (of course) Guava, since that
project loves to break their API :(

The bixo-core jar has these transitive dependencies on various versions of
Guava:

Hadoop - 11.0.2
Cascading - 14.0.1
Tika-parsers - 10.0.1
cdm - 17.0

Everyone winds up using version 10.0.1 (note that Tika has a dependency on
cdm, which wants to use 17.0)

The problem is that Hadoop (for any recent version) uses an API from
Guava's cache implementation that no longer exists:

com.google.common.cache.CacheBuilder.build(Lcom/google/common/cache/CacheLoader;)Lcom/google/common/cache/LoadingCache;
java.lang.NoSuchMethodError:
com.google.common.cache.CacheBuilder.build(Lcom/google/common/cache/CacheLoader;)Lcom/google/common/cache/LoadingCache;
at
org.apache.hadoop.io.compress.CodecPool.createCache(CodecPool.java:62)
at
org.apache.hadoop.io.compress.CodecPool.clinit(CodecPool.java:74)
at
org.apache.hadoop.io.SequenceFile$Writer.close(SequenceFile.java:1272)
at
org.apache.hadoop.mapred.SequenceFileOutputFormat$1.close(SequenceFileOutputFormat.java:79)

So what this means is that anyone trying to use Tika with Hadoop will need
to play games with the class loader to get the older version of Guava -
though that can cause other issues if Hadoop (or Cascading, etc) rely on
anything that's only in the newer Guava API.

Guava 1.0.01 was released about 3.5 years ago; 11.0.2 was from about 3
years ago. So it seems like we should upgrade to at least 11.0.2

But I don't know if this is enough of an issue to require another RC.

-- Ken

PS - I've created https://issues.apache.org/jira/browse/TIKA-1606 to track
this.


 From: Tyler Palsulich
 Sent: April 13, 2015 10:56:29am PDT
 To: dev@tika.apache.org, u...@tika.apache.org
 Subject: [VOTE] Apache Tika 1.8 Release Candidate #2

 Hi Folks,

 A candidate for the Tika 1.8 release is available at:
   https://dist.apache.org/repos/dist/dev/tika/

 The release candidate is a zip archive of the sources in:
   http://svn.apache.org/repos/asf/tika/tags/1.8-rc2/

 The SHA1 checksum of the archive is
   5e22fee9079370398472e59082d171ae2d7fdd31.

 In addition, a staged maven repository is available here:
   https://repository.apache.org/content/repositories/orgapachetika-1009

 Please vote on releasing this package as Apache Tika 1.8. The vote is
open for the next 72 hours and passes if a majority of at least three +1
Tika PMC votes are cast.

 [ ] +1 Release this package as Apache Tika 1.8
 [ ] ±0 I don't object to this release, but I haven't checked it
 [ ] -1 Do not release this package because...

 Thanks,
 Tyler


--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions  training
Hadoop, Cascading, Cassandra  Solr


[jira] [Closed] (TIKA-1266) Tika OSGI Bundle needs Bundle-ClassPath to work in Equinox

2015-04-16 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich closed TIKA-1266.
-
Resolution: Not A Problem

Thanks, [~bobpaulin]!

 Tika OSGI Bundle needs Bundle-ClassPath to work in Equinox
 --

 Key: TIKA-1266
 URL: https://issues.apache.org/jira/browse/TIKA-1266
 Project: Tika
  Issue Type: Improvement
  Components: packaging
Affects Versions: 1.4, 1.5
Reporter: pm

 The tika-bundle currently has the Embed-Dependency header filled with 
 embedded dependencies. 
 Embed-Dependency is not defined in OSGI spec, Bundle-ClassPath is .
 Please add Bundle-ClassPath with list of embedded JAR names prefixed with 
 ., .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[VOTE] Apache Tika 1.8 Release Candidate #2

2015-04-13 Thread Tyler Palsulich
Hi Folks,

A candidate for the Tika 1.8 release is available at:
  https://dist.apache.org/repos/dist/dev/tika/

The release candidate is a zip archive of the sources in:
  http://svn.apache.org/repos/asf/tika/tags/1.8-rc2/

The SHA1 checksum of the archive is
  5e22fee9079370398472e59082d171ae2d7fdd31.

In addition, a staged maven repository is available here:
  https://repository.apache.org/content/repositories/orgapachetika-1009

Please vote on releasing this package as Apache Tika 1.8. The vote is open
for the next 72 hours and passes if a majority of at least three +1 Tika
PMC votes are cast.

[ ] +1 Release this package as Apache Tika 1.8
[ ] ±0 I don't object to this release, but I haven't checked it
[ ] -1 Do not release this package because...

Thanks,
Tyler


[jira] [Commented] (TIKA-1593) Doco: Broken link to Parser Quick Start Guide

2015-04-13 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492638#comment-14492638
 ] 

Tyler Palsulich commented on TIKA-1593:
---

See https://svn.apache.org/repos/asf/tika/site/src/site/apt/download.apt.vm -- 
you need the vm extension. Then, you can use 
{code}${project.parent.version}{code} to get the current version of the 
project. Then, when we update the site for a new release, you just have to 
change the version number in the site's pom.xml file.

I'll fix this right now.

 Doco: Broken link to Parser Quick Start Guide
 ---

 Key: TIKA-1593
 URL: https://issues.apache.org/jira/browse/TIKA-1593
 Project: Tika
  Issue Type: Bug
  Components: documentation
Affects Versions: 1.7
Reporter: Dan Rollo
Priority: Minor

 The Tika web page: https://tika.apache.org/contribute.html, under the 
 Section: New Parsers, Detectors and Mime Types, there is a link with the 
 text: Parser Quick Start Guide. The link URL is: 
 https://tika.apache.org/parser_guide.apt, and does not work. 
 The .apt extension seems odd. I don't know what the link should be.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1593) Doco: Broken link to Parser Quick Start Guide

2015-04-13 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich resolved TIKA-1593.
---
Resolution: Fixed
  Assignee: Tyler Palsulich

Fixed in r1673240. Thank you [~bhamail]! Please let us know if you find any 
more.

 Doco: Broken link to Parser Quick Start Guide
 ---

 Key: TIKA-1593
 URL: https://issues.apache.org/jira/browse/TIKA-1593
 Project: Tika
  Issue Type: Bug
  Components: documentation
Affects Versions: 1.7
Reporter: Dan Rollo
Assignee: Tyler Palsulich
Priority: Minor

 The Tika web page: https://tika.apache.org/contribute.html, under the 
 Section: New Parsers, Detectors and Mime Types, there is a link with the 
 text: Parser Quick Start Guide. The link URL is: 
 https://tika.apache.org/parser_guide.apt, and does not work. 
 The .apt extension seems odd. I don't know what the link should be.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1593) Doco: Broken link to Parser Quick Start Guide

2015-04-13 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492662#comment-14492662
 ] 

Tyler Palsulich edited comment on TIKA-1593 at 4/13/15 5:02 PM:


Fixed in r1673240 and r1673241. Thank you [~bhamail]! Please let us know if you 
find any more.


was (Author: tpalsulich):
Fixed in r1673240. Thank you [~bhamail]! Please let us know if you find any 
more.

 Doco: Broken link to Parser Quick Start Guide
 ---

 Key: TIKA-1593
 URL: https://issues.apache.org/jira/browse/TIKA-1593
 Project: Tika
  Issue Type: Bug
  Components: documentation
Affects Versions: 1.7
Reporter: Dan Rollo
Assignee: Tyler Palsulich
Priority: Minor

 The Tika web page: https://tika.apache.org/contribute.html, under the 
 Section: New Parsers, Detectors and Mime Types, there is a link with the 
 text: Parser Quick Start Guide. The link URL is: 
 https://tika.apache.org/parser_guide.apt, and does not work. 
 The .apt extension seems odd. I don't know what the link should be.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1600) Unable to parse ODT files because of failed to close temporary resources

2015-04-13 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich resolved TIKA-1600.
---
Resolution: Fixed
  Assignee: Hong-Thai Nguyen

Thanks, [~thaichat04]! I just updated it -- reformatted the ODF parsing files 
(they were all a bit odd with whitespace) and moved the test into the existing 
test file.

Marking this as fixed and will cut a new release shortly.

 Unable to parse ODT files because of failed to close temporary resources
 

 Key: TIKA-1600
 URL: https://issues.apache.org/jira/browse/TIKA-1600
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.8
 Environment: Windows
Reporter: Hong-Thai Nguyen
Assignee: Hong-Thai Nguyen
 Attachments: Manuel_koha.odt


 Many ODT files are failed to parse causing of this exception. A sample file 
 in attachment
 {code}
 Apache Tika was unable to parse the document
 at C:\Users\hong-thai.nguyen\Downloads\Manuel_koha.odt.
 The full exception stack trace is included below:
 org.apache.tika.exception.TikaException: Failed to close temporary resources
   at 
 org.apache.tika.io.TemporaryResources.dispose(TemporaryResources.java:152)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:127)
   at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:342)
   at org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:299)
   at org.apache.tika.gui.TikaGUI.actionPerformed(TikaGUI.java:256)
   at javax.swing.AbstractButton.fireActionPerformed(Unknown Source)
   at javax.swing.AbstractButton$Handler.actionPerformed(Unknown Source)
   at javax.swing.DefaultButtonModel.fireActionPerformed(Unknown Source)
   at javax.swing.DefaultButtonModel.setPressed(Unknown Source)
   at javax.swing.AbstractButton.doClick(Unknown Source)
   at javax.swing.plaf.basic.BasicMenuItemUI.doClick(Unknown Source)
   at javax.swing.plaf.basic.BasicMenuItemUI$Handler.mouseReleased(Unknown 
 Source)
   at java.awt.Component.processMouseEvent(Unknown Source)
   at javax.swing.JComponent.processMouseEvent(Unknown Source)
   at java.awt.Component.processEvent(Unknown Source)
   at java.awt.Container.processEvent(Unknown Source)
   at java.awt.Component.dispatchEventImpl(Unknown Source)
   at java.awt.Container.dispatchEventImpl(Unknown Source)
   at java.awt.Component.dispatchEvent(Unknown Source)
   at java.awt.LightweightDispatcher.retargetMouseEvent(Unknown Source)
   at java.awt.LightweightDispatcher.processMouseEvent(Unknown Source)
   at java.awt.LightweightDispatcher.dispatchEvent(Unknown Source)
   at java.awt.Container.dispatchEventImpl(Unknown Source)
   at java.awt.Window.dispatchEventImpl(Unknown Source)
   at java.awt.Component.dispatchEvent(Unknown Source)
   at java.awt.EventQueue.dispatchEventImpl(Unknown Source)
   at java.awt.EventQueue.access$400(Unknown Source)
   at java.awt.EventQueue$3.run(Unknown Source)
   at java.awt.EventQueue$3.run(Unknown Source)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.security.ProtectionDomain$1.doIntersectionPrivilege(Unknown 
 Source)
   at java.security.ProtectionDomain$1.doIntersectionPrivilege(Unknown 
 Source)
   at java.awt.EventQueue$4.run(Unknown Source)
   at java.awt.EventQueue$4.run(Unknown Source)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.security.ProtectionDomain$1.doIntersectionPrivilege(Unknown 
 Source)
   at java.awt.EventQueue.dispatchEvent(Unknown Source)
   at java.awt.EventDispatchThread.pumpOneEventForFilters(Unknown Source)
   at java.awt.EventDispatchThread.pumpEventsForFilter(Unknown Source)
   at java.awt.EventDispatchThread.pumpEventsForHierarchy(Unknown Source)
   at java.awt.EventDispatchThread.pumpEvents(Unknown Source)
   at java.awt.EventDispatchThread.pumpEvents(Unknown Source)
   at java.awt.EventDispatchThread.run(Unknown Source)
 Caused by: java.io.IOException: Could not delete temporary file 
 C:\Users\HONG-T~1.NGU\AppData\Local\Temp\apache-tika-2891340188156641845.tmp
   at 
 org.apache.tika.io.TemporaryResources$1.close(TemporaryResources.java:70)
   at 
 org.apache.tika.io.TemporaryResources.close(TemporaryResources.java:121)
   at 
 org.apache.tika.io.TemporaryResources.dispose(TemporaryResources.java:150)
   ... 42 more
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1600) Unable to parse ODT files because of failed to close temporary resources

2015-04-13 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich updated TIKA-1600:
--
Priority: Blocker  (was: Major)

 Unable to parse ODT files because of failed to close temporary resources
 

 Key: TIKA-1600
 URL: https://issues.apache.org/jira/browse/TIKA-1600
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.8
 Environment: Windows
Reporter: Hong-Thai Nguyen
Assignee: Hong-Thai Nguyen
Priority: Blocker
 Attachments: Manuel_koha.odt


 Many ODT files are failed to parse causing of this exception. A sample file 
 in attachment
 {code}
 Apache Tika was unable to parse the document
 at C:\Users\hong-thai.nguyen\Downloads\Manuel_koha.odt.
 The full exception stack trace is included below:
 org.apache.tika.exception.TikaException: Failed to close temporary resources
   at 
 org.apache.tika.io.TemporaryResources.dispose(TemporaryResources.java:152)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:127)
   at org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:342)
   at org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:299)
   at org.apache.tika.gui.TikaGUI.actionPerformed(TikaGUI.java:256)
   at javax.swing.AbstractButton.fireActionPerformed(Unknown Source)
   at javax.swing.AbstractButton$Handler.actionPerformed(Unknown Source)
   at javax.swing.DefaultButtonModel.fireActionPerformed(Unknown Source)
   at javax.swing.DefaultButtonModel.setPressed(Unknown Source)
   at javax.swing.AbstractButton.doClick(Unknown Source)
   at javax.swing.plaf.basic.BasicMenuItemUI.doClick(Unknown Source)
   at javax.swing.plaf.basic.BasicMenuItemUI$Handler.mouseReleased(Unknown 
 Source)
   at java.awt.Component.processMouseEvent(Unknown Source)
   at javax.swing.JComponent.processMouseEvent(Unknown Source)
   at java.awt.Component.processEvent(Unknown Source)
   at java.awt.Container.processEvent(Unknown Source)
   at java.awt.Component.dispatchEventImpl(Unknown Source)
   at java.awt.Container.dispatchEventImpl(Unknown Source)
   at java.awt.Component.dispatchEvent(Unknown Source)
   at java.awt.LightweightDispatcher.retargetMouseEvent(Unknown Source)
   at java.awt.LightweightDispatcher.processMouseEvent(Unknown Source)
   at java.awt.LightweightDispatcher.dispatchEvent(Unknown Source)
   at java.awt.Container.dispatchEventImpl(Unknown Source)
   at java.awt.Window.dispatchEventImpl(Unknown Source)
   at java.awt.Component.dispatchEvent(Unknown Source)
   at java.awt.EventQueue.dispatchEventImpl(Unknown Source)
   at java.awt.EventQueue.access$400(Unknown Source)
   at java.awt.EventQueue$3.run(Unknown Source)
   at java.awt.EventQueue$3.run(Unknown Source)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.security.ProtectionDomain$1.doIntersectionPrivilege(Unknown 
 Source)
   at java.security.ProtectionDomain$1.doIntersectionPrivilege(Unknown 
 Source)
   at java.awt.EventQueue$4.run(Unknown Source)
   at java.awt.EventQueue$4.run(Unknown Source)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.security.ProtectionDomain$1.doIntersectionPrivilege(Unknown 
 Source)
   at java.awt.EventQueue.dispatchEvent(Unknown Source)
   at java.awt.EventDispatchThread.pumpOneEventForFilters(Unknown Source)
   at java.awt.EventDispatchThread.pumpEventsForFilter(Unknown Source)
   at java.awt.EventDispatchThread.pumpEventsForHierarchy(Unknown Source)
   at java.awt.EventDispatchThread.pumpEvents(Unknown Source)
   at java.awt.EventDispatchThread.pumpEvents(Unknown Source)
   at java.awt.EventDispatchThread.run(Unknown Source)
 Caused by: java.io.IOException: Could not delete temporary file 
 C:\Users\HONG-T~1.NGU\AppData\Local\Temp\apache-tika-2891340188156641845.tmp
   at 
 org.apache.tika.io.TemporaryResources$1.close(TemporaryResources.java:70)
   at 
 org.apache.tika.io.TemporaryResources.close(TemporaryResources.java:121)
   at 
 org.apache.tika.io.TemporaryResources.dispose(TemporaryResources.java:150)
   ... 42 more
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [VOTE] Release Apache Tika 1.8 Candidate #1

2015-04-13 Thread Tyler Palsulich
Hi Folks,

Marking this VOTE as failed. Now that the above issues have been addressed,
I'll cut a new release.

Please let me know if you find any other blockers.

Thanks,
Tyler

On Mon, Apr 13, 2015 at 12:45 AM, Hong-Thai Nguyen 
hngu...@customermatrix.com wrote:

 Not yet, I'm investigating more on TIKA-1600 today.

 Hong-Thai

 -Message d'origine-
 De : Allison, Timothy B. [mailto:talli...@mitre.org]
 Envoyé : lundi 13 avril 2015 01:07
 À : dev@tika.apache.org
 Objet : RE: [VOTE] Release Apache Tika 1.8 Candidate #1

 I don't think we've solved TIKA-1600, yet, or have we?

 -Original Message-
 From: Tyler Palsulich [mailto:tpalsul...@gmail.com]
 Sent: Sunday, April 12, 2015 12:12 AM
 To: dev@tika.apache.org
 Subject: Re: [VOTE] Release Apache Tika 1.8 Candidate #1

 Are we ready for another RC? I'd like to make sure the above issues are
 (believed to be) settled before the next cut.

 Thanks,
 Tyler
 On Apr 10, 2015 4:55 PM, David Meikle loo...@gmail.com wrote:

 
   On 10 Apr 2015, at 11:38, Allison, Timothy B. talli...@mitre.org
  wrote:
  
I agree that the ODT issue might require a respin.  What do others
  think?
 
  +1 for re-spin.
 
  
   Unfortunately, there might be 2 odt docs (mime type:
  “application/vnd.oasis.opendocument.text”?) in govdocs1…so we wouldn't
  see that problem.
  
  
  
   I did do a comparison of 1.7 vs 1.8-rc1, and the results are here:
  
  
  https://github.com/tballison/share/blob/master/tika_comparisons/tika_1
  _7_v_1_8-rc1.zip
  
  https://github.com/tballison/share/blob/master/tika_comparisons/tika_1
  _7_v_1_8-rc1.zip
  
  
   I encourage folks (if you haven't, and if you care :) ) to take a
   look
  and see if you see something that I don’t.
 
  Thanks for this Tim.  About to get on a flight, so will check through
  on that.
 
  Cheers,
  Dave
 
 



Re: [VOTE] Release Apache Tika 1.8 Candidate #1

2015-04-11 Thread Tyler Palsulich
Are we ready for another RC? I'd like to make sure the above issues are
(believed to be) settled before the next cut.

Thanks,
Tyler
On Apr 10, 2015 4:55 PM, David Meikle loo...@gmail.com wrote:


  On 10 Apr 2015, at 11:38, Allison, Timothy B. talli...@mitre.org
 wrote:
 
   I agree that the ODT issue might require a respin.  What do others
 think?

 +1 for re-spin.

 
  Unfortunately, there might be 2 odt docs (mime type:
 “application/vnd.oasis.opendocument.text”?) in govdocs1…so we wouldn't see
 that problem.
 
 
 
  I did do a comparison of 1.7 vs 1.8-rc1, and the results are here:
 
 
 https://github.com/tballison/share/blob/master/tika_comparisons/tika_1_7_v_1_8-rc1.zip
 
 https://github.com/tballison/share/blob/master/tika_comparisons/tika_1_7_v_1_8-rc1.zip
 
 
  I encourage folks (if you haven't, and if you care :) ) to take a look
 and see if you see something that I don’t.

 Thanks for this Tim.  About to get on a flight, so will check through on
 that.

 Cheers,
 Dave




Re: [VOTE] Release Apache Tika 1.8 Candidate #1

2015-04-07 Thread Tyler Palsulich
CC'ing user@tika for visibility.

Tyler

On Tue, Apr 7, 2015 at 4:54 PM, Tyler Palsulich tpalsul...@apache.org
wrote:

 Hi Folks,

 A candidate for the Tika 1.8 release is available at:
   https://dist.apache.org/repos/dist/dev/tika/

 The release candidate is a zip archive of the sources in:
   http://svn.apache.org/repos/asf/tika/tags/1.8-rc1/

 The SHA1 checksum of the archive is
   ddeb3b43ca1c1ef346658a7005434019507e096f.

 In addition, a staged maven repository is available here:
   https://repository.apache.org/content/repositories/orgapachetika-1008

 Please vote on releasing this package as Apache Tika 1.8.
 The vote is open for the next 72 hours and passes if a majority of at
 least three +1 Tika PMC votes are cast.

 [ ] +1 Release this package as Apache Tika 1.8
 [ ] -1 Do not release this package because...

 Have a good night!
 Tyler



[VOTE] Release Apache Tika 1.8 Candidate #1

2015-04-07 Thread Tyler Palsulich
Hi Folks,

A candidate for the Tika 1.8 release is available at:
  https://dist.apache.org/repos/dist/dev/tika/

The release candidate is a zip archive of the sources in:
  http://svn.apache.org/repos/asf/tika/tags/1.8-rc1/

The SHA1 checksum of the archive is
  ddeb3b43ca1c1ef346658a7005434019507e096f.

In addition, a staged maven repository is available here:
  https://repository.apache.org/content/repositories/orgapachetika-1008

Please vote on releasing this package as Apache Tika 1.8.
The vote is open for the next 72 hours and passes if a majority of at
least three +1 Tika PMC votes are cast.

[ ] +1 Release this package as Apache Tika 1.8
[ ] -1 Do not release this package because...

Have a good night!
Tyler


[jira] [Closed] (TIKA-1592) It seems dbus and x11 server are invoked, and fails for some reason too

2015-04-03 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich closed TIKA-1592.
-
Resolution: Invalid

Closing as Invalid. Feel free to create additional issues if you run into other 
problems with Tika!

Thank you for updating with the solution! I'm glad you found it. :) (I'm also 
glad this wasn't a Tika issue... Ha.)

 It seems dbus and x11 server are invoked, and fails for some reason too
 ---

 Key: TIKA-1592
 URL: https://issues.apache.org/jira/browse/TIKA-1592
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.7
 Environment: CentOs 6.6, Java 1.7
Reporter: Michael Couck

 Exception running unit tests:
 GConf Error: Failed to contact configuration server; some possible causes are 
 that you need to enable TCP/IP networking for ORBit, or you have stale NFS 
 locks due to a system crash. See http://projects.gnome.org/gconf/ for 
 information. (Details -  1: Not running within active session)
 Is Tika trying to start an x11 server using dbus? Why? This breaks the unit 
 tests, the logging is a gig for each run, and even a 64 core server is 100% 
 cpu during the failure. I am completely confounded. Any ideas?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1592) It seems dbus and x11 server are invoked, and fails for some reason too

2015-04-02 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393246#comment-14393246
 ] 

Tyler Palsulich commented on TIKA-1592:
---

I tried building ikube on a Mac, but I ran into multiple test failures.
{code}
Tests in error:
  analyze(ikube.analytics.weka.WekaClassifierIntegration)
  initializationError(ikube.action.rule.IsRemoteIndexCurrentIntegration)
  initializationError(ikube.analytics.weka.WekaForecastClassifierIntegration)
  initializationError(ikube.database.DataBaseIntegration)
  
initializationError(ikube.action.index.handler.database.TableResourceProviderIntegration)
  initializationError(ikube.web.service.AnalyzerIntegration)
  initializationError(ikube.analytics.AnalyticsServiceIntegration)
  initializationError(ikube.scheduling.SnapshotScheduleIntegration)
  initializationError(ikube.web.service.SearcherJsonIntegration)
  initializationError(ikube.scheduling.PruneScheduleIntegration)
  
initializationError(ikube.action.index.handler.email.IndexableEmailHandlerIntegration)
  
initializationError(ikube.action.index.handler.strategy.GeospatialEnrichmentStrategyIntegration)
  
initializationError(ikube.action.index.handler.filesystem.IndexableFilesystemHandlerIntegration)
  initializationError(ikube.web.service.SearcherXmlIntegration)
  initializationError(ikube.action.ResetIntegration)
  initializationError(ikube.action.index.handler.internet.SvnHandlerIntegration)
  initializationError(ikube.toolkit.DatabaseUtilitiesIntegration)
  initializationError(ikube.action.rule.RulesIntegration)
  initializationError(ikube.analytics.neuroph.NeurophAnalyzerIntegration)
  initializationError(ikube.database.EntityIntegration)
  initializationError(ikube.cluster.hzc.ClusterManagerCacheSearchIntegration)
  
initializationError(ikube.action.index.handler.database.IndexableTableHandlerIntegration)
{code}

Is Linux required?

Can you give some context of how you're using Tika in the failing unit test? 
Tika should not have any (or, really, there is very little) OS specific code. 
So, it doesn't make sense why something would try to start x11. But, a 
dependency could definitely be up to something fishy.

 It seems dbus and x11 server are invoked, and fails for some reason too
 ---

 Key: TIKA-1592
 URL: https://issues.apache.org/jira/browse/TIKA-1592
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.7
 Environment: CentOs 6.6, Java 1.7
Reporter: Michael Couck

 Exception running unit tests:
 GConf Error: Failed to contact configuration server; some possible causes are 
 that you need to enable TCP/IP networking for ORBit, or you have stale NFS 
 locks due to a system crash. See http://projects.gnome.org/gconf/ for 
 information. (Details -  1: Not running within active session)
 Is Tika trying to start an x11 server using dbus? Why? This breaks the unit 
 tests, the logging is a gig for each run, and even a 64 core server is 100% 
 cpu during the failure. I am completely confounded. Any ideas?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1592) It seems dbus and x11 server are invoked, and fails for some reason too

2015-04-02 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393184#comment-14393184
 ] 

Tyler Palsulich commented on TIKA-1592:
---

Thanks for reporting this, [~michaelcouck]! Just to be clear, you're building 
Tika 1.7 from source?

Which test case causes this? After a quick {{grep}}, I don't see any gconf or 
dbus references (don't know why there would be any, off the top of my head...). 
When you say the logging is a a gig, is that what is sent to stdout when doing 
{{mvn install}}? Or something else?

 It seems dbus and x11 server are invoked, and fails for some reason too
 ---

 Key: TIKA-1592
 URL: https://issues.apache.org/jira/browse/TIKA-1592
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.7
 Environment: CentOs 6.6, Java 1.7
Reporter: Michael Couck

 Exception running unit tests:
 GConf Error: Failed to contact configuration server; some possible causes are 
 that you need to enable TCP/IP networking for ORBit, or you have stale NFS 
 locks due to a system crash. See http://projects.gnome.org/gconf/ for 
 information. (Details -  1: Not running within active session)
 Is Tika trying to start an x11 server using dbus? Why? This breaks the unit 
 tests, the logging is a gig for each run, and even a 64 core server is 100% 
 cpu during the failure. I am completely confounded. Any ideas?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1592) It seems dbus and x11 server are invoked, and fails for some reason too

2015-04-02 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14393184#comment-14393184
 ] 

Tyler Palsulich edited comment on TIKA-1592 at 4/2/15 7:09 PM:
---

Thanks for reporting this, [~michaelcouck]! Just to be clear, you're building 
Tika 1.7 from source?

Which test case causes this? -After a quick {{grep}}, I don't see any gconf or 
dbus references (don't know why there would be any, off the top of my head...)- 
See {{grep}} output below. When you say the logging is a a gig, is that what is 
sent to stdout when doing {{mvn install}}? Or something else?

{code}
➜  trunk  grep -Ri dbus .
Binary file ./tika-parsers/src/test/resources/test-documents/testTIFF.tif 
matches
Binary file ./tika-parsers/target/test-classes/test-documents/testTIFF.tif 
matches
Binary file ./tika-parsers/target/tika-parsers-1.8-SNAPSHOT-tests.jar matches
Binary file ./tika-server/target/tika-server-1.8-SNAPSHOT.jar matches
➜  trunk  grep -Ri gconf .
Binary file ./tika-app/target/tika-app-1.8-SNAPSHOT.jar matches
Binary file ./tika-server/target/tika-server-1.8-SNAPSHOT.jar matches
{code}


was (Author: tpalsulich):
Thanks for reporting this, [~michaelcouck]! Just to be clear, you're building 
Tika 1.7 from source?

Which test case causes this? After a quick {{grep}}, I don't see any gconf or 
dbus references (don't know why there would be any, off the top of my head...). 
When you say the logging is a a gig, is that what is sent to stdout when doing 
{{mvn install}}? Or something else?

 It seems dbus and x11 server are invoked, and fails for some reason too
 ---

 Key: TIKA-1592
 URL: https://issues.apache.org/jira/browse/TIKA-1592
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.7
 Environment: CentOs 6.6, Java 1.7
Reporter: Michael Couck

 Exception running unit tests:
 GConf Error: Failed to contact configuration server; some possible causes are 
 that you need to enable TCP/IP networking for ORBit, or you have stale NFS 
 locks due to a system crash. See http://projects.gnome.org/gconf/ for 
 information. (Details -  1: Not running within active session)
 Is Tika trying to start an x11 server using dbus? Why? This breaks the unit 
 tests, the logging is a gig for each run, and even a 64 core server is 100% 
 cpu during the failure. I am completely confounded. Any ideas?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Access Control Allow Origin

2015-04-01 Thread Tyler Palsulich
Thank you for the feedback!

I think there's an issue (don't remember the number) to be able to specify
a TikaConfig file for tika-server. So, I think that would be the ideal
place to put more complex CORS configuration.

Tyler

On Wed, Apr 1, 2015 at 6:02 AM, Sergey Beryozkin sberyoz...@gmail.com
wrote:

 Hi Tyler

 Sorry for a delay, I was off for the last few days,
 The change you did looks fine, the filter can check the annotations or can
 be configured directly (which is what you did).
 It might make sense to consider checking a (Java) properties resource as a
 possible future enhancement, as a CORS filter may have many properties,
 May be if a '-cors' is provided then check a well-known class resource
 where all of the cors properties are set, if it is absent - default to '*'
 otherwise work with Properties...
 The current approach works too, might be tricky to extend it to support
 more properties but great for a start

 Thanks, Sergey





 On 27/03/15 18:56, Tyler Palsulich wrote:

 Thank you, Sergey! I didn't know about that feature. I am going to try to
 work up a patch this weekend which enables CORS. I'll let you know if I
 run
 into any issues.

 Thanks again,
 Tyler

 On Thu, Mar 26, 2015 at 2:39 AM, Mattmann, Chris A (3980) 
 chris.a.mattm...@jpl.nasa.gov wrote:



 ++
 Chris Mattmann, Ph.D.
 Chief Architect
 Instrument Software and Science Data Systems Section (398)
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 168-519, Mailstop: 168-527
 Email: chris.a.mattm...@nasa.gov
 WWW:  http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Associate Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++






 -Original Message-
 From: Tyler Palsulich tpalsul...@gmail.com
 Reply-To: dev@tika.apache.org dev@tika.apache.org
 Date: Tuesday, March 24, 2015 at 3:41 PM
 To: dev@tika.apache.org dev@tika.apache.org
 Subject: Access Control Allow Origin

  Hi Folks,

 I took a stab at creating an example website to submit a file to the
 form
 resource of our VM. See http://tpalsulich.github.io/TikaExamples/.

 If I try to use AJAX to submit the request to make the page prettier
 (see
 the script in the head of the page (with ev.preventDefault() commented
 out), I get the following error:

 XMLHttpRequest cannot load http://162.242.228.174:9998/tika/form. No
 'Access-Control-Allow-Origin' header is present on the requested
 resource.
 Origin 'http://tpalsulich.github.io' is therefore not allowed access.
 The
 response had HTTP status code 400.

 We can't allow the tika-server response header to accept * in general,
 since that isn't secure. So, would there be interest in including this
 sort
 of site on the VM? Then, the AJAX request won't be external and we won't
 have this error.

 The version button just takes you to the version resource on the VM
 (doesn't do anything with the file).

 Tyler








Re: Access Control Allow Origin

2015-04-01 Thread Tyler Palsulich
I'll change the option to -C right now. Just looked closer -- TIKA-1426 is
to provide a config for the server and app on the command line.

Tyler

On Wed, Apr 1, 2015 at 11:22 AM, Allison, Timothy B. talli...@mitre.org
wrote:

 Might be thinking of TIKA-944?

 Mind if we switch the CORS short option to -C and use -c for the tika
 config file?

 -Original Message-
 From: Tyler Palsulich [mailto:tpalsul...@gmail.com]
 Sent: Wednesday, April 01, 2015 11:13 AM
 To: dev@tika.apache.org
 Subject: Re: Access Control Allow Origin

 Thank you for the feedback!

 I think there's an issue (don't remember the number) to be able to specify
 a TikaConfig file for tika-server. So, I think that would be the ideal
 place to put more complex CORS configuration.

 Tyler

 On Wed, Apr 1, 2015 at 6:02 AM, Sergey Beryozkin sberyoz...@gmail.com
 wrote:

  Hi Tyler
 
  Sorry for a delay, I was off for the last few days,
  The change you did looks fine, the filter can check the annotations or
 can
  be configured directly (which is what you did).
  It might make sense to consider checking a (Java) properties resource as
 a
  possible future enhancement, as a CORS filter may have many properties,
  May be if a '-cors' is provided then check a well-known class resource
  where all of the cors properties are set, if it is absent - default to
 '*'
  otherwise work with Properties...
  The current approach works too, might be tricky to extend it to support
  more properties but great for a start
 
  Thanks, Sergey
 
 
 
 
 
  On 27/03/15 18:56, Tyler Palsulich wrote:
 
  Thank you, Sergey! I didn't know about that feature. I am going to try
 to
  work up a patch this weekend which enables CORS. I'll let you know if I
  run
  into any issues.
 
  Thanks again,
  Tyler
 
  On Thu, Mar 26, 2015 at 2:39 AM, Mattmann, Chris A (3980) 
  chris.a.mattm...@jpl.nasa.gov wrote:
 
 
 
  ++
  Chris Mattmann, Ph.D.
  Chief Architect
  Instrument Software and Science Data Systems Section (398)
  NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
  Office: 168-519, Mailstop: 168-527
  Email: chris.a.mattm...@nasa.gov
  WWW:  http://sunset.usc.edu/~mattmann/
  ++
  Adjunct Associate Professor, Computer Science Department
  University of Southern California, Los Angeles, CA 90089 USA
  ++
 
 
 
 
 
 
  -Original Message-
  From: Tyler Palsulich tpalsul...@gmail.com
  Reply-To: dev@tika.apache.org dev@tika.apache.org
  Date: Tuesday, March 24, 2015 at 3:41 PM
  To: dev@tika.apache.org dev@tika.apache.org
  Subject: Access Control Allow Origin
 
   Hi Folks,
 
  I took a stab at creating an example website to submit a file to the
  form
  resource of our VM. See http://tpalsulich.github.io/TikaExamples/.
 
  If I try to use AJAX to submit the request to make the page prettier
  (see
  the script in the head of the page (with ev.preventDefault() commented
  out), I get the following error:
 
  XMLHttpRequest cannot load http://162.242.228.174:9998/tika/form. No
  'Access-Control-Allow-Origin' header is present on the requested
  resource.
  Origin 'http://tpalsulich.github.io' is therefore not allowed access.
  The
  response had HTTP status code 400.
 
  We can't allow the tika-server response header to accept * in
 general,
  since that isn't secure. So, would there be interest in including this
  sort
  of site on the VM? Then, the AJAX request won't be external and we
 won't
  have this error.
 
  The version button just takes you to the version resource on the VM
  (doesn't do anything with the file).
 
  Tyler
 
 
 
 
 
 



[jira] [Comment Edited] (TIKA-1585) Create Example Website with Form Submission

2015-04-01 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14390841#comment-14390841
 ] 

Tyler Palsulich edited comment on TIKA-1585 at 4/1/15 3:51 PM:
---

Done. It works. -I'll see if I can shut 9997 down right now.- Port 9997 is now 
closed.


was (Author: tpalsulich):
Done. It works. I'll see if I can shut 9997 down right now.

 Create Example Website with Form Submission
 ---

 Key: TIKA-1585
 URL: https://issues.apache.org/jira/browse/TIKA-1585
 Project: Tika
  Issue Type: New Feature
  Components: example, server
Reporter: Tyler Palsulich
Assignee: Tyler Palsulich

 It would be great to have a website where we can direct people who ask what 
 Tika can do for [filetype] without needing them to actually download Tika.
 Some initial work to do that is 
 [here|http://tpalsulich.github.io/TikaExamples/].
 I'm far from a design guru, but I imagine the site as having a form where you 
 can upload a file at the top, checkboxes for if you want metadata, content, 
 or both, and a submit button. The request should be sent with AJAX and the 
 result should populate a {{div}}.
 One issue with AJAX requests is that Tika Server doesn't currently allow 
 Cross-Origin-Resource-Sharing (CORS). So, we either need to maintain a 
 slightly updated tika-server, or update the server to allow configuration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [jira] [Commented] (TIKA-1330) Add robust tika-batch code

2015-04-01 Thread Tyler Palsulich
All tests are passing. Only issue I see is excessive logging. The Hudson
failure does just look like a hiccup.

Tyler

On Wed, Apr 1, 2015 at 2:55 PM, Allison, Timothy B. talli...@mitre.org
wrote:

 This looks like a Hudson hiccup.

 Tyler is seeing excessive logging:
 Running org.apache.tika.cli.TikaCLIBatchIntegrationTest
 INFO - about to start driver
 INFO - about to start driver

 Anyone else having problems building from a fresh trunk?


 -Original Message-
 From: Hudson (JIRA) [mailto:j...@apache.org]
 Sent: Wednesday, April 01, 2015 5:36 PM
 To: dev@tika.apache.org
 Subject: [jira] [Commented] (TIKA-1330) Add robust tika-batch code


 [
 https://issues.apache.org/jira/browse/TIKA-1330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14391512#comment-14391512
 ]

 Hudson commented on TIKA-1330:
 --

 ABORTED: Integrated in tika-trunk-jdk1.7 #596 (See [
 https://builds.apache.org/job/tika-trunk-jdk1.7/596/])
 TIKA-1330 flush stacktrace writers (tallison:
 http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670756)
 *
 /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/FileResourceConsumer.java
 *
 /tika/trunk/tika-batch/src/main/java/org/apache/tika/util/TikaExceptionFilter.java
 TIKA-1330 clean up logging in tika-batch ant tika-app integration of
 tika-batch, take 2 (tallison:
 http://svn.apache.org/viewvc/tika/trunk/?view=revrev=1670751)
 *
 /tika/trunk/tika-batch/src/main/java/org/apache/tika/batch/FileResourceConsumer.java


  Add robust tika-batch code
  --
 
  Key: TIKA-1330
  URL: https://issues.apache.org/jira/browse/TIKA-1330
  Project: Tika
   Issue Type: Sub-task
   Components: cli, general, server
 Reporter: Tim Allison
 Assignee: Tim Allison
  Attachments: TIKA-1330v1-patch.zip
 
 
  In my current design plan, I see creating a separate component
 tika-batch that includes a small bit of configurable code to run Tika
 against a large batch of documents.  This code should be robust against OOM
 and hangs, and it should have fairly robust logging.



 --
 This message was sent by Atlassian JIRA
 (v6.3.4#6332)



[jira] [Updated] (TIKA-1558) Create a Parser Blacklist

2015-03-31 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich updated TIKA-1558:
--
Description: 
As talked about in TIKA-1555 and TIKA-1557, it would be nice to be able to 
disable Parsers without pulling their dependencies out. In some cases (e.g. 
disable all ExternalParsers), there may not be an easy way to exclude the 
dependencies via Maven.

-So, an initial design would be to include another file like 
{{META-INF/services/org.apache.tika.parser.Parser.blacklist}}. We create a new 
method {{ServiceLoader#loadServiceProviderBlacklist}}. Then, in 
{{ServiceLoader#loadServiceProviders}}, we remove all elements of the list that 
are assignable to an element in {{ServiceLoader#loadServiceProviderBlacklist}}.-

  was:
As talked about in TIKA-1555 and TIKA-1557, it would be nice to be able to 
disable Parsers without pulling their dependencies out. In some cases (e.g. 
disable all ExternalParsers), there may not be an easy way to exclude the 
dependencies via Maven.

So, an initial design would be to include another file like 
{{META-INF/services/org.apache.tika.parser.Parser.blacklist}}. We create a new 
method {{ServiceLoader#loadServiceProviderBlacklist}}. Then, in 
{{ServiceLoader#loadServiceProviders}}, we remove all elements of the list that 
are assignable to an element in {{ServiceLoader#loadServiceProviderBlacklist}}.


 Create a Parser Blacklist
 -

 Key: TIKA-1558
 URL: https://issues.apache.org/jira/browse/TIKA-1558
 Project: Tika
  Issue Type: New Feature
Reporter: Tyler Palsulich
Assignee: Tyler Palsulich
 Fix For: 1.8


 As talked about in TIKA-1555 and TIKA-1557, it would be nice to be able to 
 disable Parsers without pulling their dependencies out. In some cases (e.g. 
 disable all ExternalParsers), there may not be an easy way to exclude the 
 dependencies via Maven.
 -So, an initial design would be to include another file like 
 {{META-INF/services/org.apache.tika.parser.Parser.blacklist}}. We create a 
 new method {{ServiceLoader#loadServiceProviderBlacklist}}. Then, in 
 {{ServiceLoader#loadServiceProviders}}, we remove all elements of the list 
 that are assignable to an element in 
 {{ServiceLoader#loadServiceProviderBlacklist}}.-



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1558) Create a Parser Blacklist

2015-03-31 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1432#comment-1432
 ] 

Tyler Palsulich edited comment on TIKA-1558 at 3/31/15 9:41 PM:


-Above strategy added in r1661284. You can now blacklist Parsers by adding 
names to {{META-INF/services/org.apache.tika.parser.Parser.blacklist}} with the 
same format as the normal services file. If a class is blacklisted, all of its 
subclasses are automatically blacklisted.-

Edit: Service loading blacklisting disabled in r1670487. Use a custom 
TikaConfig like [this 
one|https://svn.apache.org/repos/asf/tika/trunk/tika-parsers/src/test/resources/org/apache/tika/config/TIKA-1558-blacklistsub.xml]
 to disable a Parser. Any subclasses of that Parser will also be excluded.


was (Author: tpalsulich):
Above strategy added in r1661284. You can now blacklist Parsers by adding names 
to {{META-INF/services/org.apache.tika.parser.Parser.blacklist}} with the same 
format as the normal services file. If a class is blacklisted, all of its 
subclasses are automatically blacklisted.

 Create a Parser Blacklist
 -

 Key: TIKA-1558
 URL: https://issues.apache.org/jira/browse/TIKA-1558
 Project: Tika
  Issue Type: New Feature
Reporter: Tyler Palsulich
Assignee: Tyler Palsulich
 Fix For: 1.8


 As talked about in TIKA-1555 and TIKA-1557, it would be nice to be able to 
 disable Parsers without pulling their dependencies out. In some cases (e.g. 
 disable all ExternalParsers), there may not be an easy way to exclude the 
 dependencies via Maven.
 -So, an initial design would be to include another file like 
 {{META-INF/services/org.apache.tika.parser.Parser.blacklist}}. We create a 
 new method {{ServiceLoader#loadServiceProviderBlacklist}}. Then, in 
 {{ServiceLoader#loadServiceProviders}}, we remove all elements of the list 
 that are assignable to an element in 
 {{ServiceLoader#loadServiceProviderBlacklist}}.-



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: including refactored docs from govdocs1 in test suite

2015-03-30 Thread Tyler Palsulich
Can you copy the hyperlink into a new doc and change the URL? I have no
idea about including the modified version.

Tyler
On Mar 30, 2015 9:18 AM, Allison, Timothy B. talli...@mitre.org wrote:

 All,

   As part of TIKA-1512, I found that I can delete all of the contents,
 including the metadata, except for one hyperlink in two documents from
 govdocs1 and still get the proper behavior -- fail before fix, work after
 fix.

   These documents are in the public domain.

   Is it ok to include these modified documents in our test suite or should
 I avoid inclusion?

   Happy to avoid inclusion for the sake of a quick release of 1.8 and then
 we have time to discuss/determine way ahead... unless the answer is obvious.

  Best,

  Tim

 -Original Message-
 From: Allison, Timothy B. [mailto:talli...@mitre.org]
 Sent: Monday, March 30, 2015 7:03 AM
 To: dev@tika.apache.org
 Subject: RE: [DISCUSS] Tika 1.8 or 1.7.1

 Unless there are objections, I'd like these to be resolved before 1.8:

 TIKA-1584 -- I'll fix
 TIKA-1575 -- Resolved by Konstantin Gribov (thank you!)
 TIKA-1512 -- I'll put in a temporary fix so that we don't get IOOBEs, but
 I'll leave this open and do some more digging to see if we need to open a
 ticket at the POI level
 TIKA-1511 -- I'll remove provided for xerial

 TIKA-1549 -- We should thank Toke Eskildsen in CHANGES.txt, no?

 I'll have these fixes completed by noon EDT.  Should I run against
 govdocs1 before or after the RC?

 My last build of Tika app (a few days ago) ballooned to ~43MB, and that's
 before I add ~3MB for xerial.  Tika server is now ~48MB.  As of my last
 build, we are still including ~4MB of pdfs (README.NLDAS1.pdf and
 README.NLDAS2.pdf) from the GRIB(?) parser in the tika-app and tika-server
 jars.

 Best,

   Tim



 -Original Message-
 From: Tyler Palsulich [mailto:tpalsul...@gmail.com]
 Sent: Sunday, March 29, 2015 9:13 AM
 To: dev@tika.apache.org
 Subject: Re: [DISCUSS] Tika 1.8 or 1.7.1

 Once TIKA-1584 and TIKA-1575 are resolved, I'll work up an RC (unless
 something else pops up).

 Thank you everyone.

 Tyler
 On Mar 29, 2015 4:43 AM, Hong-Thai Nguyen thaicha...@gmail.com wrote:

  +1 for 1.8
 
  Hong-Thai
 
   On 28 Mar 2015, at 16:01, Tyler Palsulich tpalsul...@apache.org
 wrote:
  
   Hi Folks,
  
   Now that TIKA-1581 (JHighlight licensing issues) is resolved, we need
 to
   release a new version of Tika. I'll volunteer to be the release manager
   again.
  
   Should we release this as 1.8 or 1.7.1?
  
   Does anyone have any last minute issues they'd like to finish and see
 in
   Tika 1.X? I'd like to get the example working with CORS (TIKA-1585 and
   TIKA-1586). Any others?
  
   Have a good weekend,
   Tyler
 



[jira] [Commented] (TIKA-1587) ForkParser::setJavaCommand should take ListString

2015-03-30 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386685#comment-14386685
 ] 

Tyler Palsulich commented on TIKA-1587:
---

Thank you for reporting this! It seems like a definite problem. Is there any 
way you can provide a patch?

 ForkParser::setJavaCommand should take ListString
 ---

 Key: TIKA-1587
 URL: https://issues.apache.org/jira/browse/TIKA-1587
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.7
Reporter: Oleg Oshmyan

 ForkParser::setJavaCommand currently takes a string and splits it on 
 whitespace. This makes it impossible to use commands with paths that contain 
 spaces. In particular, it makes it impossible to reliably use 
 System.getProperty(java.home) in order to launch the same Java that the 
 current process is running in, because it might contain spaces. If it would 
 just take a ListString and pass (a clone of) it directly to ProcessBuilder, 
 this wouldn't be a problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


RE: including refactored docs from govdocs1 in test suite

2015-03-30 Thread Tyler Palsulich
Ah. I see.

In general, what is the goal with handling corrupted files? Extract as much
as possible and fail gracefully?

Tyler

On Mar 30, 2015 9:32 AM, Allison, Timothy B. talli...@mitre.org wrote:

 Unfortunately, no.  MSOffice fixes the document when I do that.

 -Original Message-
 From: Tyler Palsulich [mailto:tpalsul...@gmail.com]
 Sent: Monday, March 30, 2015 9:24 AM
 To: dev@tika.apache.org
 Subject: Re: including refactored docs from govdocs1 in test suite

 Can you copy the hyperlink into a new doc and change the URL? I have no
 idea about including the modified version.

 Tyler
 On Mar 30, 2015 9:18 AM, Allison, Timothy B. talli...@mitre.org wrote:

  All,
 
As part of TIKA-1512, I found that I can delete all of the contents,
  including the metadata, except for one hyperlink in two documents from
  govdocs1 and still get the proper behavior -- fail before fix, work
after
  fix.
 
These documents are in the public domain.
 
Is it ok to include these modified documents in our test suite or
should
  I avoid inclusion?
 
Happy to avoid inclusion for the sake of a quick release of 1.8 and
then
  we have time to discuss/determine way ahead... unless the answer is
obvious.
 
   Best,
 
   Tim
 
  -Original Message-
  From: Allison, Timothy B. [mailto:talli...@mitre.org]
  Sent: Monday, March 30, 2015 7:03 AM
  To: dev@tika.apache.org
  Subject: RE: [DISCUSS] Tika 1.8 or 1.7.1
 
  Unless there are objections, I'd like these to be resolved before 1.8:
 
  TIKA-1584 -- I'll fix
  TIKA-1575 -- Resolved by Konstantin Gribov (thank you!)
  TIKA-1512 -- I'll put in a temporary fix so that we don't get IOOBEs,
but
  I'll leave this open and do some more digging to see if we need to open
a
  ticket at the POI level
  TIKA-1511 -- I'll remove provided for xerial
 
  TIKA-1549 -- We should thank Toke Eskildsen in CHANGES.txt, no?
 
  I'll have these fixes completed by noon EDT.  Should I run against
  govdocs1 before or after the RC?
 
  My last build of Tika app (a few days ago) ballooned to ~43MB, and
that's
  before I add ~3MB for xerial.  Tika server is now ~48MB.  As of my last
  build, we are still including ~4MB of pdfs (README.NLDAS1.pdf and
  README.NLDAS2.pdf) from the GRIB(?) parser in the tika-app and
tika-server
  jars.
 
  Best,
 
Tim
 
 
 
  -Original Message-
  From: Tyler Palsulich [mailto:tpalsul...@gmail.com]
  Sent: Sunday, March 29, 2015 9:13 AM
  To: dev@tika.apache.org
  Subject: Re: [DISCUSS] Tika 1.8 or 1.7.1
 
  Once TIKA-1584 and TIKA-1575 are resolved, I'll work up an RC (unless
  something else pops up).
 
  Thank you everyone.
 
  Tyler
  On Mar 29, 2015 4:43 AM, Hong-Thai Nguyen thaicha...@gmail.com
wrote:
 
   +1 for 1.8
  
   Hong-Thai
  
On 28 Mar 2015, at 16:01, Tyler Palsulich tpalsul...@apache.org
  wrote:
   
Hi Folks,
   
Now that TIKA-1581 (JHighlight licensing issues) is resolved, we
need
  to
release a new version of Tika. I'll volunteer to be the release
manager
again.
   
Should we release this as 1.8 or 1.7.1?
   
Does anyone have any last minute issues they'd like to finish and
see
  in
Tika 1.X? I'd like to get the example working with CORS (TIKA-1585
and
TIKA-1586). Any others?
   
Have a good weekend,
Tyler
  
 


Re: [DISCUSS] Tika 1.8 or 1.7.1

2015-03-30 Thread Tyler Palsulich
I just remembered TIKA-1509 and TIKA-1558 -- testing now for blacklist
functionality through TIKA-1509. If that works, I'll back out TIKA-1558.

Tim, I think you should run govdocs from the RC, in case something changes
between your run and the cut.

Tyler

On Mon, Mar 30, 2015 at 10:17 AM, Allison, Timothy B. talli...@mitre.org
wrote:

 All,

 I've made the changes that I had hoped to.  Grib pdf exclusion remains for
 any takers.

 Let me know when I should initiate the run against govdocs1 to see if
 there are any surprises on that corpus with Tika 1.8.

 Best,

 Tim

 -Original Message-
 From: Allison, Timothy B. [mailto:talli...@mitre.org]
 Sent: Monday, March 30, 2015 7:03 AM
 To: dev@tika.apache.org
 Subject: RE: [DISCUSS] Tika 1.8 or 1.7.1

 Unless there are objections, I'd like these to be resolved before 1.8:

 TIKA-1584 -- I'll fix
 TIKA-1575 -- Resolved by Konstantin Gribov (thank you!)
 TIKA-1512 -- I'll put in a temporary fix so that we don't get IOOBEs, but
 I'll leave this open and do some more digging to see if we need to open a
 ticket at the POI level
 TIKA-1511 -- I'll remove provided for xerial

 TIKA-1549 -- We should thank Toke Eskildsen in CHANGES.txt, no?

 I'll have these fixes completed by noon EDT.  Should I run against
 govdocs1 before or after the RC?

 My last build of Tika app (a few days ago) ballooned to ~43MB, and that's
 before I add ~3MB for xerial.  Tika server is now ~48MB.  As of my last
 build, we are still including ~4MB of pdfs (README.NLDAS1.pdf and
 README.NLDAS2.pdf) from the GRIB(?) parser in the tika-app and tika-server
 jars.

 Best,

   Tim



 -Original Message-
 From: Tyler Palsulich [mailto:tpalsul...@gmail.com]
 Sent: Sunday, March 29, 2015 9:13 AM
 To: dev@tika.apache.org
 Subject: Re: [DISCUSS] Tika 1.8 or 1.7.1

 Once TIKA-1584 and TIKA-1575 are resolved, I'll work up an RC (unless
 something else pops up).

 Thank you everyone.

 Tyler
 On Mar 29, 2015 4:43 AM, Hong-Thai Nguyen thaicha...@gmail.com wrote:

  +1 for 1.8
 
  Hong-Thai
 
   On 28 Mar 2015, at 16:01, Tyler Palsulich tpalsul...@apache.org
 wrote:
  
   Hi Folks,
  
   Now that TIKA-1581 (JHighlight licensing issues) is resolved, we need
 to
   release a new version of Tika. I'll volunteer to be the release manager
   again.
  
   Should we release this as 1.8 or 1.7.1?
  
   Does anyone have any last minute issues they'd like to finish and see
 in
   Tika 1.X? I'd like to get the example working with CORS (TIKA-1585 and
   TIKA-1586). Any others?
  
   Have a good weekend,
   Tyler
 



[jira] [Comment Edited] (TIKA-1584) Tika 1.7 possible regression (nested attachment files not getting parsed)

2015-03-30 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386906#comment-14386906
 ] 

Tyler Palsulich edited comment on TIKA-1584 at 3/30/15 4:05 PM:


Yup! The 1.8 release process should start this week. Ideally, it will hit the 
mirrors some time next week.

[edit: 1.8, not 1.7!]


was (Author: tpalsulich):
Yup! The 1.7 release process should start this week. Ideally, it will hit the 
mirrors some time next week.

 Tika 1.7 possible regression (nested attachment files not getting parsed)
 -

 Key: TIKA-1584
 URL: https://issues.apache.org/jira/browse/TIKA-1584
 Project: Tika
  Issue Type: Bug
  Components: server
Affects Versions: 1.7
Reporter: Rob Tulloh
Assignee: Tim Allison
Priority: Blocker
 Fix For: 1.8


 I tried to send this to the tika user list, but got a qmail failure so I am 
 opening a jira to see if I can get help with this.
 There appears to be a change in the behavior of tika since 1.5 (the last 
 version we have used). In 1.5, if we pass a file with content type of rfc822 
 which contains a zip that contains a docx file, the entire content would get 
 recursed and the text returned. In 1.7, tika only unwinds as far as the zip 
 file and ignores the content of the contained docx file. This is causing a 
 regression failure in our search tests because the contents of the docx file 
 are not found when searched for.
  
 We are testing with tika-server if this helps. If we ask the meta service to 
 just characterize the test data, it correctly determines the input is of type 
 rfc822. However, on extract, the contents of the attachment are not extracted 
 as expected.
 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream  
 http://localhost:9998/meta 2/dev/null | grep Content-Type
 Content-Type,message/rfc822
 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream  
 http://localhost:9998/tika 2/dev/null | grep docx
 sign.docx   --- this is not expected, need contents of this extracted
 We can easily reproduce this problem with a simple eml file with an 
 attachment. Can someone please comment if this seems like a problem or 
 perhaps we need to change something in our call to get the old behavior?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1575) Upgrade to PDFBox 1.8.9 when available

2015-03-29 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich updated TIKA-1575:
--
Fix Version/s: 1.8

 Upgrade to PDFBox 1.8.9 when available
 --

 Key: TIKA-1575
 URL: https://issues.apache.org/jira/browse/TIKA-1575
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor
 Fix For: 1.8

 Attachments: 005937.pdf.json, 005937_1_8_9-SNAPSHOT.pdf.json, 
 10-814_Appendix B_v3.pdf, 524276_719128_diffs.zip, 
 PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT.xlsx, 
 PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT_reports.zip, 
 PDFBox_1_8_8Vs1_8_9_20150316.zip, caught_ex_1_8_9.zip, 
 content_diffs_20150316.xlsx, diffs_1_8_9_multithread_vs_single_thread.xlsx, 
 reports_1_8_9_multithread_vs_single.zip


 The PDFBox community is about to release 1.8.9.  Let's use this issue to 
 track discussions before the release and to track Tika's upgrade to PDFBox 
 1.8.9



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [DISCUSS] Tika 1.8 or 1.7.1

2015-03-29 Thread Tyler Palsulich
Once TIKA-1584 and TIKA-1575 are resolved, I'll work up an RC (unless
something else pops up).

Thank you everyone.

Tyler
On Mar 29, 2015 4:43 AM, Hong-Thai Nguyen thaicha...@gmail.com wrote:

 +1 for 1.8

 Hong-Thai

  On 28 Mar 2015, at 16:01, Tyler Palsulich tpalsul...@apache.org wrote:
 
  Hi Folks,
 
  Now that TIKA-1581 (JHighlight licensing issues) is resolved, we need to
  release a new version of Tika. I'll volunteer to be the release manager
  again.
 
  Should we release this as 1.8 or 1.7.1?
 
  Does anyone have any last minute issues they'd like to finish and see in
  Tika 1.X? I'd like to get the example working with CORS (TIKA-1585 and
  TIKA-1586). Any others?
 
  Have a good weekend,
  Tyler



[jira] [Resolved] (TIKA-1579) Add file type to NetCDFParser

2015-03-29 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich resolved TIKA-1579.
---
Resolution: Fixed

 Add file type to NetCDFParser
 -

 Key: TIKA-1579
 URL: https://issues.apache.org/jira/browse/TIKA-1579
 Project: Tika
  Issue Type: Improvement
  Components: parser
Reporter: Ann Burgess
Assignee: Ann Burgess
 Attachments: TIKA-1579.abburgess.190315.patch.txt


 [~gostep] explains that, there are three versions of NetCDF (classic format, 
 64-bit offset, and netCDF-4/HDF5 format). When opening an existing netCDF 
 file, the netCDF library will transparently detect its format so we do not 
 need to adjust according to the detected format.
 That said, it would be good to know the file type as each can have the .nc 
 extension.  This will add patch with add file type to the metadata.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [DISCUSS] Tika 1.8 or 1.7.1

2015-03-28 Thread Tyler Palsulich
I'm also leaning toward 1.8. Especially given the newly identified
regression in TIKA-1584.

Tyler
On Mar 28, 2015 11:47 AM, Mattmann, Chris A (3980) 
chris.a.mattm...@jpl.nasa.gov wrote:

 Hi Tyler - I would VOTE for 1.8. Given the stuff associated
 with releasing (updating the website; sending emails; waiting
 periods, etc.) let’s ship all the updates we have too along
 with the jhighlight fix.

 Cheers,
 Chris

 ++
 Chris Mattmann, Ph.D.
 Chief Architect
 Instrument Software and Science Data Systems Section (398)
 NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
 Office: 168-519, Mailstop: 168-527
 Email: chris.a.mattm...@nasa.gov
 WWW:  http://sunset.usc.edu/~mattmann/
 ++
 Adjunct Associate Professor, Computer Science Department
 University of Southern California, Los Angeles, CA 90089 USA
 ++






 -Original Message-
 From: Tyler Palsulich tpalsul...@apache.org
 Reply-To: dev@tika.apache.org dev@tika.apache.org
 Date: Saturday, March 28, 2015 at 8:01 AM
 To: dev@tika.apache.org dev@tika.apache.org
 Subject: [DISCUSS] Tika 1.8 or 1.7.1

 Hi Folks,
 
 Now that TIKA-1581 (JHighlight licensing issues) is resolved, we need to
 release a new version of Tika. I'll volunteer to be the release manager
 again.
 
 Should we release this as 1.8 or 1.7.1?
 
 Does anyone have any last minute issues they'd like to finish and see in
 Tika 1.X? I'd like to get the example working with CORS (TIKA-1585 and
 TIKA-1586). Any others?
 
 Have a good weekend,
 Tyler




[jira] [Commented] (TIKA-1584) Tika 1.7 possible regression (nested attachment files not getting parsed)

2015-03-28 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385483#comment-14385483
 ] 

Tyler Palsulich commented on TIKA-1584:
---

We now have two major issues which need a quick release. So, I would say go for 
1.8. Tim, can you chime in on the current discuss thread?

 Tika 1.7 possible regression (nested attachment files not getting parsed)
 -

 Key: TIKA-1584
 URL: https://issues.apache.org/jira/browse/TIKA-1584
 Project: Tika
  Issue Type: Bug
  Components: server
Affects Versions: 1.7
Reporter: Rob Tulloh
Assignee: Tim Allison
Priority: Blocker

 I tried to send this to the tika user list, but got a qmail failure so I am 
 opening a jira to see if I can get help with this.
 There appears to be a change in the behavior of tika since 1.5 (the last 
 version we have used). In 1.5, if we pass a file with content type of rfc822 
 which contains a zip that contains a docx file, the entire content would get 
 recursed and the text returned. In 1.7, tika only unwinds as far as the zip 
 file and ignores the content of the contained docx file. This is causing a 
 regression failure in our search tests because the contents of the docx file 
 are not found when searched for.
  
 We are testing with tika-server if this helps. If we ask the meta service to 
 just characterize the test data, it correctly determines the input is of type 
 rfc822. However, on extract, the contents of the attachment are not extracted 
 as expected.
 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream  
 http://localhost:9998/meta 2/dev/null | grep Content-Type
 Content-Type,message/rfc822
 curl -X PUT -T test.eml -q -H Content-Type:application/octet-stream  
 http://localhost:9998/tika 2/dev/null | grep docx
 sign.docx   --- this is not expected, need contents of this extracted
 We can easily reproduce this problem with a simple eml file with an 
 attachment. Can someone please comment if this seems like a problem or 
 perhaps we need to change something in our call to get the old behavior?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1585) Create Example Website with Form Submission

2015-03-28 Thread Tyler Palsulich (JIRA)
Tyler Palsulich created TIKA-1585:
-

 Summary: Create Example Website with Form Submission
 Key: TIKA-1585
 URL: https://issues.apache.org/jira/browse/TIKA-1585
 Project: Tika
  Issue Type: New Feature
  Components: example, server
Reporter: Tyler Palsulich
Assignee: Tyler Palsulich


It would be great to have a website where we can direct people who ask what 
Tika can do for [filetype] without needing them to actually download Tika.

Some initial work to do that is 
[here|http://tpalsulich.github.io/TikaExamples/].

I'm far from a design guru, but I imagine the site as having a form where you 
can upload a file at the top, checkboxes for if you want metadata, content, or 
both, and a submit button. The request should be sent with AJAX and the result 
should populate a {{div}}.

One issue with AJAX requests is that Tika Server doesn't currently allow 
Cross-Origin-Resource-Sharing (CORS). So, we either need to maintain a slightly 
updated tika-server, or update the server to allow configuration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1526) ExternalParser should trap/ignore/workarround JDK-8047340 JDK-8055301 so Turkish Tika users can still use non-external parsers

2015-03-28 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich resolved TIKA-1526.
---
Resolution: Fixed

Marking this as Fixed, per the above comments. [~thetaphi] or [~hossman] or 
anyone else, please reopen this if you find any other cases.

Thank you everyone for the help!

 ExternalParser should trap/ignore/workarround JDK-8047340  JDK-8055301 so 
 Turkish Tika users can still use non-external parsers
 

 Key: TIKA-1526
 URL: https://issues.apache.org/jira/browse/TIKA-1526
 Project: Tika
  Issue Type: Wish
Reporter: Hoss Man

 the JDK has numerous pain points regarding the Turkish locale, posix_spawn 
 lowercasing being one of them...
 https://bugs.openjdk.java.net/browse/JDK-8047340
 https://bugs.openjdk.java.net/browse/JDK-8055301
 As of Tika 1.7, the TesseractOCRParser (which is an ExternalParser) is 
 enabled  configured by default in Tika, and uses ExternalParser.check to see 
 if tesseract is available -- but because of the JDK bug, this means that Tika 
 fails fast for Turkish users on BSD/UNIX variants (including MacOSX) like 
 so...
 {noformat}
   [junit4] Throwable #1: java.lang.Error: posix_spawn is not a supported 
 process launch mechanism on this platform.
   [junit4]   at java.lang.UNIXProcess$1.run(UNIXProcess.java:105)
   [junit4]   at java.lang.UNIXProcess$1.run(UNIXProcess.java:94)
   [junit4]   at java.security.AccessController.doPrivileged(Native 
 Method)
   [junit4]   at java.lang.UNIXProcess.clinit(UNIXProcess.java:92)
   [junit4]   at java.lang.ProcessImpl.start(ProcessImpl.java:130)
   [junit4]   at 
 java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
   [junit4]   at java.lang.Runtime.exec(Runtime.java:620)
   [junit4]   at java.lang.Runtime.exec(Runtime.java:485)
   [junit4]   at 
 org.apache.tika.parser.external.ExternalParser.check(ExternalParser.java:344)
   [junit4]   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.hasTesseract(TesseractOCRParser.java:117)
   [junit4]   at 
 org.apache.tika.parser.ocr.TesseractOCRParser.getSupportedTypes(TesseractOCRParser.java:90)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   [junit4]   at 
 org.apache.tika.parser.DefaultParser.getParsers(DefaultParser.java:95)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.getSupportedTypes(CompositeParser.java:229)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.getParsers(CompositeParser.java:81)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.getParser(CompositeParser.java:209)
   [junit4]   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
   [junit4]   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
 {noformat}
 ...unless they go out of their way to white list only the parsers they 
 need/want so TesseractOCRParser (and any other ExternalParsers) will never 
 even be check()ed.
 It would be nice if Tika's ExternalParser class added a similar 
 hack/workarround to what was done in SOLR-6387 to trap these types of errors. 
  In Solr we just propogate a better error explaining why Java hates the 
 turkish langauge...
 {code}
 } catch (Error err) {
   if (err.getMessage() != null  (err.getMessage().contains(posix_spawn) 
 || err.getMessage().contains(UNIXProcess))) {
 log.warn(Error forking command due to JVM locale bug (see 
 https://issues.apache.org/jira/browse/SOLR-6387):  + err.getMessage());
 return (error executing:  + cmd + );
   }
 }
 {code}
 ...but with Tika, it might be better for all ExternalParsers to just opt 
 out as if they don't recognize the filetype when they detect this type of 
 error fro m the check method (or perhaps it would be better if 
 AutoDetectParser handled this? ... i'm not really sure how it would best fit 
 into Tika's architecture)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1581) jhighlight license concerns

2015-03-28 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385337#comment-14385337
 ] 

Tyler Palsulich commented on TIKA-1581:
---

Hi [~kkrugler]. Thanks. The comment is now
bq. Tika-parsers component uses CDDL/LGPL dual-licensed dependency: jhighlight 
(https://github.com/codelibs/jhighlight)

If this looks good, I'll start a \[DISCUSS\] thread on the list about a new 
version.

 jhighlight license concerns
 ---

 Key: TIKA-1581
 URL: https://issues.apache.org/jira/browse/TIKA-1581
 Project: Tika
  Issue Type: Bug
Affects Versions: 1.7
Reporter: Karl Wright
 Fix For: 1.8


 jhighlight jar is a Tika dependency.  The Lucene team discovered that, while 
 it claims to be a CDDL/LGPL dual-license, some of its functionality is LGPL 
 only:
 {code}
 Solr's contrib/extraction contains jhighlight-1.0.jar which declares itself 
 as dual CDDL or LGPL license. However, some of its classes are distributed 
 only under LGPL, e.g.
 com.uwyn.jhighlight.highlighter.
   CppHighlighter.java
   GroovyHighlighter.java
   JavaHighlighter.java
   XmlHighlighter.java
 I downloaded the sources from Maven 
 (http://search.maven.org/remotecontent?filepath=com/uwyn/jhighlight/1.0/jhighlight-1.0-sources.jar)
  to confirm that, and also found this SVN repo: 
 http://svn.rifers.org/jhighlight/tags/release-1.0, though the project's 
 website seems to not exist anymore (https://jhighlight.dev.java.net/).
 I didn't find any direct usage of it in our code, so I guess it's probably 
 needed by a 3rd party dependency, such as Tika. Therefore if we e.g. omit it, 
 things will compile, but may fail at runtime.
 {code}
 Is it possible to remove this dependency for future releases, or allow only 
 optional inclusion of this package?  It is of concern to the ManifoldCF 
 project because we distribute a binary package that includes Tika and its 
 required dependencies, which currently includes jHighlight.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1586) Enable CORS on Tika Server

2015-03-28 Thread Tyler Palsulich (JIRA)
Tyler Palsulich created TIKA-1586:
-

 Summary: Enable CORS on Tika Server
 Key: TIKA-1586
 URL: https://issues.apache.org/jira/browse/TIKA-1586
 Project: Tika
  Issue Type: New Feature
  Components: server
Reporter: Tyler Palsulich
Assignee: Tyler Palsulich


Tika Server should allow configuration of CORS requests (for uses like 
TIKA-1585). See [this example|http://cxf.apache.org/docs/jax-rs-cors.html] from 
CXF for how to add it.

The only change from that site is that we will need to add a 
{{CrossOriginResourceSharingFilter}} as a provider.

Ideally, this is configurable (limit which resources have CORS, and which 
origins are allowed). But, I'm not thinking of any general methods of how to do 
that...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1586) Enable CORS on Tika Server

2015-03-28 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich resolved TIKA-1586.
---
Resolution: Fixed

Fixed in r1669799.

 Enable CORS on Tika Server
 --

 Key: TIKA-1586
 URL: https://issues.apache.org/jira/browse/TIKA-1586
 Project: Tika
  Issue Type: New Feature
  Components: server
Reporter: Tyler Palsulich
Assignee: Tyler Palsulich

 Tika Server should allow configuration of CORS requests (for uses like 
 TIKA-1585). See [this example|http://cxf.apache.org/docs/jax-rs-cors.html] 
 from CXF for how to add it.
 The only change from that site is that we will need to add a 
 {{CrossOriginResourceSharingFilter}} as a provider.
 Ideally, this is configurable (limit which resources have CORS, and which 
 origins are allowed). But, I'm not thinking of any general methods of how to 
 do that...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1585) Create Example Website with Form Submission

2015-03-28 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385411#comment-14385411
 ] 

Tyler Palsulich commented on TIKA-1585:
---

CORS work is now integrated. [~talli...@mitre.org], can you restart the server 
on 162.242.228.174:9998 with the --cors http://tpalsulich.github.io; option?

Then, we can close off the 9997 port (my github.io site is querying 9997, 
though, so I'll need to update that).

Is there an official place we'd like to host the above site?

 Create Example Website with Form Submission
 ---

 Key: TIKA-1585
 URL: https://issues.apache.org/jira/browse/TIKA-1585
 Project: Tika
  Issue Type: New Feature
  Components: example, server
Reporter: Tyler Palsulich
Assignee: Tyler Palsulich

 It would be great to have a website where we can direct people who ask what 
 Tika can do for [filetype] without needing them to actually download Tika.
 Some initial work to do that is 
 [here|http://tpalsulich.github.io/TikaExamples/].
 I'm far from a design guru, but I imagine the site as having a form where you 
 can upload a file at the top, checkboxes for if you want metadata, content, 
 or both, and a submit button. The request should be sent with AJAX and the 
 result should populate a {{div}}.
 One issue with AJAX requests is that Tika Server doesn't currently allow 
 Cross-Origin-Resource-Sharing (CORS). So, we either need to maintain a 
 slightly updated tika-server, or update the server to allow configuration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[DISCUSS] Tika 1.8 or 1.7.1

2015-03-28 Thread Tyler Palsulich
Hi Folks,

Now that TIKA-1581 (JHighlight licensing issues) is resolved, we need to
release a new version of Tika. I'll volunteer to be the release manager
again.

Should we release this as 1.8 or 1.7.1?

Does anyone have any last minute issues they'd like to finish and see in
Tika 1.X? I'd like to get the example working with CORS (TIKA-1585 and
TIKA-1586). Any others?

Have a good weekend,
Tyler


[jira] [Commented] (TIKA-1586) Enable CORS on Tika Server

2015-03-28 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385372#comment-14385372
 ] 

Tyler Palsulich commented on TIKA-1586:
---

Can someone take a look at the above PR and make sure I'm not doing anything 
bone-headed? Thanks!

 Enable CORS on Tika Server
 --

 Key: TIKA-1586
 URL: https://issues.apache.org/jira/browse/TIKA-1586
 Project: Tika
  Issue Type: New Feature
  Components: server
Reporter: Tyler Palsulich
Assignee: Tyler Palsulich

 Tika Server should allow configuration of CORS requests (for uses like 
 TIKA-1585). See [this example|http://cxf.apache.org/docs/jax-rs-cors.html] 
 from CXF for how to add it.
 The only change from that site is that we will need to add a 
 {{CrossOriginResourceSharingFilter}} as a provider.
 Ideally, this is configurable (limit which resources have CORS, and which 
 origins are allowed). But, I'm not thinking of any general methods of how to 
 do that...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (TIKA-1354) ForkParser doesn't work in OSGI container

2015-03-27 Thread Tyler Palsulich (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich closed TIKA-1354.
-
   Resolution: Fixed
Fix Version/s: 1.7

Marking as Fixed.

 ForkParser doesn't work in OSGI container
 -

 Key: TIKA-1354
 URL: https://issues.apache.org/jira/browse/TIKA-1354
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.6
Reporter: Michal Hlavac
 Fix For: 1.7


 I can't find way to run ForkParser in OSGI container.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Enabling CORS

2015-03-27 Thread Tyler Palsulich
Hi Folks,

I'm trying to enable CORS on a few of Tika's Server resources. But, after
adding the pom.xml dependency and a

@CrossOriginResourceSharing(
allowOrigins = {url}
)

annotation to the resources, the Access-Control-Allow-Origin header is
still not given.

Is there another configuration I need to add? Tika's server doesn't
currently have a bean configuration like at the bottom of the examples page
http://cxf.apache.org/docs/jax-rs-cors.html#JAX-RSCORS-Examples.

Thanks for any help,
Tyler


  1   2   3   4   5   6   7   8   >