Re: [DISCUSS] Tika 1.8 or 1.7.1

2015-03-29 Thread Hong-Thai Nguyen
+1 for 1.8

Hong-Thai

> On 28 Mar 2015, at 16:01, Tyler Palsulich  wrote:
> 
> Hi Folks,
> 
> Now that TIKA-1581 (JHighlight licensing issues) is resolved, we need to
> release a new version of Tika. I'll volunteer to be the release manager
> again.
> 
> Should we release this as 1.8 or 1.7.1?
> 
> Does anyone have any last minute issues they'd like to finish and see in
> Tika 1.X? I'd like to get the example working with CORS (TIKA-1585 and
> TIKA-1586). Any others?
> 
> Have a good weekend,
> Tyler


RE: [VOTE] Release Apache Tika 1.8 Candidate #1

2015-04-13 Thread Hong-Thai Nguyen
Not yet, I'm investigating more on TIKA-1600 today.

Hong-Thai

-Message d'origine-
De : Allison, Timothy B. [mailto:talli...@mitre.org] 
Envoyé : lundi 13 avril 2015 01:07
À : dev@tika.apache.org
Objet : RE: [VOTE] Release Apache Tika 1.8 Candidate #1

I don't think we've solved TIKA-1600, yet, or have we?

-Original Message-
From: Tyler Palsulich [mailto:tpalsul...@gmail.com]
Sent: Sunday, April 12, 2015 12:12 AM
To: dev@tika.apache.org
Subject: Re: [VOTE] Release Apache Tika 1.8 Candidate #1

Are we ready for another RC? I'd like to make sure the above issues are 
(believed to be) settled before the next cut.

Thanks,
Tyler
On Apr 10, 2015 4:55 PM, "David Meikle"  wrote:

>
> > On 10 Apr 2015, at 11:38, Allison, Timothy B. 
> wrote:
> >
> >  I agree that the ODT issue might require a respin.  What do others
> think?
>
> +1 for re-spin.
>
> >
> > Unfortunately, there might be 2 odt docs (mime type:
> “application/vnd.oasis.opendocument.text”?) in govdocs1…so we wouldn't 
> see that problem.
> >
> >
> >
> > I did do a comparison of 1.7 vs 1.8-rc1, and the results are here:
> >
> >
> https://github.com/tballison/share/blob/master/tika_comparisons/tika_1
> _7_v_1_8-rc1.zip
> <
> https://github.com/tballison/share/blob/master/tika_comparisons/tika_1
> _7_v_1_8-rc1.zip
> >
> >
> > I encourage folks (if you haven't, and if you care :) ) to take a 
> > look
> and see if you see something that I don’t.
>
> Thanks for this Tim.  About to get on a flight, so will check through 
> on that.
>
> Cheers,
> Dave
>
>


RE: [VOTE] Apache Tika 1.8 Release Candidate #2

2015-04-14 Thread Hong-Thai Nguyen
Hi,

+1 for me.

Great work, Tyler !

Hong-Thai

-Message d'origine-
De : Tyler Palsulich [mailto:tpalsul...@apache.org] 
Envoyé : lundi 13 avril 2015 19:56
À : dev@tika.apache.org; u...@tika.apache.org
Objet : [VOTE] Apache Tika 1.8 Release Candidate #2

Hi Folks,

A candidate for the Tika 1.8 release is available at:
  https://dist.apache.org/repos/dist/dev/tika/

The release candidate is a zip archive of the sources in:
  http://svn.apache.org/repos/asf/tika/tags/1.8-rc2/

The SHA1 checksum of the archive is
  5e22fee9079370398472e59082d171ae2d7fdd31.

In addition, a staged maven repository is available here:
  https://repository.apache.org/content/repositories/orgapachetika-1009

Please vote on releasing this package as Apache Tika 1.8. The vote is open for 
the next 72 hours and passes if a majority of at least three +1 Tika PMC votes 
are cast.

[ ] +1 Release this package as Apache Tika 1.8 [ ] ±0 I don't object to this 
release, but I haven't checked it [ ] -1 Do not release this package because...

Thanks,
Tyler


RE: Java 1.6 support for Tika 1.9?

2015-04-29 Thread Hong-Thai Nguyen
Hi forks,

I'm +1 for announcement of ending support JDK1.6 on next 1.9.

FYI, we are having still some legacy dependencies dedicated only on JDK 1.5 
(*jdk15*):

$ mvn dependency:tree
[INFO] Scanning for projects...
[INFO]
[INFO] 
[INFO] Building Apache Tika parsers 1.9-SNAPSHOT
[INFO] 
[INFO]
[INFO] --- maven-dependency-plugin:2.1:tree (default-cli) @ tika-parsers ---
[INFO] org.apache.tika:tika-parsers:bundle:1.9-SNAPSHOT
[INFO] +- org.osgi:org.osgi.core:jar:4.0.0:provided
[INFO] +- org.apache.tika:tika-core:jar:1.9-SNAPSHOT:compile
[INFO] +- org.apache.tika:tika-core:test-jar:tests:1.9-SNAPSHOT:test
[INFO] +- org.gagravarr:vorbis-java-tika:jar:0.6:compile
[INFO] +- org.apache.felix:org.apache.felix.scr.annotations:jar:1.6.0:provided
[INFO] +- net.sourceforge.jmatio:jmatio:jar:1.0:compile
[INFO] +- org.apache.james:apache-mime4j-core:jar:0.7.2:compile
[INFO] +- org.apache.james:apache-mime4j-dom:jar:0.7.2:compile
[INFO] +- org.apache.commons:commons-compress:jar:1.9:compile
[INFO] +- org.tukaani:xz:jar:1.5:compile
[INFO] +- commons-codec:commons-codec:jar:1.9:compile
[INFO] +- org.apache.pdfbox:pdfbox:jar:1.8.9:compile
[INFO] |  +- org.apache.pdfbox:fontbox:jar:1.8.9:compile
[INFO] |  +- org.apache.pdfbox:jempbox:jar:1.8.9:compile
[INFO] |  \- commons-logging:commons-logging:jar:1.1.1:compile
[INFO] +- org.bouncycastle:bcmail-jdk15on:jar:1.52:compile
[INFO] |  \- org.bouncycastle:bcpkix-jdk15on:jar:1.52:compile
[INFO] +- org.bouncycastle:bcprov-jdk15on:jar:1.52:compile
[INFO] +- org.apache.poi:poi:jar:3.12-beta1:compile
[INFO] +- org.apache.poi:poi-scratchpad:jar:3.12-beta1:compile
[INFO] +- org.apache.poi:poi-ooxml:jar:3.12-beta1:compile
[INFO] |  \- org.apache.poi:poi-ooxml-schemas:jar:3.12-beta1:compile
[INFO] | \- org.apache.xmlbeans:xmlbeans:jar:2.6.0:compile
[INFO] +- org.ccil.cowan.tagsoup:tagsoup:jar:1.2.1:compile
[INFO] +- org.ow2.asm:asm-debug-all:jar:4.1:compile
[INFO] +- com.googlecode.mp4parser:isoparser:jar:1.0.2:compile
[INFO] |  \- org.aspectj:aspectjrt:jar:1.8.0:compile
[INFO] +- com.drewnoakes:metadata-extractor:jar:2.8.0:compile
[INFO] |  \- com.adobe.xmp:xmpcore:jar:5.1.2:compile
[INFO] +- de.l3s.boilerpipe:boilerpipe:jar:1.1.0:compile
[INFO] +- rome:rome:jar:1.0:compile
[INFO] |  \- jdom:jdom:jar:1.0:compile
[INFO] +- org.gagravarr:vorbis-java-core:jar:0.6:compile
[INFO] +- com.googlecode.juniversalchardet:juniversalchardet:jar:1.0.3:compile
[INFO] +- org.codelibs:jhighlight:jar:1.0.2:compile
[INFO] +- com.pff:java-libpst:jar:0.8.1:compile
[INFO] +- com.github.junrar:junrar:jar:0.7:compile
[INFO] |  +- commons-logging:commons-logging-api:jar:1.1:compile
[INFO] |  \- org.apache.commons:commons-vfs2:jar:2.0:compile
[INFO] | +- org.apache.maven.scm:maven-scm-api:jar:1.4:compile
[INFO] | |  \- org.codehaus.plexus:plexus-utils:jar:1.5.6:compile
[INFO] | \- org.apache.maven.scm:maven-scm-provider-svnexe:jar:1.4:compile
[INFO] |+- 
org.apache.maven.scm:maven-scm-provider-svn-commons:jar:1.4:compile
[INFO] |\- regexp:regexp:jar:1.3:compile
[INFO] +- org.xerial:sqlite-jdbc:jar:3.8.6:provided
[INFO] +- junit:junit:jar:4.11:test
[INFO] |  \- org.hamcrest:hamcrest-core:jar:1.3:test
[INFO] +- org.mockito:mockito-core:jar:1.7:test
[INFO] |  \- org.objenesis:objenesis:jar:1.0:test
[INFO] +- org.slf4j:slf4j-log4j12:jar:1.7.12:test
[INFO] |  \- log4j:log4j:jar:1.2.17:test
[INFO] +- edu.ucar:netcdf4:jar:4.5.5:compile
[INFO] |  +- net.jcip:jcip-annotations:jar:1.0:compile
[INFO] |  +- net.java.dev.jna:jna:jar:4.1.0:compile
[INFO] |  \- org.slf4j:slf4j-api:jar:1.7.12:compile
[INFO] +- edu.ucar:grib:jar:4.5.5:compile
[INFO] |  +- com.google.protobuf:protobuf-java:jar:2.5.0:compile
[INFO] |  +- org.jdom:jdom2:jar:2.0.4:compile
[INFO] |  +- org.jsoup:jsoup:jar:1.7.2:compile
[INFO] |  +- edu.ucar:jj2000:jar:5.2:compile
[INFO] |  \- org.itadaki:bzip2:jar:0.9.1:compile
[INFO] +- edu.ucar:cdm:jar:4.5.5:compile
[INFO] |  +- edu.ucar:udunits:jar:4.5.5:compile
[INFO] |  +- org.apache.httpcomponents:httpcore:jar:4.2.5:compile
[INFO] |  +- joda-time:joda-time:jar:2.2:compile
[INFO] |  +- org.quartz-scheduler:quartz:jar:2.2.0:compile
[INFO] |  |  \- c3p0:c3p0:jar:0.9.1.1:compile
[INFO] |  +- net.sf.ehcache:ehcache-core:jar:2.6.2:compile
[INFO] |  \- com.beust:jcommander:jar:1.35:compile
[INFO] +- edu.ucar:httpservices:jar:4.5.5:compile
[INFO] |  +- org.apache.httpcomponents:httpclient:jar:4.2.6:compile
[INFO] |  \- org.apache.httpcomponents:httpmime:jar:4.2.6:compile
[INFO] +- com.google.guava:guava:jar:11.0.2:compile
[INFO] |  \- com.google.code.findbugs:jsr305:jar:1.3.9:compile
[INFO] \- org.apache.commons:commons-csv:jar:1.0:compile
[INFO] 
[INFO] BUILD SUCCESS
[INFO] 
[INFO] Total time: 3.268s
[IN

Re: Java 1.6 support for Tika 1.9?

2015-04-29 Thread Hong-Thai Nguyen
Effectively, the artifact names bc*-jdk15on confused me; I though that
these libs are for JDK 1.5 only.

Thanks,

On Wed, Apr 29, 2015 at 4:08 PM, Konstantin Gribov 
wrote:

> I don't see any, bouncycastle bc*-jdk15on deps are targeted for jdk 1.5-1.8
> (see https://www.bouncycastle.org/latest_releases.html).
>
> --
> Regards,
> Konstantin Gribov
>
> ср, 29 апр. 2015 г. в 16:43, Hong-Thai Nguyen  >:
>
> > Hi forks,
> >
> > I'm +1 for announcement of ending support JDK1.6 on next 1.9.
> >
> > FYI, we are having still some legacy dependencies dedicated only on JDK
> > 1.5 (*jdk15*):
> >
> > $ mvn dependency:tree
> > [INFO] Scanning for projects...
> > [INFO]
> > [INFO]
> > 
> > [INFO] Building Apache Tika parsers 1.9-SNAPSHOT
> > [INFO]
> > 
> > [INFO]
> > [INFO] --- maven-dependency-plugin:2.1:tree (default-cli) @ tika-parsers
> > ---
> > [INFO] org.apache.tika:tika-parsers:bundle:1.9-SNAPSHOT
> > [INFO] +- org.osgi:org.osgi.core:jar:4.0.0:provided
> > [INFO] +- org.apache.tika:tika-core:jar:1.9-SNAPSHOT:compile
> > [INFO] +- org.apache.tika:tika-core:test-jar:tests:1.9-SNAPSHOT:test
> > [INFO] +- org.gagravarr:vorbis-java-tika:jar:0.6:compile
> > [INFO] +-
> > org.apache.felix:org.apache.felix.scr.annotations:jar:1.6.0:provided
> > [INFO] +- net.sourceforge.jmatio:jmatio:jar:1.0:compile
> > [INFO] +- org.apache.james:apache-mime4j-core:jar:0.7.2:compile
> > [INFO] +- org.apache.james:apache-mime4j-dom:jar:0.7.2:compile
> > [INFO] +- org.apache.commons:commons-compress:jar:1.9:compile
> > [INFO] +- org.tukaani:xz:jar:1.5:compile
> > [INFO] +- commons-codec:commons-codec:jar:1.9:compile
> > [INFO] +- org.apache.pdfbox:pdfbox:jar:1.8.9:compile
> > [INFO] |  +- org.apache.pdfbox:fontbox:jar:1.8.9:compile
> > [INFO] |  +- org.apache.pdfbox:jempbox:jar:1.8.9:compile
> > [INFO] |  \- commons-logging:commons-logging:jar:1.1.1:compile
> > [INFO] +- org.bouncycastle:bcmail-jdk15on:jar:1.52:compile
> > [INFO] |  \- org.bouncycastle:bcpkix-jdk15on:jar:1.52:compile
> > [INFO] +- org.bouncycastle:bcprov-jdk15on:jar:1.52:compile
> > [INFO] +- org.apache.poi:poi:jar:3.12-beta1:compile
> > [INFO] +- org.apache.poi:poi-scratchpad:jar:3.12-beta1:compile
> > [INFO] +- org.apache.poi:poi-ooxml:jar:3.12-beta1:compile
> > [INFO] |  \- org.apache.poi:poi-ooxml-schemas:jar:3.12-beta1:compile
> > [INFO] | \- org.apache.xmlbeans:xmlbeans:jar:2.6.0:compile
> > [INFO] +- org.ccil.cowan.tagsoup:tagsoup:jar:1.2.1:compile
> > [INFO] +- org.ow2.asm:asm-debug-all:jar:4.1:compile
> > [INFO] +- com.googlecode.mp4parser:isoparser:jar:1.0.2:compile
> > [INFO] |  \- org.aspectj:aspectjrt:jar:1.8.0:compile
> > [INFO] +- com.drewnoakes:metadata-extractor:jar:2.8.0:compile
> > [INFO] |  \- com.adobe.xmp:xmpcore:jar:5.1.2:compile
> > [INFO] +- de.l3s.boilerpipe:boilerpipe:jar:1.1.0:compile
> > [INFO] +- rome:rome:jar:1.0:compile
> > [INFO] |  \- jdom:jdom:jar:1.0:compile
> > [INFO] +- org.gagravarr:vorbis-java-core:jar:0.6:compile
> > [INFO] +-
> > com.googlecode.juniversalchardet:juniversalchardet:jar:1.0.3:compile
> > [INFO] +- org.codelibs:jhighlight:jar:1.0.2:compile
> > [INFO] +- com.pff:java-libpst:jar:0.8.1:compile
> > [INFO] +- com.github.junrar:junrar:jar:0.7:compile
> > [INFO] |  +- commons-logging:commons-logging-api:jar:1.1:compile
> > [INFO] |  \- org.apache.commons:commons-vfs2:jar:2.0:compile
> > [INFO] | +- org.apache.maven.scm:maven-scm-api:jar:1.4:compile
> > [INFO] | |  \- org.codehaus.plexus:plexus-utils:jar:1.5.6:compile
> > [INFO] | \-
> > org.apache.maven.scm:maven-scm-provider-svnexe:jar:1.4:compile
> > [INFO] |+-
> > org.apache.maven.scm:maven-scm-provider-svn-commons:jar:1.4:compile
> > [INFO] |\- regexp:regexp:jar:1.3:compile
> > [INFO] +- org.xerial:sqlite-jdbc:jar:3.8.6:provided
> > [INFO] +- junit:junit:jar:4.11:test
> > [INFO] |  \- org.hamcrest:hamcrest-core:jar:1.3:test
> > [INFO] +- org.mockito:mockito-core:jar:1.7:test
> > [INFO] |  \- org.objenesis:objenesis:jar:1.0:test
> > [INFO] +- org.slf4j:slf4j-log4j12:jar:1.7.12:test
> > [INFO] |  \- log4j:log4j:jar:1.2.17:test
> > [INFO] +- edu.ucar:netcdf4:jar:4.5.5:compile
> > [INFO] |  +- net.jcip:jcip-annotations:jar:1.0:compile
> > [INFO] |  +- net.java.dev.jna:jna:jar:4.1.0:compile
> > [INFO] |  \- org.slf4j:slf4j-api:jar:1.7.12:compile
> > [INFO] +- edu.ucar:grib:jar:4.5.5:compile
> 

Re: [VOTE] Release Apache Tika 1.9 Candidate #2

2015-06-09 Thread Hong-Thai Nguyen
Hi,

+1 for me too. Great work, Chris

Thanks,

On Tue, Jun 9, 2015 at 10:55 AM, Tyler Palsulich 
wrote:

> +1 from me. Thanks for running this, Chris!
>
>
> Tyler
>
> On Mon, Jun 8, 2015 at 1:11 PM Allison, Timothy B. 
> wrote:
>
> > +1
> >
> > Built in Windows and Linux.  Works on problems (that I caused!) in rc1.
> >
> > Let's make sure to include "last Java 1.6" version in the release notes,
> > if that's what we've decided.
> >
> > Thank you, Chris!
> >
> > Best,
> >
> >Tim
> >
> >
> > -Original Message-
> > From: Mattmann, Chris A (3980) [mailto:chris.a.mattm...@jpl.nasa.gov]
> > Sent: Saturday, June 06, 2015 9:47 PM
> > To: dev@tika.apache.org
> > Cc: u...@tika.apache.org
> > Subject: [VOTE] Release Apache Tika 1.9 Candidate #2
> >
> > Hi Folks,
> >
> > A second candidate for the Tika 1.9 release is available at:
> >
> >   https://dist.apache.org/repos/dist/dev/tika/
> >
> > The release candidate is a zip archive of the sources in:
> >   http://svn.apache.org/repos/asf/tika/tags/1.9-rc2/
> >
> > The SHA1 checksum of the archive is
> > 9b78c9e9ce9640b402b7fef8e30f3cdbe384f44c.
> >
> > In addition, a staged maven repository is available here:
> > https://repository.apache.org/content/repositories/orgapachetika-1011/
> >
> >
> > Please vote on releasing this package as Apache Tika 1.9.
> > The vote is open for the next 72 hours and passes if a majority of at
> > least three +1 Tika PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Tika 1.9
> > [ ] -1 Do not release this package because…
> >
> > Cheers,
> > Chris
> >
> > P.S. Of course here is my +1.
> >
> >
> > ++
> > Chris Mattmann, Ph.D.
> > Chief Architect
> > Instrument Software and Science Data Systems Section (398)
> > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > Office: 168-519, Mailstop: 168-527
> > Email: chris.a.mattm...@nasa.gov
> > WWW:  http://sunset.usc.edu/~mattmann/
> > ++
> > Adjunct Associate Professor, Computer Science Department
> > University of Southern California, Los Angeles, CA 90089 USA
> > ++
> >
> >
> >
>



-- 
---
Hong-Thai NGUYEN
Tel.: 06 27 04 86 22


RE: [VOTE] Apache Tika 1.10 Release Candidate #1

2015-08-05 Thread Hong-Thai Nguyen
+1 for me
Build on Windows, tested with an internal corpus. There's no regression. Even 
more, we earned some more ppt documents converted comparing with 1.9

Great job David and others !

Thank

Hong-Thai

-Message d'origine-
De : Tyler Palsulich [mailto:tpalsul...@gmail.com] 
Envoyé : mardi 4 août 2015 18:43
À : dev@tika.apache.org
Objet : Re: [VOTE] Apache Tika 1.10 Release Candidate #1

Everything looks good to me! +1

Thanks, Dave!

Tyler

On Tue, Aug 4, 2015, 6:48 AM Ken Krugler 
wrote:

> +1
>
> Built on Mac, tested with Bixo.
>
> -- Ken
>
> > From: David Meikle
> > Sent: August 2, 2015 12:15:24am PDT
> > To: dev@tika.apache.org; u...@tika.apache.org
> > Subject: [VOTE] Apache Tika 1.10 Release Candidate #1
> >
> > Hi Everyone,
> >
> > A candidate for the Apache Tika 1.10 release is available at:
> >
> > https://dist.apache.org/repos/dist/dev/tika/
> >
> > The release candidate is a zip archive of the sources in:
> >
> > http://svn.apache.org/repos/asf/tika/tags/1.10-rc1/
> >
> > The SHA1 checksum of the archive is
> >
> > b1573adcb194e2c09b77eccc3b1edd16bd4ac67d.
> >
> > In addition, a staged maven repository is available here:
> >
> > https://repository.apache.org/content/repositories/orgapachetika-1013
> >
> >
> > Please vote on releasing this package as Apache Tika 1.10.
> > The vote is open for the next 72 hours and passes if a majority of at
> least
> > three +1 Tika PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Tika 1.10
> >
> > [ ] -1 Do not release this package because...
> >
> > Here is my +1!
> >
> > Cheers,
> > Dave
>
> --
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
>
>
>
>
>
> --
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
>
>
>
>
>
>


Re: [DISCUSS] Moving to Git

2015-11-18 Thread Hong-Thai Nguyen
+1 for me

Thanks,

HT

On Wed, Nov 18, 2015 at 3:46 PM, Mattmann, Chris A (3980) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> Hey Team,
>
> I propose we move to writeable git repos for Tika for our repository.
> I mostly interact with Git & Github nowadays even with Tika using the
> mirroring and PR interaction support.
>
> Thoughts?
>
> Cheers,
> Chris
>
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++
>
>
>
>


-- 
---
Hong-Thai NGUYEN
Tel.: 06 27 04 86 22


Would become a commiter

2013-07-31 Thread Hong-Thai Nguyen
Hi all,
I'm currently working at PolySpot, a provider of Search Engine Solutions. Tika 
is one of component at Connector side to extract many kind of files.
We must upgrade frequently Tika within our product, test and fix eventually 
some parsing bugs of Tika. We must release temporally in our local repository 
by attending new official release Tika version.

With these synergy, I would like to be a committer at Tika project.

Regards,

Hong-Thai Nguyen, PhD
R&D Engineer
DDI: +33 (0)1 77 75 73 15
Mob: +33 (0)6 27 04 86 22
Skype: thaichat04
hong-thai.ngu...@polyspot.com<mailto:hong-thai.ngu...@polyspot.com>
[Description : PolySpot]<http://www.polyspot.com/>
[Description : Description : Description : 
Twitter]<http://twitter.com/polyspot> [Description : Description : Description 
: Linkedin] <http://www.linkedin.com/company/polyspot>
79, rue du Faubourg Poissonnière
75009 Paris - France
Access map<http://g.co/maps/3e53>
P Please consider the environment before printing this email
This message may contain confidential or privileged information. If you are not 
the intended recipient, please advise the sender immediately by reply e-mail 
and delete this message and any attachments without retaining a copy.
Ce message peut contenir des informations confidentielles ou privilégiées. Si 
vous n'êtes pas le destinataire prévu, merci de bien vouloir en prévenir 
l'expéditeur immédiatement par retour de message électronique et de détruire ce 
message et toute éventuelle pièce jointe sans en conserver de copie.



RE: Apache tika installation issue

2013-09-27 Thread Hong-Thai Nguyen
With Eclipse Juno, you have already m2c plugin integrated, with Eclipse Indigo, 
you must install it from Market Space.
Otherwise, you can launch 'mvn eclipse:clean eclipse:eclipse' to generate 
appropriate eclispe files, then import as normal Java Projects in your 
workspace.

Hope it helps.

Hong-Thai

-Message d'origine-
De : olegtikho...@gmail.com [mailto:olegtikho...@gmail.com] De la part de Oleg 
Tikhonov
Envoyé : vendredi 27 septembre 2013 10:52
À : dev@tika.apache.org
Objet : Re: Apache tika installation issue

Hi,

if you meant "how to import" Tika's project then here the steps:

1. In Eclipse --> File --> Import ...
2. Choose "Existing Maven Project", click Next; 3. Point to Tika project, 
clicking on Browse button, say tika-core 4. Next, click on "Finish".

That's it.

Hope it helps.

BR,
Oleg





On Fri, Sep 27, 2013 at 9:48 AM, Mattmann, Chris A (398J) < 
chris.a.mattm...@jpl.nasa.gov> wrote:

> Dear Sudheer,
>
> Did you receive a reply to your question?
>
> Cheers,
> Chris
>
> ++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Adjunct Assistant Professor, Computer Science Department University of 
> Southern California, Los Angeles, CA 90089 USA
> ++
>
>
>
>
>
>
> -Original Message-
> From: sudheer y 
> Date: Tuesday, September 17, 2013 12:02 AM
> To: "dev-ow...@tika.apache.org" 
> Subject: Apache tika installation issue
>
> >Dear Experts,
> >
> >
> >Can you give step by step guide to install apache tika in eclipse 
> >using maven on windows.
> >
> >
> >
> >--
> >Thanks & Best Regards,
> >Sudheer Kumar Y
> >
> >Software Engineer
> >
> >DATAHUB SOFTWARE INDIA PVT LTD. | MAKING IT POSSIBLE
> >Mobile: +91 8143161684
> >
> >Email: sudhe...@datahubsoftware.in
> >WEB : www.datahubsoftware.com 
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
>
>


NonSequentialPDFParser

2013-12-02 Thread Hong-Thai Nguyen
Hi all,
NonSequentialPDFParser may increase 45% parsing performance on PDF extraction. 
Should we integrate in Tika ?
https://issues.apache.org/jira/browse/PDFBOX-1104

Thanks,

Hong-Thai



RE: NonSequentialPDFParser

2013-12-02 Thread Hong-Thai Nguyen
Latest comment of Maruan clarify about this new Parser:
https://issues.apache.org/jira/browse/PDFBOX-1787?focusedCommentId=13836591&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13836591


Hong-Thai


-Message d'origine-
De : Allison, Timothy B. [mailto:talli...@mitre.org] 
Envoyé : lundi 2 décembre 2013 15:59
À : dev@tika.apache.org
Objet : RE: NonSequentialPDFParser

Does the speedup only help if you are trying to parse an individual page vs the 
entire document?  If so, is partial parsing a use case for Tika?  If this has 
the same performance on the full document as the regular parser, does it have 
lower memory overhead?

-Original Message-----
From: Hong-Thai Nguyen [mailto:hong-thai.ngu...@polyspot.com] 
Sent: Monday, December 02, 2013 9:18 AM
To: dev@tika.apache.org
Subject: NonSequentialPDFParser

Hi all,
NonSequentialPDFParser may increase 45% parsing performance on PDF extraction. 
Should we integrate in Tika ?
https://issues.apache.org/jira/browse/PDFBOX-1104

Thanks,

Hong-Thai



Tika 1.5 release ?

2013-12-19 Thread Hong-Thai Nguyen
Hi,
Do you have any idea when we will release 1.5 ?

Thanks

Hong-Thai



Extract thumbnail from openxml office files

2014-01-08 Thread Hong-Thai Nguyen
Hi all,
I want to extract thumbnail image included in Open XML office files. 
Apparently, we can do it by openxml4j: 
http://openxmldeveloper.org/blog/b/openxmldeveloper/archive/2006/11/21/openxmlandjava.aspx
The question is : should we integrate thumbnail in default metadata list of 
ooxml parsing result ?


Thanks

Hong-Thai



RE: Extract thumbnail from openxml office files

2014-01-09 Thread Hong-Thai Nguyen
Hi Ray & all,

By searching on issues, I found the issue already created: 
https://issues.apache.org/jira/browse/TIKA-90
It's maybe now the time to realize it.

Thanks,

Hong-Thai

-Message d'origine-
De : Ray Gauss II [mailto:ray.ga...@alfresco.com] 
Envoyé : mercredi 8 janvier 2014 11:49
À : dev@tika.apache.org
Objet : Re: Extract thumbnail from openxml office files

Hi Hong-Thai,

It’s certainly worth investigating.  Several other formats can have embedded 
thumbnails as well so we could implement a generic thumbnail property.

We could probably store as something like a Base64 encoded string, but we’d 
likely want to place limits on the size and may need a thumbnail internet media 
type field as well to assist in decoding.

Unless others feel differently, I would say open a JIRA where we could start 
discussing the design of such a feature.

Thanks!

Ray


On January 8, 2014 at 5:36:32 AM, Hong-Thai Nguyen 
(hong-thai.ngu...@polyspot.com) wrote:
>  
> Hi all,
> I want to extract thumbnail image included in Open XML office files. 
> Apparently, we can do it by openxml4j: 
> http://openxmldeveloper.org/blog/b/openxmldeveloper/archive/2006/11/21
> /openxmlandjava.aspx The question is : should we integrate thumbnail 
> in default metadata list of ooxml parsing result ?
>  
>  
> Thanks
>  
> Hong-Thai
>  
>  



RE: Extract thumbnail from openxml office files

2014-01-09 Thread Hong-Thai Nguyen
Hi Nick,
You're begining a very interesting topic about foundation of our metadata 
concept :)
I agree with you that metadata is not the best place to store thumbnail result. 
Until now, our metadata is simple map with key:values. This structure is not 
really flexiable in some cases. For exemple, we would store author's 
information, each author has a first name and a last name.
Ideally, we could have some like struct:
Person:
FirstName
LastName

An other example is for our futur thumbnail. If we can have a metadata 
'thumbnail' with hierarchical structure like:
Thumbnail:
Dimension
Width
Length
MimeType
Extension
Pages
Description

That needs a huge refactoring about our core model. An other solution is we can 
keep thumbnail result is a list List insteads of a single value. An 
element is the thumbnail of a page. If the list has only 1 element, mean 
there's only thumbnail of the first page.

Hong-Thai

-Message d'origine-
De : Nick Burch [mailto:apa...@gagravarr.org] 
Envoyé : jeudi 9 janvier 2014 12:11
À : dev@tika.apache.org
Objet : RE: Extract thumbnail from openxml office files

On Thu, 9 Jan 2014, Hong-Thai Nguyen wrote:
> By searching on issues, I found the issue already created: 
> https://issues.apache.org/jira/browse/TIKA-90

I'm not sure if the metadata is the right place to return this. Some formats 
offer a small thumbnail, others can offer a small thumbnail for every page, and 
at least one can include a full-size image of the first page.

Would we not be better off exposing these embedded renderings via the existing 
embedded resources handling, with some sort of handy way to identify what 
something is (eg this is a full-size PNG of page 1, this is a jpg thumbnail of 
page 3)?

Nick


RE: Extract thumbnail from openxml office files

2014-01-09 Thread Hong-Thai Nguyen
I'm convinced that using embedded resources is a better solution. Thank Nick
@Matt, I ignored that we had a reflect on metadata structure. Interesting.

We would adapt TIKA-90 title & description. I hope provide an initiative on 
this work.

Hong-Thai


-Message d'origine-
De : Nick Burch [mailto:apa...@gagravarr.org] 
Envoyé : jeudi 9 janvier 2014 15:25
À : dev@tika.apache.org
Objet : RE: Extract thumbnail from openxml office files

On Thu, 9 Jan 2014, Hong-Thai Nguyen wrote:
> I agree with you that metadata is not the best place to store 
> thumbnail result. Until now, our metadata is simple map with 
> key:values. This structure is not really flexiable in some cases.

Currently, we have four kinds of "things" that we return for content:
  * Type
  * Metadata
  * Content, as xhtml
  * Any resources embedded in it (eg nested documents, images etc)

I'm not disputing that our Metadata setup could use some more work to make it 
richer (within reason!), what I'm not sure is that an expanded metadata system 
is the right place to put thumbnails and full-page renderings. 
Those feel a lot more like embedded resources to me

> An other example is for our futur thumbnail. If we can have a metadata 
> 'thumbnail' with hierarchical structure like:
>
> Thumbnail:
>   Dimension
>   Width
>   Length
>   MimeType
>   Extension
>   Pages
>   Description

If we returned the thumbnail as an embedded resource, you'd get the type + full 
metadata on the image (not just width/length), along with extension etc. If we 
had a common naming scheme for them, possibly with some custom metadata keys, 
we could return the page number it applies to, along with if it's a thumbnail 
or a full size rendering (some formats have one, the other, or both)

Are you able to explain how your scheme would be simpler and easier to use than 
returning them as embedded resources?

Nick


RE: Extract thumbnail from openxml office files

2014-01-09 Thread Hong-Thai Nguyen
Thank alot Nick, That's a great reference. BTW, may I'm wrong to say that 
thumbnail handling in Alfresco is quite complex because Alfresco can call 
external thumbnail generation with PDFBox or PDFRender  I'm defining DoD by 
retainning some main features from this in TIKA-90.
Could you guide me an example of returning embedded document in Tika parsers ?

Thanks

Hong-Thai


-Message d'origine-
De : Nick Burch [mailto:apa...@gagravarr.org] 
Envoyé : jeudi 9 janvier 2014 15:49
À : dev@tika.apache.org
Objet : RE: Extract thumbnail from openxml office files

On Thu, 9 Jan 2014, Hong-Thai Nguyen wrote:
> I'm convinced that using embedded resources is a better solution.

OK, sounds like we have a consensus and can go ahead with it, great!

One outstanding query is what name we should give to these when we return them 
as embedded resources, and if we should include a special key/value in the 
metadata that we send with them to identify them?

The source code for Alfresco has examples of extracting thumbnails and full 
images from a number of formats, along with tests. Firstly this could be a good 
source of inspiration of what formats to go for, and how to do it. Secondly, 
with a number of Alfrescans involved in the project, we might even be able to 
get the key bits of logic from the code + tests contributed into Tika, to speed 
things up :)

Nick


Passing to FEST for JUnit tests ?

2014-01-17 Thread Hong-Thai Nguyen
Dear all,

Fest (https://code.google.com/p/fest/ ) syntax is much intuitive than JUnit :
assertEquals("result", gettingResult());
by
assertThat(gettingResult(), is("result"));

We may replace progressively in our tests.

Hong-Thai



RE: Passing to FEST for JUnit tests ?

2014-01-20 Thread Hong-Thai Nguyen
Just syntax is much more fluent, nothing change with your IDE.
More about Fest vs Junit: 
http://maciejwalkowiak.pl/blog/2012/03/23/better-unit-tests-with-fest-assert/

 
Hong-Thai


-Message d'origine-
De : Konstantin Gribov [mailto:gros...@gmail.com] 
Envoyé : samedi 18 janvier 2014 09:01
À : dev@tika.apache.org
Objet : Re: Passing to FEST for JUnit tests ?

Does it give something more than just fluent interface? Does it integrate to 
IDEs as good as JUnit?

--
Best regards,
Konstantin Gribov.


2014/1/17 Hong-Thai Nguyen 

> Dear all,
>
> Fest (https://code.google.com/p/fest/ ) syntax is much intuitive than
> JUnit :
> assertEquals("result", gettingResult());
> by
> assertThat(gettingResult(), is("result"));
>
> We may replace progressively in our tests.
>
> Hong-Thai
>
>


RE: [VOTE] Apache Tika 1.5 RC1

2014-02-05 Thread Hong-Thai Nguyen
+1 for me.
Just a remark, tika-java7 is not associated as a submodule. So its version is 
still 1.5-SNAPSHOT

Hong-Thai


-Message d'origine-
De : David Meikle [mailto:loo...@gmail.com] 
Envoyé : mercredi 5 février 2014 02:59
À : dev@tika.apache.org
Cc : u...@tika.apache.org
Objet : [VOTE] Apache Tika 1.5 RC1

Hi Guys,

A candidate for the Tika 1.5 release is now available at:
http://people.apache.org/~dmeikle/tika-1.5-rc1/

The release candidate is a zip archive of the sources in:
http://svn.apache.org/repos/asf/tika/tags/1.5-rc1/

The SHA1 checksum of the archive is:
66adb7e73058da73a055a823bd61af48129c1179

A staged M2 repository can also be found on repository.apache.org here:
https://repository.apache.org/content/repositories/orgapachetika-1000

Please vote on releasing this package as Apache Tika 1.5.
The vote is open for the next 72 hours and passes if a majority of at least 
three +1 Tika PMC votes are cast.

   [ ] +1 Release this package as Apache Tika 1.5
   [ ] -1 Do not release this package because...

Here is my +1 for the release.

Cheers,
Dave


RE: [VOTE] Apache Tika 1.5 RC1

2014-02-05 Thread Hong-Thai Nguyen
Use 'mvn clean install -U'

Hong-Thai


-Message d'origine-
De : Julien Nioche [mailto:lists.digitalpeb...@gmail.com] 
Envoyé : mercredi 5 février 2014 10:50
À : dev@tika.apache.org
Objet : Re: [VOTE] Apache Tika 1.5 RC1

Hi Dave

Am trying to compile from src and am getting

[ERROR] The build could not read 1 project -> [Help 1] [ERROR]
[ERROR]   The project org.apache.tika:tika-java7:1.5-SNAPSHOT
(/data/tika-1.5/tika-java7/pom.xml) has 1 error
[ERROR] Non-resolvable parent POM: Could not find artifact
org.apache.tika:tika-parent:pom:1.5-SNAPSHOT and 'parent.relativePath'
points at wrong local POM @ line 25, column 11 -> [Help 2] [ERROR]

*mvn -version*
Apache Maven 3.0.4
Maven home: /usr/share/maven
Java version: 1.7.0_21, vendor: Oracle Corporation Java home: 
/usr/lib/jvm/java-7-openjdk-amd64/jre
Default locale: en_GB, platform encoding: UTF-8 OS name: "linux", version: 
"3.5.0-17-generic", arch: "amd64", family: "unix"

Am I missing something?

Julien



On 5 February 2014 01:59, David Meikle  wrote:

> Hi Guys,
>
> A candidate for the Tika 1.5 release is now available at:
> http://people.apache.org/~dmeikle/tika-1.5-rc1/
>
> The release candidate is a zip archive of the sources in:
> http://svn.apache.org/repos/asf/tika/tags/1.5-rc1/
>
> The SHA1 checksum of the archive is:
> 66adb7e73058da73a055a823bd61af48129c1179
>
> A staged M2 repository can also be found on repository.apache.org here:
> https://repository.apache.org/content/repositories/orgapachetika-1000
>
> Please vote on releasing this package as Apache Tika 1.5.
> The vote is open for the next 72 hours and passes if a majority of at 
> least three +1 Tika PMC votes are cast.
>
>[ ] +1 Release this package as Apache Tika 1.5
>[ ] -1 Do not release this package because...
>
> Here is my +1 for the release.
>
> Cheers,
> Dave




-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


RE: [VOTE] Apache Tika 1.5 RC2

2014-02-10 Thread Hong-Thai Nguyen
+1

Hong-Thai

-Message d'origine-
De : Julien Nioche [mailto:lists.digitalpeb...@gmail.com] 
Envoyé : lundi 10 février 2014 11:21
À : dev@tika.apache.org
Objet : Re: [VOTE] Apache Tika 1.5 RC2

Hi Dave,

+1 from me. Compiled fine on Linux Mint + tested Maven artefacts with
Behemoth and ran a parse without problems.

Thanks for doing this.

Julien


On 9 February 2014 22:53, Dave Meikle  wrote:

> Hi Guys,
>
> A new release candidate for the Tika 1.5 release is now available at:
> http://people.apache.org/~dmeikle/tika-1.5-rc2/
>
> This fixes the issues with the POM version numbers for tika-dotnet and
> tika-java7 in Tika 1.5 RC1.
>
> The release candidate is a zip archive of the sources in:
> http://svn.apache.org/repos/asf/tika/tags/1.5-rc2/
>
> The SHA1 checksum of the archive is:
> f9a3c04dc3d1ce27742d0db7b8c171bbd89063b6
>
> A staged M2 repository can also be found on repository.apache.org here:
> https://repository.apache.org/content/repositories/orgapachetika-1002
>
> Please vote on releasing this package as Apache Tika 1.5.
> The vote is open for the next 72 hours and passes if a majority of at 
> least three +1 Tika PMC votes are cast.
>
>[ ] +1 Release this package as Apache Tika 1.5
>[ ] -1 Do not release this package because...
>
> Here is my +1 for the release.
>
> Cheers,
> Dave
>



-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


RE: Build failure at trunk in org.apache.tika.server.UnpackerResourceTest

2014-02-26 Thread Hong-Thai Nguyen
Not only on Linux, idem on Windows. But tests run directly from IDE passed.

Hong-Thai

-Message d'origine-
De : Allison, Timothy B. [mailto:talli...@mitre.org] 
Envoyé : mercredi 26 février 2014 15:05
À : dev@tika.apache.org
Objet : RE: Build failure at trunk in 
org.apache.tika.server.UnpackerResourceTest

Failure here too.  My last successful pull and build occurred Feb 19.

-Original Message-
From: Nick Burch [mailto:apa...@gagravarr.org] 
Sent: Tuesday, February 25, 2014 8:14 PM
To: dev@tika.apache.org
Subject: Re: Build failure at trunk in 
org.apache.tika.server.UnpackerResourceTest

On Tue, 25 Feb 2014, Ken Krugler wrote:
> Failed tests:   testText(org.apache.tika.server.UnpackerResourceTest)
>
>  testImageDOCX(org.apache.tika.server.UnpackerResourceTest): 
> expected:<[5516590467b069fa59397432677bad4d]> but 
> was:<[bfb451ca6aa8f5a5095afd5228034e6a]>
>  testImageXSL(org.apache.tika.server.UnpackerResourceTest): 
> expected:<[68ead8f4995a3555f48a2f738b2b0c3d]> but 
> was:<[55a8207752d0406ddf6966d618fe9132]>
>  testDocPictureNoOle(org.apache.tika.server.UnpackerResourceTest): 
> expected:<[b27a41d12c646d7fc4f3826cf8183c68]> but 
> was:<[5ef5e3afe31eabcff004df6080458e9b]>
>
> This is on a Mac OS X 10.8.5, java version "1.6.0_65"

Same here on Linux, OpenJDK 1.6.0_27

Nick


RE: Build failure at trunk in org.apache.tika.server.UnpackerResourceTest

2014-02-27 Thread Hong-Thai Nguyen
Causing by two major bugs (https://issues.apache.org/jira/browse/COMPRESS-262 
and https://issues.apache.org/jira/browse/COMPRESS-264) on common-compress 1.7 
which was updated recently.
Seems that zip files returned by server correct, but we're unable read them. I 
switched to use ZipFile instead of ZipFileInputStream to fixing test.

I'm pushing soonly.

Hong-Thai


-Message d'origine-
De : David Meikle [mailto:loo...@gmail.com] 
Envoyé : mercredi 26 février 2014 17:34
À : dev@tika.apache.org
Objet : Re: Build failure at trunk in 
org.apache.tika.server.UnpackerResourceTest

Hi,

On 26 Feb 2014, at 14:57, Nick Burch  wrote:

> Is buildbot configured to build that module? Or does it perhaps skip the 
> server module?
> 
> Nick

The build is configured to build everything apart from the .NET module but does 
not appear to have triggered for the past few weeks using the @hourly SCM poll.

Cheers,
Dave


Tika 1.5 vs 1.4 testing

2014-03-03 Thread Hong-Thai Nguyen
Hi all,

I've checked on same corpus. Here's the comparaison :
||Tika||POI||PDFbox||Failed docs||
|1.4|3.9|1.8.1|92|
|1.5|3.10-beta2|1.8.4|182|

== TIKA 1.4 
- pdf (7)
   * (1) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.ParserDecorator$1@4d39a96c
   * (3) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
org.apache.tika.parser.ParserDecorator$1@4d39a96c
   * (3) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: Unable to extract PDF content
- pptx (8)
   * (7) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: Error creating OOXML extractor
   * (1) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.ParserDecorator$1@4db190a5
- doc (2)
   * (2) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.ParserDecorator$1@6ddd7ea2
- ppt (40)
   * (39) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.ParserDecorator$1@6ddd7ea2
   * (1) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
org.apache.tika.parser.ParserDecorator$1@6ddd7ea2
- xls (9)
   * (7) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.ParserDecorator$1@6ddd7ea2
   * (2) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
org.apache.tika.parser.ParserDecorator$1@6ddd7ea2
- dwg (4)
   * (4) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: Unsupported AutoCAD drawing version: 
AC1014
- odp (2)
   * (2) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
org.apache.tika.parser.ParserDecorator$1@7286f080
- rtf (13)
   * (13) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.ParserDecorator$1@455a7af4
- pps (5)
   * (5) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.ParserDecorator$1@6ddd7ea2

== TIKA 1.5 
- pdf (16)
   * (10) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.ParserDecorator$1@1e59efa5
   * (3) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: TIKA-198: Illegal IOException from 
org.apache.tika.parser.ParserDecorator$1@1e59efa5
   * (3) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: Unable to extract PDF content
- pptx (19)
   * (7) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: Error creating OOXML extractor
   * (12) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.ParserDecorator$1@2b195ebd
- doc (11)
   * (9) 
com.polyspot.document.converter.ConversionException: 
org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.ParserDecorator$1@7b796022
   * (2) 
com.polyspot.document.converter.ConversionException: org.xml.sax.SAXException: 
Namespace http://www.w3.org/1999/xhtml not declared
- ppt (47)
   * (46) 
com.polyspot.document.converter.ConversionE

Using guava on tika ?

2014-03-06 Thread Hong-Thai Nguyen
Hi all,

Guava (https://code.google.com/p/guava-libraries/) provides many facilities on 
text, file, collection ... manipuation. Should we use in Tika ?

Hong-Thai



RE: [ANNOUNCE] Apache Tika 1.5 Released

2014-03-06 Thread Hong-Thai Nguyen
Hi,
Anyone can create branch remotes/origin/1.5  on git ?

Thanks

Hong-Thai


-Message d'origine-
De : David Meikle [mailto:loo...@gmail.com] De la part de David Meikle
Envoyé : mercredi 19 février 2014 23:19
À : annou...@apache.org
Cc : dev@tika.apache.org; u...@tika.apache.org
Objet : [ANNOUNCE] Apache Tika 1.5 Released

The Apache Tika project is pleased to announce the release of Apache Tika 1.5. 
The release contents have been pushed out to the main Apache release site and 
to the Maven Central sync, so the releases should be available as soon as the 
mirrors get the syncs.

Apache Tika is a toolkit for detecting and extracting metadata and structured 
text content from various documents using existing parser libraries.

Apache Tika 1.5 contains a number of improvements and bug fixes. Details can be 
found in the changes file:
http://www.apache.org/dist/tika/CHANGES-1.5.txt

Apache Tika is available in source form from the following download page:
http://www.apache.org/dyn/closer.cgi/tika/apache-tika-1.5-src.zip

Apache Tika is also available in binary form or for use using Maven 2 from the 
Central Repository:
http://repo1.maven.org/maven2/org/apache/tika/

In the initial 48 hours, the release may not be available on all mirrors. When 
downloading from a mirror site, please remember to verify the downloads using 
signatures found on the Apache site:
https://people.apache.org/keys/group/tika.asc

For more information on Apache Tika, visit the project home page:
http://tika.apache.org/

-- Dave Meikle, on behalf of the Apache Tika community



RE: [ANNOUNCE] Apache Tika 1.5 Released

2014-03-06 Thread Hong-Thai Nguyen
I guess that users could maintain hotfixes basing on a released branch in 
attending next release.  We have already branches for old releases:
hong-thai.nguyen@HTN-PC /c/git/tika (trunk)
$ git branch -a
* trunk
  remotes/origin/0.1-incubating
  remotes/origin/0.10
  remotes/origin/0.2
  remotes/origin/0.3
  remotes/origin/0.4-rc1
  remotes/origin/0.4-rc2
  remotes/origin/0.5
  remotes/origin/0.6
  remotes/origin/0.7
  remotes/origin/0.8
  remotes/origin/0.9
  remotes/origin/0.x
  remotes/origin/1.2
  remotes/origin/1.3
  remotes/origin/1.4
  remotes/origin/HEAD -> origin/trunk
  remotes/origin/TIKA-204
  remotes/origin/trunk

Hong-Thai


-Message d'origine-
De : Jukka Zitting [mailto:jukka.zitt...@gmail.com] 
Envoyé : jeudi 6 mars 2014 15:48
À : Tika Development
Objet : Re: [ANNOUNCE] Apache Tika 1.5 Released

Hi,

On Thu, Mar 6, 2014 at 8:27 AM, Hong-Thai Nguyen 
 wrote:
> Anyone can create branch remotes/origin/1.5  on git ?

Do we need a 1.5 branch?

BR,

Jukka Zitting


RE: Using guava on tika ?

2014-03-06 Thread Hong-Thai Nguyen
Thank for feedback.
Nothing we can't do with our code :) Guava is just 'facilities' make code more 
clear, shorter and sometime faster.
I agree that this integration brings more dependencies, may create conflicts in 
end-users applications. Leave as it for now.

Cheers,

Hong-Thai

-Message d'origine-
De : Jukka Zitting [mailto:jukka.zitt...@gmail.com] 
Envoyé : jeudi 6 mars 2014 16:47
À : Tika Development
Objet : Re: Using guava on tika ?

Hi,

On Thu, Mar 6, 2014 at 6:54 AM, Nick Burch  wrote:
> On Thu, 6 Mar 2014, Hong-Thai Nguyen wrote:
>> Guava (https://code.google.com/p/guava-libraries/) provides many 
>> facilities on text, file, collection ... manipuation. Should we use in Tika?
>
> Can you give an example of where using Guava would either simplify 
> some existing code, or improve its effectiveness, or permit something 
> we couldn't otherwise do?

Also, especially in tika-core we've explicitly avoided any external 
dependencies to keep it as simple and easy as possible to include as a 
dependency in client applications. We've even gone as far as including copies 
of some Commons IO classes in org.apache.tika.io instead of referring to 
commons-io as a dependency.

BR,

Jukka Zitting


RE: Add Outlook/PST files to supported formats on the web site?

2014-04-01 Thread Hong-Thai Nguyen
Yes, but from 1.6: https://issues.apache.org/jira/browse/TIKA-623
I'm finishing return mails as extracted documents as demand, but we'll have 
this format in 1.6.

Hong-Thai


-Message d'origine-
De : Michael McCandless [mailto:luc...@mikemccandless.com] 
Envoyé : mardi 1 avril 2014 13:42
À : dev@tika.apache.org
Objet : Add Outlook/PST files to supported formats on the web site?

We only seem to list mbox (Unix) email format:

https://tika.apache.org/1.5/formats.html

But Tika can also extract messages from Outlook's PST files?

Mike McCandless

http://blog.mikemccandless.com


Unable to commit SVN ?

2014-04-03 Thread Hong-Thai Nguyen
Hi Tika men,
I have 500 error when committing to tika SVN. Do you have same problem ?
POST request on '/repos/asf/!svn/me' failed: 500 Internal Server Error

Thanks,

Hong-Thai



RE: Tika VM Service

2014-04-10 Thread Hong-Thai Nguyen
Hi Tika members,

Thank for this great initiative. I guess that there's some use cases possible 
when creating such service:
1. Tika exploitation
We may create a free accessible Tika Server to parse documents coming from 
public requests, a kind of demo or free-try document parser to check Tika 
feasibility on special user documents. That will make sense because a native 
user don't have to download, install latest build from snapshot version. We 
should add some check on incoming requests to refuse abusing/spam requests. 
This case provides similar service as in any23.org site.

2. Tika parser development
"Tika users can do adhoc parsing" is a great idea. I think we would have an 
"online IDE" for Tika parsers development. For this case, we may can have 2 sub 
scenarios:
2.1: Using existing parsers and adding new features (as adding missing parsed 
metadata, fixing bugs on XHTML handler)... This case don't need adding new 
library, and user can extends the interested Parser and try with testing 
documents. Using Groovy is an idea, because it's simple and Java-like language.
2.2: Creating new parser: but, from parser development experience, creating new 
parser ask usually 3rd party libraries, to build/run with this online service, 
we need to extend dynamically classloader. If we really want to support this 
use case, we can eventually wrap client's jars & classes as OSGi plugin, then 
loading/executing on server side. I don't know this scenario make a great sense 
when users have always possibility to checkout/build/develop new parser locally.

3. Tika parsers libraries store
For some reason (incapability of libraries, license's constrains ...), Official 
Tika could not integrate contributed parsers, this kind of service stores these 
parsers and anyone can download, apply within user's context.

Anyway, this service requires resources and humain effort in creating and 
maintenance.

Hong-Thai

-Message d'origine-
De : Nick Burch [mailto:apa...@gagravarr.org] 
Envoyé : mercredi 9 avril 2014 06:32
À : dev@tika.apache.org
Objet : Re: Tika VM Service

On Tue, 8 Apr 2014, Lewis John Mcgibbney wrote:
> I would like to propose that we get a Tika service up and running on a VM.
> Tika users can do adhoc parsing, etc and can do this based on possibly 
> stable nightly SNAPSHOT's or alternatively based on the most recent 
> stable release.
> Preferably, the service should provide a list of parsers and also 
> MediaType's supported.

My vision of how this would work would be to use the Tika Server, with some 
extensions so that it self hosted some basic documentation. We're thinking of 
trying to start that tomorrow in the hackathon, any help / ideas / projects to 
crib off gratefully received!

Nick


RE: [DISCUSS] Nightly Jenkins Builds for Trunk

2014-05-20 Thread Hong-Thai Nguyen
And for >=Java7, we need a profile to active building 'tika-java7' module.

Hong-Thai

-Message d'origine-
De : Nick Burch [mailto:apa...@gagravarr.org] 
Envoyé : mercredi 14 mai 2014 18:30
À : dev@tika.apache.org
Objet : Re: [DISCUSS] Nightly Jenkins Builds for Trunk

On Wed, 14 May 2014, Lewis John Mcgibbney wrote:
> Right now in Jenkins (builds.apache.org) we don't seem to have a Tika 
> project directory which contains the trunk build... it is just a free 
> standing project burried under the mountain of jobs currently running 
> on that box.

I believe that Buildbot is the main system being used for testing + nightly 
builds - http://ci.apache.org/builders/tika-trunk/

> Does anyone have an issue with me jumping on to the Jenkins job and 
> bringing it bang up to date with JDK7 (at least), provisioning a new 
> job for JDK8 until we get this stable and also publishing test output 
> for reference and review... finally running nightly builds which push 
> nightly SNAPSHOT's for consumption by developers?

Can you clarify what we'd get by using Jenkins instead of Buildbot? Is it 
easier to manage perhaps? Easier to setup for multiple Java versions?

As for JVM versions, we currently require 1.6, so we need to test on that. 
Newer ones would be good too! But we mustn't loose the 1.6 which is our minimum 
version...

Nick


Subcribe

2014-06-24 Thread Hong-Thai Nguyen
-- 
--
Hong-Thai


Build failed

2014-06-24 Thread Hong-Thai Nguyen
Hi all,
Sorry about last wrong mail.

I'm unable to build latest snapshot on my Windows. Any idea ?

Thanks


Tests in error:
  initializationError(org.apache.tika.bundle.BundleIT): Problem starting
test co
ntainer.

Tests run: 1, Failures: 0, Errors: 1, Skipped: 0

---
Test set: org.apache.tika.bundle.BundleIT
---
Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.711 sec
<<< FAILURE!
initializationError(org.apache.tika.bundle.BundleIT)  Time elapsed: 0.004
sec  <<< ERROR!
org.ops4j.pax.exam.TestContainerException: Problem starting test container.
at
org.ops4j.pax.exam.nat.internal.NativeTestContainer.start(NativeTestContainer.java:174)
at
org.ops4j.pax.exam.spi.reactors.EagerSingleStagedReactor.(EagerSingleStagedReactor.java:55)
at
org.ops4j.pax.exam.spi.reactors.EagerSingleStagedReactorFactory.create(EagerSingleStagedReactorFactory.java:34)
at
org.ops4j.pax.exam.spi.DefaultExamReactor.stage(DefaultExamReactor.java:83)
at
org.ops4j.pax.exam.junit.JUnit4TestRunner.prepareReactor(JUnit4TestRunner.java:155)
at
org.ops4j.pax.exam.junit.JUnit4TestRunner.(JUnit4TestRunner.java:79)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at
org.junit.internal.builders.AnnotatedBuilder.buildRunner(AnnotatedBuilder.java:29)
at
org.junit.internal.builders.AnnotatedBuilder.runnerForClass(AnnotatedBuilder.java:21)
at
org.junit.runners.model.RunnerBuilder.safeRunnerForClass(RunnerBuilder.java:59)
at
org.junit.internal.builders.AllDefaultPossibilitiesBuilder.runnerForClass(AllDefaultPossibilitiesBuilder.java:26)
at
org.junit.runners.model.RunnerBuilder.safeRunnerForClass(RunnerBuilder.java:59)
at org.junit.internal.requests.ClassRequest.getRunner(ClassRequest.java:26)
at
org.apache.maven.surefire.junit4.JUnit4TestSet.execute(JUnit4TestSet.java:51)
at
org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:123)
at
org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:104)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at
org.apache.maven.surefire.util.ReflectionUtils.invokeMethodWithArray(ReflectionUtils.java:164)
at
org.apache.maven.surefire.booter.ProviderFactory$ProviderProxy.invoke(ProviderFactory.java:110)
at
org.apache.maven.surefire.booter.SurefireStarter.invokeProvider(SurefireStarter.java:175)
at
org.apache.maven.surefire.booter.SurefireStarter.runSuitesInProcessWhenForked(SurefireStarter.java:107)
at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:68)
Caused by: org.osgi.framework.BundleException: Unable to cache bundle:
mvn:org.junit/com.springsource.org.junit/4.4.0
at org.apache.felix.framework.Felix.installBundle(Felix.java:2703)
at
org.apache.felix.framework.BundleContextImpl.installBundle(BundleContextImpl.java:165)
at
org.apache.felix.framework.BundleContextImpl.installBundle(BundleContextImpl.java:138)
at
org.ops4j.pax.exam.nat.internal.NativeTestContainer.installAndStartBundles(NativeTestContainer.java:272)
at
org.ops4j.pax.exam.nat.internal.NativeTestContainer.start(NativeTestContainer.java:171)
... 27 more
Caused by: java.lang.IllegalStateException: Stream handler unavailable due
to: null
at
org.apache.felix.framework.URLHandlersStreamHandlerProxy.openConnection(URLHandlersStreamHandlerProxy.java:311)
at java.net.URL.openConnection(URL.java:945)
at
org.apache.felix.framework.cache.JarRevision.initialize(JarRevision.java:150)
at org.apache.felix.framework.cache.JarRevision.(JarRevision.java:77)
at
org.apache.felix.framework.cache.BundleArchive.createRevisionFromLocation(BundleArchive.java:878)
at
org.apache.felix.framework.cache.BundleArchive.reviseInternal(BundleArchive.java:550)
at
org.apache.felix.framework.cache.BundleArchive.(BundleArchive.java:153)
at org.apache.felix.framework.cache.BundleCache.create(BundleCache.java:277)
at org.apache.felix.framework.Felix.installBundle(Felix.java:2699)
... 31 more

--
Hong-Thai


Re: [DISCUSS] Give examples of Parser, Detector, and Translator usage

2014-08-07 Thread Hong-Thai Nguyen
Nice idea.

We could do more than samples. We can generate parser, detecter or translator 
maven archetype. A kind o templete so that user can have quickly project to 
develop new one.

Regards,

Hong-Thai

> On 07 Aug 2014, at 18:56, Tyler Palsulich  wrote:
> 
> Hi All,
> 
> I think we should add some consolidated documentation on how to use Tika's
> Java API. It would be very helpful if we had short snippets of code that
> showed how exactly you can use Parser.parse(), for example. I think I
> remember a thread about testing example code a while back, but I'm not
> sure. We have some developer documentation on the site, but the user docs
> are somewhat lacking.
> 
> I can think of a few options:
> 
> *1) tika-example module*. This module would have example code of using each
> main interface of Tika. Simplicity and organization would be king, so new
> users can find exactly what they're looking for quickly. A big benefit of
> this is that unit tests would be baked in. I like this option. One downside
> is that reading source code in the browser is terrible (e.g. see [0]).
> 
> *2)* Examples section on the *wiki*. My impression is that the wiki is not
> as popular as the root website. And, it's also very easy to forget about
> and let go out of date. But, formatting and explanations would be pretty.
> 
> *3)* Examples section on the *website*. This has the benefit of pretty
> formatting and coloring, without the potential user having to check out the
> repo or view direct source in browser. Another benefit is this section
> would be perfect for showing how to use the tika-app jar.
> 
> Right now, I think the best option is a combination of 1 and 3. We get some
> end to end examples running in the tika-example module and short snippets
> of usage on an examples page of the website.
> 
> What do you guys think? What other options should we consider?
> 
> Tyler
> 
> [0] -
> http://svn.apache.org/repos/asf/tika/trunk/tika-core/src/main/java/org/apache/tika/parser/Parser.java


Re: [VOTE] Release Apache Tika 1.6 RC #2

2014-09-01 Thread Hong-Thai Nguyen
-1 for me because tika-dotnet/pom.xml refer to parent pom with a snapshot
version.

org.apache.tika
tika-parent
1.6-SNAPSHOT
../tika-parent/pom.xml
  


On Mon, Sep 1, 2014 at 7:16 AM, Mattmann, Chris A (3980) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> Hi Folks,
>
> A candidate for the Tika 1.6 release is available at:
>
> http://people.apache.org/~mattmann/apache-tika-1.6/rc2/
>
> The release candidate is a zip archive of the sources in:
>
> http://svn.apache.org/repos/asf/tika/tags/1.6-rc2/
>
>
> The SHA1 checksum of the archive is
> 65644121446130fa29f1b62bcd75fb33344a6ba3.
>
> A Maven staging repository is at:
>
> https://repository.apache.org/content/repositories/orgapachetika-1004/
>
>
> Please vote on releasing this package as Apache Tika 1.6.
> The vote is open for the next 72 hours and passes if a majority of at
> least three +1 Tika PMC votes are cast.
>
> [ ] +1 Release this package as Apache Tika 1.6
> [ ] -1 Do not release this package because...
>
>
> Cheers,
> Chris
>
> P.S. Here is my +1!
>
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++
>
>
>
>
>


-- 
--
Hong-Thai


Re: [VOTE] Release Apache Tika 1.6 RC #2

2014-09-01 Thread Hong-Thai Nguyen
Hi Chris,

Seem fine for me. So i change my vote to +1.

I tested with our product, this rc works fine.

Hong-Thai

> On 01 Sep 2014, at 18:39, "Mattmann, Chris A (3980)" 
>  wrote:
> 
> Hi Hong-Thai,
> 
> tika-dotnet is not part of our standard build process, it's not
> even referenced in the pom.xml and isn't done yet?
> 
> How about we fix it in 1.7 but give this one a pass?
> 
> 
> Cheers,
> Chris
> 
> -Original Message-
> From: Hong-Thai Nguyen 
> Reply-To: "dev@tika.apache.org" 
> Date: Monday, September 1, 2014 6:53 AM
> To: "dev@tika.apache.org" 
> Subject: Re: [VOTE] Release Apache Tika 1.6 RC #2
> 
>> -1 for me because tika-dotnet/pom.xml refer to parent pom with a snapshot
>> version.
>> 
>>   org.apache.tika
>>   tika-parent
>>   1.6-SNAPSHOT
>>   ../tika-parent/pom.xml
>> 
>> 
>> 
>> On Mon, Sep 1, 2014 at 7:16 AM, Mattmann, Chris A (3980) <
>> chris.a.mattm...@jpl.nasa.gov> wrote:
>> 
>>> Hi Folks,
>>> 
>>> A candidate for the Tika 1.6 release is available at:
>>> 
>>>http://people.apache.org/~mattmann/apache-tika-1.6/rc2/
>>> 
>>> The release candidate is a zip archive of the sources in:
>>> 
>>> http://svn.apache.org/repos/asf/tika/tags/1.6-rc2/
>>> 
>>> 
>>> The SHA1 checksum of the archive is
>>> 65644121446130fa29f1b62bcd75fb33344a6ba3.
>>> 
>>> A Maven staging repository is at:
>>> 
>>> https://repository.apache.org/content/repositories/orgapachetika-1004/
>>> 
>>> 
>>> Please vote on releasing this package as Apache Tika 1.6.
>>> The vote is open for the next 72 hours and passes if a majority of at
>>> least three +1 Tika PMC votes are cast.
>>> 
>>>[ ] +1 Release this package as Apache Tika 1.6
>>>[ ] -1 Do not release this package because...
>>> 
>>> 
>>> Cheers,
>>> Chris
>>> 
>>> P.S. Here is my +1!
>>> 
>>> ++
>>> Chris Mattmann, Ph.D.
>>> Chief Architect
>>> Instrument Software and Science Data Systems Section (398)
>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>> Office: 168-519, Mailstop: 168-527
>>> Email: chris.a.mattm...@nasa.gov
>>> WWW:  http://sunset.usc.edu/~mattmann/
>>> ++
>>> Adjunct Associate Professor, Computer Science Department
>>> University of Southern California, Los Angeles, CA 90089 USA
>>> ++
>> 
>> 
>> -- 
>> --
>> Hong-Thai
> 


NPE on all *.odt, odp, .ods documents

2014-09-11 Thread Hong-Thai Nguyen
Hi all,

I've tested the conversion Tika 1.6 with our corpus, all OpenOffice
document types are failed with NPE. Fix has been done on
https://issues.apache.org/jira/browse/TIKA-1412, but available from 1.7.
That's a fatal error for me.

Should we release a 1.6.1 with the fix of TIKA-1412 ?

Tack trace:
Caused by: com.polyspot.document.converter.ConversionException:
org.apache.tika.exception.TikaException: Unexpected RuntimeException from
org.apache.tika.parser.ParserDecorator$1@318e5904
at
com.polyspot.document.converter.DocumentConverter.realizeTikaConversion(DocumentConverter.java:233)
at
com.polyspot.document.converter.DocumentConverter.convert(DocumentConverter.java:127)
at
com.polyspot.wscrawlers.PsDocConverter.getConvertedDocument(PsDocConverter.java:83)
... 22 more
Caused by: org.apache.tika.exception.TikaException: Unexpected
RuntimeException from org.apache.tika.parser.ParserDecorator$1@318e5904
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:246)
at
com.polyspot.document.converter.DocumentConverter.realizeTikaConversion(DocumentConverter.java:225)
... 24 more
Caused by: java.lang.NullPointerException
at
org.apache.tika.parser.odf.OpenDocumentParser.parse(OpenDocumentParser.java:161)
at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
... 25 more

-- 
--
Hong-Thai


Re: NPE on all *.odt, odp, .ods documents

2014-09-11 Thread Hong-Thai Nguyen
Hi Chris,

Sound perfect too me.

Hong-Thai

> On 11 Sep 2014, at 15:56, "Mattmann, Chris A (3980)" 
>  wrote:
> 
> Hi Hong-Thai,
> 
> Sure, we can easily do a patch release that incorporates this.
> 
> Here would be the process:
> 
> 1. RM to create branch http://svn.apache.org/repos/asf/tika/branches/1.6
> from
> http://svn.apache.org/repos/asf/tika/tags/1.6-rc2
> 
> 
> 2. RM to apply TIKA-1412 to
> http://svn.apache.org/repos/asf/tika/branches/1.6
> 
> 3. RM to roll RC 1.6.1-rc1 out of
> http://svn.apache.org/repos/asf/tika/branches/1.6
> 
> I'll do the above 3 steps today if that sounds good?
> 
> Cheers,
> Chris
> 
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++
> 
> 
> 
> 
> 
> 
> -Original Message-
> From: Hong-Thai Nguyen 
> Reply-To: "dev@tika.apache.org" 
> Date: Thursday, September 11, 2014 5:21 AM
> To: "dev@tika.apache.org" 
> Subject: NPE on all *.odt, odp, .ods documents
> 
>> Hi all,
>> 
>> I've tested the conversion Tika 1.6 with our corpus, all OpenOffice
>> document types are failed with NPE. Fix has been done on
>> https://issues.apache.org/jira/browse/TIKA-1412, but available from 1.7.
>> That's a fatal error for me.
>> 
>> Should we release a 1.6.1 with the fix of TIKA-1412 ?
>> 
>> Tack trace:
>> Caused by: com.polyspot.document.converter.ConversionException:
>> org.apache.tika.exception.TikaException: Unexpected RuntimeException from
>> org.apache.tika.parser.ParserDecorator$1@318e5904
>> at
>> com.polyspot.document.converter.DocumentConverter.realizeTikaConversion(Do
>> cumentConverter.java:233)
>> at
>> com.polyspot.document.converter.DocumentConverter.convert(DocumentConverte
>> r.java:127)
>> at
>> com.polyspot.wscrawlers.PsDocConverter.getConvertedDocument(PsDocConverter
>> .java:83)
>> ... 22 more
>> Caused by: org.apache.tika.exception.TikaException: Unexpected
>> RuntimeException from org.apache.tika.parser.ParserDecorator$1@318e5904
>> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:246)
>> at
>> com.polyspot.document.converter.DocumentConverter.realizeTikaConversion(Do
>> cumentConverter.java:225)
>> ... 24 more
>> Caused by: java.lang.NullPointerException
>> at
>> org.apache.tika.parser.odf.OpenDocumentParser.parse(OpenDocumentParser.jav
>> a:161)
>> at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
>> at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
>> ... 25 more
>> 
>> -- 
>> --
>> Hong-Thai
> 


Re: NPE on all *.odt, odp, .ods documents

2014-09-11 Thread Hong-Thai Nguyen
I have no objection on version naming politic :)

The 1.7 with 13 fixed issues is not bad:
https://issues.apache.org/jira/browse/TIKA-1393?jql=project%20%3D%20TIKA%20AND%20fixVersion%20%3D%201.7%20AND%20resolution%20%3D%20Fixed%20ORDER%20BY%20due%20ASC%2C%20priority%20DESC%2C%20created%20ASC

On Thu, Sep 11, 2014 at 5:06 PM, Mattmann, Chris A (3980) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> +1 let me know what you guys think I'll wait until tomorrow
> based on what people say.
>
> BTW, we don't have any x.y.z releases yet - should we just
> call this 1.7? That's probably just as easy?
>
>
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++
>
>
>
>
>
>
> -Original Message-
> From: , "Timothy B." 
> Reply-To: "dev@tika.apache.org" 
> Date: Thursday, September 11, 2014 7:42 AM
> To: "dev@tika.apache.org" 
> Subject: RE: NPE on all *.odt, odp, .ods documents
>
> >Probably want to add TIKA-1411.
> >
> >Nick and all, anything else?
> >
> >-Original Message-
> >From: Hong-Thai Nguyen [mailto:thaicha...@gmail.com]
> >Sent: Thursday, September 11, 2014 10:10 AM
> >To: dev@tika.apache.org
> >Subject: Re: NPE on all *.odt, odp, .ods documents
> >
> >Hi Chris,
> >
> >Sound perfect too me.
> >
> >Hong-Thai
> >
> >> On 11 Sep 2014, at 15:56, "Mattmann, Chris A (3980)"
> >> wrote:
> >>
> >> Hi Hong-Thai,
> >>
> >> Sure, we can easily do a patch release that incorporates this.
> >>
> >> Here would be the process:
> >>
> >> 1. RM to create branch
> http://svn.apache.org/repos/asf/tika/branches/1.6
> >> from
> >> http://svn.apache.org/repos/asf/tika/tags/1.6-rc2
> >>
> >>
> >> 2. RM to apply TIKA-1412 to
> >> http://svn.apache.org/repos/asf/tika/branches/1.6
> >>
> >> 3. RM to roll RC 1.6.1-rc1 out of
> >> http://svn.apache.org/repos/asf/tika/branches/1.6
> >>
> >> I'll do the above 3 steps today if that sounds good?
> >>
> >> Cheers,
> >> Chris
> >>
> >> ++
> >> Chris Mattmann, Ph.D.
> >> Chief Architect
> >> Instrument Software and Science Data Systems Section (398)
> >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >> Office: 168-519, Mailstop: 168-527
> >> Email: chris.a.mattm...@nasa.gov
> >> WWW:  http://sunset.usc.edu/~mattmann/
> >> ++
> >> Adjunct Associate Professor, Computer Science Department
> >> University of Southern California, Los Angeles, CA 90089 USA
> >> ++
> >>
> >>
> >>
> >>
> >>
> >>
> >> -Original Message-
> >> From: Hong-Thai Nguyen 
> >> Reply-To: "dev@tika.apache.org" 
> >> Date: Thursday, September 11, 2014 5:21 AM
> >> To: "dev@tika.apache.org" 
> >> Subject: NPE on all *.odt, odp, .ods documents
> >>
> >>> Hi all,
> >>>
> >>> I've tested the conversion Tika 1.6 with our corpus, all OpenOffice
> >>> document types are failed with NPE. Fix has been done on
> >>> https://issues.apache.org/jira/browse/TIKA-1412, but available from
> >>>1.7.
> >>> That's a fatal error for me.
> >>>
> >>> Should we release a 1.6.1 with the fix of TIKA-1412 ?
> >>>
> >>> Tack trace:
> >>> Caused by: com.polyspot.document.converter.ConversionException:
> >>> org.apache.tika.exception.TikaException: Unexpected RuntimeException
> >>>from
> >>> org.apache.tika.parser.ParserDecorator$1@318e5904
> >>> at
> >>>
> >>>com.polyspot.document.converter.DocumentConverter.realizeTikaConversion(
> >>>Do
>

Re: NPE on all *.odt, odp, .ods documents

2014-09-11 Thread Hong-Thai Nguyen
I was wrong when saying that All OpenDocument are failed, some files
passed, but alot of them failed with NPE in OpenDocumentParser line 161.

I'm looking to OpenDocumentParser.java on 1.6. The bug comes from block
lines 126-130 when input is TikaInputStream (our case):
if (container instanceof ZipFile) {
zipFile = (ZipFile) container;
} else if (tis.hasFile()) {
zipFile = new ZipFile(tis.getFile());
}

zipFile is sometimes never created.


For information, this bug is really fixed in 1.7-SNAPSHOT. Here's the
detail of comparison on two versions on same corpus:
1.6:
14-09-09 16:17:43 INFO  (DocumentConversionErrorPlugin.java : 115) [pool-2
-thread-2] Summary of document conversion errors:
- pdf (7)
- pptx (10)
- doc (6)
- ppt (14)
- xls (9)
- dwg (4)
- odp (495)
- odt (839)
- pps (2)
- ods (1)

1.7-SNASPHOT:
- pdf (7) - pptx (10) - doc (6) - ppt (14) - xls (9) - dwg (4) - odp (2) -
pps (2)


On Thu, Sep 11, 2014 at 8:55 PM, Ken Krugler 
wrote:

>
> > From: Hong-Thai Nguyen
> > Sent: September 11, 2014 5:21:41am PDT
> > To: dev@tika.apache.org
> > Subject: NPE on all *.odt, odp, .ods documents
> >
> > Hi all,
> >
> > I've tested the conversion Tika 1.6 with our corpus, all OpenOffice
> > document types are failed with NPE. Fix has been done on
> > https://issues.apache.org/jira/browse/TIKA-1412, but available from 1.7.
> > That's a fatal error for me.
>
> I'm curious - don't we have unit tests for OpenOffice document types?
>
> If so, then why are they passing, but all docs tried by Hong-Thai fail?
>
> -- Ken
>
> >
> > Should we release a 1.6.1 with the fix of TIKA-1412 ?
> >
> > Tack trace:
> > Caused by: com.polyspot.document.converter.ConversionException:
> > org.apache.tika.exception.TikaException: Unexpected RuntimeException from
> > org.apache.tika.parser.ParserDecorator$1@318e5904
> > at
> >
> com.polyspot.document.converter.DocumentConverter.realizeTikaConversion(DocumentConverter.java:233)
> > at
> >
> com.polyspot.document.converter.DocumentConverter.convert(DocumentConverter.java:127)
> > at
> >
> com.polyspot.wscrawlers.PsDocConverter.getConvertedDocument(PsDocConverter.java:83)
> > ... 22 more
> > Caused by: org.apache.tika.exception.TikaException: Unexpected
> > RuntimeException from org.apache.tika.parser.ParserDecorator$1@318e5904
> > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:246)
> > at
> >
> com.polyspot.document.converter.DocumentConverter.realizeTikaConversion(DocumentConverter.java:225)
> > ... 24 more
> > Caused by: java.lang.NullPointerException
> > at
> >
> org.apache.tika.parser.odf.OpenDocumentParser.parse(OpenDocumentParser.java:161)
> > at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
> > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
> > ... 25 more
> >
> > --
> > --
> > Hong-Thai
>
>
> --
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
>
>
>
>
>
>


-- 
--
Hong-Thai


RE: NPE on all *.odt, odp, .ods documents

2014-09-22 Thread Hong-Thai Nguyen
Hi,

I've added a test for this case at r1626706.
We are having TIKA-1421 which blocks the release.

Hong-Thai

-Message d'origine-
De : Ken Krugler [mailto:kkrugler_li...@transpac.com] 
Envoyé : jeudi 11 septembre 2014 23:07
À : dev@tika.apache.org
Objet : RE: NPE on all *.odt, odp, .ods documents


> From: Hong-Thai Nguyen
> Sent: September 11, 2014 1:40:08pm PDT
> To: dev@tika.apache.org
> Subject: Re: NPE on all *.odt, odp, .ods documents
> 
> I was wrong when saying that All OpenDocument are failed, some files 
> passed, but alot of them failed with NPE in OpenDocumentParser line 161.

OK, thanks for clarifying.

So I assume we now have a unit test that would fail without the fix, yes?

Thanks,

-- Ken

> 
> I'm looking to OpenDocumentParser.java on 1.6. The bug comes from 
> block lines 126-130 when input is TikaInputStream (our case):
> if (container instanceof ZipFile) {
>zipFile = (ZipFile) container;
>} else if (tis.hasFile()) {
>zipFile = new ZipFile(tis.getFile());
>}
> 
> zipFile is sometimes never created.
> 
> 
> For information, this bug is really fixed in 1.7-SNAPSHOT. Here's the 
> detail of comparison on two versions on same corpus:
> 1.6:
> 14-09-09 16:17:43 INFO  (DocumentConversionErrorPlugin.java : 115) 
> [pool-2 -thread-2] Summary of document conversion errors:
> - pdf (7)
> - pptx (10)
> - doc (6)
> - ppt (14)
> - xls (9)
> - dwg (4)
> - odp (495)
> - odt (839)
> - pps (2)
> - ods (1)
> 
> 1.7-SNASPHOT:
> - pdf (7) - pptx (10) - doc (6) - ppt (14) - xls (9) - dwg (4) - odp 
> (2) - pps (2)
> 
> 
> On Thu, Sep 11, 2014 at 8:55 PM, Ken Krugler 
> 
> wrote:
> 
>> 
>>> From: Hong-Thai Nguyen
>>> Sent: September 11, 2014 5:21:41am PDT
>>> To: dev@tika.apache.org
>>> Subject: NPE on all *.odt, odp, .ods documents
>>> 
>>> Hi all,
>>> 
>>> I've tested the conversion Tika 1.6 with our corpus, all OpenOffice 
>>> document types are failed with NPE. Fix has been done on 
>>> https://issues.apache.org/jira/browse/TIKA-1412, but available from 1.7.
>>> That's a fatal error for me.
>> 
>> I'm curious - don't we have unit tests for OpenOffice document types?
>> 
>> If so, then why are they passing, but all docs tried by Hong-Thai fail?
>> 
>> -- Ken
>> 
>>> 
>>> Should we release a 1.6.1 with the fix of TIKA-1412 ?
>>> 
>>> Tack trace:
>>> Caused by: com.polyspot.document.converter.ConversionException:
>>> org.apache.tika.exception.TikaException: Unexpected RuntimeException 
>>> from
>>> org.apache.tika.parser.ParserDecorator$1@318e5904
>>> at
>>> 
>> com.polyspot.document.converter.DocumentConverter.realizeTikaConversi
>> on(DocumentConverter.java:233)
>>> at
>>> 
>> com.polyspot.document.converter.DocumentConverter.convert(DocumentCon
>> verter.java:127)
>>> at
>>> 
>> com.polyspot.wscrawlers.PsDocConverter.getConvertedDocument(PsDocConv
>> erter.java:83)
>>> ... 22 more
>>> Caused by: org.apache.tika.exception.TikaException: Unexpected 
>>> RuntimeException from 
>>> org.apache.tika.parser.ParserDecorator$1@318e5904
>>> at 
>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:24
>>> 6)
>>> at
>>> 
>> com.polyspot.document.converter.DocumentConverter.realizeTikaConversi
>> on(DocumentConverter.java:225)
>>> ... 24 more
>>> Caused by: java.lang.NullPointerException at
>>> 
>> org.apache.tika.parser.odf.OpenDocumentParser.parse(OpenDocumentParse
>> r.java:161)
>>> at 
>>> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91
>>> ) at 
>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:24
>>> 4)
>>> ... 25 more
>>> 
>>> --
>>> --
>>> Hong-Thai




--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr







Re: 1.7 release?

2014-10-16 Thread Hong-Thai Nguyen
Hi Andrzej,

We are impatient for 1.7 release too.
I'm having compiling problem of TIKA-1422 on me. If anyone can build
successfully on Windows, I have no objection to release 1.7

Thanks,

On Thu, Oct 16, 2014 at 10:51 AM, Andrzej Białecki  wrote:

> Hi,
>
> Any news on the 1.7 release? or at least a 1.6.1 release that includes the
> fix for broken ODF parsing…
>
> ---
> Best regards,
>
> Andrzej Bialecki
>
>


-- 
--
Hong-Thai


Re: svn commit: r1633325 - in /tika/trunk/tika-parsers/src: main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java test/java/org/apache/tika/parser/mail/RFC822ParserTest.java

2014-10-21 Thread Hong-Thai Nguyen
Hi Chris,

Yes, I made a mistake on this commit by missing a renaming file and broke
build, the next commit corrected:
Revision: 161
Author: thaichat04
Date: mardi 21 octobre 2014 11:47:54
Message:
TIKA-1422 - Fixing build & minor refactory of naming test class

Modified :
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/mail/RFC822ParserTest.java
Added :
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java
Deleted :
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/ocr/TesseractOCRTest.java

Please 'pull' latest again then tell me if OK ?

Sorry

On Tue, Oct 21, 2014 at 3:49 PM, Mattmann, Chris A (3980) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> Hi Hong-Thai,
>
> These commits look strange to me - it looks like it subtracts the
> whole files (and the unit test removed the test file, renamed it,
> and then added what largely looks like the same file, back?)
>
> Any idea what¹s up?
>
> Cheers,
> Chris
>
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++
>
>
>
>
>
>
> -Original Message-
> From: "thaicha...@apache.org" 
> Reply-To: "dev@tika.apache.org" 
> Date: Tuesday, October 21, 2014 at 2:32 AM
> To: "comm...@tika.apache.org" 
> Subject: svn commit: r1633325 - in /tika/trunk/tika-parsers/src:
> main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java
> test/java/org/apache/tika/parser/mail/RFC822ParserTest.java
>
> >Author: thaichat04
> >Date: Tue Oct 21 09:32:06 2014
> >New Revision: 1633325
> >
> >URL: http://svn.apache.org/r1633325
> >Log:
> >TIKA-1422 - Apply fix of [~olegt] in Windows
> >
> >Modified:
> >
> >tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/Tesseract
> >OCRParser.java
> >
> >tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/mail/RFC822Pa
> >rserTest.java
> >
> >Modified:
> >tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/Tesseract
> >OCRParser.java
> >URL:
> >
> http://svn.apache.org/viewvc/tika/trunk/tika-parsers/src/main/java/org/apa
> >che/tika/parser/ocr/TesseractOCRParser.java?rev=1633325&r1=1633324&r2=1633
> >325&view=diff
> >==
> >
> >---
> >tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/Tesseract
> >OCRParser.java (original)
> >+++
> >tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/Tesseract
> >OCRParser.java Tue Oct 21 09:32:06 2014
> >@@ -26,11 +26,11 @@ import java.io.IOException;
> > import java.io.InputStream;
> > import java.io.InputStreamReader;
> > import java.io.Reader;
> >+import java.util.ArrayList;
> > import java.util.HashSet;
> >+import java.util.List;
> > import java.util.Map;
> > import java.util.Set;
> >-import java.util.List;
> >-import java.util.ArrayList;
> > import java.util.concurrent.Callable;
> > import java.util.concurrent.ExecutionException;
> > import java.util.concurrent.FutureTask;
> >@@ -45,20 +45,23 @@ import org.apache.tika.io.TemporaryResou
> > import org.apache.tika.io.TikaInputStream;
> > import org.apache.tika.metadata.Metadata;
> > import org.apache.tika.mime.MediaType;
> >-import org.apache.tika.parser.Parser;
> > import org.apache.tika.parser.AbstractParser;
> > import org.apache.tika.parser.ParseContext;
> >+import org.apache.tika.parser.Parser;
> > import org.apache.tika.parser.external.ExternalParser;
> >+import org.apache.tika.parser.image.ImageParser;
> >+import org.apache.tika.parser.image.PSDParser;
> >+import org.apache.tika.parser.image.TiffParser;
> >+import org.apache.tika.parser.jpeg.JpegParser;
> > import org.apache.tika.sax.XHTMLContentHandler;
> > import org.xml.sax.ContentHandler;
> > import org.xml.sax.SAXException;
> >
> > /**
> >- * TesseractOCRParser powered by tesseract-ocr engine.
> >- * To enable this parser, create a {@link TesseractOCRConfig}
> >- * object and pass it through a ParseContext.
> >- * Tesseract-ocr must be installed and on system path or
> >- * the path to its root folder must be provided:
> >+ * TesseractOCRParser powered by tesseract-ocr engine. To enable this
> >parser,
> >+ * create a {@link TesseractOCRConfig} object and pass it through a
> >+ * ParseContext. Tesseract-ocr must be installed and on system path or
> >the path
> >+ * to its root folder must be provided:
> >  * 
> >  * TesseractOCRConfig config = new TesseractOCRConfig();
> >  * //Needed if tesseract is not on system path
> >@@ -69,226 +72,231 @@ import org

Move definitively from SVN to Git ?

2014-11-17 Thread Hong-Thai Nguyen
Hi all,

Git is implemented everywhere and profit many new features. Should we
abandon SVN repo and move to Git forever to facility apply fixes and
contribution ?

Thanks,


--
Hong-Thai


Re: Move definitively from SVN to Git ?

2014-11-17 Thread Hong-Thai Nguyen
I didn't realize that we could commit/push directly into git repo. Could we
?

Cheers

On Mon, Nov 17, 2014 at 11:46 AM, Nick Burch  wrote:

> On Mon, 17 Nov 2014, Hong-Thai Nguyen wrote:
>
>> Git is implemented everywhere and profit many new features. Should we
>> abandon SVN repo and move to Git forever to facility apply fixes and
>> contribution ?
>>
>
> We already have a git mirror - http://git.apache.org/tika.git/ - and a
> GitHub mirror which accepts pull requests - https://github.com/apache/tika
> . I believe that anyone who wants to work on Tika with Git is able to!
>
> Are you able to explain what we'd gain by making the change?
>
> Thanks
> Nick
>



-- 
--
Hong-Thai


Re: Move definitively from SVN to Git ?

2014-11-17 Thread Hong-Thai Nguyen
Yes, that's exactly I'm doing. If we move to Git, we'll avoid all SVN stuff.
Anyway, this concerns commiters only.

On Mon, Nov 17, 2014 at 12:08 PM, Nick Burch  wrote:

> On Mon, 17 Nov 2014, Hong-Thai Nguyen wrote:
>
>> I didn't realize that we could commit/push directly into git repo. Could
>> we ?
>>
>
> Master source is still SVN. However, committers can (and at least some do)
> work on a clone of the Git repo, and use GitSVN to push their changes to
> the SVN repo as commits
>
> That lets you buffer commits offline if needed, and commit directly from
> your Git environment
>
> Nick
>



-- 
--
Hong-Thai


Re: svn commit: r1640017 - /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java

2014-11-17 Thread Hong-Thai Nguyen
Hi,

I've pushed a minor fix to pass this test on Windows.

Thanks,

On Mon, Nov 17, 2014 at 4:28 PM, Mattmann, Chris A (3980) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> +1, agreed, Dave would be nice to have one as a default.
>
> ++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++
>
>
>
>
>
>
> -Original Message-
> From: David Meikle 
> Reply-To: "dev@tika.apache.org" 
> Date: Monday, November 17, 2014 at 8:54 AM
> To: "dev@tika.apache.org" 
> Subject: Re: svn commit: r1640017 -
> /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/ocr/Tesseract
> OCRConfig.java
>
> >Hi Chris,
> >
> >> On 16 Nov 2014, at 19:14, Mattmann, Chris A (3980)
> >> wrote:
> >>
> >> Thanks, Dave. I think you forgot the default config file?
> >
> >Yup, forgot the tests and example config from my change!  Just committed
> >them.
> >
> >I wasn't initial planning on including a default config, thinking if you
> >dropped a properties file on the class path it would use that, otherwise
> >it would go for the defaults but should probably add one to be consistent
> >with the PDFParserConfig.
> >
> >Cheers,
> >Dave
>
>


-- 
--
Hong-Thai


Re: [VOTE] Apache Tika 1.7 Release

2015-01-08 Thread Hong-Thai Nguyen
Seems fine for me: +1

No big regression on our corpus test of 23K docs:

15-01-07 18:19:27 INFO  (DocumentConversionErrorPlugin.java : 116)
[pool-3-thread-1] Summary of document conversion errors:
- pdf (4)
* (2) org.apache.tika.exception.TikaException: TIKA-198: Illegal
IOException from org.apache.tika.parser.ParserDecorator$1@4b0b2006
* (1) org.apache.tika.exception.TikaException: Unexpected RuntimeException
from org.apache.tika.parser.ParserDecorator$1@4b0b2006
* (1) org.apache.tika.exception.TikaException: Unable to extract PDF content
- ps (3)
* (3) org.apache.tika.exception.TikaException: Unable to unpack document
stream
- pptx (10)
* (9) org.apache.tika.exception.TikaException: Error creating OOXML
extractor
* (1) org.apache.tika.exception.TikaException: Unexpected RuntimeException
from org.apache.tika.parser.ParserDecorator$1@45df8db8
- doc (6)
* (6) org.apache.tika.exception.TikaException: Unexpected RuntimeException
from org.apache.tika.parser.ParserDecorator$1@58797499
- ppt (14)
* (13) org.apache.tika.exception.TikaException: Unexpected RuntimeException
from org.apache.tika.parser.ParserDecorator$1@58797499
* (1) org.apache.tika.exception.TikaException: TIKA-198: Illegal
IOException from org.apache.tika.parser.ParserDecorator$1@58797499
- xls (9)
* (9) org.apache.tika.exception.TikaException: Unexpected RuntimeException
from org.apache.tika.parser.ParserDecorator$1@58797499
- vsd (3)
* (3) org.apache.tika.exception.TikaException: Unexpected RuntimeException
from org.apache.tika.parser.ParserDecorator$1@58797499
- odp (2)
* (2) org.apache.tika.exception.TikaException: TIKA-198: Illegal
IOException from org.apache.tika.parser.ParserDecorator$1@753ce4d8
- chm (1)
* (1) org.apache.tika.exception.TikaException: CHM file extract error:
extracted Length is wrong.
- dwg (4)
* (4) org.apache.tika.exception.TikaException: Unsupported AutoCAD drawing
version: AC1014
- pps (2)
* (2) org.apache.tika.exception.TikaException: Unexpected RuntimeException
from org.apache.tika.parser.ParserDecorator$1@58797499
- chw (1)
* (1) org.apache.tika.exception.TikaException: Unexpected RuntimeException
from org.apache.tika.parser.ParserDecorator$1@a0b8fca

Thank Tyler,

On Tue, Jan 6, 2015 at 7:59 AM, Tyler Palsulich 
wrote:

> Hi All,
>
> A candidate for the Tika 1.7 release is available at:
> https://dist.apache.org/repos/dist/dev/tika/
>
> The release candidate is a zip archive of the sources in:
> http://svn.apache.org/repos/asf/tika/tags/1.7-rc2/
>
> The SHA1 checksum of the archive is
> 0307a8367ae6f8b1103824fd11337fd89e24e6a4.
>
> In addition, a staged maven repository is available here:
>
>
> https://repository.apache.org/content/repositories/orgapachetika-1006/org/apache/tika/
>
> Please vote on releasing this package as Apache Tika 1.7.
>
> The vote is open for the next 72 hours and passes if a majority of at least
> three +1 Tika PMC votes are cast.
>
> [ ] +1 Release this package as Apache Tika 1.7
> [ ] -1 Do not release this package because...
>
> Thanks!
> Tyler
>
> P.S. Count this as my +1!
>



-- 
--
Hong-Thai


Re: [VOTE] Apache Tika 1.7 Release

2015-01-14 Thread Hong-Thai Nguyen
I've checked again some regression tests. Seem fine for me too. So +1

Great job Tyler !

On Fri, Jan 9, 2015 at 11:02 PM, Tyler Palsulich 
wrote:

> Hi All,
>
> A candidate for the Tika 1.7 release is available at:
> https://dist.apache.org/repos/dist/dev/tika/
>
> The release candidate is a zip archive of the sources in:
> http://svn.apache.org/repos/asf/tika/tags/1.7-rc3/
>
> The SHA1 checksum of the archive is
> b2190c267433e62c08560576ab7197e506bfdc11
>
> In addition, a staged maven repository is available here:
>
>
> https://repository.apache.org/content/repositories/orgapachetika-1007/org/apache/tika/
>
> Please vote on releasing this package as Apache Tika 1.7.
>
> The vote is open for the next 72 hours and passes if a majority of at least
> three +1 Tika PMC votes are cast.
>
> [ ] +1 Release this package as Apache Tika 1.7
> [ ] -1 Do not release this package because...
>
> Thanks!
> Tyler
>
> P.S. Here is my +1!
>



-- 
--
Hong-Thai


[jira] [Commented] (TIKA-1581) jhighlight license concerns

2015-03-20 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14371432#comment-14371432
 ] 

Hong-Thai Nguyen commented on TIKA-1581:


I've contacted also 'gbe...@uwyn.com', seem that it's his email. Wait for feel 
days for his feedback.
Otherwise, we can create an 'unshipped' module to group all parsers and their 
dependencies without Apache license

> jhighlight license concerns
> ---
>
> Key: TIKA-1581
> URL: https://issues.apache.org/jira/browse/TIKA-1581
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.7
>Reporter: Karl Wright
>
> jhighlight jar is a Tika dependency.  The Lucene team discovered that, while 
> it claims to be a CDDL/LGPL dual-license, some of its functionality is LGPL 
> only:
> {code}
> Solr's contrib/extraction contains jhighlight-1.0.jar which declares itself 
> as dual CDDL or LGPL license. However, some of its classes are distributed 
> only under LGPL, e.g.
> com.uwyn.jhighlight.highlighter.
>   CppHighlighter.java
>   GroovyHighlighter.java
>   JavaHighlighter.java
>   XmlHighlighter.java
> I downloaded the sources from Maven 
> (http://search.maven.org/remotecontent?filepath=com/uwyn/jhighlight/1.0/jhighlight-1.0-sources.jar)
>  to confirm that, and also found this SVN repo: 
> http://svn.rifers.org/jhighlight/tags/release-1.0, though the project's 
> website seems to not exist anymore (https://jhighlight.dev.java.net/).
> I didn't find any direct usage of it in our code, so I guess it's probably 
> needed by a 3rd party dependency, such as Tika. Therefore if we e.g. omit it, 
> things will compile, but may fail at runtime.
> {code}
> Is it possible to remove this dependency for future releases, or allow only 
> optional inclusion of this package?  It is of concern to the ManifoldCF 
> project because we distribute a binary package that includes Tika and its 
> required dependencies, which currently includes jHighlight.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1581) jhighlight license concerns

2015-03-20 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14371432#comment-14371432
 ] 

Hong-Thai Nguyen edited comment on TIKA-1581 at 3/20/15 3:10 PM:
-

I've contacted also 'gbe...@uwyn.com', seem that it's his email. Wait for feel 
days for his feedback.
Otherwise, we can create an 'unshipped' module to group all parsers and their 
dependencies without Apache license.

[~steve_rowe], folked vesion you mentioned don't change anything about original 
license terms of JHighlight.


was (Author: thaichat04):
I've contacted also 'gbe...@uwyn.com', seem that it's his email. Wait for feel 
days for his feedback.
Otherwise, we can create an 'unshipped' module to group all parsers and their 
dependencies without Apache license

> jhighlight license concerns
> ---
>
> Key: TIKA-1581
> URL: https://issues.apache.org/jira/browse/TIKA-1581
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.7
>Reporter: Karl Wright
>
> jhighlight jar is a Tika dependency.  The Lucene team discovered that, while 
> it claims to be a CDDL/LGPL dual-license, some of its functionality is LGPL 
> only:
> {code}
> Solr's contrib/extraction contains jhighlight-1.0.jar which declares itself 
> as dual CDDL or LGPL license. However, some of its classes are distributed 
> only under LGPL, e.g.
> com.uwyn.jhighlight.highlighter.
>   CppHighlighter.java
>   GroovyHighlighter.java
>   JavaHighlighter.java
>   XmlHighlighter.java
> I downloaded the sources from Maven 
> (http://search.maven.org/remotecontent?filepath=com/uwyn/jhighlight/1.0/jhighlight-1.0-sources.jar)
>  to confirm that, and also found this SVN repo: 
> http://svn.rifers.org/jhighlight/tags/release-1.0, though the project's 
> website seems to not exist anymore (https://jhighlight.dev.java.net/).
> I didn't find any direct usage of it in our code, so I guess it's probably 
> needed by a 3rd party dependency, such as Tika. Therefore if we e.g. omit it, 
> things will compile, but may fail at runtime.
> {code}
> Is it possible to remove this dependency for future releases, or allow only 
> optional inclusion of this package?  It is of concern to the ManifoldCF 
> project because we distribute a binary package that includes Tika and its 
> required dependencies, which currently includes jHighlight.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1581) jhighlight license concerns

2015-03-20 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14371432#comment-14371432
 ] 

Hong-Thai Nguyen edited comment on TIKA-1581 at 3/20/15 3:36 PM:
-

I've contacted also 'gbe...@uwyn.com', seem that it's his email. Wait for feel 
days for his feedback.
Otherwise, we can create an 'unshipped' module to group all parsers and their 
dependencies without Apache license.


was (Author: thaichat04):
I've contacted also 'gbe...@uwyn.com', seem that it's his email. Wait for feel 
days for his feedback.
Otherwise, we can create an 'unshipped' module to group all parsers and their 
dependencies without Apache license.

[~steve_rowe], folked vesion you mentioned don't change anything about original 
license terms of JHighlight.

> jhighlight license concerns
> ---
>
> Key: TIKA-1581
> URL: https://issues.apache.org/jira/browse/TIKA-1581
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.7
>Reporter: Karl Wright
>
> jhighlight jar is a Tika dependency.  The Lucene team discovered that, while 
> it claims to be a CDDL/LGPL dual-license, some of its functionality is LGPL 
> only:
> {code}
> Solr's contrib/extraction contains jhighlight-1.0.jar which declares itself 
> as dual CDDL or LGPL license. However, some of its classes are distributed 
> only under LGPL, e.g.
> com.uwyn.jhighlight.highlighter.
>   CppHighlighter.java
>   GroovyHighlighter.java
>   JavaHighlighter.java
>   XmlHighlighter.java
> I downloaded the sources from Maven 
> (http://search.maven.org/remotecontent?filepath=com/uwyn/jhighlight/1.0/jhighlight-1.0-sources.jar)
>  to confirm that, and also found this SVN repo: 
> http://svn.rifers.org/jhighlight/tags/release-1.0, though the project's 
> website seems to not exist anymore (https://jhighlight.dev.java.net/).
> I didn't find any direct usage of it in our code, so I guess it's probably 
> needed by a 3rd party dependency, such as Tika. Therefore if we e.g. omit it, 
> things will compile, but may fail at runtime.
> {code}
> Is it possible to remove this dependency for future releases, or allow only 
> optional inclusion of this package?  It is of concern to the ManifoldCF 
> project because we distribute a binary package that includes Tika and its 
> required dependencies, which currently includes jHighlight.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1581) jhighlight license concerns

2015-03-27 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14383827#comment-14383827
 ] 

Hong-Thai Nguyen commented on TIKA-1581:


On r1669583, I switched to latest jhighlight 1.0.2, update Notices.txt and also 
in SourceCodeParser.java to aware of using CDDL/LGPL dual-license of this 
library.

> jhighlight license concerns
> ---
>
> Key: TIKA-1581
> URL: https://issues.apache.org/jira/browse/TIKA-1581
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.7
>Reporter: Karl Wright
>
> jhighlight jar is a Tika dependency.  The Lucene team discovered that, while 
> it claims to be a CDDL/LGPL dual-license, some of its functionality is LGPL 
> only:
> {code}
> Solr's contrib/extraction contains jhighlight-1.0.jar which declares itself 
> as dual CDDL or LGPL license. However, some of its classes are distributed 
> only under LGPL, e.g.
> com.uwyn.jhighlight.highlighter.
>   CppHighlighter.java
>   GroovyHighlighter.java
>   JavaHighlighter.java
>   XmlHighlighter.java
> I downloaded the sources from Maven 
> (http://search.maven.org/remotecontent?filepath=com/uwyn/jhighlight/1.0/jhighlight-1.0-sources.jar)
>  to confirm that, and also found this SVN repo: 
> http://svn.rifers.org/jhighlight/tags/release-1.0, though the project's 
> website seems to not exist anymore (https://jhighlight.dev.java.net/).
> I didn't find any direct usage of it in our code, so I guess it's probably 
> needed by a 3rd party dependency, such as Tika. Therefore if we e.g. omit it, 
> things will compile, but may fail at runtime.
> {code}
> Is it possible to remove this dependency for future releases, or allow only 
> optional inclusion of this package?  It is of concern to the ManifoldCF 
> project because we distribute a binary package that includes Tika and its 
> required dependencies, which currently includes jHighlight.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1581) jhighlight license concerns

2015-03-27 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-1581:
---
Fix Version/s: 1.8

> jhighlight license concerns
> ---
>
> Key: TIKA-1581
> URL: https://issues.apache.org/jira/browse/TIKA-1581
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.7
>Reporter: Karl Wright
> Fix For: 1.8
>
>
> jhighlight jar is a Tika dependency.  The Lucene team discovered that, while 
> it claims to be a CDDL/LGPL dual-license, some of its functionality is LGPL 
> only:
> {code}
> Solr's contrib/extraction contains jhighlight-1.0.jar which declares itself 
> as dual CDDL or LGPL license. However, some of its classes are distributed 
> only under LGPL, e.g.
> com.uwyn.jhighlight.highlighter.
>   CppHighlighter.java
>   GroovyHighlighter.java
>   JavaHighlighter.java
>   XmlHighlighter.java
> I downloaded the sources from Maven 
> (http://search.maven.org/remotecontent?filepath=com/uwyn/jhighlight/1.0/jhighlight-1.0-sources.jar)
>  to confirm that, and also found this SVN repo: 
> http://svn.rifers.org/jhighlight/tags/release-1.0, though the project's 
> website seems to not exist anymore (https://jhighlight.dev.java.net/).
> I didn't find any direct usage of it in our code, so I guess it's probably 
> needed by a 3rd party dependency, such as Tika. Therefore if we e.g. omit it, 
> things will compile, but may fail at runtime.
> {code}
> Is it possible to remove this dependency for future releases, or allow only 
> optional inclusion of this package?  It is of concern to the ManifoldCF 
> project because we distribute a binary package that includes Tika and its 
> required dependencies, which currently includes jHighlight.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1581) jhighlight license concerns

2015-03-27 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen resolved TIKA-1581.

Resolution: Fixed

> jhighlight license concerns
> ---
>
> Key: TIKA-1581
> URL: https://issues.apache.org/jira/browse/TIKA-1581
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.7
>Reporter: Karl Wright
> Fix For: 1.8
>
>
> jhighlight jar is a Tika dependency.  The Lucene team discovered that, while 
> it claims to be a CDDL/LGPL dual-license, some of its functionality is LGPL 
> only:
> {code}
> Solr's contrib/extraction contains jhighlight-1.0.jar which declares itself 
> as dual CDDL or LGPL license. However, some of its classes are distributed 
> only under LGPL, e.g.
> com.uwyn.jhighlight.highlighter.
>   CppHighlighter.java
>   GroovyHighlighter.java
>   JavaHighlighter.java
>   XmlHighlighter.java
> I downloaded the sources from Maven 
> (http://search.maven.org/remotecontent?filepath=com/uwyn/jhighlight/1.0/jhighlight-1.0-sources.jar)
>  to confirm that, and also found this SVN repo: 
> http://svn.rifers.org/jhighlight/tags/release-1.0, though the project's 
> website seems to not exist anymore (https://jhighlight.dev.java.net/).
> I didn't find any direct usage of it in our code, so I guess it's probably 
> needed by a 3rd party dependency, such as Tika. Therefore if we e.g. omit it, 
> things will compile, but may fail at runtime.
> {code}
> Is it possible to remove this dependency for future releases, or allow only 
> optional inclusion of this package?  It is of concern to the ManifoldCF 
> project because we distribute a binary package that includes Tika and its 
> required dependencies, which currently includes jHighlight.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1581) jhighlight license concerns

2015-03-30 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14386900#comment-14386900
 ] 

Hong-Thai Nguyen commented on TIKA-1581:


And great thank to [~kkrugler] with many investigation and efforts to push 
release of jhighlight 1.0.2

> jhighlight license concerns
> ---
>
> Key: TIKA-1581
> URL: https://issues.apache.org/jira/browse/TIKA-1581
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.7
>Reporter: Karl Wright
> Fix For: 1.8
>
>
> jhighlight jar is a Tika dependency.  The Lucene team discovered that, while 
> it claims to be a CDDL/LGPL dual-license, some of its functionality is LGPL 
> only:
> {code}
> Solr's contrib/extraction contains jhighlight-1.0.jar which declares itself 
> as dual CDDL or LGPL license. However, some of its classes are distributed 
> only under LGPL, e.g.
> com.uwyn.jhighlight.highlighter.
>   CppHighlighter.java
>   GroovyHighlighter.java
>   JavaHighlighter.java
>   XmlHighlighter.java
> I downloaded the sources from Maven 
> (http://search.maven.org/remotecontent?filepath=com/uwyn/jhighlight/1.0/jhighlight-1.0-sources.jar)
>  to confirm that, and also found this SVN repo: 
> http://svn.rifers.org/jhighlight/tags/release-1.0, though the project's 
> website seems to not exist anymore (https://jhighlight.dev.java.net/).
> I didn't find any direct usage of it in our code, so I guess it's probably 
> needed by a 3rd party dependency, such as Tika. Therefore if we e.g. omit it, 
> things will compile, but may fail at runtime.
> {code}
> Is it possible to remove this dependency for future releases, or allow only 
> optional inclusion of this package?  It is of concern to the ManifoldCF 
> project because we distribute a binary package that includes Tika and its 
> required dependencies, which currently includes jHighlight.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1600) Unable to parse ODT files because of failed to close temporary resources

2015-04-13 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14492084#comment-14492084
 ] 

Hong-Thai Nguyen commented on TIKA-1600:


The root exception is an NPE when parsing ODT files with elements in footnote:
{code}
java.lang.NullPointerException
at 
org.apache.tika.parser.odf.OpenDocumentContentParser$OpenDocumentElementMappingContentHandler.startSpan(OpenDocumentContentParser.java:174)
at 
org.apache.tika.parser.odf.OpenDocumentContentParser$OpenDocumentElementMappingContentHandler.startElement(OpenDocumentContentParser.java:287)
at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
at 
org.apache.tika.parser.odf.NSNormalizerContentHandler.startElement(NSNormalizerContentHandler.java:69)
at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
at 
com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.startElement(AbstractSAXParser.java:501)
at 
com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.scanStartElement(XMLNSDocumentScannerImpl.java:400)
at 
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2756)
at 
com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:647)
at 
com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140)
at 
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511)
at 
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808)
at 
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
at 
com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
at 
com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205)
at 
com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:395)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
at 
org.apache.tika.parser.odf.OpenDocumentContentParser.parseInternal(OpenDocumentContentParser.java:503)
at 
org.apache.tika.parser.odf.OpenDocumentParser.handleZipEntry(OpenDocumentParser.java:187)
at 
org.apache.tika.parser.odf.OpenDocumentParser.parse(OpenDocumentParser.java:164)
at 
org.apache.tika.parser.odf.OpenDocumentParserTest.can_parse_odt_file(OpenDocumentParserTest.java:41)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
at 
org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
at 
org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)
{code}

Seems that supporting style for ODF is recently added in 1.8:
{noformat}
Revision: 107
Author: tpalsulich
Date: samedi 14 mars 2015 0

[jira] [Created] (TIKA-1089) Tiak conversion failed on following documents

2013-02-26 Thread Hong-Thai Nguyen (JIRA)
Hong-Thai Nguyen created TIKA-1089:
--

 Summary: Tiak conversion failed on following documents
 Key: TIKA-1089
 URL: https://issues.apache.org/jira/browse/TIKA-1089
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.3
 Environment: windows, api
Reporter: Hong-Thai Nguyen


We are using Tika as our major converter of divers file formats to text, html 
version in a Search Engine.

We've collected some documents (46) which Tika can not convert

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-1089) Tiak conversion failed on following documents

2013-02-26 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-1089:
---

Attachment: crawler.log

Attach log on exceptions

> Tiak conversion failed on following documents
> -
>
> Key: TIKA-1089
> URL: https://issues.apache.org/jira/browse/TIKA-1089
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.3
> Environment: windows, api
>    Reporter: Hong-Thai Nguyen
>  Labels: test
> Attachments: crawler.log
>
>
> We are using Tika as our major converter of divers file formats to text, html 
> version in a Search Engine.
> We've collected some documents (46) which Tika can not convert

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-1089) Tika conversion failed on following documents

2013-02-26 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-1089:
---

Summary: Tika conversion failed on following documents  (was: Tiak 
conversion failed on following documents)

> Tika conversion failed on following documents
> -
>
> Key: TIKA-1089
> URL: https://issues.apache.org/jira/browse/TIKA-1089
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.3
> Environment: windows, api
>    Reporter: Hong-Thai Nguyen
>  Labels: test
> Attachments: crawler.log
>
>
> We are using Tika as our major converter of divers file formats to text, html 
> version in a Search Engine.
> We've collected some documents (46) which Tika can not convert

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-1089) Tika conversion failed on following documents

2013-02-26 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-1089:
---

Description: 
We are using Tika as our major converter of divers file formats to text, html 
version in a Search Engine.
We've collected some documents (46) which Tika can not convert: 
http://www.mediafire.com/?60clr812lerx3gy

  was:
We are using Tika as our major converter of divers file formats to text, html 
version in a Search Engine.

We've collected some documents (46) which Tika can not convert


> Tika conversion failed on following documents
> -
>
> Key: TIKA-1089
> URL: https://issues.apache.org/jira/browse/TIKA-1089
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.3
> Environment: windows, api
>Reporter: Hong-Thai Nguyen
>  Labels: test
> Attachments: crawler.log
>
>
> We are using Tika as our major converter of divers file formats to text, html 
> version in a Search Engine.
> We've collected some documents (46) which Tika can not convert: 
> http://www.mediafire.com/?60clr812lerx3gy

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-1152) Process stucks on parsing of a CHM file

2013-07-23 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-1152:
---

Attachment: eventcombmt.chm

> Process stucks on parsing of a CHM file
> ---
>
> Key: TIKA-1152
> URL: https://issues.apache.org/jira/browse/TIKA-1152
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.4
> Environment: Windows/Linux
>    Reporter: Hong-Thai Nguyen
>Priority: Critical
> Fix For: 1.5
>
> Attachments: eventcombmt.chm
>
>
> By parsing the attachment CHM file (MS Microsoft Help Files), Java process 
> stucks.
> {code}
> Thread[main,5,main]
>   
> org.apache.tika.parser.chm.lzx.ChmLzxBlock.extractContent(ChmLzxBlock.java:203)
>   org.apache.tika.parser.chm.lzx.ChmLzxBlock.(ChmLzxBlock.java:77)
>   
> org.apache.tika.parser.chm.core.ChmExtractor.extractChmEntry(ChmExtractor.java:338)
>   
> org.apache.tika.parser.chm.CHMDocumentInformation.getContent(CHMDocumentInformation.java:72)
>   
> org.apache.tika.parser.chm.CHMDocumentInformation.getText(CHMDocumentInformation.java:141)
>   org.apache.tika.parser.chm.CHM2XHTML.process(CHM2XHTML.java:34)
>   org.apache.tika.parser.chm.ChmParser.parse(ChmParser.java:51)
>   org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
>   org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   org.apache.tika.parser.AbstractParser.parse(AbstractParser.java:53)
>   
> com.polyspot.document.converter.DocumentConverter.realizeConversion(DocumentConverter.java:192)
>   
> com.polyspot.document.converter.DocumentConverter.convert(DocumentConverter.java:114)
>   
> com.polyspot.wscrawlers.PsDocConverter.getConvertedDocument(PsDocConverter.java:77)
>   
> com.polyspot.wscrawlers.Converter.getConvertedDocument(Converter.java:81)
>   
> com.polyspot.wscrawlers.AbstractConverter.getDirectConvertedDocument(AbstractConverter.java:139)
>   
> com.polyspot.connector.framework.convert.PES5ConversionService.convert(PES5ConversionService.java:43)
>   
> com.polyspot.connector.framework.convert.ConversionService.findDocumentSplitterAndCallConvert(ConversionService.java:362)
>   
> com.polyspot.connector.framework.convert.ConversionService.convertAndGenerateThumbnailForMasterFile(ConversionService.java:291)
>   
> com.polyspot.connector.framework.processors.ConvertAndMergeMasterFile.process(ConvertAndMergeMasterFile.java:40)
>   
> com.polyspot.connector.framework.processors.SequenceDocumentProcessor.process(SequenceDocumentProcessor.java:21)
>   
> com.polyspot.connector.framework.plugins.DocumentBuilderPlugin.computeDocument(DocumentBuilderPlugin.java:48)
>   
> com.polyspot.connector.framework.plugins.PluginsManager.computeDocument(PluginsManager.java:219)
>   
> com.polyspot.connector.framework.orchestrators.Orchestrator.processOutOfDateNode(Orchestrator.java:201)
>   
> com.polyspot.connector.framework.orchestrators.Orchestrator.processGrantedNode(Orchestrator.java:172)
>   
> com.polyspot.connector.framework.orchestrators.Orchestrator.processNode(Orchestrator.java:237)
>   
> com.polyspot.connector.framework.orchestrators.Orchestrator.synchronize(Orchestrator.java:288)
>   
> com.polyspot.connector.framework.orchestrators.OrchestratorMonoThreaded.requestSynchronizeOnceCreated(OrchestratorMonoThreaded.java:16)
>   
> com.polyspot.connector.framework.orchestrators.Orchestrator.requestSynchronize(Orchestrator.java:108)
>   
> com.polyspot.connector.framework.MonitoredNodeExecutor.requestChildExecution(MonitoredNodeExecutor.java:29)
>   
> com.polyspot.connector.knowledgetree.model.content.KnowledgeTreeDocumentMetadata.synchronizeAllChildren(KnowledgeTreeDocumentMetadata.java:98)
>   
> com.polyspot.connector.framework.orchestrators.Orchestrator.syncChildren(Orchestrator.java:311)
>   
> com.polyspot.connector.framework.orchestrators.Orchestrator.processGrantedNode(Orchestrator.java:177)
>   
> com.polyspot.connector.framework.orchestrators.Orchestrator.processNode(Orchestrator.java:237)
>   
> com.polyspot.connector.framework.orchestrators.Orchestrator.synchronize(Orchestrator.java:288)
>   
> com.polyspot.connector.framework.orchestrators.OrchestratorMonoThreaded.requestSynchronizeOnceCreated(OrchestratorMonoThreaded.java:16)
>   
> com.polyspot.connector.framework.orchestrators.Orchestrator.requestSynchronize(Orchestrat

[jira] [Created] (TIKA-1152) Process stucks on parsing of a CHM file

2013-07-23 Thread Hong-Thai Nguyen (JIRA)
Hong-Thai Nguyen created TIKA-1152:
--

 Summary: Process stucks on parsing of a CHM file
 Key: TIKA-1152
 URL: https://issues.apache.org/jira/browse/TIKA-1152
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
 Environment: Windows/Linux
Reporter: Hong-Thai Nguyen
Priority: Critical
 Fix For: 1.5
 Attachments: eventcombmt.chm

By parsing the attachment CHM file (MS Microsoft Help Files), Java process 
stucks.

{code}
Thread[main,5,main]


org.apache.tika.parser.chm.lzx.ChmLzxBlock.extractContent(ChmLzxBlock.java:203)
org.apache.tika.parser.chm.lzx.ChmLzxBlock.(ChmLzxBlock.java:77)

org.apache.tika.parser.chm.core.ChmExtractor.extractChmEntry(ChmExtractor.java:338)

org.apache.tika.parser.chm.CHMDocumentInformation.getContent(CHMDocumentInformation.java:72)

org.apache.tika.parser.chm.CHMDocumentInformation.getText(CHMDocumentInformation.java:141)
org.apache.tika.parser.chm.CHM2XHTML.process(CHM2XHTML.java:34)
org.apache.tika.parser.chm.ChmParser.parse(ChmParser.java:51)
org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
org.apache.tika.parser.AbstractParser.parse(AbstractParser.java:53)

com.polyspot.document.converter.DocumentConverter.realizeConversion(DocumentConverter.java:192)

com.polyspot.document.converter.DocumentConverter.convert(DocumentConverter.java:114)

com.polyspot.wscrawlers.PsDocConverter.getConvertedDocument(PsDocConverter.java:77)

com.polyspot.wscrawlers.Converter.getConvertedDocument(Converter.java:81)

com.polyspot.wscrawlers.AbstractConverter.getDirectConvertedDocument(AbstractConverter.java:139)

com.polyspot.connector.framework.convert.PES5ConversionService.convert(PES5ConversionService.java:43)

com.polyspot.connector.framework.convert.ConversionService.findDocumentSplitterAndCallConvert(ConversionService.java:362)

com.polyspot.connector.framework.convert.ConversionService.convertAndGenerateThumbnailForMasterFile(ConversionService.java:291)

com.polyspot.connector.framework.processors.ConvertAndMergeMasterFile.process(ConvertAndMergeMasterFile.java:40)

com.polyspot.connector.framework.processors.SequenceDocumentProcessor.process(SequenceDocumentProcessor.java:21)

com.polyspot.connector.framework.plugins.DocumentBuilderPlugin.computeDocument(DocumentBuilderPlugin.java:48)

com.polyspot.connector.framework.plugins.PluginsManager.computeDocument(PluginsManager.java:219)

com.polyspot.connector.framework.orchestrators.Orchestrator.processOutOfDateNode(Orchestrator.java:201)

com.polyspot.connector.framework.orchestrators.Orchestrator.processGrantedNode(Orchestrator.java:172)

com.polyspot.connector.framework.orchestrators.Orchestrator.processNode(Orchestrator.java:237)

com.polyspot.connector.framework.orchestrators.Orchestrator.synchronize(Orchestrator.java:288)

com.polyspot.connector.framework.orchestrators.OrchestratorMonoThreaded.requestSynchronizeOnceCreated(OrchestratorMonoThreaded.java:16)

com.polyspot.connector.framework.orchestrators.Orchestrator.requestSynchronize(Orchestrator.java:108)

com.polyspot.connector.framework.MonitoredNodeExecutor.requestChildExecution(MonitoredNodeExecutor.java:29)

com.polyspot.connector.knowledgetree.model.content.KnowledgeTreeDocumentMetadata.synchronizeAllChildren(KnowledgeTreeDocumentMetadata.java:98)

com.polyspot.connector.framework.orchestrators.Orchestrator.syncChildren(Orchestrator.java:311)

com.polyspot.connector.framework.orchestrators.Orchestrator.processGrantedNode(Orchestrator.java:177)

com.polyspot.connector.framework.orchestrators.Orchestrator.processNode(Orchestrator.java:237)

com.polyspot.connector.framework.orchestrators.Orchestrator.synchronize(Orchestrator.java:288)

com.polyspot.connector.framework.orchestrators.OrchestratorMonoThreaded.requestSynchronizeOnceCreated(OrchestratorMonoThreaded.java:16)

com.polyspot.connector.framework.orchestrators.Orchestrator.requestSynchronize(Orchestrator.java:108)

com.polyspot.connector.framework.MonitoredNodeExecutor.requestChildExecution(MonitoredNodeExecutor.java:29)

com.polyspot.connector.knowledgetree.driver.db.DBKnowledgeTreeDriver.executeAllDocuments(DBKnowledgeTreeDriver.java:71)

com.polyspot.connector.knowledgetree.driver.KnowledgeTreeDriver.executeAllDocuments(KnowledgeTreeDriver.java:107)

com.polyspot.connector.knowledgetree.model.content.KnowledgeTreeContent.synchronizeAllChildren(KnowledgeTreeContent.java:28

[jira] [Updated] (TIKA-1152) Process stucks on parsing of a CHM file

2013-07-23 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-1152:
---

Description: 
By parsing [the attachment CHM file|^eventcombmt.chm] (MS Microsoft Help 
Files), Java process stucks.

{code}
Thread[main,5,main]


org.apache.tika.parser.chm.lzx.ChmLzxBlock.extractContent(ChmLzxBlock.java:203)
org.apache.tika.parser.chm.lzx.ChmLzxBlock.(ChmLzxBlock.java:77)

org.apache.tika.parser.chm.core.ChmExtractor.extractChmEntry(ChmExtractor.java:338)

org.apache.tika.parser.chm.CHMDocumentInformation.getContent(CHMDocumentInformation.java:72)

org.apache.tika.parser.chm.CHMDocumentInformation.getText(CHMDocumentInformation.java:141)
org.apache.tika.parser.chm.CHM2XHTML.process(CHM2XHTML.java:34)
org.apache.tika.parser.chm.ChmParser.parse(ChmParser.java:51)
org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
org.apache.tika.parser.AbstractParser.parse(AbstractParser.java:53)

com.polyspot.document.converter.DocumentConverter.realizeConversion(DocumentConverter.java:192)

com.polyspot.document.converter.DocumentConverter.convert(DocumentConverter.java:114)

com.polyspot.wscrawlers.PsDocConverter.getConvertedDocument(PsDocConverter.java:77)

com.polyspot.wscrawlers.Converter.getConvertedDocument(Converter.java:81)

com.polyspot.wscrawlers.AbstractConverter.getDirectConvertedDocument(AbstractConverter.java:139)

com.polyspot.connector.framework.convert.PES5ConversionService.convert(PES5ConversionService.java:43)

com.polyspot.connector.framework.convert.ConversionService.findDocumentSplitterAndCallConvert(ConversionService.java:362)

com.polyspot.connector.framework.convert.ConversionService.convertAndGenerateThumbnailForMasterFile(ConversionService.java:291)

com.polyspot.connector.framework.processors.ConvertAndMergeMasterFile.process(ConvertAndMergeMasterFile.java:40)

com.polyspot.connector.framework.processors.SequenceDocumentProcessor.process(SequenceDocumentProcessor.java:21)

com.polyspot.connector.framework.plugins.DocumentBuilderPlugin.computeDocument(DocumentBuilderPlugin.java:48)

com.polyspot.connector.framework.plugins.PluginsManager.computeDocument(PluginsManager.java:219)

com.polyspot.connector.framework.orchestrators.Orchestrator.processOutOfDateNode(Orchestrator.java:201)

com.polyspot.connector.framework.orchestrators.Orchestrator.processGrantedNode(Orchestrator.java:172)

com.polyspot.connector.framework.orchestrators.Orchestrator.processNode(Orchestrator.java:237)

com.polyspot.connector.framework.orchestrators.Orchestrator.synchronize(Orchestrator.java:288)

com.polyspot.connector.framework.orchestrators.OrchestratorMonoThreaded.requestSynchronizeOnceCreated(OrchestratorMonoThreaded.java:16)

com.polyspot.connector.framework.orchestrators.Orchestrator.requestSynchronize(Orchestrator.java:108)

com.polyspot.connector.framework.MonitoredNodeExecutor.requestChildExecution(MonitoredNodeExecutor.java:29)

com.polyspot.connector.knowledgetree.model.content.KnowledgeTreeDocumentMetadata.synchronizeAllChildren(KnowledgeTreeDocumentMetadata.java:98)

com.polyspot.connector.framework.orchestrators.Orchestrator.syncChildren(Orchestrator.java:311)

com.polyspot.connector.framework.orchestrators.Orchestrator.processGrantedNode(Orchestrator.java:177)

com.polyspot.connector.framework.orchestrators.Orchestrator.processNode(Orchestrator.java:237)

com.polyspot.connector.framework.orchestrators.Orchestrator.synchronize(Orchestrator.java:288)

com.polyspot.connector.framework.orchestrators.OrchestratorMonoThreaded.requestSynchronizeOnceCreated(OrchestratorMonoThreaded.java:16)

com.polyspot.connector.framework.orchestrators.Orchestrator.requestSynchronize(Orchestrator.java:108)

com.polyspot.connector.framework.MonitoredNodeExecutor.requestChildExecution(MonitoredNodeExecutor.java:29)

com.polyspot.connector.knowledgetree.driver.db.DBKnowledgeTreeDriver.executeAllDocuments(DBKnowledgeTreeDriver.java:71)

com.polyspot.connector.knowledgetree.driver.KnowledgeTreeDriver.executeAllDocuments(KnowledgeTreeDriver.java:107)

com.polyspot.connector.knowledgetree.model.content.KnowledgeTreeContent.synchronizeAllChildren(KnowledgeTreeContent.java:28)

com.polyspot.connector.framework.orchestrators.Orchestrator.syncChildren(Orchestrator.java:311)

com.polyspot.connector.framework.orchestrators.Orchestrator.processGrantedNode(Orchestrator.java:177)

com.polyspot.connector.framework.orchestrators.Orchestrator.processNode(Orchestrator.java:237

[jira] [Updated] (TIKA-1152) Process stucks on parsing of a CHM file

2013-07-23 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-1152:
---

Description: 
By parsing [the attachment CHM file|^eventcombmt.chm] (MS Microsoft Help 
Files), Java process stuck.

{code}
Thread[main,5,main]


org.apache.tika.parser.chm.lzx.ChmLzxBlock.extractContent(ChmLzxBlock.java:203)
org.apache.tika.parser.chm.lzx.ChmLzxBlock.(ChmLzxBlock.java:77)

org.apache.tika.parser.chm.core.ChmExtractor.extractChmEntry(ChmExtractor.java:338)

org.apache.tika.parser.chm.CHMDocumentInformation.getContent(CHMDocumentInformation.java:72)

org.apache.tika.parser.chm.CHMDocumentInformation.getText(CHMDocumentInformation.java:141)
org.apache.tika.parser.chm.CHM2XHTML.process(CHM2XHTML.java:34)
org.apache.tika.parser.chm.ChmParser.parse(ChmParser.java:51)
org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
org.apache.tika.parser.AbstractParser.parse(AbstractParser.java:53)

com.polyspot.document.converter.DocumentConverter.realizeConversion(DocumentConverter.java:192)
...
{code}

  was:
By parsing [the attachment CHM file|^eventcombmt.chm] (MS Microsoft Help 
Files), Java process stucks.

{code}
Thread[main,5,main]


org.apache.tika.parser.chm.lzx.ChmLzxBlock.extractContent(ChmLzxBlock.java:203)
org.apache.tika.parser.chm.lzx.ChmLzxBlock.(ChmLzxBlock.java:77)

org.apache.tika.parser.chm.core.ChmExtractor.extractChmEntry(ChmExtractor.java:338)

org.apache.tika.parser.chm.CHMDocumentInformation.getContent(CHMDocumentInformation.java:72)

org.apache.tika.parser.chm.CHMDocumentInformation.getText(CHMDocumentInformation.java:141)
org.apache.tika.parser.chm.CHM2XHTML.process(CHM2XHTML.java:34)
org.apache.tika.parser.chm.ChmParser.parse(ChmParser.java:51)
org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
org.apache.tika.parser.AbstractParser.parse(AbstractParser.java:53)

com.polyspot.document.converter.DocumentConverter.realizeConversion(DocumentConverter.java:192)

com.polyspot.document.converter.DocumentConverter.convert(DocumentConverter.java:114)

com.polyspot.wscrawlers.PsDocConverter.getConvertedDocument(PsDocConverter.java:77)

com.polyspot.wscrawlers.Converter.getConvertedDocument(Converter.java:81)

com.polyspot.wscrawlers.AbstractConverter.getDirectConvertedDocument(AbstractConverter.java:139)

com.polyspot.connector.framework.convert.PES5ConversionService.convert(PES5ConversionService.java:43)

com.polyspot.connector.framework.convert.ConversionService.findDocumentSplitterAndCallConvert(ConversionService.java:362)

com.polyspot.connector.framework.convert.ConversionService.convertAndGenerateThumbnailForMasterFile(ConversionService.java:291)

com.polyspot.connector.framework.processors.ConvertAndMergeMasterFile.process(ConvertAndMergeMasterFile.java:40)

com.polyspot.connector.framework.processors.SequenceDocumentProcessor.process(SequenceDocumentProcessor.java:21)

com.polyspot.connector.framework.plugins.DocumentBuilderPlugin.computeDocument(DocumentBuilderPlugin.java:48)

com.polyspot.connector.framework.plugins.PluginsManager.computeDocument(PluginsManager.java:219)

com.polyspot.connector.framework.orchestrators.Orchestrator.processOutOfDateNode(Orchestrator.java:201)

com.polyspot.connector.framework.orchestrators.Orchestrator.processGrantedNode(Orchestrator.java:172)

com.polyspot.connector.framework.orchestrators.Orchestrator.processNode(Orchestrator.java:237)

com.polyspot.connector.framework.orchestrators.Orchestrator.synchronize(Orchestrator.java:288)

com.polyspot.connector.framework.orchestrators.OrchestratorMonoThreaded.requestSynchronizeOnceCreated(OrchestratorMonoThreaded.java:16)

com.polyspot.connector.framework.orchestrators.Orchestrator.requestSynchronize(Orchestrator.java:108)

com.polyspot.connector.framework.MonitoredNodeExecutor.requestChildExecution(MonitoredNodeExecutor.java:29)

com.polyspot.connector.knowledgetree.model.content.KnowledgeTreeDocumentMetadata.synchronizeAllChildren(KnowledgeTreeDocumentMetadata.java:98)

com.polyspot.connector.framework.orchestrators.Orchestrator.syncChildren(Orchestrator.java:311)

com.polyspot.connector.framework.orchestrators.Orchestrator.processGrantedNode(Orchestrator.java:177)

com.polyspot.connector.framework.orchestrators.Orchestrator.processNode(Orchestrator.java:237)

com.polyspot.connector.framework.orchestrators.Orchestrator.synchronize(Orchestrator.java:288

[jira] [Created] (TIKA-1153) Upgrade pdfbox to latest 1.8.2 version

2013-07-23 Thread Hong-Thai Nguyen (JIRA)
Hong-Thai Nguyen created TIKA-1153:
--

 Summary: Upgrade pdfbox to latest 1.8.2 version
 Key: TIKA-1153
 URL: https://issues.apache.org/jira/browse/TIKA-1153
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.4
 Environment: Windows/Linux
Reporter: Hong-Thai Nguyen
Priority: Critical
 Fix For: 1.5


Current version is 1.8.1

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1076) Upgrade to Apache POI 3.9

2013-07-23 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13716402#comment-13716402
 ] 

Hong-Thai Nguyen commented on TIKA-1076:


Strange, I got ready poi 3.9 in dependencies of Tika 1.4.

> Upgrade to Apache POI 3.9
> -
>
> Key: TIKA-1076
> URL: https://issues.apache.org/jira/browse/TIKA-1076
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.3
>Reporter: Nick Burch
> Fix For: 1.5
>
>
> We should upgrade to Apache POI 3.9, which is the latest version

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-1152) Process loops infinitely on parsing of a CHM file

2013-07-23 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-1152:
---

Summary: Process loops infinitely on parsing of a CHM file  (was: Process 
stucks on parsing of a CHM file)

> Process loops infinitely on parsing of a CHM file
> -
>
> Key: TIKA-1152
> URL: https://issues.apache.org/jira/browse/TIKA-1152
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.4
> Environment: Windows/Linux
>    Reporter: Hong-Thai Nguyen
>Priority: Critical
> Fix For: 1.5
>
> Attachments: eventcombmt.chm
>
>
> By parsing [the attachment CHM file|^eventcombmt.chm] (MS Microsoft Help 
> Files), Java process stuck.
> {code}
> Thread[main,5,main]
>   
> org.apache.tika.parser.chm.lzx.ChmLzxBlock.extractContent(ChmLzxBlock.java:203)
>   org.apache.tika.parser.chm.lzx.ChmLzxBlock.(ChmLzxBlock.java:77)
>   
> org.apache.tika.parser.chm.core.ChmExtractor.extractChmEntry(ChmExtractor.java:338)
>   
> org.apache.tika.parser.chm.CHMDocumentInformation.getContent(CHMDocumentInformation.java:72)
>   
> org.apache.tika.parser.chm.CHMDocumentInformation.getText(CHMDocumentInformation.java:141)
>   org.apache.tika.parser.chm.CHM2XHTML.process(CHM2XHTML.java:34)
>   org.apache.tika.parser.chm.ChmParser.parse(ChmParser.java:51)
>   org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
>   org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   org.apache.tika.parser.AbstractParser.parse(AbstractParser.java:53)
>   
> com.polyspot.document.converter.DocumentConverter.realizeConversion(DocumentConverter.java:192)
> ...
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1152) Process loops infinitely on parsing of a CHM file

2013-07-23 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13716538#comment-13716538
 ] 

Hong-Thai Nguyen commented on TIKA-1152:


It's a bug on ChmLzxBlock.java on this fautly file. I'm ready to push a fix, 
but don't have ASF account on Tika project.

> Process loops infinitely on parsing of a CHM file
> -
>
> Key: TIKA-1152
> URL: https://issues.apache.org/jira/browse/TIKA-1152
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.4
> Environment: Windows/Linux
>Reporter: Hong-Thai Nguyen
>Priority: Critical
> Fix For: 1.5
>
> Attachments: eventcombmt.chm
>
>
> By parsing [the attachment CHM file|^eventcombmt.chm] (MS Microsoft Help 
> Files), Java process stuck.
> {code}
> Thread[main,5,main]
>   
> org.apache.tika.parser.chm.lzx.ChmLzxBlock.extractContent(ChmLzxBlock.java:203)
>   org.apache.tika.parser.chm.lzx.ChmLzxBlock.(ChmLzxBlock.java:77)
>   
> org.apache.tika.parser.chm.core.ChmExtractor.extractChmEntry(ChmExtractor.java:338)
>   
> org.apache.tika.parser.chm.CHMDocumentInformation.getContent(CHMDocumentInformation.java:72)
>   
> org.apache.tika.parser.chm.CHMDocumentInformation.getText(CHMDocumentInformation.java:141)
>   org.apache.tika.parser.chm.CHM2XHTML.process(CHM2XHTML.java:34)
>   org.apache.tika.parser.chm.ChmParser.parse(ChmParser.java:51)
>   org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
>   org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   org.apache.tika.parser.AbstractParser.parse(AbstractParser.java:53)
>   
> com.polyspot.document.converter.DocumentConverter.realizeConversion(DocumentConverter.java:192)
> ...
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (TIKA-1152) Process loops infinitely on parsing of a CHM file

2013-07-23 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13716538#comment-13716538
 ] 

Hong-Thai Nguyen edited comment on TIKA-1152 at 7/23/13 4:17 PM:
-

It's a bug on ChmLzxBlock.java on this fautly file. It leave never loop when 
block type does not match.
I'm ready to push a fix, but don't have ASF account on Tika project.

  was (Author: thaichat04):
It's a bug on ChmLzxBlock.java on this fautly file. I'm ready to push a 
fix, but don't have ASF account on Tika project.
  
> Process loops infinitely on parsing of a CHM file
> -
>
> Key: TIKA-1152
> URL: https://issues.apache.org/jira/browse/TIKA-1152
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.4
> Environment: Windows/Linux
>Reporter: Hong-Thai Nguyen
>Priority: Critical
> Fix For: 1.5
>
> Attachments: eventcombmt.chm
>
>
> By parsing [the attachment CHM file|^eventcombmt.chm] (MS Microsoft Help 
> Files), Java process stuck.
> {code}
> Thread[main,5,main]
>   
> org.apache.tika.parser.chm.lzx.ChmLzxBlock.extractContent(ChmLzxBlock.java:203)
>   org.apache.tika.parser.chm.lzx.ChmLzxBlock.(ChmLzxBlock.java:77)
>   
> org.apache.tika.parser.chm.core.ChmExtractor.extractChmEntry(ChmExtractor.java:338)
>   
> org.apache.tika.parser.chm.CHMDocumentInformation.getContent(CHMDocumentInformation.java:72)
>   
> org.apache.tika.parser.chm.CHMDocumentInformation.getText(CHMDocumentInformation.java:141)
>   org.apache.tika.parser.chm.CHM2XHTML.process(CHM2XHTML.java:34)
>   org.apache.tika.parser.chm.ChmParser.parse(ChmParser.java:51)
>   org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
>   org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   org.apache.tika.parser.AbstractParser.parse(AbstractParser.java:53)
>   
> com.polyspot.document.converter.DocumentConverter.realizeConversion(DocumentConverter.java:192)
> ...
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (TIKA-1076) Upgrade to Apache POI 3.9

2013-07-23 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13716543#comment-13716543
 ] 

Hong-Thai Nguyen commented on TIKA-1076:


ok, as this improvement has been done (3.9 is there), should us close properly 
this issue and create other one to handle regression effects ?

> Upgrade to Apache POI 3.9
> -
>
> Key: TIKA-1076
> URL: https://issues.apache.org/jira/browse/TIKA-1076
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.3
>Reporter: Nick Burch
> Fix For: 1.5
>
>
> We should upgrade to Apache POI 3.9, which is the latest version

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (TIKA-1152) Process loops infinitely on parsing of a CHM file

2013-07-29 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-1152:
---

Attachment: ChmLzxBlock.java.patch

I attached patch commit for this fix.

> Process loops infinitely on parsing of a CHM file
> -
>
> Key: TIKA-1152
> URL: https://issues.apache.org/jira/browse/TIKA-1152
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.4
> Environment: Windows/Linux
>    Reporter: Hong-Thai Nguyen
>Priority: Critical
> Fix For: 1.5
>
> Attachments: ChmLzxBlock.java.patch, eventcombmt.chm
>
>
> By parsing [the attachment CHM file|^eventcombmt.chm] (MS Microsoft Help 
> Files), Java process stuck.
> {code}
> Thread[main,5,main]
>   
> org.apache.tika.parser.chm.lzx.ChmLzxBlock.extractContent(ChmLzxBlock.java:203)
>   org.apache.tika.parser.chm.lzx.ChmLzxBlock.(ChmLzxBlock.java:77)
>   
> org.apache.tika.parser.chm.core.ChmExtractor.extractChmEntry(ChmExtractor.java:338)
>   
> org.apache.tika.parser.chm.CHMDocumentInformation.getContent(CHMDocumentInformation.java:72)
>   
> org.apache.tika.parser.chm.CHMDocumentInformation.getText(CHMDocumentInformation.java:141)
>   org.apache.tika.parser.chm.CHM2XHTML.process(CHM2XHTML.java:34)
>   org.apache.tika.parser.chm.ChmParser.parse(ChmParser.java:51)
>   org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
>   org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   org.apache.tika.parser.AbstractParser.parse(AbstractParser.java:53)
>   
> com.polyspot.document.converter.DocumentConverter.realizeConversion(DocumentConverter.java:192)
> ...
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (TIKA-1152) Process loops infinitely on parsing of a CHM file

2013-07-29 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13722457#comment-13722457
 ] 

Hong-Thai Nguyen edited comment on TIKA-1152 at 7/29/13 1:45 PM:
-

I attached patch commit [^ChmLzxBlock.java.patch] for this fix.

  was (Author: thaichat04):
I attached patch commit for this fix.
  
> Process loops infinitely on parsing of a CHM file
> -
>
> Key: TIKA-1152
> URL: https://issues.apache.org/jira/browse/TIKA-1152
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.4
> Environment: Windows/Linux
>    Reporter: Hong-Thai Nguyen
>Priority: Critical
> Fix For: 1.5
>
> Attachments: ChmLzxBlock.java.patch, eventcombmt.chm
>
>
> By parsing [the attachment CHM file|^eventcombmt.chm] (MS Microsoft Help 
> Files), Java process stuck.
> {code}
> Thread[main,5,main]
>   
> org.apache.tika.parser.chm.lzx.ChmLzxBlock.extractContent(ChmLzxBlock.java:203)
>   org.apache.tika.parser.chm.lzx.ChmLzxBlock.(ChmLzxBlock.java:77)
>   
> org.apache.tika.parser.chm.core.ChmExtractor.extractChmEntry(ChmExtractor.java:338)
>   
> org.apache.tika.parser.chm.CHMDocumentInformation.getContent(CHMDocumentInformation.java:72)
>   
> org.apache.tika.parser.chm.CHMDocumentInformation.getText(CHMDocumentInformation.java:141)
>   org.apache.tika.parser.chm.CHM2XHTML.process(CHM2XHTML.java:34)
>   org.apache.tika.parser.chm.ChmParser.parse(ChmParser.java:51)
>   org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
>   org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   org.apache.tika.parser.AbstractParser.parse(AbstractParser.java:53)
>   
> com.polyspot.document.converter.DocumentConverter.realizeConversion(DocumentConverter.java:192)
> ...
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Comment Edited] (TIKA-1152) Process loops infinitely on parsing of a CHM file

2013-07-29 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13716538#comment-13716538
 ] 

Hong-Thai Nguyen edited comment on TIKA-1152 at 7/29/13 1:46 PM:
-

It's a bug on ChmLzxBlock.java on this faulty file. It leaves never loop when 
block type does not match.

  was (Author: thaichat04):
It's a bug on ChmLzxBlock.java on this fautly file. It leave never loop 
when block type does not match.
I'm ready to push a fix, but don't have ASF account on Tika project.
  
> Process loops infinitely on parsing of a CHM file
> -
>
> Key: TIKA-1152
> URL: https://issues.apache.org/jira/browse/TIKA-1152
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.4
> Environment: Windows/Linux
>Reporter: Hong-Thai Nguyen
>Priority: Critical
> Fix For: 1.5
>
> Attachments: ChmLzxBlock.java.patch, eventcombmt.chm
>
>
> By parsing [the attachment CHM file|^eventcombmt.chm] (MS Microsoft Help 
> Files), Java process stuck.
> {code}
> Thread[main,5,main]
>   
> org.apache.tika.parser.chm.lzx.ChmLzxBlock.extractContent(ChmLzxBlock.java:203)
>   org.apache.tika.parser.chm.lzx.ChmLzxBlock.(ChmLzxBlock.java:77)
>   
> org.apache.tika.parser.chm.core.ChmExtractor.extractChmEntry(ChmExtractor.java:338)
>   
> org.apache.tika.parser.chm.CHMDocumentInformation.getContent(CHMDocumentInformation.java:72)
>   
> org.apache.tika.parser.chm.CHMDocumentInformation.getText(CHMDocumentInformation.java:141)
>   org.apache.tika.parser.chm.CHM2XHTML.process(CHM2XHTML.java:34)
>   org.apache.tika.parser.chm.ChmParser.parse(ChmParser.java:51)
>   org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
>   org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   org.apache.tika.parser.AbstractParser.parse(AbstractParser.java:53)
>   
> com.polyspot.document.converter.DocumentConverter.realizeConversion(DocumentConverter.java:192)
> ...
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (TIKA-1200) Upgrade pdfbox 1.8.3

2013-12-02 Thread Hong-Thai Nguyen (JIRA)
Hong-Thai Nguyen created TIKA-1200:
--

 Summary: Upgrade pdfbox 1.8.3
 Key: TIKA-1200
 URL: https://issues.apache.org/jira/browse/TIKA-1200
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.4
 Environment: all
Reporter: Hong-Thai Nguyen
Priority: Critical
 Fix For: 1.5


pdfbox just released new 1.8.3 version
http://www.apache.org/dist/pdfbox/1.8.3/RELEASE-NOTES.txt



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Created] (TIKA-1201) Add option for switching to pdfbox NonSequentialPDFParser

2013-12-02 Thread Hong-Thai Nguyen (JIRA)
Hong-Thai Nguyen created TIKA-1201:
--

 Summary: Add option for switching to pdfbox NonSequentialPDFParser
 Key: TIKA-1201
 URL: https://issues.apache.org/jira/browse/TIKA-1201
 Project: Tika
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.4
 Environment: all
Reporter: Hong-Thai Nguyen
Priority: Critical


As discussing, we can improve PDF extraction by 45% with this new 
NonSequentialPDFParser and fit more with PDF specification. This parser will be 
integrated by default in pdfbox 2.0.

ref.: 
https://issues.apache.org/jira/browse/PDFBOX-1104
http://pdfbox.apache.org/ideas.html

We should provide an extended parser or parameter current PDFParser to call:
{code}
PDDocument.loadNonSeq(file, scratchFile);
{code}



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (TIKA-1201) Add possibility for switching to pdfbox NonSequentialPDFParser

2013-12-02 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-1201:
---

Summary: Add possibility for switching to pdfbox NonSequentialPDFParser  
(was: Add option for switching to pdfbox NonSequentialPDFParser)

> Add possibility for switching to pdfbox NonSequentialPDFParser
> --
>
> Key: TIKA-1201
> URL: https://issues.apache.org/jira/browse/TIKA-1201
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.4
> Environment: all
>    Reporter: Hong-Thai Nguyen
>Priority: Critical
>
> As discussing, we can improve PDF extraction by 45% with this new 
> NonSequentialPDFParser and fit more with PDF specification. This parser will 
> be integrated by default in pdfbox 2.0.
> ref.: 
> https://issues.apache.org/jira/browse/PDFBOX-1104
> http://pdfbox.apache.org/ideas.html
> We should provide an extended parser or parameter current PDFParser to call:
> {code}
> PDDocument.loadNonSeq(file, scratchFile);
> {code}



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (TIKA-1202) Refactor PDFParser to enable easier parameter setting

2013-12-04 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13838732#comment-13838732
 ] 

Hong-Thai Nguyen commented on TIKA-1202:


+1 for me.
Thanks

> Refactor PDFParser to enable easier parameter setting
> -
>
> Key: TIKA-1202
> URL: https://issues.apache.org/jira/browse/TIKA-1202
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.5
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Trivial
> Attachments: TIKA-1202.patch
>
>
> It would be handy to be able to set PDFParser parameters 
> (extractAnnotationText, etc) in a config file and via ParseContext.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Commented] (TIKA-1205) Allow PDFParser to fallback to other parser if there is an exception

2013-12-11 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13845398#comment-13845398
 ] 

Hong-Thai Nguyen commented on TIKA-1205:


Just a (newbie) question, why limit only on PDFParser, not for any other parser 
?
I agree that fallback is necessary when having exception. But, the worst case 
is infinitive loop happens when parsing a document.

For these two purposes, we would generalize to handle exception and timeout 
properly in a wrapper ?

> Allow PDFParser to fallback to other parser if there is an exception
> 
>
> Key: TIKA-1205
> URL: https://issues.apache.org/jira/browse/TIKA-1205
> Project: Tika
>  Issue Type: Improvement
>  Components: parser
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Trivial
> Fix For: 1.5
>
>
> With TIKA-1201, there is now an option to use PDFBox's NonSequentialPDFParser 
> instead of the traditional parser for parsing PDF files.  Following the 
> description in PDFBOX-1199, it would be useful to allow fallback to the 
> classic parser if NonSequentialPDFParser throws an IOException.  For the sake 
> of symmetry, I propose a boolean useParserFallbackOnException parameter.  If 
> this parameter is true, and if Tika's PDFParser is using the classic parser, 
> Tika will fallback to the NonSequentialPDFParser if there is an IOException; 
> if this parameter is true and if Tika's PDFParser is using the 
> NonSequentialPDFParser it will fallback to the classic parser if there is an 
> IOException.
> Many thanks to Hong-Thai for championing the addition of the added 
> NonSequentialPDFParser capability in TIKA-1201, and many thanks to Timo for 
> PDFBox's NonSequentialPDFParser (PDFBOX-1199)!



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)


[jira] [Commented] (TIKA-1152) Process loops infinitely on parsing of a CHM file

2013-12-23 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13855528#comment-13855528
 ] 

Hong-Thai Nguyen commented on TIKA-1152:


[~gagravarr] or anyone can have look at patch in integrate to trunk before 
release 1.5 please ?
Merci

> Process loops infinitely on parsing of a CHM file
> -
>
> Key: TIKA-1152
> URL: https://issues.apache.org/jira/browse/TIKA-1152
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.4
> Environment: Windows/Linux
>    Reporter: Hong-Thai Nguyen
>Priority: Critical
> Fix For: 1.5
>
> Attachments: ChmLzxBlock.java.patch, eventcombmt.chm
>
>
> By parsing [the attachment CHM file|^eventcombmt.chm] (MS Microsoft Help 
> Files), Java process stuck.
> {code}
> Thread[main,5,main]
>   
> org.apache.tika.parser.chm.lzx.ChmLzxBlock.extractContent(ChmLzxBlock.java:203)
>   org.apache.tika.parser.chm.lzx.ChmLzxBlock.(ChmLzxBlock.java:77)
>   
> org.apache.tika.parser.chm.core.ChmExtractor.extractChmEntry(ChmExtractor.java:338)
>   
> org.apache.tika.parser.chm.CHMDocumentInformation.getContent(CHMDocumentInformation.java:72)
>   
> org.apache.tika.parser.chm.CHMDocumentInformation.getText(CHMDocumentInformation.java:141)
>   org.apache.tika.parser.chm.CHM2XHTML.process(CHM2XHTML.java:34)
>   org.apache.tika.parser.chm.ChmParser.parse(ChmParser.java:51)
>   org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
>   org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   org.apache.tika.parser.AbstractParser.parse(AbstractParser.java:53)
>   
> com.polyspot.document.converter.DocumentConverter.realizeConversion(DocumentConverter.java:192)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1152) Process loops infinitely on parsing of a CHM file

2013-12-27 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13857418#comment-13857418
 ] 

Hong-Thai Nguyen commented on TIKA-1152:


Thank [~jukkaz], I've checked on trunk. Seems ok now.

> Process loops infinitely on parsing of a CHM file
> -
>
> Key: TIKA-1152
> URL: https://issues.apache.org/jira/browse/TIKA-1152
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.4
> Environment: Windows/Linux
>Reporter: Hong-Thai Nguyen
>Assignee: Jukka Zitting
>Priority: Critical
> Fix For: 1.5
>
> Attachments: ChmLzxBlock.java.patch, eventcombmt.chm
>
>
> By parsing [the attachment CHM file|^eventcombmt.chm] (MS Microsoft Help 
> Files), Java process stuck.
> {code}
> Thread[main,5,main]
>   
> org.apache.tika.parser.chm.lzx.ChmLzxBlock.extractContent(ChmLzxBlock.java:203)
>   org.apache.tika.parser.chm.lzx.ChmLzxBlock.(ChmLzxBlock.java:77)
>   
> org.apache.tika.parser.chm.core.ChmExtractor.extractChmEntry(ChmExtractor.java:338)
>   
> org.apache.tika.parser.chm.CHMDocumentInformation.getContent(CHMDocumentInformation.java:72)
>   
> org.apache.tika.parser.chm.CHMDocumentInformation.getText(CHMDocumentInformation.java:141)
>   org.apache.tika.parser.chm.CHM2XHTML.process(CHM2XHTML.java:34)
>   org.apache.tika.parser.chm.ChmParser.parse(ChmParser.java:51)
>   org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
>   org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   org.apache.tika.parser.AbstractParser.parse(AbstractParser.java:53)
>   
> com.polyspot.document.converter.DocumentConverter.realizeConversion(DocumentConverter.java:192)
> ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (TIKA-1215) Regression: Unable parse a mp3 file on 1.5 which parsed successfully on 1.4

2013-12-27 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-1215:
---

Attachment: Centres 080805@0650 RTBF Matin Première - A propos des rues de 
Dublin et Dubreucq.mp3

> Regression: Unable parse a mp3 file on 1.5 which parsed successfully on 1.4
> ---
>
> Key: TIKA-1215
> URL: https://issues.apache.org/jira/browse/TIKA-1215
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.5
>    Reporter: Hong-Thai Nguyen
>Priority: Critical
> Attachments: Centres 080805@0650 RTBF Matin Première - A propos des 
> rues de Dublin et Dubreucq.mp3
>
>
> With attached file, 1.5 raises this exception on parsing. This file has no 
> problem on 1.4
> {code}
> ...
> Caused by: org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml 
> not declared
>   at 
> org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:62)
>   at 
> org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getQName(ToXMLContentHandler.java:68)
>   at 
> org.apache.tika.sax.ToXMLContentHandler.startElement(ToXMLContentHandler.java:148)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.element(XHTMLContentHandler.java:323)
>   at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:107)
>   at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> com.polyspot.document.converter.DocumentConverter.realizeTikaConversion(DocumentConverter.java:221)
>   ... 15 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Created] (TIKA-1215) Regression: Unable parse a mp3 file on 1.5 which parsed successfully on 1.4

2013-12-27 Thread Hong-Thai Nguyen (JIRA)
Hong-Thai Nguyen created TIKA-1215:
--

 Summary: Regression: Unable parse a mp3 file on 1.5 which parsed 
successfully on 1.4
 Key: TIKA-1215
 URL: https://issues.apache.org/jira/browse/TIKA-1215
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.5
Reporter: Hong-Thai Nguyen
Priority: Critical
 Attachments: Centres 080805@0650 RTBF Matin Première - A propos des 
rues de Dublin et Dubreucq.mp3

With attached file, 1.5 raises this exception on parsing. This file has no 
problem on 1.4
{code}
...
Caused by: org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml not 
declared
at 
org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:62)
at 
org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getQName(ToXMLContentHandler.java:68)
at 
org.apache.tika.sax.ToXMLContentHandler.startElement(ToXMLContentHandler.java:148)
at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
at 
org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60)
at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
at 
org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
at 
org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
at 
org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284)
at 
org.apache.tika.sax.XHTMLContentHandler.element(XHTMLContentHandler.java:323)
at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:107)
at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at 
com.polyspot.document.converter.DocumentConverter.realizeTikaConversion(DocumentConverter.java:221)
... 15 more
{code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1215) Regression: Unable parse a mp3 file on 1.5 which parsed successfully on 1.4

2013-12-27 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13857542#comment-13857542
 ] 

Hong-Thai Nguyen commented on TIKA-1215:


I built on latest trunk of git://git.apache.org/tika.git

> Regression: Unable parse a mp3 file on 1.5 which parsed successfully on 1.4
> ---
>
> Key: TIKA-1215
> URL: https://issues.apache.org/jira/browse/TIKA-1215
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.5
>Reporter: Hong-Thai Nguyen
>Priority: Critical
> Attachments: Centres 080805@0650 RTBF Matin Première - A propos des 
> rues de Dublin et Dubreucq.mp3
>
>
> With attached file, 1.5 raises this exception on parsing. This file has no 
> problem on 1.4
> {code}
> ...
> Caused by: org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml 
> not declared
>   at 
> org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:62)
>   at 
> org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getQName(ToXMLContentHandler.java:68)
>   at 
> org.apache.tika.sax.ToXMLContentHandler.startElement(ToXMLContentHandler.java:148)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.element(XHTMLContentHandler.java:323)
>   at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:107)
>   at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> com.polyspot.document.converter.DocumentConverter.realizeTikaConversion(DocumentConverter.java:221)
>   ... 15 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Comment Edited] (TIKA-1215) Regression: Unable parse a mp3 file on 1.5 which parsed successfully on 1.4

2013-12-27 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13857542#comment-13857542
 ] 

Hong-Thai Nguyen edited comment on TIKA-1215 at 12/27/13 3:59 PM:
--

I built on latest trunk of git://git.apache.org/tika.git and via Java API


was (Author: thaichat04):
I built on latest trunk of git://git.apache.org/tika.git

> Regression: Unable parse a mp3 file on 1.5 which parsed successfully on 1.4
> ---
>
> Key: TIKA-1215
> URL: https://issues.apache.org/jira/browse/TIKA-1215
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.5
>Reporter: Hong-Thai Nguyen
>Priority: Critical
> Attachments: Centres 080805@0650 RTBF Matin Première - A propos des 
> rues de Dublin et Dubreucq.mp3
>
>
> With attached file, 1.5 raises this exception on parsing. This file has no 
> problem on 1.4
> {code}
> ...
> Caused by: org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml 
> not declared
>   at 
> org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:62)
>   at 
> org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getQName(ToXMLContentHandler.java:68)
>   at 
> org.apache.tika.sax.ToXMLContentHandler.startElement(ToXMLContentHandler.java:148)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.element(XHTMLContentHandler.java:323)
>   at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:107)
>   at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> com.polyspot.document.converter.DocumentConverter.realizeTikaConversion(DocumentConverter.java:221)
>   ... 15 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1215) Regression: Unable parse a mp3 file on 1.5 which parsed successfully on 1.4

2014-01-02 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13860246#comment-13860246
 ] 

Hong-Thai Nguyen commented on TIKA-1215:


[~davemeikle], here's a sample test failed on this file:
{code}
package com.polyspot.document.converter;

import static org.fest.assertions.Assertions.assertThat;

import java.io.ByteArrayOutputStream;
import java.io.InputStream;

import org.apache.commons.io.IOUtils;
import org.apache.tika.config.TikaConfig;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.CompositeParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.apache.tika.sax.ToHTMLContentHandler;
import org.apache.tika.sax.WriteOutContentHandler;
import org.junit.Before;
import org.junit.Test;
import org.xml.sax.ContentHandler;

public class Mp3ParserTest {

  private ConverterConfiguration config;
  private CompositeParser parser;
  
  @Before
  public void before() throws Exception {
  config = new ConverterConfiguration();
  config.setMimeToConverter("src/test/resources/mimeToConverter.xml");
  config.setSizeLimit(40);
  TikaConfig tikaConf = new TikaConfig(config.getMimeToConverter().trim());
  parser = (CompositeParser) tikaConf.getParser();
  }
  
  @Test
  public void can_parse_mp3_files() throws Exception {
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
ToHTMLContentHandler toHtmlContentHandler = new 
ToHTMLContentHandler(outputStream, "UTF-8"); // Extract always HTML by default
WriteOutContentHandler handler = new 
WriteOutContentHandler(toHtmlContentHandler, (int) 400);
ContentHandler bodyHandler = new BodyContentHandler(handler);

InputStream input = getClass().getResourceAsStream("/mp3/test.mp3");
try {
  ParseContext context = new ParseContext();   // parsing
  context.set(Parser.class, parser);
  parser.parse(input, bodyHandler, new Metadata(), context);
} finally {
  IOUtils.closeQuietly(input);
}

String output = outputStream.toString("UTF-8");
assertThat(output).isNotEmpty(); // failed
  }
}
{code}

> Regression: Unable parse a mp3 file on 1.5 which parsed successfully on 1.4
> ---
>
> Key: TIKA-1215
> URL: https://issues.apache.org/jira/browse/TIKA-1215
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.5
>Reporter: Hong-Thai Nguyen
>Priority: Critical
> Attachments: Centres 080805@0650 RTBF Matin Première - A propos des 
> rues de Dublin et Dubreucq.mp3
>
>
> With attached file, 1.5 raises this exception on parsing. This file has no 
> problem on 1.4
> {code}
> ...
> Caused by: org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml 
> not declared
>   at 
> org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:62)
>   at 
> org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getQName(ToXMLContentHandler.java:68)
>   at 
> org.apache.tika.sax.ToXMLContentHandler.startElement(ToXMLContentHandler.java:148)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.element(XHTMLContentHandler.java:323)
>   at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:107)
>   at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> com.polyspot.document.converter.DocumentConverter.realizeTikaConversion(DocumentConverter.java:221)
>   ... 15 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (TIKA-1215) Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4

2014-01-02 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-1215:
---

Summary: Regression: Unable to parse a mp3 file on 1.5 which parsed 
successfully on 1.4  (was: Regression: Unable parse a mp3 file on 1.5 which 
parsed successfully on 1.4)

> Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4
> --
>
> Key: TIKA-1215
> URL: https://issues.apache.org/jira/browse/TIKA-1215
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.5
>    Reporter: Hong-Thai Nguyen
>Priority: Critical
> Attachments: Centres 080805@0650 RTBF Matin Première - A propos des 
> rues de Dublin et Dubreucq.mp3
>
>
> With attached file, 1.5 raises this exception on parsing. This file has no 
> problem on 1.4
> {code}
> ...
> Caused by: org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml 
> not declared
>   at 
> org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:62)
>   at 
> org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getQName(ToXMLContentHandler.java:68)
>   at 
> org.apache.tika.sax.ToXMLContentHandler.startElement(ToXMLContentHandler.java:148)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.element(XHTMLContentHandler.java:323)
>   at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:107)
>   at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> com.polyspot.document.converter.DocumentConverter.realizeTikaConversion(DocumentConverter.java:221)
>   ... 15 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Comment Edited] (TIKA-1215) Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4

2014-01-02 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13860246#comment-13860246
 ] 

Hong-Thai Nguyen edited comment on TIKA-1215 at 1/2/14 3:11 PM:


[~davemeikle], here's a sample test failed on this file:
{code}
package com.polyspot.document.converter;

import static org.fest.assertions.Assertions.assertThat;

import java.io.ByteArrayOutputStream;
import java.io.InputStream;

import org.apache.commons.io.IOUtils;
import org.apache.tika.config.TikaConfig;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.CompositeParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.apache.tika.sax.ToHTMLContentHandler;
import org.apache.tika.sax.WriteOutContentHandler;
import org.junit.Before;
import org.junit.Test;
import org.xml.sax.ContentHandler;

public class Mp3ParserTest {

  private ConverterConfiguration config;
  private CompositeParser parser;

  @Before
  public void before() throws Exception {
config = new ConverterConfiguration();
config.setMimeToConverter("src/test/resources/mimeToConverter.xml");
config.setSizeLimit(40);
TikaConfig tikaConf = new TikaConfig(config.getMimeToConverter().trim());
parser = (CompositeParser) tikaConf.getParser();
  }

  @Test
  public void can_parse_mp3_files() throws Exception {
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
ToHTMLContentHandler toHtmlContentHandler = new 
ToHTMLContentHandler(outputStream, "UTF-8"); // Extract

 // always

 // HTML

 // by

 // default
WriteOutContentHandler handler = new 
WriteOutContentHandler(toHtmlContentHandler, (int) 400);
ContentHandler bodyHandler = new BodyContentHandler(handler);

InputStream input = getClass().getResourceAsStream("/mp3/test.mp3");
try {
  ParseContext context = new ParseContext(); // parsing
  context.set(Parser.class, parser);
  Metadata metadata = new Metadata();
  metadata.add(Metadata.RESOURCE_NAME_KEY, "12345");
  metadata.add(Metadata.CONTENT_TYPE, "audio/mpeg");
  parser.parse(input, bodyHandler, metadata, context);
} finally {
  IOUtils.closeQuietly(input);
}

String output = outputStream.toString("UTF-8");

assertThat(output).isNotEmpty(); // failed
  }

}
{code}

Here's stack error
{noformat}
org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml not declared
at 
org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:62)
at 
org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getQName(ToXMLContentHandler.java:68)
at 
org.apache.tika.sax.ToXMLContentHandler.startElement(ToXMLContentHandler.java:148)
at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
at 
org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60)
at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
at 
org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
at 
org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
at 
org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284)
at 
org.apache.tika.sax.XHTMLContentHandler.element(XHTMLContentHandler.java:323)
at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:107)
at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at 
com.polyspot.document.converter.Mp3ParserTest.can_parse_mp3_files(Mp3ParserTest.java:49)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.inv

[jira] [Comment Edited] (TIKA-1215) Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4

2014-01-02 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13860246#comment-13860246
 ] 

Hong-Thai Nguyen edited comment on TIKA-1215 at 1/2/14 3:12 PM:


[~davemeikle], here's a sample test failed on this file with 1.5-SNAPSHOT, but 
passed on 1.4:
{code}
package com.polyspot.document.converter;

import static org.fest.assertions.Assertions.assertThat;

import java.io.ByteArrayOutputStream;
import java.io.InputStream;

import org.apache.commons.io.IOUtils;
import org.apache.tika.config.TikaConfig;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.CompositeParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.apache.tika.sax.ToHTMLContentHandler;
import org.apache.tika.sax.WriteOutContentHandler;
import org.junit.Before;
import org.junit.Test;
import org.xml.sax.ContentHandler;

public class Mp3ParserTest {

  private ConverterConfiguration config;
  private CompositeParser parser;

  @Before
  public void before() throws Exception {
config = new ConverterConfiguration();
config.setMimeToConverter("src/test/resources/mimeToConverter.xml");
config.setSizeLimit(40);
TikaConfig tikaConf = new TikaConfig(config.getMimeToConverter().trim());
parser = (CompositeParser) tikaConf.getParser();
  }

  @Test
  public void can_parse_mp3_files() throws Exception {
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
ToHTMLContentHandler toHtmlContentHandler = new 
ToHTMLContentHandler(outputStream, "UTF-8"); // Extract

 // always

 // HTML

 // by

 // default
WriteOutContentHandler handler = new 
WriteOutContentHandler(toHtmlContentHandler, (int) 400);
ContentHandler bodyHandler = new BodyContentHandler(handler);

InputStream input = getClass().getResourceAsStream("/mp3/test.mp3");
try {
  ParseContext context = new ParseContext(); // parsing
  context.set(Parser.class, parser);
  Metadata metadata = new Metadata();
  metadata.add(Metadata.RESOURCE_NAME_KEY, "12345");
  metadata.add(Metadata.CONTENT_TYPE, "audio/mpeg");
  parser.parse(input, bodyHandler, metadata, context);
} finally {
  IOUtils.closeQuietly(input);
}

String output = outputStream.toString("UTF-8");

assertThat(output).isNotEmpty(); // failed
  }

}
{code}

Here's stack error
{noformat}
org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml not declared
at 
org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:62)
at 
org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getQName(ToXMLContentHandler.java:68)
at 
org.apache.tika.sax.ToXMLContentHandler.startElement(ToXMLContentHandler.java:148)
at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
at 
org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60)
at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
at 
org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
at 
org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
at 
org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284)
at 
org.apache.tika.sax.XHTMLContentHandler.element(XHTMLContentHandler.java:323)
at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:107)
at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at 
com.polyspot.document.converter.Mp3ParserTest.can_parse_mp3_files(Mp3ParserTest.java:49)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:

[jira] [Comment Edited] (TIKA-1215) Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4

2014-01-02 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13860246#comment-13860246
 ] 

Hong-Thai Nguyen edited comment on TIKA-1215 at 1/2/14 5:20 PM:


[~davemeikle], here's a sample test failed on this file with 1.5-SNAPSHOT, but 
passed on 1.4:
{code}
package com.polyspot.document.converter;

import static org.fest.assertions.Assertions.assertThat;

import java.io.ByteArrayOutputStream;
import java.io.InputStream;

import org.apache.commons.io.IOUtils;
import org.apache.tika.config.TikaConfig;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.CompositeParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.sax.BodyContentHandler;
import org.apache.tika.sax.ToHTMLContentHandler;
import org.apache.tika.sax.WriteOutContentHandler;
import org.junit.Before;
import org.junit.Test;
import org.xml.sax.ContentHandler;

public class Mp3ParserTest {
  private CompositeParser parser;

  @Before
  public void before() throws Exception {
TikaConfig tikaConf = new TikaConfig();
parser = (CompositeParser) tikaConf.getParser();
  }

  @Test
  public void can_parse_mp3_files() throws Exception {
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
ToHTMLContentHandler toHtmlContentHandler = new 
ToHTMLContentHandler(outputStream, "UTF-8"); // Extract

 // always

 // HTML

 // by

 // default
WriteOutContentHandler handler = new 
WriteOutContentHandler(toHtmlContentHandler, (int) 400);
ContentHandler bodyHandler = new BodyContentHandler(handler);

InputStream input = getClass().getResourceAsStream("/mp3/test.mp3");
try {
  ParseContext context = new ParseContext(); // parsing
  context.set(Parser.class, parser);
  Metadata metadata = new Metadata();
  metadata.add(Metadata.RESOURCE_NAME_KEY, "12345");
  metadata.add(Metadata.CONTENT_TYPE, "audio/mpeg");
  parser.parse(input, bodyHandler, metadata, context);
} finally {
  IOUtils.closeQuietly(input);
}

String output = outputStream.toString("UTF-8");

assertThat(output).isNotEmpty(); // failed

System.out.println(output);
  }
}
{code}

Here's stack error
{noformat}
org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml not declared
at 
org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:62)
at 
org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getQName(ToXMLContentHandler.java:68)
at 
org.apache.tika.sax.ToXMLContentHandler.startElement(ToXMLContentHandler.java:148)
at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
at 
org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60)
at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
at 
org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
at 
org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
at 
org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284)
at 
org.apache.tika.sax.XHTMLContentHandler.element(XHTMLContentHandler.java:323)
at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:107)
at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at 
com.polyspot.document.converter.Mp3ParserTest.can_parse_mp3_files(Mp3ParserTest.java:49)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at 
org.junit.internal.runners.mod

[jira] [Commented] (TIKA-1216) parse method of Mp3Parser doesn't work for few mp3 files

2014-01-07 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13864202#comment-13864202
 ] 

Hong-Thai Nguyen commented on TIKA-1216:


I've test with a simple test case with this file. It seems that, this problem 
is identical with TIKA-1215.

> parse method of Mp3Parser doesn't work for few mp3 files
> 
>
> Key: TIKA-1216
> URL: https://issues.apache.org/jira/browse/TIKA-1216
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.4
> Environment: Windows 7 ultimate 32-bit OS, Java 1.7
>Reporter: Sumeet Gorab
>Priority: Blocker
>  Labels: patch
> Attachments: 05 - Dharti - Sarkaaran [www.DJMaza.Com].mp3
>
>
> Try to parse a Mp3 file but parse method of Mp3Parser class is not able to 
> parse that mp3 file. Parse method is not able to complete its execution their 
> is some issue in that method.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (TIKA-1215) Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4

2014-01-07 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-1215:
---

Attachment: TIKA-1215-fix-prefix-namespaces.patch

I made a fix with a test for this issue. Please have a revision and commit 
quickly. Thanks

> Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4
> --
>
> Key: TIKA-1215
> URL: https://issues.apache.org/jira/browse/TIKA-1215
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.5
>    Reporter: Hong-Thai Nguyen
>Priority: Critical
> Attachments: Centres 080805@0650 RTBF Matin Première - A propos des 
> rues de Dublin et Dubreucq.mp3, TIKA-1215-fix-prefix-namespaces.patch
>
>
> With attached file, 1.5 raises this exception on parsing. This file has no 
> problem on 1.4
> {code}
> ...
> Caused by: org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml 
> not declared
>   at 
> org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:62)
>   at 
> org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getQName(ToXMLContentHandler.java:68)
>   at 
> org.apache.tika.sax.ToXMLContentHandler.startElement(ToXMLContentHandler.java:148)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.element(XHTMLContentHandler.java:323)
>   at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:107)
>   at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> com.polyspot.document.converter.DocumentConverter.realizeTikaConversion(DocumentConverter.java:221)
>   ... 15 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Comment Edited] (TIKA-1216) parse method of Mp3Parser doesn't work for few mp3 files

2014-01-07 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13864202#comment-13864202
 ] 

Hong-Thai Nguyen edited comment on TIKA-1216 at 1/7/14 3:57 PM:


I've tested with a simple test case with this file. It seems that, this problem 
is identical with TIKA-1215. A patch has been submitted on this issue.
Waiting for a revision & commit.

Thanks


was (Author: thaichat04):
I've test with a simple test case with this file. It seems that, this problem 
is identical with TIKA-1215.

> parse method of Mp3Parser doesn't work for few mp3 files
> 
>
> Key: TIKA-1216
> URL: https://issues.apache.org/jira/browse/TIKA-1216
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.4
> Environment: Windows 7 ultimate 32-bit OS, Java 1.7
>Reporter: Sumeet Gorab
>Priority: Blocker
>  Labels: patch
> Attachments: 05 - Dharti - Sarkaaran [www.DJMaza.Com].mp3
>
>
> Try to parse a Mp3 file but parse method of Mp3Parser class is not able to 
> parse that mp3 file. Parse method is not able to complete its execution their 
> is some issue in that method.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-90) Allow thumbnails as document metadata

2014-01-09 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-90?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13866498#comment-13866498
 ] 

Hong-Thai Nguyen commented on TIKA-90:
--

Useful for Open XML Office & OpenOffice files and some others with embedded 
thumbnail.

> Allow thumbnails as document metadata
> -
>
> Key: TIKA-90
> URL: https://issues.apache.org/jira/browse/TIKA-90
> Project: Tika
>  Issue Type: New Feature
>  Components: general
>Reporter: Jukka Zitting
>
> It would be nice if parser components could produce thumbnail images and 
> other non-string metadata when parsing documents.
> To do this, we could either generalize the current Metadata methods, or 
> introduce new methods for handling such non-string metadata.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Updated] (TIKA-1215) Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4

2014-01-13 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-1215:
---

Attachment: tika-1215-without-wildcard.patch

[~gagravarr], my code style is different the one of Apache convention. 
Apologize for that.
I attached new patch file containing changes only.

Thanks


> Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4
> --
>
> Key: TIKA-1215
> URL: https://issues.apache.org/jira/browse/TIKA-1215
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.5
>    Reporter: Hong-Thai Nguyen
>Priority: Critical
> Attachments: Centres 080805@0650 RTBF Matin Première - A propos des 
> rues de Dublin et Dubreucq.mp3, TIKA-1215-fix-prefix-namespaces.patch, 
> tika-1215-without-wildcard.patch
>
>
> With attached file, 1.5 raises this exception on parsing. This file has no 
> problem on 1.4
> {code}
> ...
> Caused by: org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml 
> not declared
>   at 
> org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:62)
>   at 
> org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getQName(ToXMLContentHandler.java:68)
>   at 
> org.apache.tika.sax.ToXMLContentHandler.startElement(ToXMLContentHandler.java:148)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.element(XHTMLContentHandler.java:323)
>   at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:107)
>   at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> com.polyspot.document.converter.DocumentConverter.realizeTikaConversion(DocumentConverter.java:221)
>   ... 15 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1215) Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4

2014-01-13 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13869590#comment-13869590
 ] 

Hong-Thai Nguyen commented on TIKA-1215:


[~talli...@apache.org], here's XML of input to parse:
{noformat}
http://www.w3.org/1999/xhtml";>Matin Première - Tour des régions 
080806
RTBF - La Première
Speech
101698.914
XXX - 
A propos du contrat de quartier rues Dublin/Dubreucq
{noformat}

I think this regression came from TIKA-1070
{code}
currentElement = currentElement.parent;
{code}

The parentElement of  is null, then getPrefix() raised exception, that's 
different from 1.4

> Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4
> --
>
> Key: TIKA-1215
> URL: https://issues.apache.org/jira/browse/TIKA-1215
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.5
>Reporter: Hong-Thai Nguyen
>Priority: Critical
> Attachments: Centres 080805@0650 RTBF Matin Première - A propos des 
> rues de Dublin et Dubreucq.mp3, TIKA-1215-fix-prefix-namespaces.patch, 
> tika-1215-without-wildcard.patch
>
>
> With attached file, 1.5 raises this exception on parsing. This file has no 
> problem on 1.4
> {code}
> ...
> Caused by: org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml 
> not declared
>   at 
> org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:62)
>   at 
> org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getQName(ToXMLContentHandler.java:68)
>   at 
> org.apache.tika.sax.ToXMLContentHandler.startElement(ToXMLContentHandler.java:148)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.element(XHTMLContentHandler.java:323)
>   at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:107)
>   at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> com.polyspot.document.converter.DocumentConverter.realizeTikaConversion(DocumentConverter.java:221)
>   ... 15 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (TIKA-1215) Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4

2014-01-14 Thread Hong-Thai Nguyen (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13870573#comment-13870573
 ] 

Hong-Thai Nguyen commented on TIKA-1215:


Great catch. Thank [~jukkaz]

> Regression: Unable to parse a mp3 file on 1.5 which parsed successfully on 1.4
> --
>
> Key: TIKA-1215
> URL: https://issues.apache.org/jira/browse/TIKA-1215
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.5
>Reporter: Hong-Thai Nguyen
>Priority: Critical
> Attachments: Centres 080805@0650 RTBF Matin Première - A propos des 
> rues de Dublin et Dubreucq.mp3, TIKA-1215-fix-prefix-namespaces.patch, 
> tika-1215-without-wildcard.patch
>
>
> With attached file, 1.5 raises this exception on parsing. This file has no 
> problem on 1.4
> {code}
> ...
> Caused by: org.xml.sax.SAXException: Namespace http://www.w3.org/1999/xhtml 
> not declared
>   at 
> org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getPrefix(ToXMLContentHandler.java:62)
>   at 
> org.apache.tika.sax.ToXMLContentHandler$ElementInfo.getQName(ToXMLContentHandler.java:68)
>   at 
> org.apache.tika.sax.ToXMLContentHandler.startElement(ToXMLContentHandler.java:148)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.xpath.MatchingContentHandler.startElement(MatchingContentHandler.java:60)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)
>   at 
> org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:264)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:254)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:284)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.element(XHTMLContentHandler.java:323)
>   at org.apache.tika.parser.mp3.Mp3Parser.parse(Mp3Parser.java:107)
>   at org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:91)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> com.polyspot.document.converter.DocumentConverter.realizeTikaConversion(DocumentConverter.java:221)
>   ... 15 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


  1   2   >