[jira] [Commented] (TIKA-1850) Tika erroneously detects some versions of jQuery as "text/html"

2016-02-04 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15132249#comment-15132249
 ] 

Nick Burch commented on TIKA-1850:
--

It's showing up for me in the snapshots repo - see 
https://repository.apache.org/content/groups/snapshots/org/apache/tika/tika-parsers/1.13-SNAPSHOT/

1.12 is being voted on now, but the commits for this were done after the 1.12 
release candidates were cut. Unless there has to be a re-creation of the RCs, 
expect it in 1.13 in 2-4 months

> Tika erroneously detects some versions of jQuery as "text/html"
> ---
>
> Key: TIKA-1850
> URL: https://issues.apache.org/jira/browse/TIKA-1850
> Project: Tika
>  Issue Type: Bug
>  Components: detector
>Affects Versions: 1.11
> Environment: {code}
> ProductName:  Mac OS X
> ProductVersion:   10.11.3
> BuildVersion: 15D21
> {code}
>Reporter: Boris Slobodin
>
> This sets the wrong {{Content-Type}} on S3 as a result, for example, when 
> using s3_website and breaks some browsers like IE.
> {code}
> ➜  wget https://ajax.googleapis.com/ajax/libs/jquery/1.7.1/jquery.min.js -O 
> jquery-1.7.1.min.js
> --2016-02-02 15:21:33--  
> https://ajax.googleapis.com/ajax/libs/jquery/1.7.1/jquery.min.js
> Resolving ajax.googleapis.com... 216.58.193.106, 2607:f8b0:400a:801::200a
> Connecting to ajax.googleapis.com|216.58.193.106|:443... connected.
> HTTP request sent, awaiting response... 200 OK
> Length: unspecified [text/javascript]
> Saving to: 'jquery-1.7.1.min.js'
> jquery-1.7.1.min.js[  <=> 
>  ]  91.67K   323KB/sin 0.3s
> 2016-02-02 15:21:33 (323 KB/s) - 'jquery-1.7.1.min.js' saved [93868]
> {code}
> {code}
> ➜  wget https://ajax.googleapis.com/ajax/libs/jquery/1.12.0/jquery.min.js -O 
> jquery-1.12.0.min.js
> --2016-02-02 15:22:10--  
> https://ajax.googleapis.com/ajax/libs/jquery/1.12.0/jquery.min.js
> Resolving ajax.googleapis.com... 216.58.193.106, 2607:f8b0:400a:801::200a
> Connecting to ajax.googleapis.com|216.58.193.106|:443... connected.
> HTTP request sent, awaiting response... 200 OK
> Length: unspecified [text/javascript]
> Saving to: 'jquery-1.12.0.min.js'
> jquery-1.12.0.min.js   [ <=>  
>  ]  95.08K  --.-KB/sin 0.03s
> 2016-02-02 15:22:10 (3.30 MB/s) - 'jquery-1.12.0.min.js' saved [97362]
> {code}
> {code}
> ➜  wget https://ajax.googleapis.com/ajax/libs/jquery/2.2.0/jquery.min.js -O 
> jquery-2.2.0.min.js
> --2016-02-02 15:22:24--  
> https://ajax.googleapis.com/ajax/libs/jquery/2.2.0/jquery.min.js
> Resolving ajax.googleapis.com... 216.58.193.106, 2607:f8b0:400a:801::200a
> Connecting to ajax.googleapis.com|216.58.193.106|:443... connected.
> HTTP request sent, awaiting response... 200 OK
> Length: unspecified [text/javascript]
> Saving to: 'jquery-2.2.0.min.js'
> jquery-2.2.0.min.js[ <=>  
>  ]  83.58K  --.-KB/sin 0.02s
> 2016-02-02 15:22:24 (3.39 MB/s) - 'jquery-2.2.0.min.js' saved [85589]
> {code}
> {color:red}{{jquery-1.7.1.min.js}}{color}
> {code}
> ➜  java -jar tika-app-1.11.jar --detect jquery-1.7.1.min.js
> text/html
> {code}
> {color:green}{{jquery-1.12.0.min.js}}{color}
> {code}
> ➜  java -jar tika-app-1.11.jar --detect jquery-1.12.0.min.js
> application/javascript
> {code}
> {color:green}{{jquery-2.2.0.min.js}}{color}
> {code}
> ➜  java -jar tika-app-1.11.jar --detect jquery-2.2.0.min.js
> application/javascript
> {code}
> Thank you.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: [VOTE] Apache Tika 1.12 Release Candidate #1

2016-02-04 Thread Lewis John Mcgibbney
Hi Chris,
+1 to release this release candidate
Thanks
Lewis

On Tue, Feb 2, 2016 at 4:24 PM, Lewis John Mcgibbney <
lewis.mcgibb...@gmail.com> wrote:

> Hi Chris,
>
> Signatures all good. Verified using the scripts apachestuff.
> mvn install and all tests pass fine on MacOSX 10.9.5
> Ran DRAT from master branch with following output
>
> Notes Binaries Archives Standards Apache Generated Unknown
> 0 2 0 868 836 0 32
> Issue filed in Jira to address and resolve the unknown's
>
> https://issues.apache.org/jira/browse/TIKA-1848
>
> On Thu, Jan 28, 2016 at 12:01 AM,  wrote:
>
>>
>> A first candidate for the Tika 1.12 release is available at:
>>
>>   https://dist.apache.org/repos/dist/dev/tika/
>>
>> The release candidate is a zip archive of the sources in:
>>
>> https://git-wip-us.apache.org/repos/asf?p=tika.git;a=tag;h=203a26ba5e65db24
>> 27f9e84bc4ff31e569ae661c
>> 
>>
>>
>> The SHA1 checksum of the archive is:
>> 30e64645af643959841ac3bb3c41f7e64eba7e5f
>>
>> In addition, a staged maven repository is available here:
>>
>> https://repository.apache.org/content/repositories/orgapachetika-1015/
>>
>>
>> Please vote on releasing this package as Apache Tika 1.12.
>> The vote is open for the next 72 hours and passes if a majority of at
>> least three +1 Tika PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Tika 1.12
>> [ ] -1 Do not release this package because…
>>
>> Cheers,
>> Chris
>>
>> P.S. Of course here is my +1.
>>
>>


-- 
*Lewis*


[jira] [Commented] (TIKA-1851) Tika 2.0 - Move test resources from core to test-resources

2016-02-04 Thread Ken Krugler (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15133624#comment-15133624
 ] 

Ken Krugler commented on TIKA-1851:
---

Hi [~talli...@apache.org] - I'm also getting a local build failure with 2.0, 
but the output i see is:

[ERROR] Failed to execute goal 
org.codehaus.groovy.maven:gmaven-plugin:1.0:execute (testSetup) on project 
tika-test-resources: startup failed, 
/Users/kenkrugler/git/tika/tika-test-resources/src/test/resources/org/apache/tika/parser/ner/opennlp/ModelGetter.groovy:
 23: unable to resolve class org.apache.commons.io.IOUtils
[ERROR] @ line 23, column 1.
[ERROR] import org.apache.commons.io.IOUtils

Is this something different than the issue you ran into?

And any way to work around it? I'd like to port the language detector changes.


> Tika 2.0 - Move test resources from core to test-resources
> --
>
> Key: TIKA-1851
> URL: https://issues.apache.org/jira/browse/TIKA-1851
> Project: Tika
>  Issue Type: Sub-task
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Trivial
> Fix For: 2.0
>
>
> Let's try to move resources that are used for testing to the test-resources 
> module if possible: MockParser, DummyParser, TikaTest and the unit tests for 
> MockParser.  That should also allow us to drop the test-jar goal in 
> tika-core.  Anything else?
> Haven't actually tried this yet; there may be surprises.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Tika 2.0 and language detection

2016-02-04 Thread Mattmann, Chris A (3980)
Hey Ken,

This is fine. I wanted to get going with our Julia/MIT-LL Text.jl based
detector and turning LanguageIdentifier into an interface. Me and
Trevor (CC’ed) are working on it, but not sure where we’re at and
shouldn’t be a blocker to moving forward.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++





-Original Message-
From: Ken Krugler 
Reply-To: "dev@tika.apache.org" 
Date: Thursday, February 4, 2016 at 12:23 PM
To: "tika-...@lucene.apache.org" 
Subject: Tika 2.0 and language detection

>Hi all,
>
>Over at https://issues.apache.org/jira/browse/TIKA-1723, Tim & I have
>been discussing whether to focus these pending changes on the 2.0 branch,
>and leave 1.x as-is.
>
>As part of that, we could do a cut-and-run in 2.0, and not spend the time
>to port the current (Tika 1.x) language detector code.
>
>I'm in favor of that approach, as I think leveraging the new detector
>project(s) gives us faster & more accurate results over more languages.
>
>But we're posting to the more general audience here, to gather input on
>things that we might not be considering.
>
>Thanks,
>
>-- Ken
>
>
>
>--
>Ken Krugler
>+1 530-210-6378
>http://www.scaleunlimited.com
>custom big data solutions & training
>Hadoop, Cascading, Cassandra & Solr
>
>
>
>
>



[jira] [Commented] (TIKA-1851) Tika 2.0 - Move test resources from core to test-resources

2016-02-04 Thread Ken Krugler (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15133629#comment-15133629
 ] 

Ken Krugler commented on TIKA-1851:
---

I'm also curious why we have Groovy code and shell scripts inside of 
src/test/resources (e.g. in 
tika-test-resources/src/test/resources/org/apache/tika/parser/ner/opennlp/). 
Shouldn't groovy code be in src/test/groovy (or src/test/java would also work)? 
Not sure about shell files, maybe that's OK as a resource.

> Tika 2.0 - Move test resources from core to test-resources
> --
>
> Key: TIKA-1851
> URL: https://issues.apache.org/jira/browse/TIKA-1851
> Project: Tika
>  Issue Type: Sub-task
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Trivial
> Fix For: 2.0
>
>
> Let's try to move resources that are used for testing to the test-resources 
> module if possible: MockParser, DummyParser, TikaTest and the unit tests for 
> MockParser.  That should also allow us to drop the test-jar goal in 
> tika-core.  Anything else?
> Haven't actually tried this yet; there may be surprises.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1824) Tika 2.0 - Create Initial Parser Modules

2016-02-04 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15132503#comment-15132503
 ] 

Tim Allison commented on TIKA-1824:
---

bq.  Thanks so much for the feedback, these are great things to be discussing.

Yes, yes, indeed.  Thank you, [~kkrugler], [~rgauss], and of course 
[~bobpaulin]!

Consensus for now...keep as is?  Sounds good to me.

bq. so I was considering creating projects with a bundle suffix that would 
embed the dependencies individually as tika-bundle did...

Interesting.  So, OSGi aside for the following (sorry), for those with, um, 
challenged development environments (i.e. medical/financial fields where you 
might only be allowed to bring in publicly released jars), users who only 
wanted to parse pdfs, say, could then grab tika-core.jar, the tika-batch.jar, 
the orig-tika-app.jar and the tika-parser-pdf-bundle.jar and be able to parse 
pdfs?  That would be awesome from the standpoint of several use cases I've 
seen.  Did I get this right?  What do others think?



> Tika 2.0 -  Create Initial Parser Modules
> -
>
> Key: TIKA-1824
> URL: https://issues.apache.org/jira/browse/TIKA-1824
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 2.0
>Reporter: Bob Paulin
>Assignee: Bob Paulin
>
> Create initial break down of parser modules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1853) Upgrade to POI 3.14-final when available

2016-02-04 Thread Tim Allison (JIRA)
Tim Allison created TIKA-1853:
-

 Summary: Upgrade to POI 3.14-final when available
 Key: TIKA-1853
 URL: https://issues.apache.org/jira/browse/TIKA-1853
 Project: Tika
  Issue Type: Improvement
Reporter: Tim Allison
Priority: Minor


Should be out soonish.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1851) Tika 2.0 - Move test resources from core to test-resources

2016-02-04 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15132561#comment-15132561
 ] 

Lewis John McGibbney commented on TIKA-1851:


Are we using the most recent osgi/Felix dependencies?
I haven't looked at the build myself, however I assume that this may be a 
permissions issue with accessing the bundle cache. Does anyone know where this 
cache is stored by default?

> Tika 2.0 - Move test resources from core to test-resources
> --
>
> Key: TIKA-1851
> URL: https://issues.apache.org/jira/browse/TIKA-1851
> Project: Tika
>  Issue Type: Sub-task
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Trivial
> Fix For: 2.0
>
>
> Let's try to move resources that are used for testing to the test-resources 
> module if possible: MockParser, DummyParser, TikaTest and the unit tests for 
> MockParser.  That should also allow us to drop the test-jar goal in 
> tika-core.  Anything else?
> Haven't actually tried this yet; there may be surprises.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1851) Tika 2.0 - Move test resources from core to test-resources

2016-02-04 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15132549#comment-15132549
 ] 

Tim Allison commented on TIKA-1851:
---

[~bobpaulin], any chance you could look into why we've been getting the 
following in 2.x's tika-bundle since we moved to git 
(https://builds.apache.org/job/tika-2.x/14/)?  

I'm sure the move didn't do it, but thanks to [~lewismc] for aiming Hudson at 
git, this has made this issue ever so much clearer. :)

{noformat}
java.lang.Exception: Unable to lock bundle cache: 
java.nio.channels.OverlappingFileLockException
at 
org.apache.felix.framework.cache.BundleCache.(BundleCache.java:176)
at org.apache.felix.framework.Felix.init(Felix.java:689)
at org.apache.felix.framework.Felix.init(Felix.java:624)
at 
org.ops4j.pax.exam.nat.internal.NativeTestContainer.start(NativeTestContainer.java:176)
at 
org.ops4j.pax.exam.spi.reactors.AllConfinedStagedReactor.invoke(AllConfinedStagedReactor.java:79)
at 
org.ops4j.pax.exam.junit.impl.ProbeRunner$2.evaluate(ProbeRunner.java:267)
{noformat}

> Tika 2.0 - Move test resources from core to test-resources
> --
>
> Key: TIKA-1851
> URL: https://issues.apache.org/jira/browse/TIKA-1851
> Project: Tika
>  Issue Type: Sub-task
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Trivial
> Fix For: 2.0
>
>
> Let's try to move resources that are used for testing to the test-resources 
> module if possible: MockParser, DummyParser, TikaTest and the unit tests for 
> MockParser.  That should also allow us to drop the test-jar goal in 
> tika-core.  Anything else?
> Haven't actually tried this yet; there may be surprises.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1851) Tika 2.0 - Move test resources from core to test-resources

2016-02-04 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-1851.
---
Resolution: Fixed

> Tika 2.0 - Move test resources from core to test-resources
> --
>
> Key: TIKA-1851
> URL: https://issues.apache.org/jira/browse/TIKA-1851
> Project: Tika
>  Issue Type: Sub-task
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Trivial
> Fix For: 2.0
>
>
> Let's try to move resources that are used for testing to the test-resources 
> module if possible: MockParser, DummyParser, TikaTest and the unit tests for 
> MockParser.  That should also allow us to drop the test-jar goal in 
> tika-core.  Anything else?
> Haven't actually tried this yet; there may be surprises.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1851) Tika 2.0 - Move test resources from core to test-resources

2016-02-04 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15132593#comment-15132593
 ] 

Tim Allison commented on TIKA-1851:
---

Dunno, but I should have mentioned that I'm getting this when I try to build 
2.x locally, too; this isn't just a Hudson issue.   

> Tika 2.0 - Move test resources from core to test-resources
> --
>
> Key: TIKA-1851
> URL: https://issues.apache.org/jira/browse/TIKA-1851
> Project: Tika
>  Issue Type: Sub-task
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Trivial
> Fix For: 2.0
>
>
> Let's try to move resources that are used for testing to the test-resources 
> module if possible: MockParser, DummyParser, TikaTest and the unit tests for 
> MockParser.  That should also allow us to drop the test-jar goal in 
> tika-core.  Anything else?
> Haven't actually tried this yet; there may be surprises.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Reopened] (TIKA-1851) Tika 2.0 - Move test resources from core to test-resources

2016-02-04 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison reopened TIKA-1851:
---

wrong reason for resolving...need to fix

> Tika 2.0 - Move test resources from core to test-resources
> --
>
> Key: TIKA-1851
> URL: https://issues.apache.org/jira/browse/TIKA-1851
> Project: Tika
>  Issue Type: Sub-task
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Trivial
> Fix For: 2.0
>
>
> Let's try to move resources that are used for testing to the test-resources 
> module if possible: MockParser, DummyParser, TikaTest and the unit tests for 
> MockParser.  That should also allow us to drop the test-jar goal in 
> tika-core.  Anything else?
> Haven't actually tried this yet; there may be surprises.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1723) Integrate language-detector into Tika

2016-02-04 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15132613#comment-15132613
 ] 

Tim Allison commented on TIKA-1723:
---

Agreed on the ease of building the new ld framework in 2.0.  

Given Mike's comparison of Tika and langdetect 
[here|http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html],
 even though it is now dated, I'd be willing to put our language detector on 
mothballs in 2.x (i.e. leave it in 1.x, and if we need to resurrect it we can). 
 That said, I didn't write that code, and I know that [~toke] on TIKA-1549 has 
since dramatically improved our speed.

This is certainly a large enough issue to invite feedback from the entire 
community.  Do we want to drop our language detection code in 2.x?  Or is there 
a good reason to keep it?



> Integrate language-detector into Tika
> -
>
> Key: TIKA-1723
> URL: https://issues.apache.org/jira/browse/TIKA-1723
> Project: Tika
>  Issue Type: Improvement
>  Components: languageidentifier
>Affects Versions: 1.11
>Reporter: Ken Krugler
>Assignee: Ken Krugler
>Priority: Minor
> Attachments: TIKA-1723-2.patch, TIKA-1723-3.patch, TIKA-1723.patch, 
> TIKA-1723v2.patch
>
>
> The language-detector project at 
> https://github.com/optimaize/language-detector is faster, has more languages 
> (70 vs 13) and better accuracy than the built-in language detector.
> This is a stab at integrating it, with some initial findings. There are a 
> number of issues this raises, especially if [~chrismattmann] moves forward 
> with turning language detection into a pluggable extension point.
> I'll add comments with results below.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1851) Tika 2.0 - Move test resources from core to test-resources

2016-02-04 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15132542#comment-15132542
 ] 

Hudson commented on TIKA-1851:
--

UNSTABLE: Integrated in tika-2.x #18 (See 
[https://builds.apache.org/job/tika-2.x/18/])
TIKA-1851:factor out test resources that used to be in core to (tallison: rev 
afb6cf2630b5006091b9862df661efa1d1ac1593)
* tika-parser-modules/tika-parser-pdf-module/pom.xml
* 
tika-parser-modules/tika-parser-scientific-module/src/test/java/org/apache/tika/parser/gdal/TestGDALParser.java
* tika-core/src/test/java/org/apache/tika/parser/mock/MockParser.java
* tika-core/src/test/java/org/apache/tika/parser/DummyParser.java
* tika-parser-modules/tika-parser-database-module/pom.xml
* tika-parsers/src/test/java/org/apache/tika/parser/mock/MockParserTest.java
* tika-test-resources/pom.xml
* tika-parser-modules/tika-parser-multimedia-module/pom.xml
* tika-parent/pom.xml
* tika-batch/pom.xml
* tika-parser-modules/tika-parser-package-module/pom.xml
* tika-server/pom.xml
* tika-parsers/pom.xml
* tika-test-resources/src/main/java/org/apache/tika/parser/mock/MockParser.java
* tika-parser-modules/tika-parser-advanced-module/pom.xml
* tika-parser-modules/tika-parser-office-module/pom.xml
* tika-test-resources/src/main/java/org/apache/tika/TikaTest.java
* tika-parser-modules/tika-parser-crypto-module/pom.xml
* tika-core/src/test/java/org/apache/tika/config/DummyExecutor.java
* tika-parser-modules/tika-parser-ebook-module/pom.xml
* 
tika-parser-modules/tika-parser-scientific-module/src/test/java/org/apache/tika/parser/dif/DIFParserTest.java
* tika-parser-modules/tika-parser-cad-module/pom.xml
* tika-parser-modules/pom.xml
* 
tika-test-resources/src/main/java/org/apache/tika/config/AbstractTikaConfigTest.java
* tika-parser-modules/tika-parser-scientific-module/pom.xml
* tika-parser-modules/tika-parser-journal-module/pom.xml
* tika-parser-modules/tika-parser-code-module/pom.xml
* tika-core/pom.xml
* tika-parser-modules/tika-parser-text-module/pom.xml
* tika-parser-modules/tika-parser-web-module/pom.xml
* tika-core/src/test/java/org/apache/tika/config/TikaConfigTest.java
* tika-core/src/test/java/org/apache/tika/config/DummyParser.java
* 
tika-test-resources/src/test/java/org/apache/tika/parser/mock/MockParserTest.java
* tika-core/src/test/java/org/apache/tika/config/AbstractTikaConfigTest.java
* tika-core/src/test/java/org/apache/tika/TikaTest.java
* 
tika-parser-modules/tika-parser-multimedia-module/src/test/java/org/apache/tika/parser/ocr/TesseractOCRConfigTest.java


> Tika 2.0 - Move test resources from core to test-resources
> --
>
> Key: TIKA-1851
> URL: https://issues.apache.org/jira/browse/TIKA-1851
> Project: Tika
>  Issue Type: Sub-task
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Trivial
> Fix For: 2.0
>
>
> Let's try to move resources that are used for testing to the test-resources 
> module if possible: MockParser, DummyParser, TikaTest and the unit tests for 
> MockParser.  That should also allow us to drop the test-jar goal in 
> tika-core.  Anything else?
> Haven't actually tried this yet; there may be surprises.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1851) Tika 2.0 - Move test resources from core to test-resources

2016-02-04 Thread Tim Allison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved TIKA-1851.
---
Resolution: Invalid

Moved shared test resources to test-resources and did some other very small 
test clean ups.

One oddity that I'm not sure how to handle or if it is a problem is that 
tika-test-resources now has junit in full scope (not just test).  My thinking 
is that all other modules will use tika-test-resources in test scope only so 
this won't be a problem. If there's a cleaner way of handling this, please let 
me know.

Had to duplicate a small bit of code from AbstractTikaConfigTest in tika-core 
to avoid circular dependency, but now we have clean(er) separation of tika-core 
and tika-test-resources.  My goal was to get to something akin to Lucene's 
lucene-test-framework.

After working through our unit tests across a variety of parsers recently, I 
see a large opportunity to refactor many tests and make more use of TikaTest 
functionality...that's for another ticket.

> Tika 2.0 - Move test resources from core to test-resources
> --
>
> Key: TIKA-1851
> URL: https://issues.apache.org/jira/browse/TIKA-1851
> Project: Tika
>  Issue Type: Sub-task
>Reporter: Tim Allison
>Assignee: Tim Allison
>Priority: Trivial
> Fix For: 2.0
>
>
> Let's try to move resources that are used for testing to the test-resources 
> module if possible: MockParser, DummyParser, TikaTest and the unit tests for 
> MockParser.  That should also allow us to drop the test-jar goal in 
> tika-core.  Anything else?
> Haven't actually tried this yet; there may be surprises.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1824) Tika 2.0 - Create Initial Parser Modules

2016-02-04 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15132507#comment-15132507
 ] 

Tim Allison commented on TIKA-1824:
---

Sorry, [~grossws], [~thaichat04] and [~lfcnassif] should have included you in 
the above! :)

> Tika 2.0 -  Create Initial Parser Modules
> -
>
> Key: TIKA-1824
> URL: https://issues.apache.org/jira/browse/TIKA-1824
> Project: Tika
>  Issue Type: Improvement
>Affects Versions: 2.0
>Reporter: Bob Paulin
>Assignee: Bob Paulin
>
> Create initial break down of parser modules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1852) Tika 2.0 - clean up unit tests to rely more on TikaTest

2016-02-04 Thread Tim Allison (JIRA)
Tim Allison created TIKA-1852:
-

 Summary: Tika 2.0 - clean up unit tests to rely more on TikaTest
 Key: TIKA-1852
 URL: https://issues.apache.org/jira/browse/TIKA-1852
 Project: Tika
  Issue Type: Task
Reporter: Tim Allison
Priority: Trivial


Unit tests for different parsers often have different habits for accomplishing 
roughly the same things. In 2.0, it would be nice to clean up some unit tests 
and rely more on TikaTest, esp. now that we have a separate bundle for a 
test-framework.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1836) Convertion DOC->TXT failed due to POI issue

2016-02-04 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15132895#comment-15132895
 ] 

Tim Allison commented on TIKA-1836:
---

Committed workaround to log rather than throw an exception in POI r1728547.  
Once the next version of POI is out and once we integrate that into Tika, this 
issue should be "fixed" at the Tika level.  The true fix would be to add 
parsing for that kind of record in POI...any takers?

> Convertion DOC->TXT failed due to POI issue
> ---
>
> Key: TIKA-1836
> URL: https://issues.apache.org/jira/browse/TIKA-1836
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.11
> Environment: Distributor ID:  Ubuntu
> Description:  Ubuntu 12.04.5 LTS
> Release:  12.04
> Codename: precise
> java version "1.7.0_91"
> OpenJDK Runtime Environment (IcedTea 2.6.3) (7u91-2.6.3-0ubuntu0.12.04.1)
> OpenJDK 64-Bit Server VM (build 24.91-b01, mixed mode)
>Reporter: Jorge Spinsanti
> Attachments: test.doc
>
>
> When we try to convert DOC -> TXT, I got the next stack trace:
> {code}
> Caused by: org.apache.tika.exception.TikaException: Unexpected 
> RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@1ddeedb6
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 15 more
> Caused by: java.lang.UnsupportedOperationException: Non-extended character 
> Pascal strings are not supported right now. Please, contact POI developers 
> for update.
>   at org.apache.poi.hwpf.model.Sttb.fillFields(Sttb.java:82)
>   at org.apache.poi.hwpf.model.Sttb.(Sttb.java:61)
>   at 
> org.apache.poi.hwpf.model.SttbUtils.readSttbSavedBy(SttbUtils.java:52)
>   at org.apache.poi.hwpf.model.SavedByTable.(SavedByTable.java:53)
>   at org.apache.poi.hwpf.HWPFDocument.(HWPFDocument.java:361)
>   at 
> org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
>   ... 22 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Tika 2.0 and language detection

2016-02-04 Thread Ken Krugler
Hi all,

Over at https://issues.apache.org/jira/browse/TIKA-1723, Tim & I have been 
discussing whether to focus these pending changes on the 2.0 branch, and leave 
1.x as-is.

As part of that, we could do a cut-and-run in 2.0, and not spend the time to 
port the current (Tika 1.x) language detector code.

I'm in favor of that approach, as I think leveraging the new detector 
project(s) gives us faster & more accurate results over more languages.

But we're posting to the more general audience here, to gather input on things 
that we might not be considering.

Thanks,

-- Ken



--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr







[jira] [Commented] (TIKA-1723) Integrate language-detector into Tika

2016-02-04 Thread Ken Krugler (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15132961#comment-15132961
 ] 

Ken Krugler commented on TIKA-1723:
---

Good idea re gathering input - I just emailed the dev list.

> Integrate language-detector into Tika
> -
>
> Key: TIKA-1723
> URL: https://issues.apache.org/jira/browse/TIKA-1723
> Project: Tika
>  Issue Type: Improvement
>  Components: languageidentifier
>Affects Versions: 1.11
>Reporter: Ken Krugler
>Assignee: Ken Krugler
>Priority: Minor
> Attachments: TIKA-1723-2.patch, TIKA-1723-3.patch, TIKA-1723.patch, 
> TIKA-1723v2.patch
>
>
> The language-detector project at 
> https://github.com/optimaize/language-detector is faster, has more languages 
> (70 vs 13) and better accuracy than the built-in language detector.
> This is a stab at integrating it, with some initial findings. There are a 
> number of issues this raises, especially if [~chrismattmann] moves forward 
> with turning language detection into a pluggable extension point.
> I'll add comments with results below.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)