RE: Move definitively from SVN to Git ?

2014-11-19 Thread Ken Krugler

> From: Tyler Palsulich
> Sent: November 19, 2014 5:43:12pm PST
> To: dev@tika.apache.org
> Subject: Re: Move definitively from SVN to Git ?
> 
> On Mon, Nov 17, 2014 at 6:23 AM, Nick Burch  wrote:
>> 
>> Given that non-committers can already work with Git, could you explain
>> what committers would gain from the move to Git which would outweigh the
>> effort that SVN-using committers would have to expend with the move?
>> 
> 
> Applying patches from GitHub pull requests is kind of clunky... A
> contributor sends PR, we review, make changes, accept, download a diff,
> apply it, and svn commit, which is then mirrored back to GitHub.
> 
> What is our preferred way of getting new contributions? In my opinion, pull
> request and merge is better than an upload/download/apply of a patch file.
> On the other hand, it might be awkward to have all patches come in as pull
> requests if we're referring to them from
> https://issues.apache.org/jira/browse/tika.
> 
> Being able to work on separate branches for large changes (e.g. TIKA-1445
> and TIKA-1302) is very convenient.
> 
> What is the effort SVN-using committers would have to expend?
> 
> I don't mean to incite a VCS war. ;)

git v. svn is more like a brushfire that flares up every few months, at least 
on the @members list :)

-- Ken

--
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr







Re: Move definitively from SVN to Git ?

2014-11-19 Thread Tyler Palsulich
On Mon, Nov 17, 2014 at 6:23 AM, Nick Burch  wrote:
>
> Given that non-committers can already work with Git, could you explain
> what committers would gain from the move to Git which would outweigh the
> effort that SVN-using committers would have to expend with the move?
>

Applying patches from GitHub pull requests is kind of clunky... A
contributor sends PR, we review, make changes, accept, download a diff,
apply it, and svn commit, which is then mirrored back to GitHub.

What is our preferred way of getting new contributions? In my opinion, pull
request and merge is better than an upload/download/apply of a patch file.
On the other hand, it might be awkward to have all patches come in as pull
requests if we're referring to them from
https://issues.apache.org/jira/browse/tika.

Being able to work on separate branches for large changes (e.g. TIKA-1445
and TIKA-1302) is very convenient.

What is the effort SVN-using committers would have to expend?

I don't mean to incite a VCS war. ;)

Have a good night,
Tyler


[jira] [Commented] (TIKA-1302) Let's run Tika against a large batch of docs nightly

2014-11-19 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14218825#comment-14218825
 ] 

Tyler Palsulich commented on TIKA-1302:
---

I just got access to an HPC cluster at NYU. How are you running Tika against 
the govdocs corpus, Tim? I'm downloading it right now and would like to 
reproduce your results.

> Let's run Tika against a large batch of docs nightly
> 
>
> Key: TIKA-1302
> URL: https://issues.apache.org/jira/browse/TIKA-1302
> Project: Tika
>  Issue Type: Improvement
>  Components: cli, general, server
>Reporter: Tim Allison
>
> Many thanks to [~lewismc] for TIKA-1301!  Once we get nightly builds up and 
> running again, it might be fun to run Tika regularly against a large set of 
> docs and report metrics.
> One excellent candidate corpus is govdocs1: 
> http://digitalcorpora.org/corpora/files.
> Any other candidate corpora?  
> [~willp-bl], have anything handy you'd like to contribute? 
> [http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite]
>  ;) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1484) Boilerpipe dependency is evil

2014-11-19 Thread Ben McCann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14218582#comment-14218582
 ] 

Ben McCann commented on TIKA-1484:
--

Yes, it turns out I can exclude Boilerpipe. I wasn't sure if I could at first 
because I wasn't sure how Tika was using it. I had to checkout and read the 
source code to determine if this was a safe option.

I don't have any candidates for replacing it.

Maybe it could be moved to a separate boilerpipe-parser project?

> Boilerpipe dependency is evil
> -
>
> Key: TIKA-1484
> URL: https://issues.apache.org/jira/browse/TIKA-1484
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.6
>Reporter: Ben McCann
>
> The Boilerpipe project bundles inside it two classes from org.cyberneko.html. 
> We're already using NekoHTML in our project. Depending on which library shows 
> up on our classpath certain parts of our project will either work or not. I'd 
> really love it if Boilerpipe could be fixed or replaced with some other 
> library that is a better citizen.
> I see I'm not the first person to run into this as another Tika user has 
> filed a bug on the Boilerpipe project: 
> https://code.google.com/p/boilerpipe/issues/detail?id=62



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1484) Boilerpipe dependency is evil

2014-11-19 Thread Ken Krugler (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14218553#comment-14218553
 ] 

Ken Krugler commented on TIKA-1484:
---

1. I assume you can exclude the Boilerpipe jar from the Tika dependency, as a 
work-around (though only if you don't need Boilerpipe). Or is that not working?

2. Do you have a candidate for replacing Boilerpipe?

3. Another possibility is that we create a facade that lets you plug in the 
implementation. This would let us remove the explicit dependency on Boilerpipe. 
Though anyone who's dealt with this and XML parsers understands that it can 
also cause pain and suffering.

> Boilerpipe dependency is evil
> -
>
> Key: TIKA-1484
> URL: https://issues.apache.org/jira/browse/TIKA-1484
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.6
>Reporter: Ben McCann
>
> The Boilerpipe project bundles inside it two classes from org.cyberneko.html. 
> We're already using NekoHTML in our project. Depending on which library shows 
> up on our classpath certain parts of our project will either work or not. I'd 
> really love it if Boilerpipe could be fixed or replaced with some other 
> library that is a better citizen.
> I see I'm not the first person to run into this as another Tika user has 
> filed a bug on the Boilerpipe project: 
> https://code.google.com/p/boilerpipe/issues/detail?id=62



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (TIKA-1484) Boilerpipe dependency is evil

2014-11-19 Thread Ben McCann (JIRA)
Ben McCann created TIKA-1484:


 Summary: Boilerpipe dependency is evil
 Key: TIKA-1484
 URL: https://issues.apache.org/jira/browse/TIKA-1484
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 1.6
Reporter: Ben McCann


The Boilerpipe project bundles inside it two classes from org.cyberneko.html. 
We're already using NekoHTML in our project. Depending on which library shows 
up on our classpath certain parts of our project will either work or not. I'd 
really love it if Boilerpipe could be fixed or replaced with some other library 
that is a better citizen.

I see I'm not the first person to run into this as another Tika user has filed 
a bug on the Boilerpipe project: 
https://code.google.com/p/boilerpipe/issues/detail?id=62



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1483) Create a general raw string parser

2014-11-19 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14218519#comment-14218519
 ] 

Luis Filipe Nassif commented on TIKA-1483:
--

Before someone asks, one key difference of this parser to the TextParser is 
that here the intention is to do a best effort to extract strings from non text 
files, like binaries, corrupted and unknown files (which can have strings in 
different charsets).

> Create a general raw string parser
> --
>
> Key: TIKA-1483
> URL: https://issues.apache.org/jira/browse/TIKA-1483
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.6
>Reporter: Luis Filipe Nassif
>
> I think it can be very useful adding a general parser able to extract raw 
> strings from files (like the strings command), which can be used as the 
> fallback parser for all mimetypes not having a specific parser 
> implementation, like application/octet-stream. It can also be used as a 
> fallback for corrupt files throwing a TikaException.
> It must be configured with the script/language to be extracted from the files 
> (currently I implemented one specific for Latin1).
> It can use heuristics to extract strings encoded with different charsets 
> within the same file, mainly the common ISO-8859-1, UTF8 and UTF16.
> What the community thinks about that?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1481) TikaJAXRS get metadata calls give different results

2014-11-19 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14218505#comment-14218505
 ] 

Sergey Beryozkin edited comment on TIKA-1481 at 11/19/14 9:11 PM:
--

Hi Darya
It is something to do with the curl options. -T is effectively a form payload 
AFAIK, possibly a multipart/form-data one. The 2nd option is a direct body 
payload. Please use a tcp trace and see what is different.
By the way - it would be more beneficial for the community at large if you 
could ask the questions at the users list - the questions raised at JIRAs have 
a very low visibility, unless they do identify genuine issue
Thanks, Sergey


was (Author: sergey_beryozkin):
Hi Darya
It is something to do with the curl options. -T is effectively a form payload 
AFAIK, possibly a multipart/form-data one. The 2nd option is a direct body 
payload. Please use a tcp trace and see whta is different.
By the way - it would be more beneficial for the community at large if you 
could ask the questions at the users list - the questions raised at JIRAs have 
a very low visibility, unless they do identify genuine issue
Thanks, Sergey

> TikaJAXRS get metadata calls give different results
> ---
>
> Key: TIKA-1481
> URL: https://issues.apache.org/jira/browse/TIKA-1481
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 1.6
> Environment: Windows 8, JDK 1.8
>Reporter: Darya Arbuzova
>Priority: Minor
> Attachments: sample.csv
>
>
> Hello!
> I'm trying to use Tika in server mode.
> I downloaded tika-server-1.6.jar from http://mirror.vorboss.net/apache/tika/.
> I have tried to get file metadata in 2 different ways (as explained here: 
> http://wiki.apache.org/tika/TikaJAXRS ):
> {{> curl -T sample.csv http://localhost:9998/meta --header "Content-Type: 
> text/csv"}}
> {{"Content-Encoding","windows-1252"}}
> {{"Content-Type","text/plain; charset=windows-1252"}}
> and
> {{> curl -X PUT -d @sample.csv http://localhost:9998/meta --header 
> "Content-Type: text/csv"}}
> {{"Content-Encoding","ISO-8859-1"}}
> {{"Content-Type","text/plain; charset=ISO-8859-1"}}
> How come they give different results in encoding if I call the same 
> {{http://localhost:9998/meta}}?
> What could the other differences appear and which is the preferable way to 
> get metadata?
> Many thanks!
> Best regards,
> Darya Arbuzova



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1481) TikaJAXRS get metadata calls give different results

2014-11-19 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14218505#comment-14218505
 ] 

Sergey Beryozkin edited comment on TIKA-1481 at 11/19/14 9:12 PM:
--

Hi Darya
It is something to do with the curl options. -T is effectively a form payload 
AFAIK, possibly a multipart/form-data one. The 2nd option is a direct body 
payload. Please use a tcp trace and see what is different.
By the way - it would be more beneficial for the community at large if you 
could ask the questions at the users list - the questions raised at JIRAs have 
a very low visibility, unless they do identify genuine issues
Thanks, Sergey


was (Author: sergey_beryozkin):
Hi Darya
It is something to do with the curl options. -T is effectively a form payload 
AFAIK, possibly a multipart/form-data one. The 2nd option is a direct body 
payload. Please use a tcp trace and see what is different.
By the way - it would be more beneficial for the community at large if you 
could ask the questions at the users list - the questions raised at JIRAs have 
a very low visibility, unless they do identify genuine issue
Thanks, Sergey

> TikaJAXRS get metadata calls give different results
> ---
>
> Key: TIKA-1481
> URL: https://issues.apache.org/jira/browse/TIKA-1481
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 1.6
> Environment: Windows 8, JDK 1.8
>Reporter: Darya Arbuzova
>Priority: Minor
> Attachments: sample.csv
>
>
> Hello!
> I'm trying to use Tika in server mode.
> I downloaded tika-server-1.6.jar from http://mirror.vorboss.net/apache/tika/.
> I have tried to get file metadata in 2 different ways (as explained here: 
> http://wiki.apache.org/tika/TikaJAXRS ):
> {{> curl -T sample.csv http://localhost:9998/meta --header "Content-Type: 
> text/csv"}}
> {{"Content-Encoding","windows-1252"}}
> {{"Content-Type","text/plain; charset=windows-1252"}}
> and
> {{> curl -X PUT -d @sample.csv http://localhost:9998/meta --header 
> "Content-Type: text/csv"}}
> {{"Content-Encoding","ISO-8859-1"}}
> {{"Content-Type","text/plain; charset=ISO-8859-1"}}
> How come they give different results in encoding if I call the same 
> {{http://localhost:9998/meta}}?
> What could the other differences appear and which is the preferable way to 
> get metadata?
> Many thanks!
> Best regards,
> Darya Arbuzova



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1481) TikaJAXRS get metadata calls give different results

2014-11-19 Thread Sergey Beryozkin (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14218505#comment-14218505
 ] 

Sergey Beryozkin commented on TIKA-1481:


Hi Darya
It is something to do with the curl options. -T is effectively a form payload 
AFAIK, possibly a multipart/form-data one. The 2nd option is a direct body 
payload. Please use a tcp trace and see whta is different.
By the way - it would be more beneficial for the community at large if you 
could ask the questions at the users list - the questions raised at JIRAs have 
a very low visibility, unless they do identify genuine issue
Thanks, Sergey

> TikaJAXRS get metadata calls give different results
> ---
>
> Key: TIKA-1481
> URL: https://issues.apache.org/jira/browse/TIKA-1481
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 1.6
> Environment: Windows 8, JDK 1.8
>Reporter: Darya Arbuzova
>Priority: Minor
> Attachments: sample.csv
>
>
> Hello!
> I'm trying to use Tika in server mode.
> I downloaded tika-server-1.6.jar from http://mirror.vorboss.net/apache/tika/.
> I have tried to get file metadata in 2 different ways (as explained here: 
> http://wiki.apache.org/tika/TikaJAXRS ):
> {{> curl -T sample.csv http://localhost:9998/meta --header "Content-Type: 
> text/csv"}}
> {{"Content-Encoding","windows-1252"}}
> {{"Content-Type","text/plain; charset=windows-1252"}}
> and
> {{> curl -X PUT -d @sample.csv http://localhost:9998/meta --header 
> "Content-Type: text/csv"}}
> {{"Content-Encoding","ISO-8859-1"}}
> {{"Content-Type","text/plain; charset=ISO-8859-1"}}
> How come they give different results in encoding if I call the same 
> {{http://localhost:9998/meta}}?
> What could the other differences appear and which is the preferable way to 
> get metadata?
> Many thanks!
> Best regards,
> Darya Arbuzova



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1473) Apache Tika is not working for .docx documents

2014-11-19 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14218324#comment-14218324
 ] 

Tim Allison commented on TIKA-1473:
---

Any chance you could cleanse the document of client-sensitive data and share it 
with us?  Are there attachments?

> Apache Tika is not working for .docx documents 
> ---
>
> Key: TIKA-1473
> URL: https://issues.apache.org/jira/browse/TIKA-1473
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.5, 1.6
>Reporter: Franco Catto
>Priority: Blocker
>
> I am using Apache Tika 1.6 to read different document files. 
> It is reading pdf and old format doc files but when I try to read docx file, 
> it gives me following exception:
> org.apache.tika.exception.TikaException: Failed to close temporary resources 
> at org.apache.tika.io.TemporaryResources.dispose(TemporaryResources.java:152) 
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:127) 
> ...
> The resource can not be closed because it is still being used by the Java 
> Process, certainly the OOXML parser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1473) Apache Tika is not working for .docx documents

2014-11-19 Thread Milan Zivkovic (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14218189#comment-14218189
 ] 

Milan Zivkovic commented on TIKA-1473:
--

Hi
I was trying with heap size of 6g (-Xmx6g -Xms6g).

Milan

> Apache Tika is not working for .docx documents 
> ---
>
> Key: TIKA-1473
> URL: https://issues.apache.org/jira/browse/TIKA-1473
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.5, 1.6
>Reporter: Franco Catto
>Priority: Blocker
>
> I am using Apache Tika 1.6 to read different document files. 
> It is reading pdf and old format doc files but when I try to read docx file, 
> it gives me following exception:
> org.apache.tika.exception.TikaException: Failed to close temporary resources 
> at org.apache.tika.io.TemporaryResources.dispose(TemporaryResources.java:152) 
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:127) 
> ...
> The resource can not be closed because it is still being used by the Java 
> Process, certainly the OOXML parser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1473) Apache Tika is not working for .docx documents

2014-11-19 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14218120#comment-14218120
 ] 

Tyler Palsulich commented on TIKA-1473:
---

Hi,

Have you tried increasing the memory limit when you try to parse the document 
({{java -Xmx1000m -jar tika-app.jar}}, or something similar)?

Tyler

> Apache Tika is not working for .docx documents 
> ---
>
> Key: TIKA-1473
> URL: https://issues.apache.org/jira/browse/TIKA-1473
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.5, 1.6
>Reporter: Franco Catto
>Priority: Blocker
>
> I am using Apache Tika 1.6 to read different document files. 
> It is reading pdf and old format doc files but when I try to read docx file, 
> it gives me following exception:
> org.apache.tika.exception.TikaException: Failed to close temporary resources 
> at org.apache.tika.io.TemporaryResources.dispose(TemporaryResources.java:152) 
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:127) 
> ...
> The resource can not be closed because it is still being used by the Java 
> Process, certainly the OOXML parser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: svn commit: r1640535 - /tika/trunk/tika-server/src/main/java/org/apache/tika/server/TikaResource. java

2014-11-19 Thread David Meikle
Hey Guys,

> On 19 Nov 2014, at 17:09, Tyler Palsulich  wrote:
> 
> Found it! http://markmail.org/message/42nc64tdyhvzaril 
> 
> 
> Looks like javax, java, then other. I'll update the site today.

Sorry a clean install here and I didn’t update the settings (nor notice).  Will 
set my config now and tidy up imports in recent commits.

Cheers,
Dave

[jira] [Commented] (TIKA-1483) Create a general raw string parser

2014-11-19 Thread Tyler Palsulich (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14218068#comment-14218068
 ] 

Tyler Palsulich commented on TIKA-1483:
---

Definitely agree. This would be really nice to have.

> Create a general raw string parser
> --
>
> Key: TIKA-1483
> URL: https://issues.apache.org/jira/browse/TIKA-1483
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.6
>Reporter: Luis Filipe Nassif
>
> I think it can be very useful adding a general parser able to extract raw 
> strings from files (like the strings command), which can be used as the 
> fallback parser for all mimetypes not having a specific parser 
> implementation, like application/octet-stream. It can also be used as a 
> fallback for corrupt files throwing a TikaException.
> It must be configured with the script/language to be extracted from the files 
> (currently I implemented one specific for Latin1).
> It can use heuristics to extract strings encoded with different charsets 
> within the same file, mainly the common ISO-8859-1, UTF8 and UTF16.
> What the community thinks about that?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: svn commit: r1640535 - /tika/trunk/tika-server/src/main/java/org/apache/tika/server/TikaResource. java

2014-11-19 Thread Tyler Palsulich
Found it! http://markmail.org/message/42nc64tdyhvzaril

Looks like javax, java, then other. I'll update the site today.

Tyler

On Wed, Nov 19, 2014 at 10:54 AM, Nick Burch  wrote:

> On Wed, 19 Nov 2014, Tyler Palsulich wrote:
>
>> It looks like imports are being reordered here. I think we decided (can't
>> find an archive link right now) on java and javax imports before others.
>>
>
> Everything we wrote down is here:
> http://tika.apache.org/contribute.html#Code_Formatting
>
> Nothing there yet on that particular area, but I agree it'd be worth
> recording what we're generally doing!
>
> Nick
>


Re: svn commit: r1640535 - /tika/trunk/tika-server/src/main/java/org/apache/tika/server/TikaResource. java

2014-11-19 Thread Nick Burch

On Wed, 19 Nov 2014, Tyler Palsulich wrote:

It looks like imports are being reordered here. I think we decided (can't
find an archive link right now) on java and javax imports before others.


Everything we wrote down is here:
http://tika.apache.org/contribute.html#Code_Formatting

Nothing there yet on that particular area, but I agree it'd be worth 
recording what we're generally doing!


Nick


Re: svn commit: r1640535 - /tika/trunk/tika-server/src/main/java/org/apache/tika/server/TikaResource.java

2014-11-19 Thread Tyler Palsulich
On Wed, Nov 19, 2014 at 7:44 AM,  wrote:

> Author: dmeikle
> Date: Wed Nov 19 12:44:41 2014
> New Revision: 1640535
>
> URL: http://svn.apache.org/r1640535
> Log:
> TIKA-1477: Added new custom header to Tika resource override Tesseract OCR
> language
>
> Modified:
>
> tika/trunk/tika-server/src/main/java/org/apache/tika/server/TikaResource.java
>
> Modified:
> tika/trunk/tika-server/src/main/java/org/apache/tika/server/TikaResource.java
> URL:
> http://svn.apache.org/viewvc/tika/trunk/tika-server/src/main/java/org/apache/tika/server/TikaResource.java?rev=1640535&r1=1640534&r2=1640535&view=diff
>
> ==
> ---
> tika/trunk/tika-server/src/main/java/org/apache/tika/server/TikaResource.java
> (original)
> +++
> tika/trunk/tika-server/src/main/java/org/apache/tika/server/TikaResource.java
> Wed Nov 19 12:44:41 2014
> @@ -17,35 +17,6 @@
>
>  package org.apache.tika.server;
>
> -import java.io.IOException;
> -import java.io.InputStream;
> -import java.io.OutputStream;
> -import java.io.OutputStreamWriter;
> -import java.io.Writer;
> -import java.util.Locale;
> -import java.util.Map;
> -import java.util.Set;
> -
> -import javax.mail.internet.ContentDisposition;
> -import javax.mail.internet.ParseException;
> -import javax.ws.rs.Consumes;
> -import javax.ws.rs.GET;
> -import javax.ws.rs.PUT;
> -import javax.ws.rs.Path;
> -import javax.ws.rs.Produces;
> -import javax.ws.rs.WebApplicationException;
> -import javax.ws.rs.core.Context;
> -import javax.ws.rs.core.HttpHeaders;
> -import javax.ws.rs.core.MultivaluedMap;
> -import javax.ws.rs.core.Response;
> -import javax.ws.rs.core.StreamingOutput;
> -import javax.ws.rs.core.UriInfo;
> -import javax.xml.transform.OutputKeys;
> -import javax.xml.transform.TransformerConfigurationException;
> -import javax.xml.transform.sax.SAXTransformerFactory;
> -import javax.xml.transform.sax.TransformerHandler;
> -import javax.xml.transform.stream.StreamResult;
> -
>  import org.apache.commons.logging.Log;
>  import org.apache.commons.logging.LogFactory;
>  import org.apache.cxf.jaxrs.ext.multipart.Attachment;
> @@ -63,14 +34,44 @@ import org.apache.tika.parser.AutoDetect
>  import org.apache.tika.parser.ParseContext;
>  import org.apache.tika.parser.Parser;
>  import org.apache.tika.parser.html.HtmlParser;
> +import org.apache.tika.parser.ocr.TesseractOCRConfig;
>  import org.apache.tika.sax.BodyContentHandler;
>  import org.apache.tika.sax.ExpandedTitleContentHandler;
>  import org.xml.sax.ContentHandler;
>  import org.xml.sax.SAXException;
>
> +import javax.mail.internet.ContentDisposition;
> +import javax.mail.internet.ParseException;
> +import javax.ws.rs.Consumes;
> +import javax.ws.rs.GET;
> +import javax.ws.rs.PUT;
> +import javax.ws.rs.Path;
> +import javax.ws.rs.Produces;
> +import javax.ws.rs.WebApplicationException;
> +import javax.ws.rs.core.Context;
> +import javax.ws.rs.core.HttpHeaders;
> +import javax.ws.rs.core.MultivaluedMap;
> +import javax.ws.rs.core.Response;
> +import javax.ws.rs.core.StreamingOutput;
> +import javax.ws.rs.core.UriInfo;
> +import javax.xml.transform.OutputKeys;
> +import javax.xml.transform.TransformerConfigurationException;
> +import javax.xml.transform.sax.SAXTransformerFactory;
> +import javax.xml.transform.sax.TransformerHandler;
> +import javax.xml.transform.stream.StreamResult;
> +import java.io.IOException;
> +import java.io.InputStream;
> +import java.io.OutputStream;
> +import java.io.OutputStreamWriter;
> +import java.io.Writer;
> +import java.util.Locale;
> +import java.util.Map;
> +import java.util.Set;
> +
>


It looks like imports are being reordered here. I think we decided (can't
find an archive link right now) on java and javax imports before others.

Tyler


[jira] [Comment Edited] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2014-11-19 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14217965#comment-14217965
 ] 

Tim Allison edited comment on TIKA-1445 at 11/19/14 3:01 PM:
-

How about using the order of parsers as specified in TikaConfig?  That should 
accommodate 6 class files in different jars, no?

Via TikaConfig, we could also specify the which subclass of a default composite 
parser to use.  I now see at least three use cases:
1) Tika classic: pick the first parser that applies and hope that it is the one 
you meant, ignore the others. :)
2) The use case we've been discussing, where each parser is additive.
3) A BackOffOnExceptionParser (TIKA-1483 got me thinking about this)

Wait, for Tika 2.0, couldn't we do all the class loading from TikaConfig?  We 
could also get rid of our one-off parser config hacks (like Solr):

{noformat}

  
2
something or other
  
  audio/basic
  audio/x-aiff
  audio/x-wav

{noformat}

We could specify a ChainingParser on the fly via config:
{noformat}

  org.apache.tika.parser.jpeg.JPegParser
  ...
  ...
  org.apache.tika.parser.ocr.TesseractOCR

  image/bmp
  image/gif
  image/png
  image/vnd.wap.wbmp
  image/x-icon
  image/x-ms-bmp
  image/x-xcf


{noformat}


was (Author: talli...@mitre.org):
How about using the order of parsers as specified in TikaConfig?  That should 
accommodate 6 class files in different jars, no?

Via TikaConfig, we could also specify the which subclass of a default composite 
parser to use.  I now see at least three use cases:
1) Tika classic: pick the first parser that applies and hope that it is the one 
you meant, ignore the others. :)
2) The use case we've been discussing, where each parser is additive.
3) A BackOffOnExceptionParser (TIKA-1483 got me thinking about this)

> Figure out how to add Image metadata extraction to Tesseract parser
> ---
>
> Key: TIKA-1445
> URL: https://issues.apache.org/jira/browse/TIKA-1445
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
> Fix For: 1.8
>
> Attachments: TIKA-1445.Mattmann.101214.patch.txt, 
> TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, 
> TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch
>
>
> Now that Tesseract is the default image parser in Tika for many image types, 
> consider how to add back in the metadata extraction capabilities by the other 
> Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2014-11-19 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14217965#comment-14217965
 ] 

Tim Allison commented on TIKA-1445:
---

How about using the order of parsers as specified in TikaConfig?  That should 
accommodate 6 class files in different jars, no?

Via TikaConfig, we could also specify the which subclass of a default composite 
parser to use.  I now see at least three use cases:
1) Tika classic: pick the first parser that applies and hope that it is the one 
you meant, ignore the others. :)
2) The use case we've been discussing, where each parser is additive.
3) A BackOffOnExceptionParser (TIKA-1483 got me thinking about this)

> Figure out how to add Image metadata extraction to Tesseract parser
> ---
>
> Key: TIKA-1445
> URL: https://issues.apache.org/jira/browse/TIKA-1445
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
> Fix For: 1.8
>
> Attachments: TIKA-1445.Mattmann.101214.patch.txt, 
> TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, 
> TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch
>
>
> Now that Tesseract is the default image parser in Tika for many image types, 
> consider how to add back in the metadata extraction capabilities by the other 
> Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (TIKA-1471) OOM with corrupt PDF file

2014-11-19 Thread Alan Burlison (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Burlison closed TIKA-1471.
---
   Resolution: Done
Fix Version/s: (was: 1.7)

Fixed in later versions of upstream PDFBox component

> OOM with corrupt PDF file
> -
>
> Key: TIKA-1471
> URL: https://issues.apache.org/jira/browse/TIKA-1471
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 1.6
> Environment: Linux, JVM 1.8.0_25-b17, 64-bit
>Reporter: Alan Burlison
>Priority: Blocker
>
> Use of PDFBox 1.8.6 by Tika 1.6 is causing OOM errors with corrupt PDF files, 
> due to a bug in PDFBox, see PDFBOX-2493. This makes Tika 1.6 unusable from 
> inside a long-running webapp and I've had to revert to Tika 1.5. Although 1.5 
> also throws errors with the corrupt file it does not cause OOM errors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1471) OOM with corrupt PDF file

2014-11-19 Thread Alan Burlison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14217942#comment-14217942
 ] 

Alan Burlison commented on TIKA-1471:
-

I think you'd probably have to look at performance to see if there's any 
difference between calling it after every document or every N documents - I'm 
guessing every time is simplest.

I logged the issue just so you and people using Tika knew there was a potential 
OOM with PDFBox 1.6, as you say it's already fixed in later PDFBox issues, so I 
think there's nothing more to be done here. I'll mark it as resolved, and 
thanks again for your help.

> OOM with corrupt PDF file
> -
>
> Key: TIKA-1471
> URL: https://issues.apache.org/jira/browse/TIKA-1471
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 1.6
> Environment: Linux, JVM 1.8.0_25-b17, 64-bit
>Reporter: Alan Burlison
>Priority: Blocker
> Fix For: 1.7
>
>
> Use of PDFBox 1.8.6 by Tika 1.6 is causing OOM errors with corrupt PDF files, 
> due to a bug in PDFBox, see PDFBOX-2493. This makes Tika 1.6 unusable from 
> inside a long-running webapp and I've had to revert to Tika 1.5. Although 1.5 
> also throws errors with the corrupt file it does not cause OOM errors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1477) Add custom header to allow overriding of OCR language to be used in Tika Server

2014-11-19 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14217913#comment-14217913
 ] 

Hudson commented on TIKA-1477:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #323 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/323/])
TIKA-1477: Added new custom header to Tika resource override Tesseract OCR 
language (dmeikle: 
http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1640535)
* /tika/trunk/tika-server/src/main/java/org/apache/tika/server/TikaResource.java


> Add custom header to allow overriding of OCR language to be used in Tika 
> Server
> ---
>
> Key: TIKA-1477
> URL: https://issues.apache.org/jira/browse/TIKA-1477
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Reporter: Dave Meikle
>Assignee: Dave Meikle
>Priority: Minor
> Fix For: 1.7
>
>
> The _TesseractOCRParser_ relies on different language models to accurately 
> OCR content written in different languages.  At present, the Tika Server 
> provides no way to specify additional specific languages without code changes.
> To enable clients to ask for processing to be performed using specific 
> language models, we should add an optional new custom HTTP header (e.g. 
> X-Tika-OCRLanguage) which will override the TesseractOCRConfig language value 
> and set it on the ParseContext for use during parsing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1477) Add custom header to allow overriding of OCR language to be used in Tika Server

2014-11-19 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14217865#comment-14217865
 ] 

Hudson commented on TIKA-1477:
--

SUCCESS: Integrated in tika-trunk-jdk1.6 #304 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.6/304/])
TIKA-1477: Added new custom header to Tika resource override Tesseract OCR 
language (dmeikle: 
http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1640535)
* /tika/trunk/tika-server/src/main/java/org/apache/tika/server/TikaResource.java


> Add custom header to allow overriding of OCR language to be used in Tika 
> Server
> ---
>
> Key: TIKA-1477
> URL: https://issues.apache.org/jira/browse/TIKA-1477
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Reporter: Dave Meikle
>Assignee: Dave Meikle
>Priority: Minor
> Fix For: 1.7
>
>
> The _TesseractOCRParser_ relies on different language models to accurately 
> OCR content written in different languages.  At present, the Tika Server 
> provides no way to specify additional specific languages without code changes.
> To enable clients to ask for processing to be performed using specific 
> language models, we should add an optional new custom HTTP header (e.g. 
> X-Tika-OCRLanguage) which will override the TesseractOCRConfig language value 
> and set it on the ParseContext for use during parsing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1471) OOM with corrupt PDF file

2014-11-19 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14217838#comment-14217838
 ] 

Tim Allison commented on TIKA-1471:
---

Got it.  Thank _you_, [~alanbur].  [~jahewson] dug into the 
PDFont.clearResources() issue on PDFBOX-2200 and declared the static call safe 
even in a multithreaded environment.  The overall issue disappears with PDFBox 
2.0.

In Tika's PDFParser, we're now calling clearResources() after every 
document...I'm wondering if we should do it after every 1000 docs or so.  

Should we close this issue?  Any more work to do?  Thank you, again.

> OOM with corrupt PDF file
> -
>
> Key: TIKA-1471
> URL: https://issues.apache.org/jira/browse/TIKA-1471
> Project: Tika
>  Issue Type: Bug
>  Components: general
>Affects Versions: 1.6
> Environment: Linux, JVM 1.8.0_25-b17, 64-bit
>Reporter: Alan Burlison
>Priority: Blocker
> Fix For: 1.7
>
>
> Use of PDFBox 1.8.6 by Tika 1.6 is causing OOM errors with corrupt PDF files, 
> due to a bug in PDFBox, see PDFBOX-2493. This makes Tika 1.6 unusable from 
> inside a long-running webapp and I've had to revert to Tika 1.5. Although 1.5 
> also throws errors with the corrupt file it does not cause OOM errors.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1482) ForkParser throws exceptions when process some large pdf files

2014-11-19 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14217805#comment-14217805
 ] 

Tim Allison commented on TIKA-1482:
---

Alright, couldn't resist.  I get an OOM with -Xmx32m (the default value in 
ForkParser) with pure PDFBox 1.8.7.  If I bump that to 64m, all is well.

> ForkParser throws exceptions when process some large pdf files
> --
>
> Key: TIKA-1482
> URL: https://issues.apache.org/jira/browse/TIKA-1482
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.6
> Environment: Windows 7_x64 / JDK 1.7.0_17
>Reporter: Sean Zhao
>Priority: Critical
> Fix For: 1.6
>
> Attachments: SRCH-13412.pdf
>
>
> In Tika 1.6, ForkParser throws org.apache.tika.exception.TikaException , 
> message:Unexpected error in forked server process, when parsing some large 
> pdf files.  While tika 1.3 won't.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1482) ForkParser throws exceptions when process some large pdf files

2014-11-19 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14217800#comment-14217800
 ] 

Tim Allison commented on TIKA-1482:
---

To add to [~gagravarr]'s comment, if [~lfcnassif]'s recommendation doesn't 
work, before opening an issue on PDFBox's Jira, try grabbing PDFBox 1.8.7 
[here|https://pdfbox.apache.org/downloads.html] and run the app:
{noformat}
java -jar pdfbox-app-x.y.z.jar ExtractText [OPTIONS]  [Text file]
{noformat}

> ForkParser throws exceptions when process some large pdf files
> --
>
> Key: TIKA-1482
> URL: https://issues.apache.org/jira/browse/TIKA-1482
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.6
> Environment: Windows 7_x64 / JDK 1.7.0_17
>Reporter: Sean Zhao
>Priority: Critical
> Fix For: 1.6
>
> Attachments: SRCH-13412.pdf
>
>
> In Tika 1.6, ForkParser throws org.apache.tika.exception.TikaException , 
> message:Unexpected error in forked server process, when parsing some large 
> pdf files.  While tika 1.3 won't.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-595) HtmlHandler does not support multivalue metadata

2014-11-19 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14217799#comment-14217799
 ] 

Hudson commented on TIKA-595:
-

SUCCESS: Integrated in tika-trunk-jdk1.6 #303 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.6/303/])
TIKA-595: Adding Julien Nioche's patch to enable Multivalue Metadata for Html 
(dmeikle: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1640521)
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlHandler.java


> HtmlHandler does not support multivalue metadata
> 
>
> Key: TIKA-595
> URL: https://issues.apache.org/jira/browse/TIKA-595
> Project: Tika
>  Issue Type: Bug
>  Components: metadata, parser
>Affects Versions: 0.8
>Reporter: Lutz Pumpenmeier
>Assignee: Dave Meikle
>Priority: Minor
> Fix For: 1.7
>
> Attachments: TIKA-595.patch
>
>
> The HtmlHandler uses metadata.set(...). So META tags that occure more than 
> once are not handled correctly (DublinCore metadata can be set more than 
> once).
> The handler should use  metadata.add(..) instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-595) HtmlHandler does not support multivalue metadata

2014-11-19 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14217785#comment-14217785
 ] 

Hudson commented on TIKA-595:
-

SUCCESS: Integrated in tika-trunk-jdk1.7 #322 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/322/])
TIKA-595: Adding Julien Nioche's patch to enable Multivalue Metadata for Html 
(dmeikle: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1640521)
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlHandler.java


> HtmlHandler does not support multivalue metadata
> 
>
> Key: TIKA-595
> URL: https://issues.apache.org/jira/browse/TIKA-595
> Project: Tika
>  Issue Type: Bug
>  Components: metadata, parser
>Affects Versions: 0.8
>Reporter: Lutz Pumpenmeier
>Assignee: Dave Meikle
>Priority: Minor
> Fix For: 1.7
>
> Attachments: TIKA-595.patch
>
>
> The HtmlHandler uses metadata.set(...). So META tags that occure more than 
> once are not handled correctly (DublinCore metadata can be set more than 
> once).
> The handler should use  metadata.add(..) instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1446) CHM parser : wrong decompression of aligned blocks

2014-11-19 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14217786#comment-14217786
 ] 

Hudson commented on TIKA-1446:
--

SUCCESS: Integrated in tika-trunk-jdk1.7 #322 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/322/])
Reverting incorrect commit whilst fixing test on TIKA-1446 (dmeikle: 
http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1640520)
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlHandler.java
TIKA-1446: Updated test so it loads the test documents from the classpath 
(dmeikle: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1640518)
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlHandler.java
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/chm/TestChmExtraction.java


> CHM parser : wrong decompression of aligned blocks
> --
>
> Key: TIKA-1446
> URL: https://issues.apache.org/jira/browse/TIKA-1446
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.7
>Reporter: Bin Hawking
>Priority: Critical
> Attachments: chm.zip
>
>
> If an embedded file contains aligned blocks, the parser outputs chaotic text 
> or empty text as to this file.
> I have fixed it myself, corrected decompressAlignedBlock() and its 
> preparation methods. Mostly this bug is due to misusing main tree/align 
> tree/length tree. And some tree is built wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1483) Create a general raw string parser

2014-11-19 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14217747#comment-14217747
 ] 

Tim Allison commented on TIKA-1483:
---

+1.  It would be great to have something like this, especially if we could add 
language models eventually a la 
[la-strings|http://la-strings.sourceforge.net/].  We could also use this as a 
fallback parser in case there's an exception.

> Create a general raw string parser
> --
>
> Key: TIKA-1483
> URL: https://issues.apache.org/jira/browse/TIKA-1483
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.6
>Reporter: Luis Filipe Nassif
>
> I think it can be very useful adding a general parser able to extract raw 
> strings from files (like the strings command), which can be used as the 
> fallback parser for all mimetypes not having a specific parser 
> implementation, like application/octet-stream. It can also be used as a 
> fallback for corrupt files throwing a TikaException.
> It must be configured with the script/language to be extracted from the files 
> (currently I implemented one specific for Latin1).
> It can use heuristics to extract strings encoded with different charsets 
> within the same file, mainly the common ISO-8859-1, UTF8 and UTF16.
> What the community thinks about that?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-595) HtmlHandler does not support multivalue metadata

2014-11-19 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14217749#comment-14217749
 ] 

Julien Nioche commented on TIKA-595:


Thanks Dave!

> HtmlHandler does not support multivalue metadata
> 
>
> Key: TIKA-595
> URL: https://issues.apache.org/jira/browse/TIKA-595
> Project: Tika
>  Issue Type: Bug
>  Components: metadata, parser
>Affects Versions: 0.8
>Reporter: Lutz Pumpenmeier
>Assignee: Dave Meikle
>Priority: Minor
> Fix For: 1.7
>
> Attachments: TIKA-595.patch
>
>
> The HtmlHandler uses metadata.set(...). So META tags that occure more than 
> once are not handled correctly (DublinCore metadata can be set more than 
> once).
> The handler should use  metadata.add(..) instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1446) CHM parser : wrong decompression of aligned blocks

2014-11-19 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14217742#comment-14217742
 ] 

Hudson commented on TIKA-1446:
--

SUCCESS: Integrated in tika-trunk-jdk1.6 #302 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.6/302/])
Reverting incorrect commit whilst fixing test on TIKA-1446 (dmeikle: 
http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1640520)
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlHandler.java
TIKA-1446: Updated test so it loads the test documents from the classpath 
(dmeikle: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1640518)
* 
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/html/HtmlHandler.java
* 
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/chm/TestChmExtraction.java


> CHM parser : wrong decompression of aligned blocks
> --
>
> Key: TIKA-1446
> URL: https://issues.apache.org/jira/browse/TIKA-1446
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.7
>Reporter: Bin Hawking
>Priority: Critical
> Attachments: chm.zip
>
>
> If an embedded file contains aligned blocks, the parser outputs chaotic text 
> or empty text as to this file.
> I have fixed it myself, corrected decompressAlignedBlock() and its 
> preparation methods. Mostly this bug is due to misusing main tree/align 
> tree/length tree. And some tree is built wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-595) HtmlHandler does not support multivalue metadata

2014-11-19 Thread Dave Meikle (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Meikle resolved TIKA-595.
--
Resolution: Fixed

Committed Julien Nioche's patch in r1640521. Thanks!

> HtmlHandler does not support multivalue metadata
> 
>
> Key: TIKA-595
> URL: https://issues.apache.org/jira/browse/TIKA-595
> Project: Tika
>  Issue Type: Bug
>  Components: metadata, parser
>Affects Versions: 0.8
>Reporter: Lutz Pumpenmeier
>Assignee: Dave Meikle
>Priority: Minor
> Fix For: 1.7
>
> Attachments: TIKA-595.patch
>
>
> The HtmlHandler uses metadata.set(...). So META tags that occure more than 
> once are not handled correctly (DublinCore metadata can be set more than 
> once).
> The handler should use  metadata.add(..) instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1483) Create a general raw string parser

2014-11-19 Thread Luis Filipe Nassif (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luis Filipe Nassif updated TIKA-1483:
-
Description: 
I think it can be very useful adding a general parser able to extract raw 
strings from files (like the strings command), which can be used as the 
fallback parser for all mimetypes not having a specific parser implementation, 
like application/octet-stream. It can also be used as a fallback for corrupt 
files throwing a TikaException.

It must be configured with the script/language to be extracted from the files 
(currently I implemented one specific for Latin1).
It can use heuristics to extract strings encoded with different charsets within 
the same file, mainly the common ISO-8859-1, UTF8 and UTF16.

What the community thinks about that?

  was:
I think it can be very useful adding a general parser able to extract raw 
strings from files (like the strings command), which can be used as the 
fallback parser for all mimetypes not having a specific parser implementation, 
like application/octet-stream. It can also be used as a fallback for corrupt 
files throwing a TikaException.

It must be configured with the script/language to be extracted from the files 
(currently I implemented one specific for Latin1).
It can use heuristics to extract strings encoded with different charsets within 
the same file, maily ISO-8859-1, UTF8 and UTF16.

What the community think about that?


> Create a general raw string parser
> --
>
> Key: TIKA-1483
> URL: https://issues.apache.org/jira/browse/TIKA-1483
> Project: Tika
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.6
>Reporter: Luis Filipe Nassif
>
> I think it can be very useful adding a general parser able to extract raw 
> strings from files (like the strings command), which can be used as the 
> fallback parser for all mimetypes not having a specific parser 
> implementation, like application/octet-stream. It can also be used as a 
> fallback for corrupt files throwing a TikaException.
> It must be configured with the script/language to be extracted from the files 
> (currently I implemented one specific for Latin1).
> It can use heuristics to extract strings encoded with different charsets 
> within the same file, mainly the common ISO-8859-1, UTF8 and UTF16.
> What the community thinks about that?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1482) ForkParser throws exceptions when process some large pdf files

2014-11-19 Thread Luis Filipe Nassif (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14217691#comment-14217691
 ] 

Luis Filipe Nassif commented on TIKA-1482:
--

You can increase the forked jvm max heap size, if there is ram available .

> ForkParser throws exceptions when process some large pdf files
> --
>
> Key: TIKA-1482
> URL: https://issues.apache.org/jira/browse/TIKA-1482
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.6
> Environment: Windows 7_x64 / JDK 1.7.0_17
>Reporter: Sean Zhao
>Priority: Critical
> Fix For: 1.6
>
> Attachments: SRCH-13412.pdf
>
>
> In Tika 1.6, ForkParser throws org.apache.tika.exception.TikaException , 
> message:Unexpected error in forked server process, when parsing some large 
> pdf files.  While tika 1.3 won't.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: TIKA-1445 and having multiple Parsers (as many as needed) work on the same MediaType

2014-11-19 Thread David Meikle
Hi Guys,

> On 18 Nov 2014, at 16:52, Allison, Timothy B.  wrote:
> 
> Chris,
>  Thank you for moving this to the dev list.  This would be a fairly large 
> change, and the discussion is valuable.

Given the potential implications of the change, I am wondering if it is worth 
scheduling a Google Hangout / Conference Call / IRC session to chat through 
things once we have all had time to flesh out thoughts out?

I am happy to facilitate setting this up and documenting it (meeting notes), so 
we can include outputs on the list for further discussion and subsequent formal 
decision making with everyone involved.

Cheers,
Dave

[jira] [Commented] (TIKA-1445) Figure out how to add Image metadata extraction to Tesseract parser

2014-11-19 Thread Dave Meikle (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14217685#comment-14217685
 ] 

Dave Meikle commented on TIKA-1445:
---

bq. Hey Guys, to be honest, the way I see that we solve the ServiceLoading 
problem is somehow to move away from it. Relying on the JVM to implicitly 
decide which parser to load based on ClassLoading is not scalable IMO. At 
worst, even capturing an ordered preference file that isn't ServiceLoading is 
1000x better IMO than relying on the JVM and the classpath. We need somehow to 
bring this logic into Tika (still thinking about how and will try to prototype 
something).

+1 - I think this is example of something we will probably hit more and more as 
we further extend Tika, i.e. wanting multiple parsers to have an interest in 
and then parse content of the same mime type, and moving away from using the 
re-ordering approach seems like the only way to go here.

_ServiceLoading_ per se is not a problem, indeed this is a nice way to make it 
simple for external providers to be added, but I think we need to think about 
Parsers in a pipeline and allow users to customise the parsers that participate 
in the pipeline through positive exclusions via config.

The above is a big change and I think if we went with something like this would 
need to be a 2.X of Tika. 

I suspect the problem with clashing Metadata entries is not really there, as 
most parsers look for different keys, or in cases where they process commons 
ones (e.g. title, size, description, etc) they should hopefully be getting the 
same value anyway.  IMO I think we could send the same Metadata object through 
the 'pipeline', adding any unique new value in for a key.

Will join the party and try to flesh out thoughts on a branch.

bq. 3) It is a good idea to identify which parser produced each content with a 
 tag.

+1 - this will be really helpful.

> Figure out how to add Image metadata extraction to Tesseract parser
> ---
>
> Key: TIKA-1445
> URL: https://issues.apache.org/jira/browse/TIKA-1445
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Chris A. Mattmann
>Assignee: Chris A. Mattmann
> Fix For: 1.8
>
> Attachments: TIKA-1445.Mattmann.101214.patch.txt, 
> TIKA-1445.Palsulich.102614.patch, TIKA-1445_tallison_20141027.patch.txt, 
> TIKA-1445_tallison_v2_20141027.patch, TIKA-1445_tallison_v3_20141027.patch
>
>
> Now that Tesseract is the default image parser in Tika for many image types, 
> consider how to add back in the metadata extraction capabilities by the other 
> Image parsers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1358) Add support for newer iWork file formats

2014-11-19 Thread Fabian Lange (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Fabian Lange updated TIKA-1358:
---
Attachment: iwork13-testfiles-2014-11.zip

new iworks test files re-saved with latest iworks which makes a zip out of them 
again.
They were opened, saved and closed without modification in Apple's iWork
applications as of November 2014. Versions of Pages, Numbers and Keynote
as follows:

* Numbers 3.5 (2109)
* Keynote 6.5 (2110)
* Pages 5.5.1 (2111)

> Add support for newer iWork file formats
> 
>
> Key: TIKA-1358
> URL: https://issues.apache.org/jira/browse/TIKA-1358
> Project: Tika
>  Issue Type: Wish
>  Components: parser
>Affects Versions: 1.5
>Reporter: Jelle Kastelein
>  Labels: newbie
> Attachments: iwork13-testdocs-zips.zip, iwork13-testfiles-2014-11.zip
>
>
> IWork 2013 uses a revised file format which replaces the xml files that hold 
> the content by .iwa files (a binary format). This file format is becoming 
> increasingly relevant as more and more people are using apple products. 
> However, it does not appear to work with the current IWorkPackageParser 
> (tested with several of the example .pages files one can get from the 
> iCloud). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1358) Add support for newer iWork file formats

2014-11-19 Thread Fabian Lange (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14217612#comment-14217612
 ] 

Fabian Lange commented on TIKA-1358:


Yes. Again.
The file format supported right now is iWork08, which is a zip file with a 
certain structure.
Then there was the new format which used an apple package, which almost 
everybody sees as fold, but some apple apps convert to zip (like you mentioned 
mail and safari did). Now they still use the same folder structure but zip it 
up.

> Add support for newer iWork file formats
> 
>
> Key: TIKA-1358
> URL: https://issues.apache.org/jira/browse/TIKA-1358
> Project: Tika
>  Issue Type: Wish
>  Components: parser
>Affects Versions: 1.5
>Reporter: Jelle Kastelein
>  Labels: newbie
> Attachments: iwork13-testdocs-zips.zip
>
>
> IWork 2013 uses a revised file format which replaces the xml files that hold 
> the content by .iwa files (a binary format). This file format is becoming 
> increasingly relevant as more and more people are using apple products. 
> However, it does not appear to work with the current IWorkPackageParser 
> (tested with several of the example .pages files one can get from the 
> iCloud). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1482) ForkParser throws exceptions when process some large pdf files

2014-11-19 Thread Nick Burch (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14217609#comment-14217609
 ] 

Nick Burch commented on TIKA-1482:
--

That looks like a pdfbox issue

Can you try again with a recent nightly build of Tika? There's a slightly newer 
version of PDFBox in there. If the problem still remains with the latest 
Tika+PDFBox, it'll need reporting upstream to the Apache PDFBox project

> ForkParser throws exceptions when process some large pdf files
> --
>
> Key: TIKA-1482
> URL: https://issues.apache.org/jira/browse/TIKA-1482
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.6
> Environment: Windows 7_x64 / JDK 1.7.0_17
>Reporter: Sean Zhao
>Priority: Critical
> Fix For: 1.6
>
> Attachments: SRCH-13412.pdf
>
>
> In Tika 1.6, ForkParser throws org.apache.tika.exception.TikaException , 
> message:Unexpected error in forked server process, when parsing some large 
> pdf files.  While tika 1.3 won't.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)