[jira] [Commented] (TIKA-1481) TikaJAXRS get metadata calls give different results

2014-11-24 Thread Darya Arbuzova (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224167#comment-14224167
 ] 

Darya Arbuzova commented on TIKA-1481:
--

Thank you, Sergey!

I was trying to find a better place to ask my questions, but I can't send 
anything at u...@tika.apache.org (provided in the mailing list: 
http://tika.apache.org/mail-lists.html) and didn't come across any Q&A-like 
pages. What do you mean by «users list»?
Sorry I'm posting in the wrong place.

Best regards,
Darya Arbuzova

> TikaJAXRS get metadata calls give different results
> ---
>
> Key: TIKA-1481
> URL: https://issues.apache.org/jira/browse/TIKA-1481
> Project: Tika
>  Issue Type: Bug
>  Components: server
>Affects Versions: 1.6
> Environment: Windows 8, JDK 1.8
>Reporter: Darya Arbuzova
>Priority: Minor
> Attachments: sample.csv
>
>
> Hello!
> I'm trying to use Tika in server mode.
> I downloaded tika-server-1.6.jar from http://mirror.vorboss.net/apache/tika/.
> I have tried to get file metadata in 2 different ways (as explained here: 
> http://wiki.apache.org/tika/TikaJAXRS ):
> {{> curl -T sample.csv http://localhost:9998/meta --header "Content-Type: 
> text/csv"}}
> {{"Content-Encoding","windows-1252"}}
> {{"Content-Type","text/plain; charset=windows-1252"}}
> and
> {{> curl -X PUT -d @sample.csv http://localhost:9998/meta --header 
> "Content-Type: text/csv"}}
> {{"Content-Encoding","ISO-8859-1"}}
> {{"Content-Type","text/plain; charset=ISO-8859-1"}}
> How come they give different results in encoding if I call the same 
> {{http://localhost:9998/meta}}?
> What could the other differences appear and which is the preferable way to 
> get metadata?
> Many thanks!
> Best regards,
> Darya Arbuzova



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-11-24 Thread JIRA

[ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223469#comment-14223469
 ] 

Andreas Lehmkühler edited comment on TIKA-1442 at 11/24/14 9:04 PM:


[~talli...@apache.org] Yes, build 
[145|http://repository.apache.org/content/repositories/snapshots/org/apache/pdfbox/pdfbox-app/1.8.8-SNAPSHOT/pdfbox-app-1.8.8-20141124.203353-145.jar]
 of 1.8.8 should include the latest changes

{quote}
https://builds.apache.org/job/PDFBox-trunk/1471/, I guess...?
{quote}
That one results in the latest 2.0.0 SNAPSHOT build number 
[724|http://repository.apache.org/content/repositories/snapshots/org/apache/pdfbox/pdfbox-app/2.0.0-SNAPSHOT/pdfbox-app-2.0.0-20141124.203546-724.jar]


was (Author: lehmi):
Yes, build 
[145|http://repository.apache.org/content/repositories/snapshots/org/apache/pdfbox/pdfbox-app/1.8.8-SNAPSHOT/pdfbox-app-1.8.8-20141124.203353-145.jar]
 of 1.8.8 should include the latest changes

{quote}
https://builds.apache.org/job/PDFBox-trunk/1471/, I guess...?
{quote}
That one results in the latest 2.0.0 SNAPSHOT build number 
[724|http://repository.apache.org/content/repositories/snapshots/org/apache/pdfbox/pdfbox-app/2.0.0-SNAPSHOT/pdfbox-app-2.0.0-20141124.203546-724.jar]

> Upgrade to PDFBox 1.8.8
> ---
>
> Key: TIKA-1442
> URL: https://issues.apache.org/jira/browse/TIKA-1442
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
> Fix For: 1.8
>
> Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip
>
>
> Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
> 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
> 1.7.  Let's use this issue to carry on the discussion of regression testing 
> (if any further discussion is necessary) or any other prep that needs to 
> happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-11-24 Thread JIRA

[ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223469#comment-14223469
 ] 

Andreas Lehmkühler edited comment on TIKA-1442 at 11/24/14 9:03 PM:


Yes, build 
[145|http://repository.apache.org/content/repositories/snapshots/org/apache/pdfbox/pdfbox-app/1.8.8-SNAPSHOT/pdfbox-app-1.8.8-20141124.203353-145.jar]
 of 1.8.8 should include the latest changes

{quote}
https://builds.apache.org/job/PDFBox-trunk/1471/, I guess...?
{quote}
That one results in the latest 2.0.0 SNAPSHOT build number 
[724|http://repository.apache.org/content/repositories/snapshots/org/apache/pdfbox/pdfbox-app/2.0.0-SNAPSHOT/pdfbox-app-2.0.0-20141124.203546-724.jar]


was (Author: lehmi):
Yes, build 145 should include the latest changes

> Upgrade to PDFBox 1.8.8
> ---
>
> Key: TIKA-1442
> URL: https://issues.apache.org/jira/browse/TIKA-1442
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
> Fix For: 1.8
>
> Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip
>
>
> Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
> 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
> 1.7.  Let's use this issue to carry on the discussion of regression testing 
> (if any further discussion is necessary) or any other prep that needs to 
> happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-11-24 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223460#comment-14223460
 ] 

Tim Allison edited comment on TIKA-1442 at 11/24/14 8:39 PM:
-

Thank you for PDFBOX-2520!  I'd put my $ on Jenkins.  I'll kick this off 
tomorrow EDT.

[~lehmi], https://builds.apache.org/job/PDFBox-trunk/1471/, I guess...?

Is there an easy way to tell when Maven is picking up the correct version from 
my 1.8.8-SNAPSHOT definition?
Currently, I'm getting: pdfbox-1.8.8-20141124.203308-145.jar ?


was (Author: talli...@mitre.org):
Thank you for PDFBOX-2520!  I'd put my $ on Jenkins.  I'll kick this off 
tomorrow EDT.

[~lehmi], pdfbox-1.8.8-20141124.203308-145.jar ?

> Upgrade to PDFBox 1.8.8
> ---
>
> Key: TIKA-1442
> URL: https://issues.apache.org/jira/browse/TIKA-1442
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
> Fix For: 1.8
>
> Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip
>
>
> Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
> 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
> 1.7.  Let's use this issue to carry on the discussion of regression testing 
> (if any further discussion is necessary) or any other prep that needs to 
> happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-11-24 Thread JIRA

[ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223469#comment-14223469
 ] 

Andreas Lehmkühler commented on TIKA-1442:
--

Yes, build 145 should include the latest changes

> Upgrade to PDFBox 1.8.8
> ---
>
> Key: TIKA-1442
> URL: https://issues.apache.org/jira/browse/TIKA-1442
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
> Fix For: 1.8
>
> Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip
>
>
> Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
> 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
> 1.7.  Let's use this issue to carry on the discussion of regression testing 
> (if any further discussion is necessary) or any other prep that needs to 
> happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-11-24 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223460#comment-14223460
 ] 

Tim Allison edited comment on TIKA-1442 at 11/24/14 8:34 PM:
-

Thank you for PDFBOX-2520!  I'd put my $ on Jenkins.  I'll kick this off 
tomorrow EDT.

[~lehmi], pdfbox-1.8.8-20141124.203308-145.jar ?


was (Author: talli...@mitre.org):
Thank you for PDFBOX-2520!  I'd put my $ on Jenkins.  I'll kick this off 
tomorrow EDT.

> Upgrade to PDFBox 1.8.8
> ---
>
> Key: TIKA-1442
> URL: https://issues.apache.org/jira/browse/TIKA-1442
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
> Fix For: 1.8
>
> Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip
>
>
> Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
> 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
> 1.7.  Let's use this issue to carry on the discussion of regression testing 
> (if any further discussion is necessary) or any other prep that needs to 
> happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-11-24 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223460#comment-14223460
 ] 

Tim Allison commented on TIKA-1442:
---

Thank you for PDFBOX-2520!  I'd put my $ on Jenkins.  I'll kick this off 
tomorrow EDT.

> Upgrade to PDFBox 1.8.8
> ---
>
> Key: TIKA-1442
> URL: https://issues.apache.org/jira/browse/TIKA-1442
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
> Fix For: 1.8
>
> Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip
>
>
> Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
> 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
> 1.7.  Let's use this issue to carry on the discussion of regression testing 
> (if any further discussion is necessary) or any other prep that needs to 
> happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-11-24 Thread JIRA

[ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223430#comment-14223430
 ] 

Andreas Lehmkühler edited comment on TIKA-1442 at 11/24/14 8:09 PM:


[~talli...@apache.org] I guess PDFBox 1.8.8-SNAPSHOT is ready for another TIKA 
test run. I've added the changes I had in mind and fixed the issue you 
mentioned on the list as well, see PDFBOX-2520.

P.S.: I don't know who'll be faster, jenkins or you. So you might have to wait 
until the lastest binaries are available if you don't use your own compiled 
version ;-)


was (Author: lehmi):
[~talli...@apache.org] I guess PDFBox 1.8.8-SNAPSHOT is ready for another TIKA 
test run. I've added the changes I had in mind and fixed the issue you 
mentioned on the list as well, see PDFBOX-2520.

> Upgrade to PDFBox 1.8.8
> ---
>
> Key: TIKA-1442
> URL: https://issues.apache.org/jira/browse/TIKA-1442
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
> Fix For: 1.8
>
> Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip
>
>
> Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
> 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
> 1.7.  Let's use this issue to carry on the discussion of regression testing 
> (if any further discussion is necessary) or any other prep that needs to 
> happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-11-24 Thread JIRA

[ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223430#comment-14223430
 ] 

Andreas Lehmkühler commented on TIKA-1442:
--

[~talli...@apache.org] I guess PDFBox 1.8.8-SNAPSHOT is ready for another TIKA 
test run. I've added the changes I had in mind and fixed the issue you 
mentioned on the list as well, see PDRFBOX-2520.

> Upgrade to PDFBox 1.8.8
> ---
>
> Key: TIKA-1442
> URL: https://issues.apache.org/jira/browse/TIKA-1442
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
> Fix For: 1.8
>
> Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip
>
>
> Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
> 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
> 1.7.  Let's use this issue to carry on the discussion of regression testing 
> (if any further discussion is necessary) or any other prep that needs to 
> happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1442) Upgrade to PDFBox 1.8.8

2014-11-24 Thread JIRA

[ 
https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223430#comment-14223430
 ] 

Andreas Lehmkühler edited comment on TIKA-1442 at 11/24/14 8:02 PM:


[~talli...@apache.org] I guess PDFBox 1.8.8-SNAPSHOT is ready for another TIKA 
test run. I've added the changes I had in mind and fixed the issue you 
mentioned on the list as well, see PDFBOX-2520.


was (Author: lehmi):
[~talli...@apache.org] I guess PDFBox 1.8.8-SNAPSHOT is ready for another TIKA 
test run. I've added the changes I had in mind and fixed the issue you 
mentioned on the list as well, see PDRFBOX-2520.

> Upgrade to PDFBox 1.8.8
> ---
>
> Key: TIKA-1442
> URL: https://issues.apache.org/jira/browse/TIKA-1442
> Project: Tika
>  Issue Type: Improvement
>Reporter: Tim Allison
>Assignee: Tim Allison
> Fix For: 1.8
>
> Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, 
> pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip
>
>
> Given the regressions we identified in PDFBox 1.8.7, we should upgrade to 
> 1.8.8 as soon as it is ready.  I'm tempted to call this a blocker on Tika 
> 1.7.  Let's use this issue to carry on the discussion of regression testing 
> (if any further discussion is necessary) or any other prep that needs to 
> happen before 1.8.8's release.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Subsets of tika parsers redux

2014-11-24 Thread Mattmann, Chris A (3980)
Hey Nick,

This sounds like a great plan to me, good job to you
and Sergey. As for helping I¹ll try my best, but I¹m not
an OSGI guru :)

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++






-Original Message-
From: Nick Burch 
Reply-To: "dev@tika.apache.org" 
Date: Sunday, November 23, 2014 at 6:12 PM
To: "dev@tika.apache.org" 
Subject: Subsets of tika parsers redux

>Hi All
>
>During ApacheCon, I had a chance to chat with Sergey about the "subset of
>Tika Parsers" issue that bubbles up from time to time. It seemed to work
>well, and I think we both now have a better idea of the other's needs and
>concerns, which is good :)
>
>As is shown on our list from time to time, but more commonly elsewhere,
>we 
>have some users who are confused already by the split between tika-core
>and tika-parsers. Anything that fragments further is going to cause more
>issues for that kind of user.
>
>On the other hand, there are potential users out there who want just a
>handful of parsers, in a simple and easy and small way, who don't know a
>lot about Tika yet, and aren't perhaps Maven or OSGi gurus. Many of those
>are using OSGi, but not all.
>
>One suggested solution is to just document what dependencies of
>tika-parsers can be excluded at the maven level to disable certain
>parsers 
>+ shrink the resulting dependency tree. However, that requires manual
>updates, manual checking, and like our examples on the website risk
>getting out of date without automated checking.
>
>Discussion then turned to our move to get all the examples for the
>website 
>into svn, with unit tests, and having the website pull those from svn on
>the fly to always get the latest tested version.
>
>
>That led to an idea. Not sure if it'll work yet, but...
>
>What about having multiple Tika OSGi bundles? Continue with the "full"
>bundle as now, but also have ones for "pdf", "microsoft office", "images"
>etc. OSGi users (eg CXF users) could then opt to depend on pdf+image if
>they only wanted a handful of parsers, or the full one as now.
>
>The smart bit - we have unit tests for these smaller bundles. These unit
>tests ensure that the desired parsers still work on their smaller bundle.
>These unit tests also ensure that unwanted parsers don't work, thus
>flagging up if extra dependencies have snuck though.
>
>Finally, we pull out the includes/excludes information that went into the
>bundle, and display that for non-OSGi users. A non-OSGi person wanting
>"tika with pdf only" could then look at what the tika-pdf-bundle does and
>doesn't use, and from that know what maven level dependencies to keep and
>which to exclude
>
>
>This new plan would mean having to tweak our build to support multiple
>bundles, and potentially tweaking our bundles so that you could load
>tika-pdf + tika-image and have those two play nicely together. It'd also
>need some new unit tests, and the work to figure out what to
>include/exclude for each of our handful of "common" cases. It should,
>however, deliver a way for OSGi and non-OSGi people to get just a subset
>if that's all they want.
>
>Can anyone see a flaw with this plan? Anyone see a better way? Anyone
>want 
>to help? :)
>
>Nick



[jira] [Commented] (TIKA-1332) Create "eval" code

2014-11-24 Thread Tim Allison (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14222947#comment-14222947
 ] 

Tim Allison commented on TIKA-1332:
---

In a personal communication, I asked [~sergey_beryozkin] for recommendations 
for handling static content in the jax-rs framework.  For the UI component of 
the eval code -- how the user interacts with the results of the eval -- Is 
there an easy equivalent in JAX-RS that allows for the user to browse a 
directory of files and click on desired files for download as easily as one can 
with Jetty's ResourceHandler.

With permission, I'm posting/summarizing [~sergey_beryozkin]'s responses.  If 
anyone else has a recommendation leveraging the JAX-RS framework for dynamic 
data and still using something so easy as Jetty's ResourceHandler for static 
content, please let us know.

Option 1: 
Handcode a JAX-RS handler that mimics Jetty's ResourceHandler
> That can be easily enough though with JAX-RS if you'd like to explore
> this path, something like this I guess:
>
{noformat}
 @Path("eval")
 public class TikaEvaluation {
   @Context
   private UriInfo ui;
   @GET
   @Path("list")
   @Produces("text/html")
   public Response getListOfResultURIs() {
   List uris = new LinkedList();
   for (File f : getResultFiles()) {

   uris.add(ui.getAbsoluteUriBuilder().path(f.getName()).build());
  }
   // uris list now how a list of links to individual files
   // next we need to decide how to convert that to HTML
   // one option is to return the list as is and redirect that to
   // JSP, another option is to build a basic HTML string right here in 
the
   // method, another option is to register a MessageBodyWriter that 
will
   // convert the list into HTML
   // the individual links will be managed by getFile() method

   return Response.ok(uris).build();
   }

   @GET
   @Path("list/{name}")
   @Produces("application/json", "multipart/mixed")
   public Response getFile(@PathParam("name") String name) {
   ...
   }

{noformat}

Option 2:
Run Jetty's ResourceHandler from the same embedded Jetty server that is hosting 
the JAX-RS code.
> This link would probably be the best one: [link| 
> https://git-wip-us.apache.org/repos/asf?p=cxf.git;a=blob_plain;f=distribution/src/main/release/samples/jax_rs/search/src/main/java/demo/jaxrs/search/server/Server.java;hb=HEAD]

> Tika JAX-RS server actually runs on top of Jetty right now too, but in
> this case we have a direct Jetty server setup.
>
> The server registers a CXF servlet and Jetty handlers too. CXF servlet
> also redirect to default handlers like a default handler for serving the
> static content. This is not needed if the result files are accessible
> over URI that does not overlap with a CXF servlet URI pattern.
> In fact, I wonder if a Tika JAXRS style of the registration may also do
> ? If you register a CXF endpoint at /eval and the results are accessible
> over /results then it should  work ? Unless Jetty ContentHandler is not
> installed by default - then the linked to code would def do :-)

> the only possible downside here is that as far as the consistent URI 
> space management is concerned we'd have one part of it (the static 
> resources) controlled natively by Jetty and the rest - by JAX-RS. so it 
> can be trickier to provide a support for searching the results, 
> enforcing the common security rules (when/if needed).
> That said may be it is not of a real concern, it can always be removed 
> in the future if needed.


Other options?


> Create "eval" code
> --
>
> Key: TIKA-1332
> URL: https://issues.apache.org/jira/browse/TIKA-1332
> Project: Tika
>  Issue Type: Sub-task
>  Components: cli, general, server
>Reporter: Tim Allison
>
> For this issue, we can start with code to gather statistics on each run (# of 
> exceptions per file type, most common exceptions per file type, number of 
> metadata items, total text extracted, etc).  We should also be able to 
> compare one run against another.  Going forward, there's plenty of room to 
> improve.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (TIKA-1473) Apache Tika is not working for .docx documents

2014-11-24 Thread Milan Zivkovic (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14222927#comment-14222927
 ] 

Milan Zivkovic edited comment on TIKA-1473 at 11/24/14 11:56 AM:
-

Hi, 
Indeed I was using the FileInputStream, but if I wrap it with the 
TikaInputStream I get the same problem. 
{code}
public static void main( final String[] args ) throws IOException, 
TikaException {
final String path = "path_to_file";
final Metadata metadata = new Metadata();
final InputStream is = TikaInputStream.get( Files.newInputStream( 
Paths.get( path ) ) );
final String someText = TIKA.parseToString( is, metadata, 
MAX_CONTENT_LENGTH );
System.out.println( someText );
}
{code}


was (Author: mzivkovic):
Hi, 
Indeed I was using the FileInputStream, but if I wrap it with the 
TikaInputStream I get the same problem. 
{code}
public static void main( final String[] args ) throws IOException, 
TikaException {
final String path = "path_to_file";
final Metadata metadata = new Metadata();
InputStream is = TikaInputStream.get( Files.newInputStream( Paths.get( 
path ) ) );
is = TikaInputStream.get( is );
final String someText = TIKA.parseToString( is, metadata, 
MAX_CONTENT_LENGTH );
System.out.println( someText );
}
{code}

> Apache Tika is not working for .docx documents 
> ---
>
> Key: TIKA-1473
> URL: https://issues.apache.org/jira/browse/TIKA-1473
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.5, 1.6
>Reporter: Franco Catto
>Priority: Blocker
>
> I am using Apache Tika 1.6 to read different document files. 
> It is reading pdf and old format doc files but when I try to read docx file, 
> it gives me following exception:
> org.apache.tika.exception.TikaException: Failed to close temporary resources 
> at org.apache.tika.io.TemporaryResources.dispose(TemporaryResources.java:152) 
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:127) 
> ...
> The resource can not be closed because it is still being used by the Java 
> Process, certainly the OOXML parser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (TIKA-1473) Apache Tika is not working for .docx documents

2014-11-24 Thread Milan Zivkovic (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14222927#comment-14222927
 ] 

Milan Zivkovic commented on TIKA-1473:
--

Hi, 
Indeed I was using the FileInputStream, but if I wrap it with the 
TikaInputStream I get the same problem. 
{code}
public static void main( final String[] args ) throws IOException, 
TikaException {
final String path = "path_to_file";
final Metadata metadata = new Metadata();
InputStream is = TikaInputStream.get( Files.newInputStream( Paths.get( 
path ) ) );
is = TikaInputStream.get( is );
final String someText = TIKA.parseToString( is, metadata, 
MAX_CONTENT_LENGTH );
System.out.println( someText );
}
{code}

> Apache Tika is not working for .docx documents 
> ---
>
> Key: TIKA-1473
> URL: https://issues.apache.org/jira/browse/TIKA-1473
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.5, 1.6
>Reporter: Franco Catto
>Priority: Blocker
>
> I am using Apache Tika 1.6 to read different document files. 
> It is reading pdf and old format doc files but when I try to read docx file, 
> it gives me following exception:
> org.apache.tika.exception.TikaException: Failed to close temporary resources 
> at org.apache.tika.io.TemporaryResources.dispose(TemporaryResources.java:152) 
> at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:127) 
> ...
> The resource can not be closed because it is still being used by the Java 
> Process, certainly the OOXML parser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Subsets of tika parsers redux

2014-11-24 Thread Sergey Beryozkin

Hi Nick

Was good talking to you and thanks for initiating this thread.

It is an interesting idea, one that can lead to introducing 
finer-grained bundles but also providing a mechanism for the 
(auto-)generation of the import metadata required by each of the parser 
modules. Besides, introducing several smaller bundles that would group 
most popular formats is a good one on its own IMHO.


My doubt here is how many of those bundles we'd need to create and if it 
will make it easy for users to get a task like "Get a parser for the 
format A only, or parsers A and B formats only" done.


Are we talking about introducing a parser module per every supported 
format, and having tika-parsers depend on all of those modules, with 
every parser module becoming a bundle (a jar plus an entry in the 
manifest) ?


Thanks, Sergey


On 23/11/14 17:12, Nick Burch wrote:

Hi All

During ApacheCon, I had a chance to chat with Sergey about the "subset
of Tika Parsers" issue that bubbles up from time to time. It seemed to
work well, and I think we both now have a better idea of the other's
needs and concerns, which is good :)

As is shown on our list from time to time, but more commonly elsewhere,
we have some users who are confused already by the split between
tika-core and tika-parsers. Anything that fragments further is going to
cause more issues for that kind of user.

On the other hand, there are potential users out there who want just a
handful of parsers, in a simple and easy and small way, who don't know a
lot about Tika yet, and aren't perhaps Maven or OSGi gurus. Many of
those are using OSGi, but not all.

One suggested solution is to just document what dependencies of
tika-parsers can be excluded at the maven level to disable certain
parsers + shrink the resulting dependency tree. However, that requires
manual updates, manual checking, and like our examples on the website
risk getting out of date without automated checking.

Discussion then turned to our move to get all the examples for the
website into svn, with unit tests, and having the website pull those
from svn on the fly to always get the latest tested version.


That led to an idea. Not sure if it'll work yet, but...

What about having multiple Tika OSGi bundles? Continue with the "full"
bundle as now, but also have ones for "pdf", "microsoft office",
"images" etc. OSGi users (eg CXF users) could then opt to depend on
pdf+image if they only wanted a handful of parsers, or the full one as now.

The smart bit - we have unit tests for these smaller bundles. These unit
tests ensure that the desired parsers still work on their smaller
bundle. These unit tests also ensure that unwanted parsers don't work,
thus flagging up if extra dependencies have snuck though.

Finally, we pull out the includes/excludes information that went into
the bundle, and display that for non-OSGi users. A non-OSGi person
wanting "tika with pdf only" could then look at what the tika-pdf-bundle
does and doesn't use, and from that know what maven level dependencies
to keep and which to exclude


This new plan would mean having to tweak our build to support multiple
bundles, and potentially tweaking our bundles so that you could load
tika-pdf + tika-image and have those two play nicely together. It'd also
need some new unit tests, and the work to figure out what to
include/exclude for each of our handful of "common" cases. It should,
however, deliver a way for OSGi and non-OSGi people to get just a subset
if that's all they want.

Can anyone see a flaw with this plan? Anyone see a better way? Anyone
want to help? :)

Nick




[jira] [Resolved] (TIKA-672) Proper error handling in the CHM parser

2014-11-24 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen resolved TIKA-672.
---
Resolution: Fixed

Check no more System.err/System.out inside CHM parser

> Proper error handling in the CHM parser
> ---
>
> Key: TIKA-672
> URL: https://issues.apache.org/jira/browse/TIKA-672
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Jukka Zitting
>Priority: Minor
> Fix For: 1.7
>
>
> The new CHM parser (TIKA-245) swallows exceptions and uses System.err and 
> System.out prints to report problems in many places. We should change that to 
> properly throw exceptions as follows:
> - IOExceptions when the document stream can not be read
> - TikaExceptions when the stream can not be parsed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-672) Proper error handling in the CHM parser

2014-11-24 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-672:
--
Fix Version/s: 1.7

> Proper error handling in the CHM parser
> ---
>
> Key: TIKA-672
> URL: https://issues.apache.org/jira/browse/TIKA-672
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Jukka Zitting
>Priority: Minor
> Fix For: 1.7
>
>
> The new CHM parser (TIKA-245) swallows exceptions and uses System.err and 
> System.out prints to report problems in many places. We should change that to 
> properly throw exceptions as follows:
> - IOExceptions when the document stream can not be read
> - TikaExceptions when the stream can not be parsed



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1448) CHM parser : defect in file extraction

2014-11-24 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-1448:
---
Fix Version/s: 1.7

> CHM parser : defect in file extraction
> --
>
> Key: TIKA-1448
> URL: https://issues.apache.org/jira/browse/TIKA-1448
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.7
>Reporter: Bin Hawking
> Fix For: 1.7
>
>
> in ChmBlockInfo class:
> chmBlockInfo
> .setIniBlock((chmBlockInfo.startBlock - 
> chmBlockInfo.startBlock)
> % (int) clcd.getResetInterval());
> always sets 0
> according to the lzx algorithm, should be
> chmBlockInfo
> .setIniBlock( chmBlockInfo.startBlock - 
> chmBlockInfo.startBlock
> % (int) clcd.getResetInterval());



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1446) CHM parser : wrong decompression of aligned blocks

2014-11-24 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-1446:
---
Fix Version/s: 1.7

> CHM parser : wrong decompression of aligned blocks
> --
>
> Key: TIKA-1446
> URL: https://issues.apache.org/jira/browse/TIKA-1446
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.7
>Reporter: Bin Hawking
>Priority: Critical
> Fix For: 1.7
>
> Attachments: chm.zip
>
>
> If an embedded file contains aligned blocks, the parser outputs chaotic text 
> or empty text as to this file.
> I have fixed it myself, corrected decompressAlignedBlock() and its 
> preparation methods. Mostly this bug is due to misusing main tree/align 
> tree/length tree. And some tree is built wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1430) CHM parser gets faulty text (fix found)

2014-11-24 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-1430:
---
Fix Version/s: 1.7

> CHM parser gets faulty text (fix found)
> ---
>
> Key: TIKA-1430
> URL: https://issues.apache.org/jira/browse/TIKA-1430
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.5, 1.6
> Environment: Windows 7; JDK 7 or 8
>Reporter: Bin Hawking
>Priority: Critical
> Fix For: 1.7
>
>
> Get partially wrong text out of a CHM file, including the chm files in 
> tika-parsers/src/test/resources/test-documents/testChm*.chm
> I tried 1.6 and 1.5. Same bad. I wonder why no one complained before? 
> I checked the source code. The cause is obvious:
> When tika decompresses the LZX, the first block is done well, but as to the 
> 2nd block and later on, Tika uses previous content as the compressed data. 
> see in org.apache.tika.parser.chm.lzx.ChmLzxBlock
> """
> if (prevBlock != null
> && prevBlock.getState().getBlockLength() > prevBlock
> .getState().getBlockRemaining())
> setChmSection(new ChmSection(prevBlock.getContent()));
> //   NOTE: the dataSegment to be decompressed is not kept
> else
> setChmSection(new ChmSection(dataSegment));
> """
> My fix:
> 1.Add a prevcontent member variable in ChmSection class, so that 
> dataSegment and prevBlock.getContent() are both kept in it.
> 2.In ChmLzxBlock.extractContent() when invoking decompressBlock(), 
> pass ChmSection.prevcontent if exists, instead of ChmSection.data.
> Now, I tried some chm files, and got the correct looking texts. 
> BTW. The unit test should be tougher, as in this case some small text (the 
> first block) is decompressed correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1430) CHM parser gets faulty text (fix found)

2014-11-24 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen resolved TIKA-1430.

Resolution: Fixed

> CHM parser gets faulty text (fix found)
> ---
>
> Key: TIKA-1430
> URL: https://issues.apache.org/jira/browse/TIKA-1430
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.5, 1.6
> Environment: Windows 7; JDK 7 or 8
>Reporter: Bin Hawking
>Priority: Critical
> Fix For: 1.7
>
>
> Get partially wrong text out of a CHM file, including the chm files in 
> tika-parsers/src/test/resources/test-documents/testChm*.chm
> I tried 1.6 and 1.5. Same bad. I wonder why no one complained before? 
> I checked the source code. The cause is obvious:
> When tika decompresses the LZX, the first block is done well, but as to the 
> 2nd block and later on, Tika uses previous content as the compressed data. 
> see in org.apache.tika.parser.chm.lzx.ChmLzxBlock
> """
> if (prevBlock != null
> && prevBlock.getState().getBlockLength() > prevBlock
> .getState().getBlockRemaining())
> setChmSection(new ChmSection(prevBlock.getContent()));
> //   NOTE: the dataSegment to be decompressed is not kept
> else
> setChmSection(new ChmSection(dataSegment));
> """
> My fix:
> 1.Add a prevcontent member variable in ChmSection class, so that 
> dataSegment and prevBlock.getContent() are both kept in it.
> 2.In ChmLzxBlock.extractContent() when invoking decompressBlock(), 
> pass ChmSection.prevcontent if exists, instead of ChmSection.data.
> Now, I tried some chm files, and got the correct looking texts. 
> BTW. The unit test should be tougher, as in this case some small text (the 
> first block) is decompressed correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (TIKA-1447) CHM parser: wrong directory list

2014-11-24 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen updated TIKA-1447:
---
Fix Version/s: 1.7

> CHM parser: wrong directory list
> 
>
> Key: TIKA-1447
> URL: https://issues.apache.org/jira/browse/TIKA-1447
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.7
>Reporter: Bin Hawking
>Priority: Critical
> Fix For: 1.7
>
>
> CHM parser gets wrong directory list of a chm file (eg. testChm2.chm in 
> tika-parser's test-resources):
> 1. Duplicate entries (mostly from PMGI chunks, which should have been 
> ignored.)
> 2. Invalid entry (usually with unreadable entry name).
> 3. Missed entries (some times it is like TIKA-1176)
> I have fixed it (to some degree), by using the PMGL header to find dir chunks 
> and their respective meaningful parts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1448) CHM parser : defect in file extraction

2014-11-24 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen resolved TIKA-1448.

Resolution: Fixed

> CHM parser : defect in file extraction
> --
>
> Key: TIKA-1448
> URL: https://issues.apache.org/jira/browse/TIKA-1448
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.7
>Reporter: Bin Hawking
> Fix For: 1.7
>
>
> in ChmBlockInfo class:
> chmBlockInfo
> .setIniBlock((chmBlockInfo.startBlock - 
> chmBlockInfo.startBlock)
> % (int) clcd.getResetInterval());
> always sets 0
> according to the lzx algorithm, should be
> chmBlockInfo
> .setIniBlock( chmBlockInfo.startBlock - 
> chmBlockInfo.startBlock
> % (int) clcd.getResetInterval());



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1447) CHM parser: wrong directory list

2014-11-24 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen resolved TIKA-1447.

Resolution: Fixed

> CHM parser: wrong directory list
> 
>
> Key: TIKA-1447
> URL: https://issues.apache.org/jira/browse/TIKA-1447
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.7
>Reporter: Bin Hawking
>Priority: Critical
>
> CHM parser gets wrong directory list of a chm file (eg. testChm2.chm in 
> tika-parser's test-resources):
> 1. Duplicate entries (mostly from PMGI chunks, which should have been 
> ignored.)
> 2. Invalid entry (usually with unreadable entry name).
> 3. Missed entries (some times it is like TIKA-1176)
> I have fixed it (to some degree), by using the PMGL header to find dir chunks 
> and their respective meaningful parts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (TIKA-1446) CHM parser : wrong decompression of aligned blocks

2014-11-24 Thread Hong-Thai Nguyen (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hong-Thai Nguyen resolved TIKA-1446.

Resolution: Fixed

> CHM parser : wrong decompression of aligned blocks
> --
>
> Key: TIKA-1446
> URL: https://issues.apache.org/jira/browse/TIKA-1446
> Project: Tika
>  Issue Type: Bug
>Affects Versions: 1.7
>Reporter: Bin Hawking
>Priority: Critical
> Attachments: chm.zip
>
>
> If an embedded file contains aligned blocks, the parser outputs chaotic text 
> or empty text as to this file.
> I have fixed it myself, corrected decompressAlignedBlock() and its 
> preparation methods. Mostly this bug is due to misusing main tree/align 
> tree/length tree. And some tree is built wrong.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)