[jira] [Commented] (TIKA-1481) TikaJAXRS get metadata calls give different results
[ https://issues.apache.org/jira/browse/TIKA-1481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14224167#comment-14224167 ] Darya Arbuzova commented on TIKA-1481: -- Thank you, Sergey! I was trying to find a better place to ask my questions, but I can't send anything at u...@tika.apache.org (provided in the mailing list: http://tika.apache.org/mail-lists.html) and didn't come across any Q&A-like pages. What do you mean by «users list»? Sorry I'm posting in the wrong place. Best regards, Darya Arbuzova > TikaJAXRS get metadata calls give different results > --- > > Key: TIKA-1481 > URL: https://issues.apache.org/jira/browse/TIKA-1481 > Project: Tika > Issue Type: Bug > Components: server >Affects Versions: 1.6 > Environment: Windows 8, JDK 1.8 >Reporter: Darya Arbuzova >Priority: Minor > Attachments: sample.csv > > > Hello! > I'm trying to use Tika in server mode. > I downloaded tika-server-1.6.jar from http://mirror.vorboss.net/apache/tika/. > I have tried to get file metadata in 2 different ways (as explained here: > http://wiki.apache.org/tika/TikaJAXRS ): > {{> curl -T sample.csv http://localhost:9998/meta --header "Content-Type: > text/csv"}} > {{"Content-Encoding","windows-1252"}} > {{"Content-Type","text/plain; charset=windows-1252"}} > and > {{> curl -X PUT -d @sample.csv http://localhost:9998/meta --header > "Content-Type: text/csv"}} > {{"Content-Encoding","ISO-8859-1"}} > {{"Content-Type","text/plain; charset=ISO-8859-1"}} > How come they give different results in encoding if I call the same > {{http://localhost:9998/meta}}? > What could the other differences appear and which is the preferable way to > get metadata? > Many thanks! > Best regards, > Darya Arbuzova -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1442) Upgrade to PDFBox 1.8.8
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223469#comment-14223469 ] Andreas Lehmkühler edited comment on TIKA-1442 at 11/24/14 9:04 PM: [~talli...@apache.org] Yes, build [145|http://repository.apache.org/content/repositories/snapshots/org/apache/pdfbox/pdfbox-app/1.8.8-SNAPSHOT/pdfbox-app-1.8.8-20141124.203353-145.jar] of 1.8.8 should include the latest changes {quote} https://builds.apache.org/job/PDFBox-trunk/1471/, I guess...? {quote} That one results in the latest 2.0.0 SNAPSHOT build number [724|http://repository.apache.org/content/repositories/snapshots/org/apache/pdfbox/pdfbox-app/2.0.0-SNAPSHOT/pdfbox-app-2.0.0-20141124.203546-724.jar] was (Author: lehmi): Yes, build [145|http://repository.apache.org/content/repositories/snapshots/org/apache/pdfbox/pdfbox-app/1.8.8-SNAPSHOT/pdfbox-app-1.8.8-20141124.203353-145.jar] of 1.8.8 should include the latest changes {quote} https://builds.apache.org/job/PDFBox-trunk/1471/, I guess...? {quote} That one results in the latest 2.0.0 SNAPSHOT build number [724|http://repository.apache.org/content/repositories/snapshots/org/apache/pdfbox/pdfbox-app/2.0.0-SNAPSHOT/pdfbox-app-2.0.0-20141124.203546-724.jar] > Upgrade to PDFBox 1.8.8 > --- > > Key: TIKA-1442 > URL: https://issues.apache.org/jira/browse/TIKA-1442 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Assignee: Tim Allison > Fix For: 1.8 > > Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, > pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, > pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip > > > Given the regressions we identified in PDFBox 1.8.7, we should upgrade to > 1.8.8 as soon as it is ready. I'm tempted to call this a blocker on Tika > 1.7. Let's use this issue to carry on the discussion of regression testing > (if any further discussion is necessary) or any other prep that needs to > happen before 1.8.8's release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1442) Upgrade to PDFBox 1.8.8
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223469#comment-14223469 ] Andreas Lehmkühler edited comment on TIKA-1442 at 11/24/14 9:03 PM: Yes, build [145|http://repository.apache.org/content/repositories/snapshots/org/apache/pdfbox/pdfbox-app/1.8.8-SNAPSHOT/pdfbox-app-1.8.8-20141124.203353-145.jar] of 1.8.8 should include the latest changes {quote} https://builds.apache.org/job/PDFBox-trunk/1471/, I guess...? {quote} That one results in the latest 2.0.0 SNAPSHOT build number [724|http://repository.apache.org/content/repositories/snapshots/org/apache/pdfbox/pdfbox-app/2.0.0-SNAPSHOT/pdfbox-app-2.0.0-20141124.203546-724.jar] was (Author: lehmi): Yes, build 145 should include the latest changes > Upgrade to PDFBox 1.8.8 > --- > > Key: TIKA-1442 > URL: https://issues.apache.org/jira/browse/TIKA-1442 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Assignee: Tim Allison > Fix For: 1.8 > > Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, > pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, > pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip > > > Given the regressions we identified in PDFBox 1.8.7, we should upgrade to > 1.8.8 as soon as it is ready. I'm tempted to call this a blocker on Tika > 1.7. Let's use this issue to carry on the discussion of regression testing > (if any further discussion is necessary) or any other prep that needs to > happen before 1.8.8's release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1442) Upgrade to PDFBox 1.8.8
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223460#comment-14223460 ] Tim Allison edited comment on TIKA-1442 at 11/24/14 8:39 PM: - Thank you for PDFBOX-2520! I'd put my $ on Jenkins. I'll kick this off tomorrow EDT. [~lehmi], https://builds.apache.org/job/PDFBox-trunk/1471/, I guess...? Is there an easy way to tell when Maven is picking up the correct version from my 1.8.8-SNAPSHOT definition? Currently, I'm getting: pdfbox-1.8.8-20141124.203308-145.jar ? was (Author: talli...@mitre.org): Thank you for PDFBOX-2520! I'd put my $ on Jenkins. I'll kick this off tomorrow EDT. [~lehmi], pdfbox-1.8.8-20141124.203308-145.jar ? > Upgrade to PDFBox 1.8.8 > --- > > Key: TIKA-1442 > URL: https://issues.apache.org/jira/browse/TIKA-1442 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Assignee: Tim Allison > Fix For: 1.8 > > Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, > pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, > pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip > > > Given the regressions we identified in PDFBox 1.8.7, we should upgrade to > 1.8.8 as soon as it is ready. I'm tempted to call this a blocker on Tika > 1.7. Let's use this issue to carry on the discussion of regression testing > (if any further discussion is necessary) or any other prep that needs to > happen before 1.8.8's release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1442) Upgrade to PDFBox 1.8.8
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223469#comment-14223469 ] Andreas Lehmkühler commented on TIKA-1442: -- Yes, build 145 should include the latest changes > Upgrade to PDFBox 1.8.8 > --- > > Key: TIKA-1442 > URL: https://issues.apache.org/jira/browse/TIKA-1442 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Assignee: Tim Allison > Fix For: 1.8 > > Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, > pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, > pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip > > > Given the regressions we identified in PDFBox 1.8.7, we should upgrade to > 1.8.8 as soon as it is ready. I'm tempted to call this a blocker on Tika > 1.7. Let's use this issue to carry on the discussion of regression testing > (if any further discussion is necessary) or any other prep that needs to > happen before 1.8.8's release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1442) Upgrade to PDFBox 1.8.8
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223460#comment-14223460 ] Tim Allison edited comment on TIKA-1442 at 11/24/14 8:34 PM: - Thank you for PDFBOX-2520! I'd put my $ on Jenkins. I'll kick this off tomorrow EDT. [~lehmi], pdfbox-1.8.8-20141124.203308-145.jar ? was (Author: talli...@mitre.org): Thank you for PDFBOX-2520! I'd put my $ on Jenkins. I'll kick this off tomorrow EDT. > Upgrade to PDFBox 1.8.8 > --- > > Key: TIKA-1442 > URL: https://issues.apache.org/jira/browse/TIKA-1442 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Assignee: Tim Allison > Fix For: 1.8 > > Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, > pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, > pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip > > > Given the regressions we identified in PDFBox 1.8.7, we should upgrade to > 1.8.8 as soon as it is ready. I'm tempted to call this a blocker on Tika > 1.7. Let's use this issue to carry on the discussion of regression testing > (if any further discussion is necessary) or any other prep that needs to > happen before 1.8.8's release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1442) Upgrade to PDFBox 1.8.8
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223460#comment-14223460 ] Tim Allison commented on TIKA-1442: --- Thank you for PDFBOX-2520! I'd put my $ on Jenkins. I'll kick this off tomorrow EDT. > Upgrade to PDFBox 1.8.8 > --- > > Key: TIKA-1442 > URL: https://issues.apache.org/jira/browse/TIKA-1442 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Assignee: Tim Allison > Fix For: 1.8 > > Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, > pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, > pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip > > > Given the regressions we identified in PDFBox 1.8.7, we should upgrade to > 1.8.8 as soon as it is ready. I'm tempted to call this a blocker on Tika > 1.7. Let's use this issue to carry on the discussion of regression testing > (if any further discussion is necessary) or any other prep that needs to > happen before 1.8.8's release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1442) Upgrade to PDFBox 1.8.8
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223430#comment-14223430 ] Andreas Lehmkühler edited comment on TIKA-1442 at 11/24/14 8:09 PM: [~talli...@apache.org] I guess PDFBox 1.8.8-SNAPSHOT is ready for another TIKA test run. I've added the changes I had in mind and fixed the issue you mentioned on the list as well, see PDFBOX-2520. P.S.: I don't know who'll be faster, jenkins or you. So you might have to wait until the lastest binaries are available if you don't use your own compiled version ;-) was (Author: lehmi): [~talli...@apache.org] I guess PDFBox 1.8.8-SNAPSHOT is ready for another TIKA test run. I've added the changes I had in mind and fixed the issue you mentioned on the list as well, see PDFBOX-2520. > Upgrade to PDFBox 1.8.8 > --- > > Key: TIKA-1442 > URL: https://issues.apache.org/jira/browse/TIKA-1442 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Assignee: Tim Allison > Fix For: 1.8 > > Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, > pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, > pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip > > > Given the regressions we identified in PDFBox 1.8.7, we should upgrade to > 1.8.8 as soon as it is ready. I'm tempted to call this a blocker on Tika > 1.7. Let's use this issue to carry on the discussion of regression testing > (if any further discussion is necessary) or any other prep that needs to > happen before 1.8.8's release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1442) Upgrade to PDFBox 1.8.8
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223430#comment-14223430 ] Andreas Lehmkühler commented on TIKA-1442: -- [~talli...@apache.org] I guess PDFBox 1.8.8-SNAPSHOT is ready for another TIKA test run. I've added the changes I had in mind and fixed the issue you mentioned on the list as well, see PDRFBOX-2520. > Upgrade to PDFBox 1.8.8 > --- > > Key: TIKA-1442 > URL: https://issues.apache.org/jira/browse/TIKA-1442 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Assignee: Tim Allison > Fix For: 1.8 > > Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, > pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, > pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip > > > Given the regressions we identified in PDFBox 1.8.7, we should upgrade to > 1.8.8 as soon as it is ready. I'm tempted to call this a blocker on Tika > 1.7. Let's use this issue to carry on the discussion of regression testing > (if any further discussion is necessary) or any other prep that needs to > happen before 1.8.8's release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1442) Upgrade to PDFBox 1.8.8
[ https://issues.apache.org/jira/browse/TIKA-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223430#comment-14223430 ] Andreas Lehmkühler edited comment on TIKA-1442 at 11/24/14 8:02 PM: [~talli...@apache.org] I guess PDFBox 1.8.8-SNAPSHOT is ready for another TIKA test run. I've added the changes I had in mind and fixed the issue you mentioned on the list as well, see PDFBOX-2520. was (Author: lehmi): [~talli...@apache.org] I guess PDFBox 1.8.8-SNAPSHOT is ready for another TIKA test run. I've added the changes I had in mind and fixed the issue you mentioned on the list as well, see PDRFBOX-2520. > Upgrade to PDFBox 1.8.8 > --- > > Key: TIKA-1442 > URL: https://issues.apache.org/jira/browse/TIKA-1442 > Project: Tika > Issue Type: Improvement >Reporter: Tim Allison >Assignee: Tim Allison > Fix For: 1.8 > > Attachments: pdfbox_1_8_6V1_8_8-SNAPSHOT.xlsx, > pdfbox_1_8_6V1_8_8-SNAPSHOTb.xlsx, pdfbox_1_8_6V1_8_8-SNAPSHOTc.xlsx, > pdfbox_1_8_6V1_8_8-SNAPSHOTc.zip > > > Given the regressions we identified in PDFBox 1.8.7, we should upgrade to > 1.8.8 as soon as it is ready. I'm tempted to call this a blocker on Tika > 1.7. Let's use this issue to carry on the discussion of regression testing > (if any further discussion is necessary) or any other prep that needs to > happen before 1.8.8's release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Subsets of tika parsers redux
Hey Nick, This sounds like a great plan to me, good job to you and Sergey. As for helping I¹ll try my best, but I¹m not an OSGI guru :) Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++ -Original Message- From: Nick Burch Reply-To: "dev@tika.apache.org" Date: Sunday, November 23, 2014 at 6:12 PM To: "dev@tika.apache.org" Subject: Subsets of tika parsers redux >Hi All > >During ApacheCon, I had a chance to chat with Sergey about the "subset of >Tika Parsers" issue that bubbles up from time to time. It seemed to work >well, and I think we both now have a better idea of the other's needs and >concerns, which is good :) > >As is shown on our list from time to time, but more commonly elsewhere, >we >have some users who are confused already by the split between tika-core >and tika-parsers. Anything that fragments further is going to cause more >issues for that kind of user. > >On the other hand, there are potential users out there who want just a >handful of parsers, in a simple and easy and small way, who don't know a >lot about Tika yet, and aren't perhaps Maven or OSGi gurus. Many of those >are using OSGi, but not all. > >One suggested solution is to just document what dependencies of >tika-parsers can be excluded at the maven level to disable certain >parsers >+ shrink the resulting dependency tree. However, that requires manual >updates, manual checking, and like our examples on the website risk >getting out of date without automated checking. > >Discussion then turned to our move to get all the examples for the >website >into svn, with unit tests, and having the website pull those from svn on >the fly to always get the latest tested version. > > >That led to an idea. Not sure if it'll work yet, but... > >What about having multiple Tika OSGi bundles? Continue with the "full" >bundle as now, but also have ones for "pdf", "microsoft office", "images" >etc. OSGi users (eg CXF users) could then opt to depend on pdf+image if >they only wanted a handful of parsers, or the full one as now. > >The smart bit - we have unit tests for these smaller bundles. These unit >tests ensure that the desired parsers still work on their smaller bundle. >These unit tests also ensure that unwanted parsers don't work, thus >flagging up if extra dependencies have snuck though. > >Finally, we pull out the includes/excludes information that went into the >bundle, and display that for non-OSGi users. A non-OSGi person wanting >"tika with pdf only" could then look at what the tika-pdf-bundle does and >doesn't use, and from that know what maven level dependencies to keep and >which to exclude > > >This new plan would mean having to tweak our build to support multiple >bundles, and potentially tweaking our bundles so that you could load >tika-pdf + tika-image and have those two play nicely together. It'd also >need some new unit tests, and the work to figure out what to >include/exclude for each of our handful of "common" cases. It should, >however, deliver a way for OSGi and non-OSGi people to get just a subset >if that's all they want. > >Can anyone see a flaw with this plan? Anyone see a better way? Anyone >want >to help? :) > >Nick
[jira] [Commented] (TIKA-1332) Create "eval" code
[ https://issues.apache.org/jira/browse/TIKA-1332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14222947#comment-14222947 ] Tim Allison commented on TIKA-1332: --- In a personal communication, I asked [~sergey_beryozkin] for recommendations for handling static content in the jax-rs framework. For the UI component of the eval code -- how the user interacts with the results of the eval -- Is there an easy equivalent in JAX-RS that allows for the user to browse a directory of files and click on desired files for download as easily as one can with Jetty's ResourceHandler. With permission, I'm posting/summarizing [~sergey_beryozkin]'s responses. If anyone else has a recommendation leveraging the JAX-RS framework for dynamic data and still using something so easy as Jetty's ResourceHandler for static content, please let us know. Option 1: Handcode a JAX-RS handler that mimics Jetty's ResourceHandler > That can be easily enough though with JAX-RS if you'd like to explore > this path, something like this I guess: > {noformat} @Path("eval") public class TikaEvaluation { @Context private UriInfo ui; @GET @Path("list") @Produces("text/html") public Response getListOfResultURIs() { List uris = new LinkedList(); for (File f : getResultFiles()) { uris.add(ui.getAbsoluteUriBuilder().path(f.getName()).build()); } // uris list now how a list of links to individual files // next we need to decide how to convert that to HTML // one option is to return the list as is and redirect that to // JSP, another option is to build a basic HTML string right here in the // method, another option is to register a MessageBodyWriter that will // convert the list into HTML // the individual links will be managed by getFile() method return Response.ok(uris).build(); } @GET @Path("list/{name}") @Produces("application/json", "multipart/mixed") public Response getFile(@PathParam("name") String name) { ... } {noformat} Option 2: Run Jetty's ResourceHandler from the same embedded Jetty server that is hosting the JAX-RS code. > This link would probably be the best one: [link| > https://git-wip-us.apache.org/repos/asf?p=cxf.git;a=blob_plain;f=distribution/src/main/release/samples/jax_rs/search/src/main/java/demo/jaxrs/search/server/Server.java;hb=HEAD] > Tika JAX-RS server actually runs on top of Jetty right now too, but in > this case we have a direct Jetty server setup. > > The server registers a CXF servlet and Jetty handlers too. CXF servlet > also redirect to default handlers like a default handler for serving the > static content. This is not needed if the result files are accessible > over URI that does not overlap with a CXF servlet URI pattern. > In fact, I wonder if a Tika JAXRS style of the registration may also do > ? If you register a CXF endpoint at /eval and the results are accessible > over /results then it should work ? Unless Jetty ContentHandler is not > installed by default - then the linked to code would def do :-) > the only possible downside here is that as far as the consistent URI > space management is concerned we'd have one part of it (the static > resources) controlled natively by Jetty and the rest - by JAX-RS. so it > can be trickier to provide a support for searching the results, > enforcing the common security rules (when/if needed). > That said may be it is not of a real concern, it can always be removed > in the future if needed. Other options? > Create "eval" code > -- > > Key: TIKA-1332 > URL: https://issues.apache.org/jira/browse/TIKA-1332 > Project: Tika > Issue Type: Sub-task > Components: cli, general, server >Reporter: Tim Allison > > For this issue, we can start with code to gather statistics on each run (# of > exceptions per file type, most common exceptions per file type, number of > metadata items, total text extracted, etc). We should also be able to > compare one run against another. Going forward, there's plenty of room to > improve. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1473) Apache Tika is not working for .docx documents
[ https://issues.apache.org/jira/browse/TIKA-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14222927#comment-14222927 ] Milan Zivkovic edited comment on TIKA-1473 at 11/24/14 11:56 AM: - Hi, Indeed I was using the FileInputStream, but if I wrap it with the TikaInputStream I get the same problem. {code} public static void main( final String[] args ) throws IOException, TikaException { final String path = "path_to_file"; final Metadata metadata = new Metadata(); final InputStream is = TikaInputStream.get( Files.newInputStream( Paths.get( path ) ) ); final String someText = TIKA.parseToString( is, metadata, MAX_CONTENT_LENGTH ); System.out.println( someText ); } {code} was (Author: mzivkovic): Hi, Indeed I was using the FileInputStream, but if I wrap it with the TikaInputStream I get the same problem. {code} public static void main( final String[] args ) throws IOException, TikaException { final String path = "path_to_file"; final Metadata metadata = new Metadata(); InputStream is = TikaInputStream.get( Files.newInputStream( Paths.get( path ) ) ); is = TikaInputStream.get( is ); final String someText = TIKA.parseToString( is, metadata, MAX_CONTENT_LENGTH ); System.out.println( someText ); } {code} > Apache Tika is not working for .docx documents > --- > > Key: TIKA-1473 > URL: https://issues.apache.org/jira/browse/TIKA-1473 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.5, 1.6 >Reporter: Franco Catto >Priority: Blocker > > I am using Apache Tika 1.6 to read different document files. > It is reading pdf and old format doc files but when I try to read docx file, > it gives me following exception: > org.apache.tika.exception.TikaException: Failed to close temporary resources > at org.apache.tika.io.TemporaryResources.dispose(TemporaryResources.java:152) > at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:127) > ... > The resource can not be closed because it is still being used by the Java > Process, certainly the OOXML parser. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1473) Apache Tika is not working for .docx documents
[ https://issues.apache.org/jira/browse/TIKA-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14222927#comment-14222927 ] Milan Zivkovic commented on TIKA-1473: -- Hi, Indeed I was using the FileInputStream, but if I wrap it with the TikaInputStream I get the same problem. {code} public static void main( final String[] args ) throws IOException, TikaException { final String path = "path_to_file"; final Metadata metadata = new Metadata(); InputStream is = TikaInputStream.get( Files.newInputStream( Paths.get( path ) ) ); is = TikaInputStream.get( is ); final String someText = TIKA.parseToString( is, metadata, MAX_CONTENT_LENGTH ); System.out.println( someText ); } {code} > Apache Tika is not working for .docx documents > --- > > Key: TIKA-1473 > URL: https://issues.apache.org/jira/browse/TIKA-1473 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.5, 1.6 >Reporter: Franco Catto >Priority: Blocker > > I am using Apache Tika 1.6 to read different document files. > It is reading pdf and old format doc files but when I try to read docx file, > it gives me following exception: > org.apache.tika.exception.TikaException: Failed to close temporary resources > at org.apache.tika.io.TemporaryResources.dispose(TemporaryResources.java:152) > at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:127) > ... > The resource can not be closed because it is still being used by the Java > Process, certainly the OOXML parser. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Subsets of tika parsers redux
Hi Nick Was good talking to you and thanks for initiating this thread. It is an interesting idea, one that can lead to introducing finer-grained bundles but also providing a mechanism for the (auto-)generation of the import metadata required by each of the parser modules. Besides, introducing several smaller bundles that would group most popular formats is a good one on its own IMHO. My doubt here is how many of those bundles we'd need to create and if it will make it easy for users to get a task like "Get a parser for the format A only, or parsers A and B formats only" done. Are we talking about introducing a parser module per every supported format, and having tika-parsers depend on all of those modules, with every parser module becoming a bundle (a jar plus an entry in the manifest) ? Thanks, Sergey On 23/11/14 17:12, Nick Burch wrote: Hi All During ApacheCon, I had a chance to chat with Sergey about the "subset of Tika Parsers" issue that bubbles up from time to time. It seemed to work well, and I think we both now have a better idea of the other's needs and concerns, which is good :) As is shown on our list from time to time, but more commonly elsewhere, we have some users who are confused already by the split between tika-core and tika-parsers. Anything that fragments further is going to cause more issues for that kind of user. On the other hand, there are potential users out there who want just a handful of parsers, in a simple and easy and small way, who don't know a lot about Tika yet, and aren't perhaps Maven or OSGi gurus. Many of those are using OSGi, but not all. One suggested solution is to just document what dependencies of tika-parsers can be excluded at the maven level to disable certain parsers + shrink the resulting dependency tree. However, that requires manual updates, manual checking, and like our examples on the website risk getting out of date without automated checking. Discussion then turned to our move to get all the examples for the website into svn, with unit tests, and having the website pull those from svn on the fly to always get the latest tested version. That led to an idea. Not sure if it'll work yet, but... What about having multiple Tika OSGi bundles? Continue with the "full" bundle as now, but also have ones for "pdf", "microsoft office", "images" etc. OSGi users (eg CXF users) could then opt to depend on pdf+image if they only wanted a handful of parsers, or the full one as now. The smart bit - we have unit tests for these smaller bundles. These unit tests ensure that the desired parsers still work on their smaller bundle. These unit tests also ensure that unwanted parsers don't work, thus flagging up if extra dependencies have snuck though. Finally, we pull out the includes/excludes information that went into the bundle, and display that for non-OSGi users. A non-OSGi person wanting "tika with pdf only" could then look at what the tika-pdf-bundle does and doesn't use, and from that know what maven level dependencies to keep and which to exclude This new plan would mean having to tweak our build to support multiple bundles, and potentially tweaking our bundles so that you could load tika-pdf + tika-image and have those two play nicely together. It'd also need some new unit tests, and the work to figure out what to include/exclude for each of our handful of "common" cases. It should, however, deliver a way for OSGi and non-OSGi people to get just a subset if that's all they want. Can anyone see a flaw with this plan? Anyone see a better way? Anyone want to help? :) Nick
[jira] [Resolved] (TIKA-672) Proper error handling in the CHM parser
[ https://issues.apache.org/jira/browse/TIKA-672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen resolved TIKA-672. --- Resolution: Fixed Check no more System.err/System.out inside CHM parser > Proper error handling in the CHM parser > --- > > Key: TIKA-672 > URL: https://issues.apache.org/jira/browse/TIKA-672 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Jukka Zitting >Priority: Minor > Fix For: 1.7 > > > The new CHM parser (TIKA-245) swallows exceptions and uses System.err and > System.out prints to report problems in many places. We should change that to > properly throw exceptions as follows: > - IOExceptions when the document stream can not be read > - TikaExceptions when the stream can not be parsed -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-672) Proper error handling in the CHM parser
[ https://issues.apache.org/jira/browse/TIKA-672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen updated TIKA-672: -- Fix Version/s: 1.7 > Proper error handling in the CHM parser > --- > > Key: TIKA-672 > URL: https://issues.apache.org/jira/browse/TIKA-672 > Project: Tika > Issue Type: Bug > Components: parser >Reporter: Jukka Zitting >Priority: Minor > Fix For: 1.7 > > > The new CHM parser (TIKA-245) swallows exceptions and uses System.err and > System.out prints to report problems in many places. We should change that to > properly throw exceptions as follows: > - IOExceptions when the document stream can not be read > - TikaExceptions when the stream can not be parsed -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1448) CHM parser : defect in file extraction
[ https://issues.apache.org/jira/browse/TIKA-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen updated TIKA-1448: --- Fix Version/s: 1.7 > CHM parser : defect in file extraction > -- > > Key: TIKA-1448 > URL: https://issues.apache.org/jira/browse/TIKA-1448 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.7 >Reporter: Bin Hawking > Fix For: 1.7 > > > in ChmBlockInfo class: > chmBlockInfo > .setIniBlock((chmBlockInfo.startBlock - > chmBlockInfo.startBlock) > % (int) clcd.getResetInterval()); > always sets 0 > according to the lzx algorithm, should be > chmBlockInfo > .setIniBlock( chmBlockInfo.startBlock - > chmBlockInfo.startBlock > % (int) clcd.getResetInterval()); -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1446) CHM parser : wrong decompression of aligned blocks
[ https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen updated TIKA-1446: --- Fix Version/s: 1.7 > CHM parser : wrong decompression of aligned blocks > -- > > Key: TIKA-1446 > URL: https://issues.apache.org/jira/browse/TIKA-1446 > Project: Tika > Issue Type: Bug >Affects Versions: 1.7 >Reporter: Bin Hawking >Priority: Critical > Fix For: 1.7 > > Attachments: chm.zip > > > If an embedded file contains aligned blocks, the parser outputs chaotic text > or empty text as to this file. > I have fixed it myself, corrected decompressAlignedBlock() and its > preparation methods. Mostly this bug is due to misusing main tree/align > tree/length tree. And some tree is built wrong. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1430) CHM parser gets faulty text (fix found)
[ https://issues.apache.org/jira/browse/TIKA-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen updated TIKA-1430: --- Fix Version/s: 1.7 > CHM parser gets faulty text (fix found) > --- > > Key: TIKA-1430 > URL: https://issues.apache.org/jira/browse/TIKA-1430 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.5, 1.6 > Environment: Windows 7; JDK 7 or 8 >Reporter: Bin Hawking >Priority: Critical > Fix For: 1.7 > > > Get partially wrong text out of a CHM file, including the chm files in > tika-parsers/src/test/resources/test-documents/testChm*.chm > I tried 1.6 and 1.5. Same bad. I wonder why no one complained before? > I checked the source code. The cause is obvious: > When tika decompresses the LZX, the first block is done well, but as to the > 2nd block and later on, Tika uses previous content as the compressed data. > see in org.apache.tika.parser.chm.lzx.ChmLzxBlock > """ > if (prevBlock != null > && prevBlock.getState().getBlockLength() > prevBlock > .getState().getBlockRemaining()) > setChmSection(new ChmSection(prevBlock.getContent())); > // NOTE: the dataSegment to be decompressed is not kept > else > setChmSection(new ChmSection(dataSegment)); > """ > My fix: > 1.Add a prevcontent member variable in ChmSection class, so that > dataSegment and prevBlock.getContent() are both kept in it. > 2.In ChmLzxBlock.extractContent() when invoking decompressBlock(), > pass ChmSection.prevcontent if exists, instead of ChmSection.data. > Now, I tried some chm files, and got the correct looking texts. > BTW. The unit test should be tougher, as in this case some small text (the > first block) is decompressed correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1430) CHM parser gets faulty text (fix found)
[ https://issues.apache.org/jira/browse/TIKA-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen resolved TIKA-1430. Resolution: Fixed > CHM parser gets faulty text (fix found) > --- > > Key: TIKA-1430 > URL: https://issues.apache.org/jira/browse/TIKA-1430 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.5, 1.6 > Environment: Windows 7; JDK 7 or 8 >Reporter: Bin Hawking >Priority: Critical > Fix For: 1.7 > > > Get partially wrong text out of a CHM file, including the chm files in > tika-parsers/src/test/resources/test-documents/testChm*.chm > I tried 1.6 and 1.5. Same bad. I wonder why no one complained before? > I checked the source code. The cause is obvious: > When tika decompresses the LZX, the first block is done well, but as to the > 2nd block and later on, Tika uses previous content as the compressed data. > see in org.apache.tika.parser.chm.lzx.ChmLzxBlock > """ > if (prevBlock != null > && prevBlock.getState().getBlockLength() > prevBlock > .getState().getBlockRemaining()) > setChmSection(new ChmSection(prevBlock.getContent())); > // NOTE: the dataSegment to be decompressed is not kept > else > setChmSection(new ChmSection(dataSegment)); > """ > My fix: > 1.Add a prevcontent member variable in ChmSection class, so that > dataSegment and prevBlock.getContent() are both kept in it. > 2.In ChmLzxBlock.extractContent() when invoking decompressBlock(), > pass ChmSection.prevcontent if exists, instead of ChmSection.data. > Now, I tried some chm files, and got the correct looking texts. > BTW. The unit test should be tougher, as in this case some small text (the > first block) is decompressed correctly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1447) CHM parser: wrong directory list
[ https://issues.apache.org/jira/browse/TIKA-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen updated TIKA-1447: --- Fix Version/s: 1.7 > CHM parser: wrong directory list > > > Key: TIKA-1447 > URL: https://issues.apache.org/jira/browse/TIKA-1447 > Project: Tika > Issue Type: Bug >Affects Versions: 1.7 >Reporter: Bin Hawking >Priority: Critical > Fix For: 1.7 > > > CHM parser gets wrong directory list of a chm file (eg. testChm2.chm in > tika-parser's test-resources): > 1. Duplicate entries (mostly from PMGI chunks, which should have been > ignored.) > 2. Invalid entry (usually with unreadable entry name). > 3. Missed entries (some times it is like TIKA-1176) > I have fixed it (to some degree), by using the PMGL header to find dir chunks > and their respective meaningful parts. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1448) CHM parser : defect in file extraction
[ https://issues.apache.org/jira/browse/TIKA-1448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen resolved TIKA-1448. Resolution: Fixed > CHM parser : defect in file extraction > -- > > Key: TIKA-1448 > URL: https://issues.apache.org/jira/browse/TIKA-1448 > Project: Tika > Issue Type: Bug > Components: parser >Affects Versions: 1.7 >Reporter: Bin Hawking > Fix For: 1.7 > > > in ChmBlockInfo class: > chmBlockInfo > .setIniBlock((chmBlockInfo.startBlock - > chmBlockInfo.startBlock) > % (int) clcd.getResetInterval()); > always sets 0 > according to the lzx algorithm, should be > chmBlockInfo > .setIniBlock( chmBlockInfo.startBlock - > chmBlockInfo.startBlock > % (int) clcd.getResetInterval()); -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1447) CHM parser: wrong directory list
[ https://issues.apache.org/jira/browse/TIKA-1447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen resolved TIKA-1447. Resolution: Fixed > CHM parser: wrong directory list > > > Key: TIKA-1447 > URL: https://issues.apache.org/jira/browse/TIKA-1447 > Project: Tika > Issue Type: Bug >Affects Versions: 1.7 >Reporter: Bin Hawking >Priority: Critical > > CHM parser gets wrong directory list of a chm file (eg. testChm2.chm in > tika-parser's test-resources): > 1. Duplicate entries (mostly from PMGI chunks, which should have been > ignored.) > 2. Invalid entry (usually with unreadable entry name). > 3. Missed entries (some times it is like TIKA-1176) > I have fixed it (to some degree), by using the PMGL header to find dir chunks > and their respective meaningful parts. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1446) CHM parser : wrong decompression of aligned blocks
[ https://issues.apache.org/jira/browse/TIKA-1446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hong-Thai Nguyen resolved TIKA-1446. Resolution: Fixed > CHM parser : wrong decompression of aligned blocks > -- > > Key: TIKA-1446 > URL: https://issues.apache.org/jira/browse/TIKA-1446 > Project: Tika > Issue Type: Bug >Affects Versions: 1.7 >Reporter: Bin Hawking >Priority: Critical > Attachments: chm.zip > > > If an embedded file contains aligned blocks, the parser outputs chaotic text > or empty text as to this file. > I have fixed it myself, corrected decompressAlignedBlock() and its > preparation methods. Mostly this bug is due to misusing main tree/align > tree/length tree. And some tree is built wrong. -- This message was sent by Atlassian JIRA (v6.3.4#6332)