[jira] [Commented] (TIKA-891) Use POST in addition to PUT on method calls in tika-server
[ https://issues.apache.org/jira/browse/TIKA-891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14361268#comment-14361268 ] Tyler Palsulich commented on TIKA-891: -- I took a stab at adding the {{@POST}} annotation to each {{@PUT}} resource. But, it didn't work. Turns out, you cannot support two HTTP request methods for the same resource. See [this blog|http://marxsoftware.blogspot.com/2010/02/playing-with-jerseyjax-rs-method.html#post-body-8986464064706467213] for an experiment. I couldn't seem to find anything in the [official documentation|http://docs.oracle.com/cd/E19798-01/821-1841/giepu/index.html]. But, the blog and my own experience seem like reason enough. So, now the real question: Since we can't support {{@PUT}} and {{@POST}}, do we want to leave everything the way it is, or switch everything to {{@POST}}? I'm leaning toward leaving the resources the way they are. But, {{@POST}}, as discussed above, isn't necessarily wrong... Use POST in addition to PUT on method calls in tika-server -- Key: TIKA-891 URL: https://issues.apache.org/jira/browse/TIKA-891 Project: Tika Issue Type: Improvement Components: general Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Priority: Trivial Labels: newbie Fix For: 1.9 Per Jukka's email: http://s.apache.org/uR It would be a better use of REST/HTTP verbs to use POST to put content to a resource where we don't intend to store that content (which is the implication of PUT). Max suggested adding: {code} @POST {code} annotations to the methods we are currently exposing using PUT to take care of this. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1106) CLAVIN Integration
[ https://issues.apache.org/jira/browse/TIKA-1106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1106: Component/s: (was: general) parser CLAVIN Integration -- Key: TIKA-1106 URL: https://issues.apache.org/jira/browse/TIKA-1106 Project: Tika Issue Type: Wish Components: parser Affects Versions: 1.3 Environment: All Reporter: Adam Estrada Assignee: Chris A. Mattmann Priority: Minor Labels: entity, geospatial, new-parser Fix For: 1.8 I've been evaluating CLAVIN as a way to extract location information from unstructured text. It seems like meshing it with Tika in some way would make a lot of sense. From CLAVIN website... {quote} CLAVIN (*Cartographic Location And Vicinity INdexer*) is an open source software package for document geotagging and geoparsing that employs context-based geographic entity resolution. It combines a variety of open source tools with natural language processing techniques to extract location names from unstructured text documents and resolve them against gazetteer records. Importantly, CLAVIN does not simply look up location names; rather, it uses intelligent heuristics in an attempt to identify precisely which Springfield (for example) was intended by the author, based on the context of the document. CLAVIN also employs fuzzy search to handle incorrectly-spelled location names, and it recognizes alternative names (e.g., Ivory Coast and Côte d'Ivoire) as referring to the same geographic entity. By enriching text documents with structured geo data, CLAVIN enables hierarchical geospatial search and advanced geospatial analytics on unstructured data. {quote} There was only one other instance of the word clavin mentioned in the ASF jira site so I thought it was definitely worth posting here. https://github.com/Berico-Technologies/CLAVIN -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1106) CLAVIN Integration
[ https://issues.apache.org/jira/browse/TIKA-1106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-1106: Issue Type: New Feature (was: Wish) CLAVIN Integration -- Key: TIKA-1106 URL: https://issues.apache.org/jira/browse/TIKA-1106 Project: Tika Issue Type: New Feature Components: parser Affects Versions: 1.3 Environment: All Reporter: Adam Estrada Assignee: Chris A. Mattmann Priority: Minor Labels: entity, geospatial, new-parser Fix For: 1.8 I've been evaluating CLAVIN as a way to extract location information from unstructured text. It seems like meshing it with Tika in some way would make a lot of sense. From CLAVIN website... {quote} CLAVIN (*Cartographic Location And Vicinity INdexer*) is an open source software package for document geotagging and geoparsing that employs context-based geographic entity resolution. It combines a variety of open source tools with natural language processing techniques to extract location names from unstructured text documents and resolve them against gazetteer records. Importantly, CLAVIN does not simply look up location names; rather, it uses intelligent heuristics in an attempt to identify precisely which Springfield (for example) was intended by the author, based on the context of the document. CLAVIN also employs fuzzy search to handle incorrectly-spelled location names, and it recognizes alternative names (e.g., Ivory Coast and Côte d'Ivoire) as referring to the same geographic entity. By enriching text documents with structured geo data, CLAVIN enables hierarchical geospatial search and advanced geospatial analytics on unstructured data. {quote} There was only one other instance of the word clavin mentioned in the ASF jira site so I thought it was definitely worth posting here. https://github.com/Berico-Technologies/CLAVIN -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (TIKA-1106) CLAVIN Integration
[ https://issues.apache.org/jira/browse/TIKA-1106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann reassigned TIKA-1106: --- Assignee: Chris A. Mattmann CLAVIN Integration -- Key: TIKA-1106 URL: https://issues.apache.org/jira/browse/TIKA-1106 Project: Tika Issue Type: Wish Components: general Affects Versions: 1.3 Environment: All Reporter: Adam Estrada Assignee: Chris A. Mattmann Priority: Minor Labels: entity, geospatial, new-parser Fix For: 1.8 I've been evaluating CLAVIN as a way to extract location information from unstructured text. It seems like meshing it with Tika in some way would make a lot of sense. From CLAVIN website... {quote} CLAVIN (*Cartographic Location And Vicinity INdexer*) is an open source software package for document geotagging and geoparsing that employs context-based geographic entity resolution. It combines a variety of open source tools with natural language processing techniques to extract location names from unstructured text documents and resolve them against gazetteer records. Importantly, CLAVIN does not simply look up location names; rather, it uses intelligent heuristics in an attempt to identify precisely which Springfield (for example) was intended by the author, based on the context of the document. CLAVIN also employs fuzzy search to handle incorrectly-spelled location names, and it recognizes alternative names (e.g., Ivory Coast and Côte d'Ivoire) as referring to the same geographic entity. By enriching text documents with structured geo data, CLAVIN enables hierarchical geospatial search and advanced geospatial analytics on unstructured data. {quote} There was only one other instance of the word clavin mentioned in the ASF jira site so I thought it was definitely worth posting here. https://github.com/Berico-Technologies/CLAVIN -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1106) CLAVIN Integration
[ https://issues.apache.org/jira/browse/TIKA-1106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14361500#comment-14361500 ] Chris A. Mattmann commented on TIKA-1106: - Have multiple students and folks working on this right now. Going to also take a look. [~aashish24] please also look. CLAVIN Integration -- Key: TIKA-1106 URL: https://issues.apache.org/jira/browse/TIKA-1106 Project: Tika Issue Type: Wish Components: general Affects Versions: 1.3 Environment: All Reporter: Adam Estrada Assignee: Chris A. Mattmann Priority: Minor Labels: entity, geospatial, new-parser Fix For: 1.8 I've been evaluating CLAVIN as a way to extract location information from unstructured text. It seems like meshing it with Tika in some way would make a lot of sense. From CLAVIN website... {quote} CLAVIN (*Cartographic Location And Vicinity INdexer*) is an open source software package for document geotagging and geoparsing that employs context-based geographic entity resolution. It combines a variety of open source tools with natural language processing techniques to extract location names from unstructured text documents and resolve them against gazetteer records. Importantly, CLAVIN does not simply look up location names; rather, it uses intelligent heuristics in an attempt to identify precisely which Springfield (for example) was intended by the author, based on the context of the document. CLAVIN also employs fuzzy search to handle incorrectly-spelled location names, and it recognizes alternative names (e.g., Ivory Coast and Côte d'Ivoire) as referring to the same geographic entity. By enriching text documents with structured geo data, CLAVIN enables hierarchical geospatial search and advanced geospatial analytics on unstructured data. {quote} There was only one other instance of the word clavin mentioned in the ASF jira site so I thought it was definitely worth posting here. https://github.com/Berico-Technologies/CLAVIN -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (TIKA-942) HTTP Accept header evaluator
[ https://issues.apache.org/jira/browse/TIKA-942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich closed TIKA-942. Resolution: Won't Fix Closing as Won't Fix, since the JAX-RS implementation automatically handles this. HTTP Accept header evaluator Key: TIKA-942 URL: https://issues.apache.org/jira/browse/TIKA-942 Project: Tika Issue Type: New Feature Components: mime Reporter: Jukka Zitting Labels: HTTP The HTTP Accept header (http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html) provides a flexible mechanism for an HTTP client to express its preferences for different response media types. Unfortunately processing Accept headers on the server side is quite complicated because of the somewhat complicated syntax and the possibility of media type inheritance relationships (can I respond with application/xml if the client requests text/plain?). The media type registry in Tika is perfect for resolving such cases, so I'd like to introduce a new {{String resolveHttpAccept(String accept, String... types)}} method in the Tika facade. The method would take the value of an HTTP accept header and evaluate it against the given media types supported by a server, using the configured media type registry for type inheritance information. The method would then return the best match from among the given media types, or {{application/octet-stream}} if none of the listed types would be accepted by the client. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-651) Unescaped attribute value generated
[ https://issues.apache.org/jira/browse/TIKA-651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich resolved TIKA-651. -- Resolution: Fixed Marking as fixed per the above comment. And, it's been a week and a half with no objection. Unescaped attribute value generated --- Key: TIKA-651 URL: https://issues.apache.org/jira/browse/TIKA-651 Project: Tika Issue Type: Bug Components: parser Affects Versions: 0.9 Reporter: Raimund Merkert Assignee: Jukka Zitting Attachments: XHTMLSerializer.java I've converted a word document that contains hyperlinks with a complex query component. The character is not escaped and mozilla complains about that when I write out the XHTML via a content handler that I wrote. It's not clear to me whether or not my contenthandler should assume attributes are properly escaped or not. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1095) Only gibberish extracted from this PDF
[ https://issues.apache.org/jira/browse/TIKA-1095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14361451#comment-14361451 ] Tyler Palsulich commented on TIKA-1095: --- Just commented on PDFBOX-2451. Still have this issue with PDFBox 1.8.9-SNAPSHOT, so we still have it in Tika. Only gibberish extracted from this PDF -- Key: TIKA-1095 URL: https://issues.apache.org/jira/browse/TIKA-1095 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.3 Environment: Probably any Reporter: Bas van Meurs Labels: pdfbox Attachments: ALG 2010-05-19 03 bijlage 1 - besluitenlijst dagelijks bestuur d d 10 februari 2010.pdf, test.txt java -jar /usr/share/tika/tika-app-1.3.jar -t /home/adrupal/www/sites/stadsregio.nl/files/files/Agendastukken/ALG 2010-05-19 03 bijlage 1 - besluitenlijst dagelijks bestuur d d 10 februari 2010.pdf /tmp/test.txt This produces all gibberish. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1098) not able to parse pdfs/docs/ppts using 1.1 tika parser
[ https://issues.apache.org/jira/browse/TIKA-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14361461#comment-14361461 ] Tyler Palsulich commented on TIKA-1098: --- Tika still can't parse this file. I tried with PDFBox 1.8.9 SNAPSHOT, but hit the following exception: {code} ➜ trunk java -jar ~/Downloads/pdfbox.jar ExtractText ~/Downloads/test.pdf Mar 13, 2015 9:14:33 PM org.apache.pdfbox.pdfparser.BaseParser parseCOSStream WARNING: Specified stream length 2390 is wrong. Fall back to reading stream until 'endstream'. Mar 13, 2015 9:14:33 PM org.apache.pdfbox.pdfparser.BaseParser parseCOSArray WARNING: Corrupt object reference ExtractText failed with the following exception: java.io.IOException: Unknown dir object c='' cInt=62 peek='' peekInt=62 364863 at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:1362) at org.apache.pdfbox.pdfparser.BaseParser.parseCOSArray(BaseParser.java:1066) at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:1275) at org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryValue(BaseParser.java:249) at org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionary(BaseParser.java:356) at org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:1264) at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:641) at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:203) at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1239) at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1204) at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1129) at org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:212) at org.apache.pdfbox.ExtractText.main(ExtractText.java:85) at org.apache.pdfbox.PDFBox.main(PDFBox.java:58) {code} Does anyone recognize this error? Or, should I open a new issue with PDFBox? not able to parse pdfs/docs/ppts using 1.1 tika parser Key: TIKA-1098 URL: https://issues.apache.org/jira/browse/TIKA-1098 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.1 Environment: linux redhat Reporter: Qian Diao Attachments: url_1763_approx-alg-notes.pdf Hi, I got some parsing problems when using Tika 1.1 for the attached pdf file. my code (Test.java): import java.io.File; import java.io.InputStream; import java.io.FileInputStream; import org.apache.tika.metadata.Metadata; import org.apache.tika.parser.AutoDetectParser; import org.apache.tika.parser.ParseContext; import org.apache.tika.parser.Parser; import org.apache.tika.parser.html.BoilerpipeContentHandler; import org.apache.tika.sax.BodyContentHandler; import org.apache.tika.parser.html.HtmlParser; import de.l3s.boilerpipe.extractors.ArticleExtractor; public class Test { private static final String validBoilerpipeFilenameRegEx = .*(\\.)(htm|html|shtml|php|asp|aspx)$; public String parseFile(File inFile) { if (inFile == null || !inFile.isFile() || !inFile.canRead()) return null; InputStream is = null; String outputText = ; try { // Open input stream is = new FileInputStream(inFile); // Prepare parser BodyContentHandler contenthandler = new BodyContentHandler(-1); Metadata metadata = new Metadata(); metadata.set(Metadata.RESOURCE_NAME_KEY, inFile.getName()); ParseContext pc = new ParseContext(); // Call parse with boilerpipe if valid boilerpipe extension; otherwise, call regular parse. if (!inFile.getName().matches(validBoilerpipeFilenameRegEx)) { Parser parser = new AutoDetectParser(); parser.parse(is, contenthandler, metadata, pc); } else { Parser parser = new HtmlParser(); BoilerpipeContentHandler bh = new BoilerpipeContentHandler(contenthandler, new ArticleExtractor()); parser.parse(is, bh, metadata, pc); } // Prepare text for write outputText = contenthandler.toString(); } catch (Exception e) { System.out.println(e); return null; } finally { try { if (is != null) is.close(); } catch (Exception e) {} } return outputText; } =output
[jira] [Updated] (TIKA-1106) CLAVIN Integration
[ https://issues.apache.org/jira/browse/TIKA-1106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich updated TIKA-1106: -- Labels: entity geospatial new-parser (was: entity geospatial) CLAVIN Integration -- Key: TIKA-1106 URL: https://issues.apache.org/jira/browse/TIKA-1106 Project: Tika Issue Type: Wish Components: general Affects Versions: 1.3 Environment: All Reporter: Adam Estrada Priority: Minor Labels: entity, geospatial, new-parser Fix For: 1.8 I've been evaluating CLAVIN as a way to extract location information from unstructured text. It seems like meshing it with Tika in some way would make a lot of sense. From CLAVIN website... {quote} CLAVIN (*Cartographic Location And Vicinity INdexer*) is an open source software package for document geotagging and geoparsing that employs context-based geographic entity resolution. It combines a variety of open source tools with natural language processing techniques to extract location names from unstructured text documents and resolve them against gazetteer records. Importantly, CLAVIN does not simply look up location names; rather, it uses intelligent heuristics in an attempt to identify precisely which Springfield (for example) was intended by the author, based on the context of the document. CLAVIN also employs fuzzy search to handle incorrectly-spelled location names, and it recognizes alternative names (e.g., Ivory Coast and Côte d'Ivoire) as referring to the same geographic entity. By enriching text documents with structured geo data, CLAVIN enables hierarchical geospatial search and advanced geospatial analytics on unstructured data. {quote} There was only one other instance of the word clavin mentioned in the ASF jira site so I thought it was definitely worth posting here. https://github.com/Berico-Technologies/CLAVIN -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1069) Tika TXTParser requires (writeLimit + 1) because of paragrah wrapping
[ https://issues.apache.org/jira/browse/TIKA-1069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14361344#comment-14361344 ] Tyler Palsulich commented on TIKA-1069: --- Reproduced this with 1.8-SNAPSHOT. If a text file has {{writeLimit - 1}} characters, it's fine. But, if it has exactly {{writeLimit}} characters, an exception is thrown which says the document contained more than {{writeLimit}} characters, which isn't true. Thoughts? Tika TXTParser requires (writeLimit + 1) because of paragrah wrapping - Key: TIKA-1069 URL: https://issues.apache.org/jira/browse/TIKA-1069 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.2 Reporter: Horia Chiorean When using a configured {{writeLimit}} with {{BodyContentHandler}}, together with the {{TXTParser}}, the latter always throws a {{WriteLimitReachedException}} from the {{ignorableWhitespace}} method, when parsing a text which has the {{writeLimit}} length. From what I can tell so far, this is caused by the fact that the {{TXTParser}} wraps the text in a p element. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1071) Some spaces are omitted when merging two lines of a paragraph
[ https://issues.apache.org/jira/browse/TIKA-1071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich resolved TIKA-1071. --- Resolution: Fixed Closing as fixed: {code} tika http://www.amf-france.org/documents/general/8091_1.pdf | grep commeinclu {code} has no output. Some spaces are omitted when merging two lines of a paragraph - Key: TIKA-1071 URL: https://issues.apache.org/jira/browse/TIKA-1071 Project: Tika Issue Type: Bug Components: parser Affects Versions: 1.3 Reporter: Guillaume Vauvert Attachments: RG_AMF_Livre_IV_8091_1.pdf Tika 1.3 sometimes append two succesive lines without inserting a space, while Tika 1.2 does not make the error. Document : http://www.amf-france.org/documents/general/8091_1.pdf With Tika 1.3 : Page 2, Paragraph:4° La référence aux « membres du conseil d'administration ou directoire de la SICAV » doit s'entendre commeincluant, le cas échéant, le président de la société par actions simplifiée ou celui ou ceux de ses dirigeants que lesstatuts désignent pour exercer les attributions du conseil d'administration conformément aux dispositions de l'articleL. 227-1 du code de commerce. commeincluant should be comme incluant With Tika 1.2 : Page 2, Paragraph:4° La référence aux « membres du conseil d'administration ou directoire de la SICAV » doit s'entendre comme incluant, le cas échéant, le président de la société par actions simplifiée ou celui ou ceux de ses dirigeants que les statuts désignent pour exercer les attributions du conseil d'administration conformément aux dispositions de l'article L. 227-1 du code de commerce. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: Parser test resources
Good points. Maybe it's a good idea to keep the new files organized, like chm, but leave the old ones where they are? The test-documents directory has 460 entries right now. Tyler On Wed, Mar 11, 2015 at 8:43 AM, Nick Burch apa...@gagravarr.org wrote: On Tue, 10 Mar 2015, Tyler Palsulich wrote: Or, do enough parsers have overlapping test resource dependencies where it makes sense to have them _all_ under one directory? I believe that most of the test files get used for both detection and parsing unit tests It would be nice to easily know which files are used for which tests. 5 lines of perl should give you that, or fewer if you don't want to be able to understand the perl... ;-) Many, but not all of the test files are of the form testfiletype.ext or testfiletype_special type/description.ext, which I find makes it fairly easy to spot what files go with what. Not all though. Would fixing the few files not in that format help, or hinder do you think? Nick
[jira] [Commented] (TIKA-1088) Unsupported AutoCAD drawing version: AC1009
[ https://issues.apache.org/jira/browse/TIKA-1088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14361432#comment-14361432 ] Tyler Palsulich commented on TIKA-1088: --- Related to TIKA-1045, support for another AutoCAD version. Unsupported AutoCAD drawing version: AC1009 --- Key: TIKA-1088 URL: https://issues.apache.org/jira/browse/TIKA-1088 Project: Tika Issue Type: Improvement Components: parser Reporter: Hardik Upadhyay Labels: new-parser Attachments: 227051.dwg Tika parser version 1.2 and 1.3 fails to parse DWG file version AC1009. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1059) Better Handling of InterruptedException in ExternalParser and ExternalEmbedder
[ https://issues.apache.org/jira/browse/TIKA-1059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14361277#comment-14361277 ] Tyler Palsulich commented on TIKA-1059: --- Is this feature still worth implementing? Better Handling of InterruptedException in ExternalParser and ExternalEmbedder -- Key: TIKA-1059 URL: https://issues.apache.org/jira/browse/TIKA-1059 Project: Tika Issue Type: Improvement Components: parser Affects Versions: 1.3 Reporter: Ray Gauss II Fix For: 1.8 The {{ExternalParser}} and {{ExternalEmbedder}} classes currently catch {{InterruptedException}} and ignore it. The methods should either call {{interrupt()}} on the current thread or re-throw the exception, possibly wrapped in a {{TikaException}}. See TIKA-775 for a previous discussion. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1063) OpenDocument basic style support
[ https://issues.apache.org/jira/browse/TIKA-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14361334#comment-14361334 ] Hudson commented on TIKA-1063: -- UNSTABLE: Integrated in tika-trunk-jdk1.7 #547 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/547/]) TIKA-1063. Add basic ODF style support, contributed by Axel Dörfler. (tpalsulich: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=107) * /tika/trunk/CHANGES.txt * /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/odf/OpenDocumentContentParser.java * /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/odf/ODFParserTest.java OpenDocument basic style support Key: TIKA-1063 URL: https://issues.apache.org/jira/browse/TIKA-1063 Project: Tika Issue Type: Bug Components: parser Reporter: Axel Dörfler Priority: Minor Attachments: testStyles.odt, tika-opendocument-styles.patch I've added basic support for list and text styles. Paragraph styles are omitted on purpose -- one could use the style names as class names, though. Only bold, italic, and underlined text is supported. Lists now differentiate between ordered and unordered lists. Test case included. I've also changed the ODFParserTest to make a bit more use of the methods of its super class. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1063) OpenDocument basic style support
[ https://issues.apache.org/jira/browse/TIKA-1063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14361422#comment-14361422 ] Hudson commented on TIKA-1063: -- SUCCESS: Integrated in tika-trunk-jdk1.7 #548 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/548/]) TIKA-1063. Add ODF style test resource file. (tpalsulich: http://svn.apache.org/viewvc/tika/trunk/?view=revrev=110) * /tika/trunk/tika-parsers/src/test/resources/test-documents/testStyles.odt OpenDocument basic style support Key: TIKA-1063 URL: https://issues.apache.org/jira/browse/TIKA-1063 Project: Tika Issue Type: Bug Components: parser Reporter: Axel Dörfler Priority: Minor Attachments: testStyles.odt, tika-opendocument-styles.patch I've added basic support for list and text styles. Paragraph styles are omitted on purpose -- one could use the style names as class names, though. Only bold, italic, and underlined text is supported. Lists now differentiate between ordered and unordered lists. Test case included. I've also changed the ODFParserTest to make a bit more use of the methods of its super class. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (TIKA-1079) Word document hits AIOOBE in SummaryExtractor.parseSummaries
[ https://issues.apache.org/jira/browse/TIKA-1079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tyler Palsulich resolved TIKA-1079. --- Resolution: Fixed There is no more exception with Tika 1.8-SNAPSHOT. So, I'm marking this as fixed. Word document hits AIOOBE in SummaryExtractor.parseSummaries Key: TIKA-1079 URL: https://issues.apache.org/jira/browse/TIKA-1079 Project: Tika Issue Type: Bug Components: parser Reporter: Michael McCandless Fix For: 1.8 Attachments: guide_to_daips_(id_3152_ver_1.0.0).doc I'm not yet sure if this is a corrupted document (though, MS Word opens it just fine) or a bug in POI ... but I hit this exc when running it through TikaCLI: {noformat} java.lang.ArrayIndexOutOfBoundsException: -1 at org.apache.poi.hpsf.CodePageString.init(CodePageString.java:161) at org.apache.poi.hpsf.TypedPropertyValue.readValue(TypedPropertyValue.java:158) at org.apache.poi.hpsf.VariantSupport.read(VariantSupport.java:163) at org.apache.poi.hpsf.Property.init(Property.java:164) at org.apache.poi.hpsf.Section.init(Section.java:277) at org.apache.poi.hpsf.PropertySet.init(PropertySet.java:451) at org.apache.poi.hpsf.PropertySet.init(PropertySet.java:246) at org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaryEntryIfExists(SummaryExtractor.java:78) at org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaries(SummaryExtractor.java:69) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:170) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:161) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120) at org.apache.tika.cli.TikaCLI$OutputType.process(TikaCLI.java:139) at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:415) at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:109) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-1077) JAXRS Server meta-resource should be able to return metadata in JSON or XML
[ https://issues.apache.org/jira/browse/TIKA-1077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14361423#comment-14361423 ] Tyler Palsulich commented on TIKA-1077: --- Related to TIKA-944. JAXRS Server meta-resource should be able to return metadata in JSON or XML --- Key: TIKA-1077 URL: https://issues.apache.org/jira/browse/TIKA-1077 Project: Tika Issue Type: Improvement Components: server Affects Versions: 1.2, 1.3 Environment: (all) Reporter: Aaron Weber Priority: Minor Labels: JAXRS, json, metadata, service, xml Tika JAXRS /meta resource currently returns plain text output of PUT file's metadata (properties). Based upon the http request's ACCEPT header, this resource should return either the current format (i.e. default), JSON formatted, or XML formatted metadata. This resource should set the return/output header to match. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (TIKA-1575) Upgrade to PDFBox 1.8.9 when available
Tim Allison created TIKA-1575: - Summary: Upgrade to PDFBox 1.8.9 when available Key: TIKA-1575 URL: https://issues.apache.org/jira/browse/TIKA-1575 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor The PDFBox community is about to release 1.8.9. Let's use this issue to track discussions before the release and to track Tika's upgrade to PDFBox 1.8.9 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1575) Upgrade to PDFBox 1.8.9 when available
[ https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1575: -- Attachment: PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT_reports.zip PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT.xlsx [~tilman], thank you, again, for pinging me on the impending release of PDFBox 1.8.9. And, also thanks to you, I've turned on the AccessChecker, so you shouldn't see any content from files that don't allow extraction. I ran the most recent eval code against all files that end in a pdf extension in govdocs1. I've included in the xlsx file all files with some kind of an exception or with any difference in attachment counts, metadata value counts, lang id or content. I've also included an example of a static dump of reports from the comparison database. More work remains on that... I haven't had a chance to join in your earlier comments from our work on the 1.8.8 release. Many apologies! My quick impression: 1) no differences in attachments 2) no differences in metadata values 3) 1.8.9 fixed 3 null pointer exceptions, no new exceptions 4) Content wise: a) with 1.8.9 we're getting less form field info (looks like internal field names? More digging is required...) b) might be actual modest regressions with 147/147012.pdf 223/223704.pdf Upgrade to PDFBox 1.8.9 when available -- Key: TIKA-1575 URL: https://issues.apache.org/jira/browse/TIKA-1575 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor Attachments: PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT.xlsx, PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT_reports.zip The PDFBox community is about to release 1.8.9. Let's use this issue to track discussions before the release and to track Tika's upgrade to PDFBox 1.8.9 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (TIKA-1575) Upgrade to PDFBox 1.8.9 when available
[ https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14360492#comment-14360492 ] Tim Allison edited comment on TIKA-1575 at 3/13/15 3:47 PM: Form clutter...This was embedded inside 776568. With PDFBox 1.8.8, we extracted the keys for the subform (but there was no meaningful content in this doc): {noformat}Briefings\n\nNo\n\n NWSI 10-814 November 10, 2008\n\n 19\n\n\n\tform1[0]: \n\t#subform[0]: \n\tPrintButton1[0]: \n\tCheckBox1[0]: \n\tCheckBox2[0]: \n\tTextField1[0]: \n\tCheckBox5[0]: \n\tCheckBox6[0]: \n\tTextField2[0]: \n\tTextField3[0]: \n\tCheckBox9[0]: \n\tCheckBox10[0]: \n\tCheckBox11[0]: \n\tCheckBox12[0]: \n\tCheckBox11[1]: \n\tCheckBox12[1]: \n\tCheckBox11[2]: \n\tCheckBox12[2]: \n\tTextField4[0]: \n\tTextField2[1]: \n\tTextField9[0]: \n\n\t#subform[1]: \n\tCheckBox1[1]: \n\tCheckBox2[1]: \n\tTextField1[1]: \n\tCheckBox5[1]: \n\tCheckBox6[1]: \n\tCheckBox9[1]: \n\tCheckBox10[1]: \n\tCheckBox11[3]: \n\tCheckBox12[3]: \n\tCheckBox11[4]: \n\tCheckBox12[4]: \n\tTextField4[1]: \n\tTextField5[0]: \n\tCheckBox5[2]: \n\tCheckBox6[2]: \n\n\t#subform[2]: \n\tCheckBox1[2]: \n\tCheckBox2[2]: \n\tCheckBox9[2]: \n\tCheckBox10[2]: \n\tTextField4[2]: \n\tCheckBox5[3]: \n\tCheckBox6[3]: \n\tCheckBox1[3]: \n\tCheckBox2[3]: \n\tCheckBox5[4]: \n\tCheckBox6[4]: \n\tCheckBox9[3]: \n\tCheckBox10[3]: \n\tTextField4[3]: \n\tCheckBox9[4]: \n\tCheckBox10[4]: \n\tTextField6[0]: \n\tTextField7[0]: \n\tCheckBox9[5]: \n\tCheckBox10[5]: \n\tTextField6[1]: \n\tTextField6[2]: \n\tTextField8[0]: \n\tTextField8[1]: \n\n\t#subform[3]: \n\tCheckBox1[4]: \n\tCheckBox2[4]: \n\tCheckBox5[5]: \n\tCheckBox6[5]: \n\tCheckBox9[6]: \n\tCheckBox10[6]: \n\tTextField4[4]: \n\tCheckBox5[6]: \n\tCheckBox6[6]: \n\tCheckBox1[5]: \n\tCheckBox2[5]: \n\tCheckBox5[7]: \n\tCheckBox6[7]: \n\tCheckBox5[8]: \n\tCheckBox5[9]: \n\tCheckBox6[8]: \n\tCheckBox6[9]: \n\tTextField8[2]: \n\tCheckBox9[7]: \n\tCheckBox10[7]: \n\tTextField6[3]: \n\tTextField6[4]: \n\tCheckBox5[10]: \n\tCheckBox5[11]: \n\tCheckBox6[10]: \n\tCheckBox6[11]: \n\n\n\n\n,{noformat} In 1.8.9, there's just this: {noformat} Briefings\n\nNo\n\n NWSI 10-814 November 10, 2008\n\n 19\n\n\n\tform1[0]: \n\n\n\n {noformat} There's no difference with PDFBox app's ExtractText between 1.8.8 and 1.8.9 on this file. was (Author: talli...@mitre.org): Form clutter...This was embedded inside 776568. With PDFBox 1.8.8, we extracted the keys for the subform (but there was no meaningful content in this doc): {noformat}Briefings\n\nNo\n\n NWSI 10-814 November 10, 2008\n\n 19\n\n\n\tform1[0]: \n\t#subform[0]: \n\tPrintButton1[0]: \n\tCheckBox1[0]: \n\tCheckBox2[0]: \n\tTextField1[0]: \n\tCheckBox5[0]: \n\tCheckBox6[0]: \n\tTextField2[0]: \n\tTextField3[0]: \n\tCheckBox9[0]: \n\tCheckBox10[0]: \n\tCheckBox11[0]: \n\tCheckBox12[0]: \n\tCheckBox11[1]: \n\tCheckBox12[1]: \n\tCheckBox11[2]: \n\tCheckBox12[2]: \n\tTextField4[0]: \n\tTextField2[1]: \n\tTextField9[0]: \n\n\t#subform[1]: \n\tCheckBox1[1]: \n\tCheckBox2[1]: \n\tTextField1[1]: \n\tCheckBox5[1]: \n\tCheckBox6[1]: \n\tCheckBox9[1]: \n\tCheckBox10[1]: \n\tCheckBox11[3]: \n\tCheckBox12[3]: \n\tCheckBox11[4]: \n\tCheckBox12[4]: \n\tTextField4[1]: \n\tTextField5[0]: \n\tCheckBox5[2]: \n\tCheckBox6[2]: \n\n\t#subform[2]: \n\tCheckBox1[2]: \n\tCheckBox2[2]: \n\tCheckBox9[2]: \n\tCheckBox10[2]: \n\tTextField4[2]: \n\tCheckBox5[3]: \n\tCheckBox6[3]: \n\tCheckBox1[3]: \n\tCheckBox2[3]: \n\tCheckBox5[4]: \n\tCheckBox6[4]: \n\tCheckBox9[3]: \n\tCheckBox10[3]: \n\tTextField4[3]: \n\tCheckBox9[4]: \n\tCheckBox10[4]: \n\tTextField6[0]: \n\tTextField7[0]: \n\tCheckBox9[5]: \n\tCheckBox10[5]: \n\tTextField6[1]: \n\tTextField6[2]: \n\tTextField8[0]: \n\tTextField8[1]: \n\n\t#subform[3]: \n\tCheckBox1[4]: \n\tCheckBox2[4]: \n\tCheckBox5[5]: \n\tCheckBox6[5]: \n\tCheckBox9[6]: \n\tCheckBox10[6]: \n\tTextField4[4]: \n\tCheckBox5[6]: \n\tCheckBox6[6]: \n\tCheckBox1[5]: \n\tCheckBox2[5]: \n\tCheckBox5[7]: \n\tCheckBox6[7]: \n\tCheckBox5[8]: \n\tCheckBox5[9]: \n\tCheckBox6[8]: \n\tCheckBox6[9]: \n\tTextField8[2]: \n\tCheckBox9[7]: \n\tCheckBox10[7]: \n\tTextField6[3]: \n\tTextField6[4]: \n\tCheckBox5[10]: \n\tCheckBox5[11]: \n\tCheckBox6[10]: \n\tCheckBox6[11]: \n\n\n\n\n,{noformat} In 1.8.9, there's just this: {noformat} Briefings\n\nNo\n\n NWSI 10-814 November 10, 2008\n\n 19\n\n\n\tform1[0]: \n\n\n\n {noformat} Upgrade to PDFBox 1.8.9 when available -- Key: TIKA-1575 URL: https://issues.apache.org/jira/browse/TIKA-1575 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor Attachments: 10-814_Appendix B_v3.pdf, PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT.xlsx, PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT_reports.zip The
[jira] [Commented] (TIKA-682) Creative Suite formats support
[ https://issues.apache.org/jira/browse/TIKA-682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14360487#comment-14360487 ] Christopher Dedels commented on TIKA-682: - Yes, that magic matches all the InDesign files I looked at. Thanks for the info! Creative Suite formats support -- Key: TIKA-682 URL: https://issues.apache.org/jira/browse/TIKA-682 Project: Tika Issue Type: New Feature Components: parser Affects Versions: 1.8 Reporter: Vivian Li Labels: new-parser Attachments: Untitled-1.indd, myfile.psd, myfile.xmp Is it possible to support Creative Suite formats, such as PSD, InDesign, etc.? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (TIKA-1575) Upgrade to PDFBox 1.8.9 when available
[ https://issues.apache.org/jira/browse/TIKA-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Allison updated TIKA-1575: -- Attachment: 10-814_Appendix B_v3.pdf Form clutter...This was embedded inside 776568. With PDFBox 1.8.8, we extracted the keys for the subform (but there was no meaningful content in this doc): {noformat}Briefings\n\nNo\n\n NWSI 10-814 November 10, 2008\n\n 19\n\n\n\tform1[0]: \n\t#subform[0]: \n\tPrintButton1[0]: \n\tCheckBox1[0]: \n\tCheckBox2[0]: \n\tTextField1[0]: \n\tCheckBox5[0]: \n\tCheckBox6[0]: \n\tTextField2[0]: \n\tTextField3[0]: \n\tCheckBox9[0]: \n\tCheckBox10[0]: \n\tCheckBox11[0]: \n\tCheckBox12[0]: \n\tCheckBox11[1]: \n\tCheckBox12[1]: \n\tCheckBox11[2]: \n\tCheckBox12[2]: \n\tTextField4[0]: \n\tTextField2[1]: \n\tTextField9[0]: \n\n\t#subform[1]: \n\tCheckBox1[1]: \n\tCheckBox2[1]: \n\tTextField1[1]: \n\tCheckBox5[1]: \n\tCheckBox6[1]: \n\tCheckBox9[1]: \n\tCheckBox10[1]: \n\tCheckBox11[3]: \n\tCheckBox12[3]: \n\tCheckBox11[4]: \n\tCheckBox12[4]: \n\tTextField4[1]: \n\tTextField5[0]: \n\tCheckBox5[2]: \n\tCheckBox6[2]: \n\n\t#subform[2]: \n\tCheckBox1[2]: \n\tCheckBox2[2]: \n\tCheckBox9[2]: \n\tCheckBox10[2]: \n\tTextField4[2]: \n\tCheckBox5[3]: \n\tCheckBox6[3]: \n\tCheckBox1[3]: \n\tCheckBox2[3]: \n\tCheckBox5[4]: \n\tCheckBox6[4]: \n\tCheckBox9[3]: \n\tCheckBox10[3]: \n\tTextField4[3]: \n\tCheckBox9[4]: \n\tCheckBox10[4]: \n\tTextField6[0]: \n\tTextField7[0]: \n\tCheckBox9[5]: \n\tCheckBox10[5]: \n\tTextField6[1]: \n\tTextField6[2]: \n\tTextField8[0]: \n\tTextField8[1]: \n\n\t#subform[3]: \n\tCheckBox1[4]: \n\tCheckBox2[4]: \n\tCheckBox5[5]: \n\tCheckBox6[5]: \n\tCheckBox9[6]: \n\tCheckBox10[6]: \n\tTextField4[4]: \n\tCheckBox5[6]: \n\tCheckBox6[6]: \n\tCheckBox1[5]: \n\tCheckBox2[5]: \n\tCheckBox5[7]: \n\tCheckBox6[7]: \n\tCheckBox5[8]: \n\tCheckBox5[9]: \n\tCheckBox6[8]: \n\tCheckBox6[9]: \n\tTextField8[2]: \n\tCheckBox9[7]: \n\tCheckBox10[7]: \n\tTextField6[3]: \n\tTextField6[4]: \n\tCheckBox5[10]: \n\tCheckBox5[11]: \n\tCheckBox6[10]: \n\tCheckBox6[11]: \n\n\n\n\n,{noformat} In 1.8.9, there's just this: {noformat} Briefings\n\nNo\n\n NWSI 10-814 November 10, 2008\n\n 19\n\n\n\tform1[0]: \n\n\n\n {noformat} Upgrade to PDFBox 1.8.9 when available -- Key: TIKA-1575 URL: https://issues.apache.org/jira/browse/TIKA-1575 Project: Tika Issue Type: Improvement Reporter: Tim Allison Priority: Minor Attachments: 10-814_Appendix B_v3.pdf, PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT.xlsx, PDFBox_1_8_8VPDFBox_1_8_9-SNAPSHOT_reports.zip The PDFBox community is about to release 1.8.9. Let's use this issue to track discussions before the release and to track Tika's upgrade to PDFBox 1.8.9 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-682) Creative Suite formats support
[ https://issues.apache.org/jira/browse/TIKA-682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14360369#comment-14360369 ] Nick Burch commented on TIKA-682: - [~cdedels] We already have a more specific InDesign magic added - 0x0606edf5d81d46e5bd31efe7fe74b71d - does that cover your file as well? At this point, I think we're hopefully covered for mime magic, the next step is for people to pick a format that interests them and start working on parsers! Creative Suite formats support -- Key: TIKA-682 URL: https://issues.apache.org/jira/browse/TIKA-682 Project: Tika Issue Type: New Feature Components: parser Affects Versions: 1.8 Reporter: Vivian Li Labels: new-parser Attachments: Untitled-1.indd, myfile.psd, myfile.xmp Is it possible to support Creative Suite formats, such as PSD, InDesign, etc.? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (TIKA-682) Creative Suite formats support
[ https://issues.apache.org/jira/browse/TIKA-682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14360344#comment-14360344 ] Christopher Dedels commented on TIKA-682: - I looked at my magic file for Adobe InDesign and found the the first four bytes of the file should be 0x0606edf5 (big endian). Hope this helps. Creative Suite formats support -- Key: TIKA-682 URL: https://issues.apache.org/jira/browse/TIKA-682 Project: Tika Issue Type: New Feature Components: parser Affects Versions: 1.8 Reporter: Vivian Li Labels: new-parser Attachments: Untitled-1.indd, myfile.psd, myfile.xmp Is it possible to support Creative Suite formats, such as PSD, InDesign, etc.? -- This message was sent by Atlassian JIRA (v6.3.4#6332)