[jira] [Commented] (TIKA-761) Provide version number by CLI argument -V
[ https://issues.apache.org/jira/browse/TIKA-761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13134976#comment-13134976 ] Ingo Renner commented on TIKA-761: -- Got the NPE resolved, it was caused by the changes to the pom. Since adding explicit resource directives Maven didn't copy tika-mimetypes.xml into the jar anymore. Fixed patch coming up... Provide version number by CLI argument -V - Key: TIKA-761 URL: https://issues.apache.org/jira/browse/TIKA-761 Project: Tika Issue Type: New Feature Components: cli, general Reporter: Ingo Renner Priority: Minor Attachments: TIKA-761.diff, TIKA-761.diff, TIKA-761.diff, TIKA-761.diff I'd like to get the Apache Tika version number through CLI argument -V or --version. The patch is trivial and basically finished. The only thing missing (because Java is not my native programming language) is the actual version number. Any hints where I can get that from? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-761) Provide version number by CLI argument -V
[ https://issues.apache.org/jira/browse/TIKA-761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13135106#comment-13135106 ] Ingo Renner commented on TIKA-761: -- Hi Nick and Jukka, some update on the META-INF approach: The path for the properties file would be /META-INF/maven/org.apache.tika/tika-core/pom.properties I tried String pomPropertiesFile = /META-INF/maven/ + this.getClass().getPackage().getName() + /tika-core/pom.properties; InputStream pomIs = Tika.class.getResourceAsStream(pomPropertiesFile); Problem is that getResourceAsStream replaces dots in the path with slashes except for the last one. So the path becomes something like /META-INF/maven/org/apache/tika/tika-core/pom.properties leading to an NPE when trying to load the properties from pomIs. ... Leaving us (me?) w/o a way to get to this properties file... Any ideas? Provide version number by CLI argument -V - Key: TIKA-761 URL: https://issues.apache.org/jira/browse/TIKA-761 Project: Tika Issue Type: New Feature Components: cli, general Reporter: Ingo Renner Priority: Minor Attachments: TIKA-761.diff, TIKA-761.diff, TIKA-761.diff, TIKA-761.diff I'd like to get the Apache Tika version number through CLI argument -V or --version. The patch is trivial and basically finished. The only thing missing (because Java is not my native programming language) is the actual version number. Any hints where I can get that from? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-761) Provide version number by CLI argument -V
[ https://issues.apache.org/jira/browse/TIKA-761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13135129#comment-13135129 ] Jukka Zitting commented on TIKA-761: I'd simply hardcode the properties file path as {{/META-INF/maven/org.apache.tika/tika-app/pom.properties}}. It's not going to change any time soon. Provide version number by CLI argument -V - Key: TIKA-761 URL: https://issues.apache.org/jira/browse/TIKA-761 Project: Tika Issue Type: New Feature Components: cli, general Reporter: Ingo Renner Priority: Minor Attachments: TIKA-761.diff, TIKA-761.diff, TIKA-761.diff, TIKA-761.diff I'd like to get the Apache Tika version number through CLI argument -V or --version. The patch is trivial and basically finished. The only thing missing (because Java is not my native programming language) is the actual version number. Any hints where I can get that from? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-761) Provide version number by CLI argument -V
[ https://issues.apache.org/jira/browse/TIKA-761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13135162#comment-13135162 ] Ingo Renner commented on TIKA-761: -- sure, but the dots will still be replaced with slashes by getResourceAsStream(), so it won't matter really (except for saving the getPackage() call)... Provide version number by CLI argument -V - Key: TIKA-761 URL: https://issues.apache.org/jira/browse/TIKA-761 Project: Tika Issue Type: New Feature Components: cli, general Reporter: Ingo Renner Priority: Minor Attachments: TIKA-761.diff, TIKA-761.diff, TIKA-761.diff, TIKA-761.diff I'd like to get the Apache Tika version number through CLI argument -V or --version. The patch is trivial and basically finished. The only thing missing (because Java is not my native programming language) is the actual version number. Any hints where I can get that from? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-761) Provide version number by CLI argument -V
[ https://issues.apache.org/jira/browse/TIKA-761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13135167#comment-13135167 ] Jukka Zitting commented on TIKA-761: bq. the dots will still be replaced with slashes Only if the path is relative. If the path starts with a slash, like in {{/META-INF/...}}, no dot replacement will occur. Provide version number by CLI argument -V - Key: TIKA-761 URL: https://issues.apache.org/jira/browse/TIKA-761 Project: Tika Issue Type: New Feature Components: cli, general Reporter: Ingo Renner Priority: Minor Attachments: TIKA-761.diff, TIKA-761.diff, TIKA-761.diff, TIKA-761.diff I'd like to get the Apache Tika version number through CLI argument -V or --version. The patch is trivial and basically finished. The only thing missing (because Java is not my native programming language) is the actual version number. Any hints where I can get that from? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (TIKA-761) Provide version number by CLI argument -V
[ https://issues.apache.org/jira/browse/TIKA-761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13135181#comment-13135181 ] Ingo Renner commented on TIKA-761: -- oh wow, indeed works. Patch coming! Provide version number by CLI argument -V - Key: TIKA-761 URL: https://issues.apache.org/jira/browse/TIKA-761 Project: Tika Issue Type: New Feature Components: cli, general Reporter: Ingo Renner Priority: Minor Attachments: TIKA-761.diff, TIKA-761.diff, TIKA-761.diff, TIKA-761.diff I'd like to get the Apache Tika version number through CLI argument -V or --version. The patch is trivial and basically finished. The only thing missing (because Java is not my native programming language) is the actual version number. Any hints where I can get that from? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Google's Compact Language Detector
On Tue, Oct 25, 2011 at 12:12 PM, Michael McCandless luc...@mikemccandless.com wrote: Tika seems to have a lot of trouble with Spanish (confuses w/ Galician) and Danish (confuses with Dutch). s/Dutch/Norwegian/ -- lucidimagination.com
Re: Google's Compact Language Detector
On Tue, Oct 25, 2011 at 12:32 PM, Robert Muir rcm...@gmail.com wrote: On Tue, Oct 25, 2011 at 12:12 PM, Michael McCandless luc...@mikemccandless.com wrote: Tika seems to have a lot of trouble with Spanish (confuses w/ Galician) and Danish (confuses with Dutch). s/Dutch/Norwegian/ Woops, thanks! Mike McCandless http://blog.mikemccandless.com
Re: Google's Compact Language Detector
On Oct 25, 2011, at 6:12pm, Michael McCandless wrote: OK I posted the 3rd post about CLD, this time testing perf by comparing to Tika and language-detection (Google Code project): http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html Net/net all three do very well (= 97% accuracy); I had to remove 4 languages from consideration because we don't support them. Tika seems to have a lot of trouble with Spanish (confuses w/ Galician) and Danish (confuses with Dutch). Also, Tika's performance is substantially slow than the other two... not sure what's up. I'm not surprised that Tika is slower than CLD, given the highly optimized nature of that code. Though 2 orders of magnitude is...painful. I took a swing at this a while back, but didn't complete the patch. The main issues I tried to solve were: - Tika processes all of the text in the document, which (for longer documents) slows it down significantly, versus sampling up to some limit. - The ProfilingWriter is very inefficient. Every character processed does an array copy, and every three characters triggers a new String() -- Ken http://blog.mikemccandless.com On Mon, Oct 24, 2011 at 4:53 PM, Michael McCandless luc...@mikemccandless.com wrote: On Mon, Oct 24, 2011 at 2:15 PM, Ken Krugler kkrugler_li...@transpac.com wrote: Sounds like a great idea - see the recent comment thread on https://issues.apache.org/jira/browse/TIKA-431 for some related discussions. And there's also https://issues.apache.org/jira/browse/TIKA-539 Those do look related (if you swap charset in for language)! It's tricky to know just how much to trust what the server (Content-Type HTTP header) and content (http-equiv meta tag) says, though I do like CLD's approach: they never fully trust what was declared but rather use the declaration as a hint to boost language priors. And then to figure out what priors to assign for each hint they have these tables trained from a large content set (10% of Base). If we have access to a biggish crawl we could presumably do something similar, ie record how often the hint is wrong and translate that into appropriate prior boosts, ie make it a hint instead of fully trusting it. Does anyone know how ICU translates the encoding hint into priors for each encoding? Also, what will you be using to test language detection? WIkipedia pages? I'm using the corpus from here: http://shuyo.wordpress.com/2011/09/29/langdetect-is-updatedadded-profiles-of-estonian-lithuanian-latvian-slovene-and-so-on/ It's a random subset of europarl (1000 strings from each of 21 langs). Wikipedia would be great too! Mike McCandless http://blog.mikemccandless.com -- Ken Krugler +1 530-210-6378 http://bixolabs.com custom big data solutions training Hadoop, Cascading, Mahout Solr
[jira] [Updated] (TIKA-605) Tika GDAL parser
[ https://issues.apache.org/jira/browse/TIKA-605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-605: --- Fix Version/s: (was: 1.0) 1.1 - push out to 1.1: prep for 1.0. Tika GDAL parser Key: TIKA-605 URL: https://issues.apache.org/jira/browse/TIKA-605 Project: Tika Issue Type: New Feature Components: parser Environment: indep. of env. Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Labels: gdal, integration, tika Fix For: 1.1 Attachments: 0001-TIKA-605-Tika-GDAL-parser.patch, TIKA-605.Mattmann.092511.patch.txt Leverage the GDAL toolkit and its Java SWIG bindings to create a Tika parser around GDAL. See here: http://trac.osgeo.org/gdal/browser/trunk/gdal/swig/java/apps/gdalinfo.java -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-754) Automatic line break insertion (BR element) instead of '\n' in XHTMLContentHandler
[ https://issues.apache.org/jira/browse/TIKA-754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-754: --- Fix Version/s: (was: 1.0) 1.1 - push out to 1.1: prep for 1.0. Automatic line break insertion (BR element) instead of '\n' in XHTMLContentHandler -- Key: TIKA-754 URL: https://issues.apache.org/jira/browse/TIKA-754 Project: Tika Issue Type: Improvement Affects Versions: 0.10, 1.0 Reporter: Pablo Queixalos Priority: Minor Fix For: 1.1 Attachments: TIKA-754.poc.patch As seen with some parsers (PDF, PPT), some text blocks still contains text carriage returns ('\n') in the outputted XHTML. A global fix for this could be located in XHTMLContentHandler.characters(...). By analyzing the given char array, when a '\n' char is encountered insert a BR element instead. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-757) Address TODOs when we upgrade to next POI release (3.8 beta 5)
[ https://issues.apache.org/jira/browse/TIKA-757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-757: --- Fix Version/s: (was: 1.0) 1.1 - push out to 1.1: prep for 1.0. Address TODOs when we upgrade to next POI release (3.8 beta 5) -- Key: TIKA-757 URL: https://issues.apache.org/jira/browse/TIKA-757 Project: Tika Issue Type: Improvement Reporter: Michael McCandless Fix For: 1.1 I'm opening a blanket issue to remind us all to address the TODOs in the sources for when we upgrade to the next POI. I think this (a single blanket issue) is better than keeping separate issues open even though they are technically fixed? For example, I've committed TIKA-753 (speedups for embedded office docs), yet it included some TODOs for further speedups possible once we upgrade POI. Rather than keeping TIKA-753 (and others like it) open, I think we should resolve them and let this issue cover all the TODOs. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-758) Address TODOs when we upgrade to next PDFBox release
[ https://issues.apache.org/jira/browse/TIKA-758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-758: --- Fix Version/s: (was: 1.0) 1.1 - push out to 1.1: prep for 1.0. Address TODOs when we upgrade to next PDFBox release Key: TIKA-758 URL: https://issues.apache.org/jira/browse/TIKA-758 Project: Tika Issue Type: Improvement Reporter: Michael McCandless Fix For: 1.1 Like TIKA-757 for POI, I'm opening this blanket issue to address any TODOs in the code when we next upgrade PDFBox. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events
[ https://issues.apache.org/jira/browse/TIKA-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-715: --- Fix Version/s: (was: 1.0) 1.1 - push out to 1.1: prep for 1.0. Some parsers produce non-well-formed XHTML SAX events - Key: TIKA-715 URL: https://issues.apache.org/jira/browse/TIKA-715 Project: Tika Issue Type: Bug Components: parser Affects Versions: 0.10 Reporter: Michael McCandless Fix For: 1.1 Attachments: TIKA-715.patch With TIKA-683 I committed simple, commented out code to SafeContentHandler, to verify that the SAX events produced by the parser have valid (matched) tags. Ie, each startElement(foo) is matched by the closing endElement(foo). I only did basic nesting test, plus checking that p is never embedded inside another p; we could strengthen this further to check that all tags only appear in valid parents... I was able to use this to fix issues with the new RTF parser (TIKA-683), but I was surprised that some other parsers failed the new asserts. It could be these are relatively minor offenses (eg closing a table w/o closing the tr) and we need not do anything here... but I think it'd be cleaner if all our parsers produced matched, well-formed XHTML events. I haven't looked into any of these... it could be they are easy to fix. Failures: {noformat} testOutlookHTMLVersion(org.apache.tika.parser.microsoft.OutlookParserTest) Time elapsed: 0.032 sec ERROR! java.lang.AssertionError: end tag=body with no startElement at org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:224) at org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275) at org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:210) at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242) at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129) at org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookHTMLVersion(OutlookParserTest.java:158) testParseKeynote(org.apache.tika.parser.iwork.IWorkParserTest) Time elapsed: 0.116 sec ERROR! java.lang.AssertionError: mismatched elements open=tr close=table at org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226) at org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275) at org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252) at org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:287) at org.apache.tika.parser.iwork.KeynoteContentHandler.endElement(KeynoteContentHandler.java:136) at org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:601) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2938) at com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648) at com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140) at com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808) at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737) at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119) at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205) at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522) at javax.xml.parsers.SAXParser.parse(SAXParser.java:395) at javax.xml.parsers.SAXParser.parse(SAXParser.java:198) at org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:190) at org.apache.tika.parser.iwork.IWorkParserTest.testParseKeynote(IWorkParserTest.java:49) testMultipart(org.apache.tika.parser.mail.RFC822ParserTest) Time elapsed: 0.025 sec ERROR!
[jira] [Updated] (TIKA-565) Improved OSGi bundling
[ https://issues.apache.org/jira/browse/TIKA-565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated TIKA-565: --- Fix Version/s: (was: 1.0) 1.1 - push out to 1.1: prep for 1.0. Improved OSGi bundling -- Key: TIKA-565 URL: https://issues.apache.org/jira/browse/TIKA-565 Project: Tika Issue Type: Improvement Components: packaging Affects Versions: 0.10 Reporter: Jukka Zitting Assignee: Jukka Zitting Fix For: 1.1 Attachments: core-bundle-fix.diff I'd like to add proper integration tests for tika-bundle and expose the Tika facade object as a service so other bundles could access it easily like this: @Reference private Tika tika; It would also be nice to allow other OSGi bundles to expose their Parser implementations as pluggable services and have the Tika bundle automatically pick up and use them along with all the embedded parsers it contains. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Tika is waiting for ODFToolkit to improve ODF file format processing
On Tue, Oct 25, 2011 at 1:03 PM, Michael McCandless luc...@mikemccandless.com wrote: On Mon, Oct 24, 2011 at 9:17 AM, Rob Weir robw...@apache.org wrote: On Mon, Oct 24, 2011 at 4:54 AM, Devin Han devin...@apache.org wrote: I saw this issue in Tika: OpenOffice parser: master footer text isn't extracted https://issues.apache.org/jira/browse/TIKA-736 The current ODF parser of Tika doesn't touch the styles part and the embeded document, only meta and content. They are waiting for the first ODF Toolkit incubating release, then switch to a full featured parser much as they have for the POI powered ones. The first release is coming and we will have no code update before it. So, I suggest start the discussion that how to use ODF Toolkit to realize it based on the snapshot. In that JIRA thread Uwe talks about the desire for a streaming/SAX-like API for scanning the ODF documents. I agree. The DOM approach we use with ODF Toolkit is necessary for when you need random, read/write access to a document. But you pay a performance (mainly heap memory) penalty for that flexibility. But if you can organize your program logic into a single-pass read-only approach, then a streaming approach can -- in theory -- perform much better for that restricted use case. But I still wonder how much the underlying ZipInputStream implementation actually manages to stream the deflate algorithm when it unzips ODF's ZIP package In any case, this is something I'd be interested in working on after we get our initial ODF Toolkit release out. A memory optimized streaming API for read-only, single pass uses. I agree a more SAX-like (single pass, don't hold stuff in RAM) approach would mostly fit Tika's needs well. Note that the DOM approach is also used by other parsers Tika wraps (eg PDFBox, POI I think), so this is not a unique challenge for ODF. Tika's needs are actually quite simple compared to what ODFToolkit can do. Ie, really we just need read-only single pass (document - text), with some amount of document structure retained (so we know where to put p, div, b, etc., tags). Is there a list of the complete set of tags you use, or a schema or something? For TIKA-736 in particular, it'd be nice to reconstruct each slide so that any text from the master slide/layout is inlined into each slide that uses it, so that the resulting text looks the way it looks when you view the document in OpenOffice. This is the approach we're working towards in TIKA-712 for PPT/X files. Text box position is ultimately encoded as x,y coordinates on the slide. So the visual appearance on the slide and the order of the text boxes in the document's XML are generally unrelated. But it should be possible to sort the coordinates to get an top-to-bottom, left-to-write reading order. Maybe even with some sensitivity to BiDi. I've certainly seen that use case mentioned by others. I imagine to do this you'd need DOM-like access to the master slide / layout / style, and could then us SAX-like single pass for the normal slides. Well, you could stream one slide at a time, but we'd need to be able to store the complete text contents of each individual slide to do the coordinate sort. But that is not so bad. Presentations tend to be outrageously large based on large images (high color depth, high dpi) rather than large amounts of text. TIKA-735 is another issue with the the current ODF parser, whereby the text from embedded documents is always placed at the end of the text from the original document, rather than being inlined at the point where the embedding occurred. Seems like a SAX like API would work fine here, ie, we should simply recurse into the embedded doc when we encounter it. Right. Mike McCandless http://blog.mikemccandless.com
Tika 1.0 RC?
Hey Guys, I created a 1.1 version in JIRA and pushed all open (~13) issues for 1.0 to 1.1. We now have 32 issues resolved in the current 1.0. WDYT? Good enough for a 1.0 release? I'm happy to spin the RC tonight or in the next day (PDT). Any objections? Cheers, Chris ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++