[jira] [Commented] (TIKA-761) Provide version number by CLI argument -V

2011-10-25 Thread Ingo Renner (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13134976#comment-13134976
 ] 

Ingo Renner commented on TIKA-761:
--

Got the NPE resolved, it was caused by the changes to the pom. Since adding 
explicit resource directives Maven didn't copy tika-mimetypes.xml into the jar 
anymore. Fixed patch coming up...

 Provide version number by CLI argument -V
 -

 Key: TIKA-761
 URL: https://issues.apache.org/jira/browse/TIKA-761
 Project: Tika
  Issue Type: New Feature
  Components: cli, general
Reporter: Ingo Renner
Priority: Minor
 Attachments: TIKA-761.diff, TIKA-761.diff, TIKA-761.diff, 
 TIKA-761.diff


 I'd like to get the Apache Tika version number through CLI argument -V or 
 --version. The patch is trivial and basically finished. The only thing 
 missing (because Java is not my native programming language) is the actual 
 version number. Any hints where I can get that from?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-761) Provide version number by CLI argument -V

2011-10-25 Thread Ingo Renner (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13135106#comment-13135106
 ] 

Ingo Renner commented on TIKA-761:
--

Hi Nick and Jukka, some update on the META-INF approach:

The path for the properties file would be
/META-INF/maven/org.apache.tika/tika-core/pom.properties

I tried 
String pomPropertiesFile = /META-INF/maven/
+ this.getClass().getPackage().getName()
+ /tika-core/pom.properties;
InputStream pomIs = Tika.class.getResourceAsStream(pomPropertiesFile);

Problem is that getResourceAsStream replaces dots in the path with slashes 
except for the last one. So the path becomes something like 
/META-INF/maven/org/apache/tika/tika-core/pom.properties leading to an NPE when 
trying to load the properties from pomIs. ... Leaving us (me?) w/o a way to get 
to this properties file...

Any ideas?

 Provide version number by CLI argument -V
 -

 Key: TIKA-761
 URL: https://issues.apache.org/jira/browse/TIKA-761
 Project: Tika
  Issue Type: New Feature
  Components: cli, general
Reporter: Ingo Renner
Priority: Minor
 Attachments: TIKA-761.diff, TIKA-761.diff, TIKA-761.diff, 
 TIKA-761.diff


 I'd like to get the Apache Tika version number through CLI argument -V or 
 --version. The patch is trivial and basically finished. The only thing 
 missing (because Java is not my native programming language) is the actual 
 version number. Any hints where I can get that from?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-761) Provide version number by CLI argument -V

2011-10-25 Thread Jukka Zitting (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13135129#comment-13135129
 ] 

Jukka Zitting commented on TIKA-761:


I'd simply hardcode the properties file path as 
{{/META-INF/maven/org.apache.tika/tika-app/pom.properties}}. It's not going to 
change any time soon.

 Provide version number by CLI argument -V
 -

 Key: TIKA-761
 URL: https://issues.apache.org/jira/browse/TIKA-761
 Project: Tika
  Issue Type: New Feature
  Components: cli, general
Reporter: Ingo Renner
Priority: Minor
 Attachments: TIKA-761.diff, TIKA-761.diff, TIKA-761.diff, 
 TIKA-761.diff


 I'd like to get the Apache Tika version number through CLI argument -V or 
 --version. The patch is trivial and basically finished. The only thing 
 missing (because Java is not my native programming language) is the actual 
 version number. Any hints where I can get that from?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-761) Provide version number by CLI argument -V

2011-10-25 Thread Ingo Renner (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13135162#comment-13135162
 ] 

Ingo Renner commented on TIKA-761:
--

sure, but the dots will still be replaced with slashes by 
getResourceAsStream(), so it won't matter really (except for saving the 
getPackage() call)...

 Provide version number by CLI argument -V
 -

 Key: TIKA-761
 URL: https://issues.apache.org/jira/browse/TIKA-761
 Project: Tika
  Issue Type: New Feature
  Components: cli, general
Reporter: Ingo Renner
Priority: Minor
 Attachments: TIKA-761.diff, TIKA-761.diff, TIKA-761.diff, 
 TIKA-761.diff


 I'd like to get the Apache Tika version number through CLI argument -V or 
 --version. The patch is trivial and basically finished. The only thing 
 missing (because Java is not my native programming language) is the actual 
 version number. Any hints where I can get that from?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-761) Provide version number by CLI argument -V

2011-10-25 Thread Jukka Zitting (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13135167#comment-13135167
 ] 

Jukka Zitting commented on TIKA-761:


bq. the dots will still be replaced with slashes

Only if the path is relative. If the path starts with a slash, like in 
{{/META-INF/...}}, no dot replacement will occur.

 Provide version number by CLI argument -V
 -

 Key: TIKA-761
 URL: https://issues.apache.org/jira/browse/TIKA-761
 Project: Tika
  Issue Type: New Feature
  Components: cli, general
Reporter: Ingo Renner
Priority: Minor
 Attachments: TIKA-761.diff, TIKA-761.diff, TIKA-761.diff, 
 TIKA-761.diff


 I'd like to get the Apache Tika version number through CLI argument -V or 
 --version. The patch is trivial and basically finished. The only thing 
 missing (because Java is not my native programming language) is the actual 
 version number. Any hints where I can get that from?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-761) Provide version number by CLI argument -V

2011-10-25 Thread Ingo Renner (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13135181#comment-13135181
 ] 

Ingo Renner commented on TIKA-761:
--

oh wow, indeed works. Patch coming!

 Provide version number by CLI argument -V
 -

 Key: TIKA-761
 URL: https://issues.apache.org/jira/browse/TIKA-761
 Project: Tika
  Issue Type: New Feature
  Components: cli, general
Reporter: Ingo Renner
Priority: Minor
 Attachments: TIKA-761.diff, TIKA-761.diff, TIKA-761.diff, 
 TIKA-761.diff


 I'd like to get the Apache Tika version number through CLI argument -V or 
 --version. The patch is trivial and basically finished. The only thing 
 missing (because Java is not my native programming language) is the actual 
 version number. Any hints where I can get that from?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: Google's Compact Language Detector

2011-10-25 Thread Robert Muir
On Tue, Oct 25, 2011 at 12:12 PM, Michael McCandless
luc...@mikemccandless.com wrote:

 Tika seems to have a lot of trouble with Spanish (confuses w/
 Galician) and Danish (confuses with Dutch).

s/Dutch/Norwegian/



-- 
lucidimagination.com


Re: Google's Compact Language Detector

2011-10-25 Thread Michael McCandless
On Tue, Oct 25, 2011 at 12:32 PM, Robert Muir rcm...@gmail.com wrote:
 On Tue, Oct 25, 2011 at 12:12 PM, Michael McCandless
 luc...@mikemccandless.com wrote:

 Tika seems to have a lot of trouble with Spanish (confuses w/
 Galician) and Danish (confuses with Dutch).

 s/Dutch/Norwegian/

Woops, thanks!

Mike McCandless

http://blog.mikemccandless.com


Re: Google's Compact Language Detector

2011-10-25 Thread Ken Krugler

On Oct 25, 2011, at 6:12pm, Michael McCandless wrote:

 OK I posted the 3rd post about CLD, this time testing perf by
 comparing to Tika and language-detection (Google Code project):
 

 http://blog.mikemccandless.com/2011/10/accuracy-and-performance-of-googles.html
 
 Net/net all three do very well (= 97% accuracy); I had to remove 4
 languages from consideration because we don't support them.
 
 Tika seems to have a lot of trouble with Spanish (confuses w/
 Galician) and Danish (confuses with Dutch).
 
 Also, Tika's performance is substantially slow than the other two... not
 sure what's up.

I'm not surprised that Tika is slower than CLD, given the highly optimized 
nature of that code. Though 2 orders of magnitude is...painful.

I took a swing at this a while back, but didn't complete the patch.

The main issues I tried to solve were:

 - Tika processes all of the text in the document, which (for longer documents) 
slows it down significantly, versus sampling up to some limit.

 - The ProfilingWriter is very inefficient. Every character processed does an 
array copy, and every three characters triggers a new String()

-- Ken

 http://blog.mikemccandless.com
 
 On Mon, Oct 24, 2011 at 4:53 PM, Michael McCandless
 luc...@mikemccandless.com wrote:
 On Mon, Oct 24, 2011 at 2:15 PM, Ken Krugler
 kkrugler_li...@transpac.com wrote:
 
 Sounds like a great idea - see the recent comment thread on 
 https://issues.apache.org/jira/browse/TIKA-431 for some related discussions.
 
 And there's also https://issues.apache.org/jira/browse/TIKA-539
 
 Those do look related (if you swap charset in for language)!
 
 It's tricky to know just how much to trust what the server
 (Content-Type HTTP header) and content (http-equiv meta tag) says,
 though I do like CLD's approach: they never fully trust what was
 declared but rather use the declaration as a hint to boost language
 priors.
 
 And then to figure out what priors to assign for each hint they have
 these tables trained from a large content set (10% of Base).
 
 If we have access to a biggish crawl we could presumably do something
 similar, ie record how often the hint is wrong and translate that into
 appropriate prior boosts, ie make it a hint instead of fully trusting
 it.
 
 Does anyone know how ICU translates the encoding hint into priors
 for each encoding?
 
 Also, what will you be using to test language detection? WIkipedia pages?
 
 I'm using the corpus from here:
 

 http://shuyo.wordpress.com/2011/09/29/langdetect-is-updatedadded-profiles-of-estonian-lithuanian-latvian-slovene-and-so-on/
 
 It's a random subset of europarl (1000 strings from each of 21 langs).
 
 Wikipedia would be great too!
 
 Mike McCandless
 
 http://blog.mikemccandless.com
 

--
Ken Krugler
+1 530-210-6378
http://bixolabs.com
custom big data solutions  training
Hadoop, Cascading, Mahout  Solr





[jira] [Updated] (TIKA-605) Tika GDAL parser

2011-10-25 Thread Chris A. Mattmann (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-605?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-605:
---

Fix Version/s: (was: 1.0)
   1.1

- push out to 1.1: prep for 1.0.

 Tika GDAL parser
 

 Key: TIKA-605
 URL: https://issues.apache.org/jira/browse/TIKA-605
 Project: Tika
  Issue Type: New Feature
  Components: parser
 Environment: indep. of env.
Reporter: Chris A. Mattmann
Assignee: Chris A. Mattmann
  Labels: gdal, integration, tika
 Fix For: 1.1

 Attachments: 0001-TIKA-605-Tika-GDAL-parser.patch, 
 TIKA-605.Mattmann.092511.patch.txt


 Leverage the GDAL toolkit and its Java SWIG bindings to create a Tika parser 
 around GDAL. See here: 
 http://trac.osgeo.org/gdal/browser/trunk/gdal/swig/java/apps/gdalinfo.java

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-754) Automatic line break insertion (BR element) instead of '\n' in XHTMLContentHandler

2011-10-25 Thread Chris A. Mattmann (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-754:
---

Fix Version/s: (was: 1.0)
   1.1

- push out to 1.1: prep for 1.0.

 Automatic line break insertion (BR element) instead of '\n' in 
 XHTMLContentHandler
 --

 Key: TIKA-754
 URL: https://issues.apache.org/jira/browse/TIKA-754
 Project: Tika
  Issue Type: Improvement
Affects Versions: 0.10, 1.0
Reporter: Pablo Queixalos
Priority: Minor
 Fix For: 1.1

 Attachments: TIKA-754.poc.patch


 As seen with some parsers (PDF, PPT), some text blocks still contains text 
 carriage returns ('\n') in the outputted XHTML. 
 A global fix for this could be located in XHTMLContentHandler.characters(...).
 By analyzing the given char array, when a '\n' char is encountered insert a 
 BR element instead.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-757) Address TODOs when we upgrade to next POI release (3.8 beta 5)

2011-10-25 Thread Chris A. Mattmann (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-757:
---

Fix Version/s: (was: 1.0)
   1.1

- push out to 1.1: prep for 1.0.

 Address TODOs when we upgrade to next POI release (3.8 beta 5)
 --

 Key: TIKA-757
 URL: https://issues.apache.org/jira/browse/TIKA-757
 Project: Tika
  Issue Type: Improvement
Reporter: Michael McCandless
 Fix For: 1.1


 I'm opening a blanket issue to remind us all to address the TODOs in the 
 sources for when we upgrade to the next POI.
 I think this (a single blanket issue) is better than keeping separate issues 
 open even though they are technically fixed?
 For example, I've committed TIKA-753 (speedups for embedded office docs), yet 
 it included some TODOs for further speedups possible once we upgrade POI.  
 Rather than keeping TIKA-753 (and others like it) open, I think we should 
 resolve them and let this issue cover all the TODOs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-758) Address TODOs when we upgrade to next PDFBox release

2011-10-25 Thread Chris A. Mattmann (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-758:
---

Fix Version/s: (was: 1.0)
   1.1

- push out to 1.1: prep for 1.0.

 Address TODOs when we upgrade to next PDFBox release
 

 Key: TIKA-758
 URL: https://issues.apache.org/jira/browse/TIKA-758
 Project: Tika
  Issue Type: Improvement
Reporter: Michael McCandless
 Fix For: 1.1


 Like TIKA-757 for POI, I'm opening this blanket issue to address any TODOs in 
 the code when we next upgrade PDFBox.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events

2011-10-25 Thread Chris A. Mattmann (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-715:
---

Fix Version/s: (was: 1.0)
   1.1

- push out to 1.1: prep for 1.0.

 Some parsers produce non-well-formed XHTML SAX events
 -

 Key: TIKA-715
 URL: https://issues.apache.org/jira/browse/TIKA-715
 Project: Tika
  Issue Type: Bug
  Components: parser
Affects Versions: 0.10
Reporter: Michael McCandless
 Fix For: 1.1

 Attachments: TIKA-715.patch


 With TIKA-683 I committed simple, commented out code to
 SafeContentHandler, to verify that the SAX events produced by the
 parser have valid (matched) tags.  Ie, each startElement(foo) is
 matched by the closing endElement(foo).
 I only did basic nesting test, plus checking that p is never
 embedded inside another p; we could strengthen this further to check
 that all tags only appear in valid parents...
 I was able to use this to fix issues with the new RTF parser
 (TIKA-683), but I was surprised that some other parsers failed the new
 asserts.
 It could be these are relatively minor offenses (eg closing a table
 w/o closing the tr) and we need not do anything here... but I think
 it'd be cleaner if all our parsers produced matched, well-formed XHTML
 events.
 I haven't looked into any of these... it could be they are easy to fix.
 Failures:
 {noformat}
 testOutlookHTMLVersion(org.apache.tika.parser.microsoft.OutlookParserTest)  
 Time elapsed: 0.032 sec   ERROR!
 java.lang.AssertionError: end tag=body with no startElement
   at 
 org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:224)
   at 
 org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
   at 
 org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:210)
   at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
   at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
   at 
 org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookHTMLVersion(OutlookParserTest.java:158)
 testParseKeynote(org.apache.tika.parser.iwork.IWorkParserTest)  Time elapsed: 
 0.116 sec   ERROR!
 java.lang.AssertionError: mismatched elements open=tr close=table
   at 
 org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226)
   at 
 org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
   at 
 org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252)
   at 
 org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:287)
   at 
 org.apache.tika.parser.iwork.KeynoteContentHandler.endElement(KeynoteContentHandler.java:136)
   at 
 org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
   at 
 com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:601)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2938)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140)
   at 
 com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511)
   at 
 com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808)
   at 
 com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
   at 
 com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
   at 
 com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205)
   at 
 com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522)
   at javax.xml.parsers.SAXParser.parse(SAXParser.java:395)
   at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
   at 
 org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:190)
   at 
 org.apache.tika.parser.iwork.IWorkParserTest.testParseKeynote(IWorkParserTest.java:49)
 testMultipart(org.apache.tika.parser.mail.RFC822ParserTest)  Time elapsed: 
 0.025 sec   ERROR!
 

[jira] [Updated] (TIKA-565) Improved OSGi bundling

2011-10-25 Thread Chris A. Mattmann (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated TIKA-565:
---

Fix Version/s: (was: 1.0)
   1.1

- push out to 1.1: prep for 1.0.

 Improved OSGi bundling
 --

 Key: TIKA-565
 URL: https://issues.apache.org/jira/browse/TIKA-565
 Project: Tika
  Issue Type: Improvement
  Components: packaging
Affects Versions: 0.10
Reporter: Jukka Zitting
Assignee: Jukka Zitting
 Fix For: 1.1

 Attachments: core-bundle-fix.diff


 I'd like to add proper integration tests for tika-bundle and expose the Tika 
 facade object as a service so other bundles could access it easily like this:
 @Reference
 private Tika tika;
 It would also be nice to allow other OSGi bundles to expose their Parser 
 implementations as pluggable services and have the Tika bundle automatically 
 pick up and use them along with all the embedded parsers it contains.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: Tika is waiting for ODFToolkit to improve ODF file format processing

2011-10-25 Thread Rob Weir
On Tue, Oct 25, 2011 at 1:03 PM, Michael McCandless
luc...@mikemccandless.com wrote:
 On Mon, Oct 24, 2011 at 9:17 AM, Rob Weir robw...@apache.org wrote:
 On Mon, Oct 24, 2011 at 4:54 AM, Devin Han devin...@apache.org wrote:
 I saw this issue in Tika: OpenOffice parser: master footer text isn't
 extracted https://issues.apache.org/jira/browse/TIKA-736

 The current ODF parser of Tika doesn't touch the styles part and the embeded
 document, only meta and content. They are waiting for the first ODF Toolkit
 incubating release, then switch to a full featured parser much as they have
 for the POI powered ones.

 The first release is coming and we will have no code update before it. So, I
 suggest start the discussion that how to use ODF Toolkit to realize it based
 on the snapshot.


 In that JIRA thread Uwe talks about the desire for a
 streaming/SAX-like API for scanning the ODF documents.  I agree.  The
 DOM approach we use with ODF Toolkit is necessary for when you need
 random, read/write access to a document.  But you pay a performance
 (mainly heap memory) penalty for that flexibility.  But if you can
 organize your program logic into a single-pass read-only approach,
 then a streaming approach can -- in theory -- perform much better for
 that restricted use case.  But I still wonder how much the underlying
 ZipInputStream implementation actually manages to stream the deflate
 algorithm when it unzips ODF's ZIP package

 In any case, this is something I'd be interested in working on after
 we get our initial ODF Toolkit release out.  A memory optimized
 streaming API for read-only, single pass uses.

 I agree a more SAX-like (single pass, don't hold stuff in RAM)
 approach would mostly fit Tika's needs well.

 Note that the DOM approach is also used by other parsers Tika wraps
 (eg PDFBox, POI I think), so this is not a unique challenge for
 ODF.

 Tika's needs are actually quite simple compared to what ODFToolkit can
 do.

 Ie, really we just need read-only single pass (document - text), with
 some amount of document structure retained (so we know where to put
 p, div, b, etc., tags).


Is there a list of the complete set of tags you use, or a schema or something?

 For TIKA-736 in particular, it'd be nice to reconstruct each slide
 so that any text from the master slide/layout is inlined into each
 slide that uses it, so that the resulting text looks the way it looks
 when you view the document in OpenOffice.  This is the approach we're
 working towards in TIKA-712 for PPT/X files.


Text box position is ultimately encoded as x,y coordinates on the
slide.  So the visual appearance on the slide and the order of the
text boxes in the document's XML are generally unrelated.  But it
should be possible to sort the coordinates to get an top-to-bottom,
left-to-write reading order.  Maybe even with some sensitivity to
BiDi.

I've certainly seen that use case mentioned by others.

 I imagine to do this you'd need DOM-like access to the master slide /
 layout / style, and could then us SAX-like single pass for the
 normal slides.


Well, you could stream one slide at a time, but we'd need to be able
to store the complete text contents of each individual slide to do the
coordinate sort.  But that is not so bad.  Presentations tend to be
outrageously large based on large images (high color depth, high dpi)
rather than large amounts of text.

 TIKA-735 is another issue with the the current ODF parser, whereby the text
 from embedded documents is always placed at the end of the text from
 the original document, rather than being inlined at the point where
 the embedding occurred.  Seems like a SAX like API would work fine
 here, ie, we should simply recurse into the embedded doc when we
 encounter it.


Right.

 Mike McCandless

 http://blog.mikemccandless.com



Tika 1.0 RC?

2011-10-25 Thread Mattmann, Chris A (388J)
Hey Guys,

I created a 1.1 version in JIRA and pushed all open (~13) issues for 1.0 to 1.1.

We now have 32 issues resolved in the current 1.0. WDYT? Good enough 
for a 1.0 release? I'm happy to spin the RC tonight or in the next day (PDT).

Any objections?

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++