Re: 1.0 RC in next 2 weeks

2011-09-15 Thread Mattmann, Chris A (388J)
Hey Kevin,

At one point we talked about a 0.9.1 type of point release, 
but decided upon 1.0 instead:

http://s.apache.org/Jvs

That said, I'm happy when the dev community of Tika is ready 
to cut a release, and will gladly RC it. It's one of my favorite 
things to do!

Cheers,
Chris

On Sep 15, 2011, at 3:32 PM, Kevin Clark wrote:

> In light of the recent file handle bug (via parseToString) woudl it be
> possible to get a point release in the meantime?
> 
> On Thu, Sep 15, 2011 at 3:06 PM, Mattmann, Chris A (388J)
>  wrote:
>> Hi there Jan,
>> 
>> I was hoping to have time to spin it up by now, but haven't yet. Plus 
>> there's a ton of bug fixes
>> and development going on, so I'm going to let things settle a bit.
>> 
>> Jukka and I have been communicating with Sally and press@ and we hope to 
>> have a 1.0 RC
>> and release in time for ApacheCon NA 2011 in November, so I think that's the 
>> target. Would be
>> great to get it before then though!
>> 
>> Cheers,
>> Chris
>> 
>> On Sep 15, 2011, at 2:59 PM, Jan Høydahl wrote:
>> 
>>> Hi,
>>> 
>>> I'm planning on upgrading Solr ExtractingHandler from Tika 0.8 to 0.9, for 
>>> next release. Is 1.0 close enough that I should wait, e.g. within next 3-4 
>>> weeks?
>>> 
>>> --
>>> Jan Høydahl, search solution architect
>>> Cominvent AS - www.cominvent.com
>>> Solr Training - www.solrtraining.com
>>> 
>>> On 3. aug. 2011, at 13:31, Jukka Zitting wrote:
>>> 
 Hi,
 
 On Wed, Aug 3, 2011 at 1:32 AM, Mattmann, Chris A (388J)
  wrote:
> If others have items on their 1.0 TODO list, please wrap them up in the 
> next week or so.
> I'll try and cut the 1.0 RC-1 by next weekend if that works for everyone.
 
 Thanks for pushing this forward!
 
 I'm back from a long vacation now and should be able to help clean up
 things in preparation for the release. I think the main thing on my
 part would be getting rid of the deprecated stuff on the Parser
 interface and elsewhere, and of course come up with a migration
 document that outlines how to update code that still uses the old
 method signatures.
 
 BR,
 
 Jukka Zitting
>>> 
>> 
>> 
>> ++
>> Chris Mattmann, Ph.D.
>> Senior Computer Scientist
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 171-266B, Mailstop: 171-246
>> Email: chris.a.mattm...@nasa.gov
>> WWW:   http://sunset.usc.edu/~mattmann/
>> ++
>> Adjunct Assistant Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++
>> 
>> 


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: 1.0 RC in next 2 weeks

2011-09-15 Thread Kevin Clark
In light of the recent file handle bug (via parseToString) woudl it be
possible to get a point release in the meantime?

On Thu, Sep 15, 2011 at 3:06 PM, Mattmann, Chris A (388J)
 wrote:
> Hi there Jan,
>
> I was hoping to have time to spin it up by now, but haven't yet. Plus there's 
> a ton of bug fixes
> and development going on, so I'm going to let things settle a bit.
>
> Jukka and I have been communicating with Sally and press@ and we hope to have 
> a 1.0 RC
> and release in time for ApacheCon NA 2011 in November, so I think that's the 
> target. Would be
> great to get it before then though!
>
> Cheers,
> Chris
>
> On Sep 15, 2011, at 2:59 PM, Jan Høydahl wrote:
>
>> Hi,
>>
>> I'm planning on upgrading Solr ExtractingHandler from Tika 0.8 to 0.9, for 
>> next release. Is 1.0 close enough that I should wait, e.g. within next 3-4 
>> weeks?
>>
>> --
>> Jan Høydahl, search solution architect
>> Cominvent AS - www.cominvent.com
>> Solr Training - www.solrtraining.com
>>
>> On 3. aug. 2011, at 13:31, Jukka Zitting wrote:
>>
>>> Hi,
>>>
>>> On Wed, Aug 3, 2011 at 1:32 AM, Mattmann, Chris A (388J)
>>>  wrote:
 If others have items on their 1.0 TODO list, please wrap them up in the 
 next week or so.
 I'll try and cut the 1.0 RC-1 by next weekend if that works for everyone.
>>>
>>> Thanks for pushing this forward!
>>>
>>> I'm back from a long vacation now and should be able to help clean up
>>> things in preparation for the release. I think the main thing on my
>>> part would be getting rid of the deprecated stuff on the Parser
>>> interface and elsewhere, and of course come up with a migration
>>> document that outlines how to update code that still uses the old
>>> method signatures.
>>>
>>> BR,
>>>
>>> Jukka Zitting
>>
>
>
> ++
> Chris Mattmann, Ph.D.
> Senior Computer Scientist
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 171-266B, Mailstop: 171-246
> Email: chris.a.mattm...@nasa.gov
> WWW:   http://sunset.usc.edu/~mattmann/
> ++
> Adjunct Assistant Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++
>
>


Re: 1.0 RC in next 2 weeks

2011-09-15 Thread Mattmann, Chris A (388J)
Hi there Jan,

I was hoping to have time to spin it up by now, but haven't yet. Plus there's a 
ton of bug fixes 
and development going on, so I'm going to let things settle a bit. 

Jukka and I have been communicating with Sally and press@ and we hope to have a 
1.0 RC 
and release in time for ApacheCon NA 2011 in November, so I think that's the 
target. Would be
great to get it before then though!

Cheers,
Chris

On Sep 15, 2011, at 2:59 PM, Jan Høydahl wrote:

> Hi,
> 
> I'm planning on upgrading Solr ExtractingHandler from Tika 0.8 to 0.9, for 
> next release. Is 1.0 close enough that I should wait, e.g. within next 3-4 
> weeks?
> 
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Solr Training - www.solrtraining.com
> 
> On 3. aug. 2011, at 13:31, Jukka Zitting wrote:
> 
>> Hi,
>> 
>> On Wed, Aug 3, 2011 at 1:32 AM, Mattmann, Chris A (388J)
>>  wrote:
>>> If others have items on their 1.0 TODO list, please wrap them up in the 
>>> next week or so.
>>> I'll try and cut the 1.0 RC-1 by next weekend if that works for everyone.
>> 
>> Thanks for pushing this forward!
>> 
>> I'm back from a long vacation now and should be able to help clean up
>> things in preparation for the release. I think the main thing on my
>> part would be getting rid of the deprecated stuff on the Parser
>> interface and elsewhere, and of course come up with a migration
>> document that outlines how to update code that still uses the old
>> method signatures.
>> 
>> BR,
>> 
>> Jukka Zitting
> 


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



Re: 1.0 RC in next 2 weeks

2011-09-15 Thread Jan Høydahl
Hi,

I'm planning on upgrading Solr ExtractingHandler from Tika 0.8 to 0.9, for next 
release. Is 1.0 close enough that I should wait, e.g. within next 3-4 weeks?

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 3. aug. 2011, at 13:31, Jukka Zitting wrote:

> Hi,
> 
> On Wed, Aug 3, 2011 at 1:32 AM, Mattmann, Chris A (388J)
>  wrote:
>> If others have items on their 1.0 TODO list, please wrap them up in the next 
>> week or so.
>> I'll try and cut the 1.0 RC-1 by next weekend if that works for everyone.
> 
> Thanks for pushing this forward!
> 
> I'm back from a long vacation now and should be able to help clean up
> things in preparation for the release. I think the main thing on my
> part would be getting rid of the deprecated stuff on the Parser
> interface and elsewhere, and of course come up with a migration
> document that outlines how to update code that still uses the old
> method signatures.
> 
> BR,
> 
> Jukka Zitting



[jira] [Commented] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events

2011-09-15 Thread Ken Krugler (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-715?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13105542#comment-13105542
 ] 

Ken Krugler commented on TIKA-715:
--

Hi Mike - excellent work re adding these checks.

I think as part of this, all of the unit tests in the HTML parser could should 
be modified to validate the resulting XHTML for compliance w/1.0.

I remember Andrzej (???) bringing up this issue a while ago, in the context of 
providing a general helper class that all parsers could use to generate the 
appropriate  elements, and avoid emitting invalid tags/attributes. There 
was some discussion about this on the list, but I don't think anything came of 
it.

> Some parsers produce non-well-formed XHTML SAX events
> -
>
> Key: TIKA-715
> URL: https://issues.apache.org/jira/browse/TIKA-715
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Michael McCandless
> Fix For: 1.0
>
> Attachments: TIKA-715.patch
>
>
> With TIKA-683 I committed simple, commented out code to
> SafeContentHandler, to verify that the SAX events produced by the
> parser have valid (matched) tags.  Ie, each startElement("foo") is
> matched by the closing endElement("foo").
> I only did basic nesting test, plus checking that  is never
> embedded inside another ; we could strengthen this further to check
> that all tags only appear in valid parents...
> I was able to use this to fix issues with the new RTF parser
> (TIKA-683), but I was surprised that some other parsers failed the new
> asserts.
> It could be these are relatively minor offenses (eg closing a table
> w/o closing the tr) and we need not do anything here... but I think
> it'd be cleaner if all our parsers produced matched, well-formed XHTML
> events.
> I haven't looked into any of these... it could be they are easy to fix.
> Failures:
> {noformat}
> testOutlookHTMLVersion(org.apache.tika.parser.microsoft.OutlookParserTest)  
> Time elapsed: 0.032 sec  <<< ERROR!
> java.lang.AssertionError: end tag=body with no startElement
>   at 
> org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:224)
>   at 
> org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:210)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:242)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
>   at 
> org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookHTMLVersion(OutlookParserTest.java:158)
> testParseKeynote(org.apache.tika.parser.iwork.IWorkParserTest)  Time elapsed: 
> 0.116 sec  <<< ERROR!
> java.lang.AssertionError: mismatched elements open=tr close=table
>   at 
> org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226)
>   at 
> org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:287)
>   at 
> org.apache.tika.parser.iwork.KeynoteContentHandler.endElement(KeynoteContentHandler.java:136)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>   at 
> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:601)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2938)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511)
>   at 
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808)
>   at 
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
>   at 
> com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
>   at 
> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205)
>   at 
> com.sun.org.apache.xerc

[jira] [Updated] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events

2011-09-15 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated TIKA-715:


Component/s: parser

> Some parsers produce non-well-formed XHTML SAX events
> -
>
> Key: TIKA-715
> URL: https://issues.apache.org/jira/browse/TIKA-715
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Michael McCandless
> Fix For: 1.0
>
> Attachments: TIKA-715.patch
>
>
> With TIKA-683 I committed simple, commented out code to
> SafeContentHandler, to verify that the SAX events produced by the
> parser have valid (matched) tags.  Ie, each startElement("foo") is
> matched by the closing endElement("foo").
> I only did basic nesting test, plus checking that  is never
> embedded inside another ; we could strengthen this further to check
> that all tags only appear in valid parents...
> I was able to use this to fix issues with the new RTF parser
> (TIKA-683), but I was surprised that some other parsers failed the new
> asserts.
> It could be these are relatively minor offenses (eg closing a table
> w/o closing the tr) and we need not do anything here... but I think
> it'd be cleaner if all our parsers produced matched, well-formed XHTML
> events.
> I haven't looked into any of these... it could be they are easy to fix.
> Failures:
> {noformat}
> testOutlookHTMLVersion(org.apache.tika.parser.microsoft.OutlookParserTest)  
> Time elapsed: 0.032 sec  <<< ERROR!
> java.lang.AssertionError: end tag=body with no startElement
>   at 
> org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:224)
>   at 
> org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:210)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:242)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
>   at 
> org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookHTMLVersion(OutlookParserTest.java:158)
> testParseKeynote(org.apache.tika.parser.iwork.IWorkParserTest)  Time elapsed: 
> 0.116 sec  <<< ERROR!
> java.lang.AssertionError: mismatched elements open=tr close=table
>   at 
> org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226)
>   at 
> org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:287)
>   at 
> org.apache.tika.parser.iwork.KeynoteContentHandler.endElement(KeynoteContentHandler.java:136)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>   at 
> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:601)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2938)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511)
>   at 
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808)
>   at 
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
>   at 
> com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
>   at 
> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205)
>   at 
> com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522)
>   at javax.xml.parsers.SAXParser.parse(SAXParser.java:395)
>   at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
>   at 
> org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:190)
>   at 
> org.apache.tika.parser.iwork.IWorkParserTest.testParseKeynote(IWorkParserTest.java:49)
> testMultipart(org.apache.tika.parser.mail.RFC822ParserTest)  Time elapsed: 
> 0.025 sec  <<< ERROR!
> java.lang.AssertionError

[jira] [Commented] (TIKA-683) RTF Parser issues with non european characters

2011-09-15 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13105499#comment-13105499
 ] 

Michael McCandless commented on TIKA-683:
-

I opened TIKA-715 for the mis-matched XHTML events.

> RTF Parser issues with non european characters
> --
>
> Key: TIKA-683
> URL: https://issues.apache.org/jira/browse/TIKA-683
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 0.9
>Reporter: Nick Burch
>Assignee: Michael McCandless
> Fix For: 1.0
>
> Attachments: TIKA-683-unicode-testcase.patch, TIKA-683.patch, 
> TIKA-683.patch, TIKA-683.patch, TIKA-683.patch, testRTFJapanese.rtf, 
> testUnicodeUCNControlWordCharacterDoubling.rtf, 
> testWORD_bold_character_runs.docx, testWORD_bold_character_runs2.docx
>
>
> As reported on user@ in "non-West European languages support":
>   
> http://mail-archives.apache.org/mod_mbox/tika-user/201107.mbox/%3cof0c0a3275.da7810e9-onc22578cc.0051eede-c22578cc.00525...@il.ibm.com%3E
> The RTF Parser seems to be doubling up some non-european characters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events

2011-09-15 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated TIKA-715:


Attachment: TIKA-715.patch

Patch turning on the asserts in SafeContentHandler.java.

> Some parsers produce non-well-formed XHTML SAX events
> -
>
> Key: TIKA-715
> URL: https://issues.apache.org/jira/browse/TIKA-715
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Reporter: Michael McCandless
> Fix For: 1.0
>
> Attachments: TIKA-715.patch
>
>
> With TIKA-683 I committed simple, commented out code to
> SafeContentHandler, to verify that the SAX events produced by the
> parser have valid (matched) tags.  Ie, each startElement("foo") is
> matched by the closing endElement("foo").
> I only did basic nesting test, plus checking that  is never
> embedded inside another ; we could strengthen this further to check
> that all tags only appear in valid parents...
> I was able to use this to fix issues with the new RTF parser
> (TIKA-683), but I was surprised that some other parsers failed the new
> asserts.
> It could be these are relatively minor offenses (eg closing a table
> w/o closing the tr) and we need not do anything here... but I think
> it'd be cleaner if all our parsers produced matched, well-formed XHTML
> events.
> I haven't looked into any of these... it could be they are easy to fix.
> Failures:
> {noformat}
> testOutlookHTMLVersion(org.apache.tika.parser.microsoft.OutlookParserTest)  
> Time elapsed: 0.032 sec  <<< ERROR!
> java.lang.AssertionError: end tag=body with no startElement
>   at 
> org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:224)
>   at 
> org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:210)
>   at 
> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:242)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
>   at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
>   at 
> org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookHTMLVersion(OutlookParserTest.java:158)
> testParseKeynote(org.apache.tika.parser.iwork.IWorkParserTest)  Time elapsed: 
> 0.116 sec  <<< ERROR!
> java.lang.AssertionError: mismatched elements open=tr close=table
>   at 
> org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226)
>   at 
> org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252)
>   at 
> org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:287)
>   at 
> org.apache.tika.parser.iwork.KeynoteContentHandler.endElement(KeynoteContentHandler.java:136)
>   at 
> org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
>   at 
> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:601)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2938)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140)
>   at 
> com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511)
>   at 
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808)
>   at 
> com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
>   at 
> com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
>   at 
> com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205)
>   at 
> com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522)
>   at javax.xml.parsers.SAXParser.parse(SAXParser.java:395)
>   at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
>   at 
> org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:190)
>   at 
> org.apache.tika.parser.iwork.IWorkParserTest.testParseKeynote(IWorkParserTest.java:49)
> testMultipart(org.apache.tika.parser.mail.RFC822ParserTest)  

[jira] [Created] (TIKA-715) Some parsers produce non-well-formed XHTML SAX events

2011-09-15 Thread Michael McCandless (JIRA)
Some parsers produce non-well-formed XHTML SAX events
-

 Key: TIKA-715
 URL: https://issues.apache.org/jira/browse/TIKA-715
 Project: Tika
  Issue Type: Bug
Reporter: Michael McCandless
 Fix For: 1.0


With TIKA-683 I committed simple, commented out code to
SafeContentHandler, to verify that the SAX events produced by the
parser have valid (matched) tags.  Ie, each startElement("foo") is
matched by the closing endElement("foo").

I only did basic nesting test, plus checking that  is never
embedded inside another ; we could strengthen this further to check
that all tags only appear in valid parents...

I was able to use this to fix issues with the new RTF parser
(TIKA-683), but I was surprised that some other parsers failed the new
asserts.

It could be these are relatively minor offenses (eg closing a table
w/o closing the tr) and we need not do anything here... but I think
it'd be cleaner if all our parsers produced matched, well-formed XHTML
events.

I haven't looked into any of these... it could be they are easy to fix.

Failures:

{noformat}
testOutlookHTMLVersion(org.apache.tika.parser.microsoft.OutlookParserTest)  
Time elapsed: 0.032 sec  <<< ERROR!
java.lang.AssertionError: end tag=body with no startElement
at 
org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:224)
at 
org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
at 
org.apache.tika.sax.XHTMLContentHandler.endDocument(XHTMLContentHandler.java:210)
at 
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:242)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:129)
at 
org.apache.tika.parser.microsoft.OutlookParserTest.testOutlookHTMLVersion(OutlookParserTest.java:158)



testParseKeynote(org.apache.tika.parser.iwork.IWorkParserTest)  Time elapsed: 
0.116 sec  <<< ERROR!
java.lang.AssertionError: mismatched elements open=tr close=table
at 
org.apache.tika.sax.SafeContentHandler.verifyEndElement(SafeContentHandler.java:226)
at 
org.apache.tika.sax.SafeContentHandler.endElement(SafeContentHandler.java:275)
at 
org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:252)
at 
org.apache.tika.sax.XHTMLContentHandler.endElement(XHTMLContentHandler.java:287)
at 
org.apache.tika.parser.iwork.KeynoteContentHandler.endElement(KeynoteContentHandler.java:136)
at 
org.apache.tika.sax.ContentHandlerDecorator.endElement(ContentHandlerDecorator.java:136)
at 
com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:601)
at 
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanEndElement(XMLDocumentFragmentScannerImpl.java:1782)
at 
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl$FragmentContentDriver.next(XMLDocumentFragmentScannerImpl.java:2938)
at 
com.sun.org.apache.xerces.internal.impl.XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:648)
at 
com.sun.org.apache.xerces.internal.impl.XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:140)
at 
com.sun.org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl.java:511)
at 
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:808)
at 
com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:737)
at 
com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:119)
at 
com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1205)
at 
com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:522)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:395)
at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
at 
org.apache.tika.parser.iwork.IWorkPackageParser.parse(IWorkPackageParser.java:190)
at 
org.apache.tika.parser.iwork.IWorkParserTest.testParseKeynote(IWorkParserTest.java:49)



testMultipart(org.apache.tika.parser.mail.RFC822ParserTest)  Time elapsed: 
0.025 sec  <<< ERROR!
java.lang.AssertionError: p inside p
at 
org.apache.tika.sax.SafeContentHandler.verifyStartElement(SafeContentHandler.java:216)
at 
org.apache.tika.sax.SafeContentHandler.startElement(SafeContentHandler.java:245)
at 
org.apache.tika.sax.XHTMLContentHandler.startElement(XHTMLContentHandler.java:241)
at 
org.apache.tika.sax.ContentHandlerDecorator.startElement(ContentHandlerDecorator.java:126)

[jira] [Resolved] (TIKA-683) RTF Parser issues with non european characters

2011-09-15 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved TIKA-683.
-

   Resolution: Fixed
Fix Version/s: 1.0

I'll open a follow-on issue for the mis-matched XHTML events from some 
parsers

> RTF Parser issues with non european characters
> --
>
> Key: TIKA-683
> URL: https://issues.apache.org/jira/browse/TIKA-683
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 0.9
>Reporter: Nick Burch
>Assignee: Michael McCandless
> Fix For: 1.0
>
> Attachments: TIKA-683-unicode-testcase.patch, TIKA-683.patch, 
> TIKA-683.patch, TIKA-683.patch, TIKA-683.patch, testRTFJapanese.rtf, 
> testUnicodeUCNControlWordCharacterDoubling.rtf, 
> testWORD_bold_character_runs.docx, testWORD_bold_character_runs2.docx
>
>
> As reported on user@ in "non-West European languages support":
>   
> http://mail-archives.apache.org/mod_mbox/tika-user/201107.mbox/%3cof0c0a3275.da7810e9-onc22578cc.0051eede-c22578cc.00525...@il.ibm.com%3E
> The RTF Parser seems to be doubling up some non-european characters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-683) RTF Parser issues with non european characters

2011-09-15 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13105413#comment-13105413
 ] 

Michael McCandless commented on TIKA-683:
-

Thanks Chris, I'll commit today!

> RTF Parser issues with non european characters
> --
>
> Key: TIKA-683
> URL: https://issues.apache.org/jira/browse/TIKA-683
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 0.9
>Reporter: Nick Burch
>Assignee: Michael McCandless
> Attachments: TIKA-683-unicode-testcase.patch, TIKA-683.patch, 
> TIKA-683.patch, TIKA-683.patch, TIKA-683.patch, testRTFJapanese.rtf, 
> testUnicodeUCNControlWordCharacterDoubling.rtf, 
> testWORD_bold_character_runs.docx, testWORD_bold_character_runs2.docx
>
>
> As reported on user@ in "non-West European languages support":
>   
> http://mail-archives.apache.org/mod_mbox/tika-user/201107.mbox/%3cof0c0a3275.da7810e9-onc22578cc.0051eede-c22578cc.00525...@il.ibm.com%3E
> The RTF Parser seems to be doubling up some non-european characters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Resolved] (TIKA-666) Unable to extract content from RTF files

2011-09-15 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/TIKA-666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved TIKA-666.


Resolution: Fixed

- closed per Mike's comment, and the fix for TIKA-683

> Unable to extract content from RTF files
> 
>
> Key: TIKA-666
> URL: https://issues.apache.org/jira/browse/TIKA-666
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 0.8, 0.9
> Environment: Windows 32 bit OS, JDK 1.6.19
>Reporter: samraj
>  Labels: RTF
> Attachments: Redline.rtf
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> HI,
>  I have tried with various set of RTF document to extract the text Content. I 
> have tried so many technique to extract the text from rtf.. Its failed. I 
> have attached the RTF document here

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-683) RTF Parser issues with non european characters

2011-09-15 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13105362#comment-13105362
 ] 

Chris A. Mattmann commented on TIKA-683:


Hey Mike, +1 to commit, go for it!

> RTF Parser issues with non european characters
> --
>
> Key: TIKA-683
> URL: https://issues.apache.org/jira/browse/TIKA-683
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 0.9
>Reporter: Nick Burch
>Assignee: Michael McCandless
> Attachments: TIKA-683-unicode-testcase.patch, TIKA-683.patch, 
> TIKA-683.patch, TIKA-683.patch, TIKA-683.patch, testRTFJapanese.rtf, 
> testUnicodeUCNControlWordCharacterDoubling.rtf, 
> testWORD_bold_character_runs.docx, testWORD_bold_character_runs2.docx
>
>
> As reported on user@ in "non-West European languages support":
>   
> http://mail-archives.apache.org/mod_mbox/tika-user/201107.mbox/%3cof0c0a3275.da7810e9-onc22578cc.0051eede-c22578cc.00525...@il.ibm.com%3E
> The RTF Parser seems to be doubling up some non-european characters

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (TIKA-666) Unable to extract content from RTF files

2011-09-15 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/TIKA-666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13105310#comment-13105310
 ] 

Michael McCandless commented on TIKA-666:
-

It looks like TIKA-683 fixes this issue, or at least I'm able to extract text 
for Redline.rtf.

> Unable to extract content from RTF files
> 
>
> Key: TIKA-666
> URL: https://issues.apache.org/jira/browse/TIKA-666
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 0.8, 0.9
> Environment: Windows 32 bit OS, JDK 1.6.19
>Reporter: samraj
>  Labels: RTF
> Attachments: Redline.rtf
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> HI,
>  I have tried with various set of RTF document to extract the text Content. I 
> have tried so many technique to extract the text from rtf.. Its failed. I 
> have attached the RTF document here

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira