[jira] Commented: (TIKA-433) Tika + Hadoop

2010-05-26 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12871544#action_12871544 ] Julien Nioche commented on TIKA-433: You can do that with [Behemoth|http://code.google.co

[jira] Commented: (TIKA-431) Tika currently misuses the HTTP Content-Encoding header, and does not seem to use the charset part of the Content-Type header properly.

2010-05-26 Thread Jukka Zitting (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12871567#action_12871567 ] Jukka Zitting commented on TIKA-431: Agreed, we should be using the charset parameter of

[jira] Commented: (TIKA-430) Automatically let all valid XHTML 1.0 attributes through from HTML documents

2010-05-26 Thread Jukka Zitting (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12871568#action_12871568 ] Jukka Zitting commented on TIKA-430: Sounds reasonable, especially since unlike extra con

[jira] Commented: (TIKA-429) Error parsing DTD

2010-05-26 Thread Jukka Zitting (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12871570#action_12871570 ] Jukka Zitting commented on TIKA-429: Looks like the input document is incorrectly treated

[jira] Commented: (TIKA-430) Automatically let all valid XHTML 1.0 attributes through from HTML documents

2010-05-26 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-430?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12871585#action_12871585 ] Julien Nioche commented on TIKA-430: The method mapSafeAttribute(String elementName, Stri

[jira] Resolved: (TIKA-425) Exception parsing mp3

2010-05-26 Thread Jukka Zitting (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jukka Zitting resolved TIKA-425. Assignee: Jukka Zitting Fix Version/s: 0.8 Resolution: Fixed Thanks for the problem r

[jira] Resolved: (TIKA-428) Unexpected RuntimeException when parsing PPTM (?) file

2010-05-26 Thread Jukka Zitting (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jukka Zitting resolved TIKA-428. Assignee: Jukka Zitting Resolution: Duplicate Yes, this is a duplicate of TIKA-418. > Unexpect

[jira] Commented: (TIKA-418) RuntimeException while getting content for ppsx, ppsm, pptm, thmx and xps file types

2010-05-26 Thread Jukka Zitting (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12871594#action_12871594 ] Jukka Zitting commented on TIKA-418: See the duplicate issue TIKA-428 for a stack trace o

[jira] Commented: (TIKA-420) [PATCH] Integration of boilerpipe: Boilerplate Removal and Fulltext Extraction from HTML pages

2010-05-26 Thread Jukka Zitting (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12871608#action_12871608 ] Jukka Zitting commented on TIKA-420: Agreed with Ken about using XHTML SAX events instead

[jira] Commented: (TIKA-427) Parsing CSS as XML

2010-05-26 Thread Jukka Zitting (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12871611#action_12871611 ] Jukka Zitting commented on TIKA-427: The type detection code in Tika gets confused by the

[jira] Resolved: (TIKA-424) Avoid ArrayIndexOutOfBoundsException on some mp3 files

2010-05-26 Thread Jukka Zitting (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jukka Zitting resolved TIKA-424. Assignee: Jukka Zitting Fix Version/s: 0.8 Resolution: Fixed Thanks! Patch committed

[jira] Commented: (TIKA-418) RuntimeException while getting content for ppsx, ppsm, pptm, thmx and xps file types

2010-05-26 Thread Jukka Zitting (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12871614#action_12871614 ] Jukka Zitting commented on TIKA-418: Re: mp3 problem, in fact it was already filed separa

[jira] Commented: (TIKA-433) Tika + Hadoop

2010-05-26 Thread Grant Ingersoll (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12871616#action_12871616 ] Grant Ingersoll commented on TIKA-433: -- Does that mean you are going to extract it from

[jira] Commented: (TIKA-433) Tika + Hadoop

2010-05-26 Thread Julien Nioche (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12871623#action_12871623 ] Julien Nioche commented on TIKA-433: Could do. I can't see a place in Tika's code for non

[jira] Resolved: (TIKA-413) DWG Parser

2010-05-26 Thread Jukka Zitting (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jukka Zitting resolved TIKA-413. Assignee: Jukka Zitting Fix Version/s: 0.8 Resolution: Fixed Good stuff! I committed

[jira] Commented: (TIKA-433) Tika + Hadoop

2010-05-26 Thread Grant Ingersoll (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12871720#action_12871720 ] Grant Ingersoll commented on TIKA-433: -- I think it makes sense as a Tika contrib, but th

[jira] Commented: (TIKA-433) Tika + Hadoop

2010-05-26 Thread Jukka Zitting (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12871726#action_12871726 ] Jukka Zitting commented on TIKA-433: We could easily add a separate tika-hadoop component

[jira] Commented: (TIKA-433) Tika + Hadoop

2010-05-26 Thread Yonik Seeley (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12871742#action_12871742 ] Yonik Seeley commented on TIKA-433: --- >From the peanut gallery, Lucene has gone down the con

Re: Improved handling of attributes

2010-05-26 Thread Mattmann, Chris A (388J)
Hey Ken, I wanted to get back to you on this: > > 1. Ability to allow all attributes through from HTML documents > > TIKA-379, building on TIKA-347, allows both more relaxed passing of > attributes, as well as letting all elements through. > > So if somebody wants to get the "lang" attribute f

Re: Improved handling of attributes

2010-05-26 Thread Jukka Zitting
Hi, On Wed, May 26, 2010 at 3:49 PM, Mattmann, Chris A (388J) wrote: > I'm worried that we're mixing concerns here. Some of the information that > you cite above sounds more to me like metadata (and in fact, thinking about > it, you could argue that attributes themselves on the XHTML amount that

Re: Improved handling of attributes

2010-05-26 Thread Mattmann, Chris A (388J)
Hey Jukka, So you're seeing the delineation more as: * metadata = document level stuff * XHTML = textual representation [which can included finer-grained what I would call "metadata" too] ? If so, interesting, I wonder then if there should be some sort of rethinking then of the way tha

[jira] Commented: (TIKA-402) Support for Keynote and Pages documents

2010-05-26 Thread Jukka Zitting (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12871767#action_12871767 ] Jukka Zitting commented on TIKA-402: Latest patch committed in revision 948452, thanks!

Re: Improved handling of attributes

2010-05-26 Thread Jukka Zitting
Hi, On Wed, May 26, 2010 at 5:10 PM, Mattmann, Chris A (388J) wrote: > If so, interesting, I wonder then if there should be some sort of rethinking > then > of the way that we capture or represent the XHTML because I would think that > our existing Metadata object could be reused at that level t

[jira] Assigned: (TIKA-431) Tika currently misuses the HTTP Content-Encoding header, and does not seem to use the charset part of the Content-Type header properly.

2010-05-26 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ken Krugler reassigned TIKA-431: Assignee: Ken Krugler > Tika currently misuses the HTTP Content-Encoding header, and does not seem to

[jira] Commented: (TIKA-431) Tika currently misuses the HTTP Content-Encoding header, and does not seem to use the charset part of the Content-Type header properly.

2010-05-26 Thread Ken Krugler (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12871824#action_12871824 ] Ken Krugler commented on TIKA-431: -- I should have some time soon to do a once-over on a bunc

[jira] Commented: (TIKA-402) Support for Keynote and Pages documents

2010-05-26 Thread Martijn van Groningen (JIRA)
[ https://issues.apache.org/jira/browse/TIKA-402?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12871848#action_12871848 ] Martijn van Groningen commented on TIKA-402: Oops... next patch will have 4 space