[jira] [Commented] (NUTCH-1097) application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml

2011-09-02 Thread Ferdy (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13095989#comment-13095989
 ] 

Ferdy commented on NUTCH-1097:
--

After digging into it for a while, I believe the best solution for now is to 
allow regexes in plugin.xml for the attribute contentType. This way multiple 
mimetypes mapped from parse-plugins.xml can be supported by the plugin.xml of 
the individual parser extensions. (Instead of plain using the wildcard '*')

Too keep backwards compatibility, I decided to escape '+' in the contentType 
attribute of extensions, because a lot of mimetypes contain this character. 
This will not break existing functionality. So you can use any regular 
expression supported by the standard Java Pattern except the '+' character. The 
wildcard '*' is still usable, because this one is checked first in 
ParserFactory. (Otherwise an exception occurs because '*' is not an valid 
regex.)

To summarize the latest patch (v3) contains 2 changes:
- ParserFactory matches contentType attribute of extensions using standard Java 
regexes with escaped '+' characters.
- parse-html's plugin.xml has contentType text/html|application/xhtml+xml so 
it's consistent with the default provided parse-plugins.xml.

I'm not arguing these changes should be committed as is in the codebase, but I 
do believe the current situation is not flexible enough. (Especially the fact 
that many-to-one mappings of parse-plugins.xml cannot be supported by parser 
plugin.xml files). So if you have any suggestions or corrections feel free to 
reply.

> application/xhtml+xml should be enabled in plugin.xml of parse-html; allow 
> multiple mimetypes for plugin.xml
> 
>
> Key: NUTCH-1097
> URL: https://issues.apache.org/jira/browse/NUTCH-1097
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.3
>Reporter: Ferdy
>Priority: Minor
> Attachments: NUTCH-1097-v1.patch, NUTCH-1097-v2.patch
>
>
> The configuration in parse-plugins.xml expects the parse-html plugin to 
> accept application/xhtml+xml, however the plugin.xml of this plugin does not 
> list this type. Either change the entry in parse-plugins.xml or change the 
> parse-html plugin.xml. I suggest the latter. See patch.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1097) application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml

2011-10-10 Thread Lewis John McGibbney (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13124498#comment-13124498
 ] 

Lewis John McGibbney commented on NUTCH-1097:
-

Hi Ferdy, in general I think this looks OK and think that you are correct that 
parse-html should also accept application/xhtml+xml. The issue with this is 
whether or not this would be covered by parse-tika if for example parse-html 
was not to be included within plugin.includes?
>From what I have read, I do not see what benefit this provides over calling 
>parse-tika to deal with all application/xhtml+xml mimeTypes? Please correct me 
>where I am wrong. Thanks

> application/xhtml+xml should be enabled in plugin.xml of parse-html; allow 
> multiple mimetypes for plugin.xml
> 
>
> Key: NUTCH-1097
> URL: https://issues.apache.org/jira/browse/NUTCH-1097
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.3
>Reporter: Ferdy
>Priority: Minor
> Fix For: 1.4
>
> Attachments: NUTCH-1097-nutchgora_v1.patch, NUTCH-1097-v1.patch, 
> NUTCH-1097-v2.patch, NUTCH-1097-v3.patch
>
>
> The configuration in parse-plugins.xml expects the parse-html plugin to 
> accept application/xhtml+xml, however the plugin.xml of this plugin does not 
> list this type. Either change the entry in parse-plugins.xml or change the 
> parse-html plugin.xml. I suggest the latter. See patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1097) application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml

2011-10-10 Thread Lewis John McGibbney (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13124502#comment-13124502
 ] 

Lewis John McGibbney commented on NUTCH-1097:
-

having re-read the list thread and the full issue I'm tempted towards +1 if you 
can clarify my thoughts above. I'll begin testing this and give feedback asap.

> application/xhtml+xml should be enabled in plugin.xml of parse-html; allow 
> multiple mimetypes for plugin.xml
> 
>
> Key: NUTCH-1097
> URL: https://issues.apache.org/jira/browse/NUTCH-1097
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.3
>Reporter: Ferdy
>Priority: Minor
> Fix For: 1.4
>
> Attachments: NUTCH-1097-nutchgora_v1.patch, NUTCH-1097-v1.patch, 
> NUTCH-1097-v2.patch, NUTCH-1097-v3.patch
>
>
> The configuration in parse-plugins.xml expects the parse-html plugin to 
> accept application/xhtml+xml, however the plugin.xml of this plugin does not 
> list this type. Either change the entry in parse-plugins.xml or change the 
> parse-html plugin.xml. I suggest the latter. See patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1097) application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml

2011-10-11 Thread Ferdy (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13124761#comment-13124761
 ] 

Ferdy commented on NUTCH-1097:
--

Hi,

As far as I know, currently parse-tika is used as a catch-all parser. So yes, 
if you were to exclude parse-html, tika would just parse all html related 
types. However the fact that Nutch still has a separate html parser indicates 
that (for some reason) html should not be handled by tika yet. Since 
application/xhtml+xml is actually also html but with a more uncommon mimetype I 
believe it should be handled by the same parser that handles text/html. More 
importantly, I would still not be able to specify that only text/html and 
application/xhtml+xml should be parsed. ALL mimetypes will be parsed by the 
catch-all handler, since you can only specify the wildcard or ONE mimetype per 
parser.

The "text/html and application/xhtml+xml" use case is just a single example of 
why there should be a bit more flexibility for configuring parsers. If somebody 
wants to dispatch more than one mimetype (but not all) to a specific parser, 
that just not possible. This is regardless of using a tika (or whatever parser) 
as a catch-all parser.

Nevertheless my regex solution feels a bit hackish so it should be properly 
documented when committed.

> application/xhtml+xml should be enabled in plugin.xml of parse-html; allow 
> multiple mimetypes for plugin.xml
> 
>
> Key: NUTCH-1097
> URL: https://issues.apache.org/jira/browse/NUTCH-1097
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.3
>Reporter: Ferdy
>Priority: Minor
> Fix For: 1.4
>
> Attachments: NUTCH-1097-nutchgora_v1.patch, NUTCH-1097-v1.patch, 
> NUTCH-1097-v2.patch, NUTCH-1097-v3.patch
>
>
> The configuration in parse-plugins.xml expects the parse-html plugin to 
> accept application/xhtml+xml, however the plugin.xml of this plugin does not 
> list this type. Either change the entry in parse-plugins.xml or change the 
> parse-html plugin.xml. I suggest the latter. See patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1097) application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml

2011-10-11 Thread Lewis John McGibbney (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13125367#comment-13125367
 ] 

Lewis John McGibbney commented on NUTCH-1097:
-

Does anyone else have input for this one? I think it is a valuable contribution 
and makes perfect sense... the inverse of this is that it makes no-sense for 
parse-html to not parse application/xhtml+xml files.

> application/xhtml+xml should be enabled in plugin.xml of parse-html; allow 
> multiple mimetypes for plugin.xml
> 
>
> Key: NUTCH-1097
> URL: https://issues.apache.org/jira/browse/NUTCH-1097
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.3
>Reporter: Ferdy
>Priority: Minor
> Fix For: 1.4
>
> Attachments: NUTCH-1097-nutchgora_v1.patch, NUTCH-1097-v1.patch, 
> NUTCH-1097-v2.patch, NUTCH-1097-v3.patch
>
>
> The configuration in parse-plugins.xml expects the parse-html plugin to 
> accept application/xhtml+xml, however the plugin.xml of this plugin does not 
> list this type. Either change the entry in parse-plugins.xml or change the 
> parse-html plugin.xml. I suggest the latter. See patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1097) application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml

2011-10-11 Thread Andrzej Bialecki (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13125414#comment-13125414
 ] 

Andrzej Bialecki  commented on NUTCH-1097:
--

+1 the idea makes sense. Patch looks good, but it needs a minor fix - mime 
types may contain also "." characters, e.g. "application/vnd.ms-word", and 
these need to be escaped too.

> application/xhtml+xml should be enabled in plugin.xml of parse-html; allow 
> multiple mimetypes for plugin.xml
> 
>
> Key: NUTCH-1097
> URL: https://issues.apache.org/jira/browse/NUTCH-1097
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.3
>Reporter: Ferdy
>Priority: Minor
> Fix For: 1.4
>
> Attachments: NUTCH-1097-nutchgora_v1.patch, NUTCH-1097-v1.patch, 
> NUTCH-1097-v2.patch, NUTCH-1097-v3.patch
>
>
> The configuration in parse-plugins.xml expects the parse-html plugin to 
> accept application/xhtml+xml, however the plugin.xml of this plugin does not 
> list this type. Either change the entry in parse-plugins.xml or change the 
> parse-html plugin.xml. I suggest the latter. See patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1097) application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml

2011-10-12 Thread Lewis John McGibbney (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13125903#comment-13125903
 ] 

Lewis John McGibbney commented on NUTCH-1097:
-

OK Ferdy this fine for me. trunk patch compiles and passes all tests. It would 
be great to get this committed in 1.4. I am happy to act as assignee and commit 
if there are no further comments/suggestions. Same applies for nutchgora branch.

> application/xhtml+xml should be enabled in plugin.xml of parse-html; allow 
> multiple mimetypes for plugin.xml
> 
>
> Key: NUTCH-1097
> URL: https://issues.apache.org/jira/browse/NUTCH-1097
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.3
>Reporter: Ferdy
>Priority: Minor
> Fix For: 1.4
>
> Attachments: NUTCH-1097-nutchgora_v1.patch, 
> NUTCH-1097-nutchgora_v2.patch, NUTCH-1097-v1.patch, NUTCH-1097-v2.patch, 
> NUTCH-1097-v3.patch, NUTCH-1097-v4.patch
>
>
> The configuration in parse-plugins.xml expects the parse-html plugin to 
> accept application/xhtml+xml, however the plugin.xml of this plugin does not 
> list this type. Either change the entry in parse-plugins.xml or change the 
> parse-html plugin.xml. I suggest the latter. See patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1097) application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml

2011-10-12 Thread Andrzej Bialecki (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13125916#comment-13125916
 ] 

Andrzej Bialecki  commented on NUTCH-1097:
--

+1, the latest patch looks good.

> application/xhtml+xml should be enabled in plugin.xml of parse-html; allow 
> multiple mimetypes for plugin.xml
> 
>
> Key: NUTCH-1097
> URL: https://issues.apache.org/jira/browse/NUTCH-1097
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.3
>Reporter: Ferdy
>Priority: Minor
> Fix For: 1.4
>
> Attachments: NUTCH-1097-nutchgora_v1.patch, 
> NUTCH-1097-nutchgora_v2.patch, NUTCH-1097-v1.patch, NUTCH-1097-v2.patch, 
> NUTCH-1097-v3.patch, NUTCH-1097-v4.patch
>
>
> The configuration in parse-plugins.xml expects the parse-html plugin to 
> accept application/xhtml+xml, however the plugin.xml of this plugin does not 
> list this type. Either change the entry in parse-plugins.xml or change the 
> parse-html plugin.xml. I suggest the latter. See patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1097) application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml

2011-10-12 Thread Markus Jelsma (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13125925#comment-13125925
 ] 

Markus Jelsma commented on NUTCH-1097:
--

+1, very useful  

> application/xhtml+xml should be enabled in plugin.xml of parse-html; allow 
> multiple mimetypes for plugin.xml
> 
>
> Key: NUTCH-1097
> URL: https://issues.apache.org/jira/browse/NUTCH-1097
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.3
>Reporter: Ferdy
>Priority: Minor
> Fix For: 1.4
>
> Attachments: NUTCH-1097-nutchgora_v1.patch, 
> NUTCH-1097-nutchgora_v2.patch, NUTCH-1097-v1.patch, NUTCH-1097-v2.patch, 
> NUTCH-1097-v3.patch, NUTCH-1097-v4.patch
>
>
> The configuration in parse-plugins.xml expects the parse-html plugin to 
> accept application/xhtml+xml, however the plugin.xml of this plugin does not 
> list this type. Either change the entry in parse-plugins.xml or change the 
> parse-html plugin.xml. I suggest the latter. See patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1097) application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml

2011-10-14 Thread Hudson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13127376#comment-13127376
 ] 

Hudson commented on NUTCH-1097:
---

Integrated in Nutch-nutchgora #32 (See 
[https://builds.apache.org/job/Nutch-nutchgora/32/])
commit to address NUTCH-1097 and update to changes.txt

lewismc : 
http://svn.apache.org/viewvc/nutch/branches/nutchgora/viewvc/?view=rev&root=&revision=1182504
Files : 
* /nutch/branches/nutchgora/CHANGES.txt
* /nutch/branches/nutchgora/src/java/org/apache/nutch/parse/ParserFactory.java
* /nutch/branches/nutchgora/src/plugin/parse-html/plugin.xml


> application/xhtml+xml should be enabled in plugin.xml of parse-html; allow 
> multiple mimetypes for plugin.xml
> 
>
> Key: NUTCH-1097
> URL: https://issues.apache.org/jira/browse/NUTCH-1097
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.3
>Reporter: Ferdy
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.4, nutchgora
>
> Attachments: NUTCH-1097-nutchgora_v1.patch, 
> NUTCH-1097-nutchgora_v2.patch, NUTCH-1097-v1.patch, NUTCH-1097-v2.patch, 
> NUTCH-1097-v3.patch, NUTCH-1097-v4.patch
>
>
> The configuration in parse-plugins.xml expects the parse-html plugin to 
> accept application/xhtml+xml, however the plugin.xml of this plugin does not 
> list this type. Either change the entry in parse-plugins.xml or change the 
> parse-html plugin.xml. I suggest the latter. See patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1097) application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml

2011-10-14 Thread Hudson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13127430#comment-13127430
 ] 

Hudson commented on NUTCH-1097:
---

Integrated in Nutch-trunk #1631 (See 
[https://builds.apache.org/job/Nutch-trunk/1631/])
commit to address NUTCH-1097 and update to changes.txt

lewismc : 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1182506
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/parse/ParserFactory.java
* /nutch/trunk/src/plugin/parse-html/plugin.xml


> application/xhtml+xml should be enabled in plugin.xml of parse-html; allow 
> multiple mimetypes for plugin.xml
> 
>
> Key: NUTCH-1097
> URL: https://issues.apache.org/jira/browse/NUTCH-1097
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.3
>Reporter: Ferdy
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.4, nutchgora
>
> Attachments: NUTCH-1097-nutchgora_v1.patch, 
> NUTCH-1097-nutchgora_v2.patch, NUTCH-1097-v1.patch, NUTCH-1097-v2.patch, 
> NUTCH-1097-v3.patch, NUTCH-1097-v4.patch
>
>
> The configuration in parse-plugins.xml expects the parse-html plugin to 
> accept application/xhtml+xml, however the plugin.xml of this plugin does not 
> list this type. Either change the entry in parse-plugins.xml or change the 
> parse-html plugin.xml. I suggest the latter. See patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1097) application/xhtml+xml should be enabled in plugin.xml of parse-html; allow multiple mimetypes for plugin.xml

2011-10-31 Thread Hudson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13140120#comment-13140120
 ] 

Hudson commented on NUTCH-1097:
---

Integrated in nutch-trunk-maven #3 (See 
[https://builds.apache.org/job/nutch-trunk-maven/3/])
commit to address NUTCH-1097 and update to changes.txt

lewismc : 
http://svn.apache.org/viewvc/nutch/trunk/viewvc/?view=rev&root=&revision=1182506
Files : 
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/parse/ParserFactory.java
* /nutch/trunk/src/plugin/parse-html/plugin.xml


> application/xhtml+xml should be enabled in plugin.xml of parse-html; allow 
> multiple mimetypes for plugin.xml
> 
>
> Key: NUTCH-1097
> URL: https://issues.apache.org/jira/browse/NUTCH-1097
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.3
>Reporter: Ferdy
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.4, nutchgora
>
> Attachments: NUTCH-1097-nutchgora_v1.patch, 
> NUTCH-1097-nutchgora_v2.patch, NUTCH-1097-v1.patch, NUTCH-1097-v2.patch, 
> NUTCH-1097-v3.patch, NUTCH-1097-v4.patch
>
>
> The configuration in parse-plugins.xml expects the parse-html plugin to 
> accept application/xhtml+xml, however the plugin.xml of this plugin does not 
> list this type. Either change the entry in parse-plugins.xml or change the 
> parse-html plugin.xml. I suggest the latter. See patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira