[jira] [Updated] (NUTCH-490) Extension point with filters for Neko HTML parser (with patch)
[ https://issues.apache.org/jira/browse/NUTCH-490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-490: Component/s: (was: fetcher) parser Extension point with filters for Neko HTML parser (with patch) -- Key: NUTCH-490 URL: https://issues.apache.org/jira/browse/NUTCH-490 Project: Nutch Issue Type: Improvement Components: parser Affects Versions: 0.9.0 Environment: Any Reporter: Marcin Okraszewski Priority: Minor Fix For: 1.9 Attachments: HtmlParser.java.diff, NekoFilters_for_1.0.patch, nutch-extensionpoins_plugin.xml.diff In my project I need to set filters for Neko HTML parser. So instead of adding it hard coded, I made an extension point to define filters for Neko. I was fallowing the code for HtmlParser filters. In fact the method to get filters I think could be generalized to handle both cases. But I didn't want to make too big mess. The attached patch is for Nutch 0.9. This part of code wasn't changed in trunk, so should be applicable easily. BTW. I wonder if it wouldn't be best to have HTML DOM Parsing defined by extension point itself. Now there are options for Neko and TagSoap. But if someone would like to use something else or set give different settings for the parser, he would need to modify HtmlParser class, instead of replacing a plugin. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (NUTCH-490) Extension point with filters for Neko HTML parser (with patch)
[ https://issues.apache.org/jira/browse/NUTCH-490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-490: -- Fix Version/s: 1.8 Extension point with filters for Neko HTML parser (with patch) -- Key: NUTCH-490 URL: https://issues.apache.org/jira/browse/NUTCH-490 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 0.9.0 Environment: Any Reporter: Marcin Okraszewski Priority: Minor Fix For: 2.3, 1.8 Attachments: HtmlParser.java.diff, NekoFilters_for_1.0.patch, nutch-extensionpoins_plugin.xml.diff In my project I need to set filters for Neko HTML parser. So instead of adding it hard coded, I made an extension point to define filters for Neko. I was fallowing the code for HtmlParser filters. In fact the method to get filters I think could be generalized to handle both cases. But I didn't want to make too big mess. The attached patch is for Nutch 0.9. This part of code wasn't changed in trunk, so should be applicable easily. BTW. I wonder if it wouldn't be best to have HTML DOM Parsing defined by extension point itself. Now there are options for Neko and TagSoap. But if someone would like to use something else or set give different settings for the parser, he would need to modify HtmlParser class, instead of replacing a plugin. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-490) Extension point with filters for Neko HTML parser (with patch)
[ https://issues.apache.org/jira/browse/NUTCH-490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-490: --- Patch Info: Patch Available Fix Version/s: 2.2 1.7 Extension point with filters for Neko HTML parser (with patch) -- Key: NUTCH-490 URL: https://issues.apache.org/jira/browse/NUTCH-490 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 0.9.0 Environment: Any Reporter: Marcin Okraszewski Priority: Minor Fix For: 1.7, 2.2 Attachments: HtmlParser.java.diff, NekoFilters_for_1.0.patch, nutch-extensionpoins_plugin.xml.diff In my project I need to set filters for Neko HTML parser. So instead of adding it hard coded, I made an extension point to define filters for Neko. I was fallowing the code for HtmlParser filters. In fact the method to get filters I think could be generalized to handle both cases. But I didn't want to make too big mess. The attached patch is for Nutch 0.9. This part of code wasn't changed in trunk, so should be applicable easily. BTW. I wonder if it wouldn't be best to have HTML DOM Parsing defined by extension point itself. Now there are options for Neko and TagSoap. But if someone would like to use something else or set give different settings for the parser, he would need to modify HtmlParser class, instead of replacing a plugin. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira