[jira] [Updated] (NUTCH-882) Design a Host table in GORA
[ https://issues.apache.org/jira/browse/NUTCH-882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-882: Assignee: (was: Julien Nioche) Design a Host table in GORA --- Key: NUTCH-882 URL: https://issues.apache.org/jira/browse/NUTCH-882 Project: Nutch Issue Type: New Feature Affects Versions: nutchgora Reporter: Julien Nioche Fix For: nutchgora Attachments: NUTCH-882-v1.patch, hostdb.patch Having a separate GORA table for storing information about hosts (and domains?) would be very useful for : * customising the behaviour of the fetching on a host basis e.g. number of threads, min time between threads etc... * storing stats * keeping metadata and possibly propagate them to the webpages * keeping a copy of the robots.txt and possibly use that later to filter the webtable * store sitemaps files and update the webtable accordingly I'll try to come up with a GORA schema for such a host table but any comments are of course already welcome -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1331) limit crawler to defined depth
[ https://issues.apache.org/jira/browse/NUTCH-1331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1331: - Attachment: NUTCH-1331-v2.patch Attached is an implementation of what I described earlier. This has been generously donated by www.ant.com This allows to track the depth for a URL and remove its outlinks based on a global setting or per-seed limit crawler to defined depth -- Key: NUTCH-1331 URL: https://issues.apache.org/jira/browse/NUTCH-1331 Project: Nutch Issue Type: New Feature Components: generator, parser, storage Affects Versions: 1.4 Reporter: behnam nikbakht Attachments: NUTCH-1331-v2.patch, NUTCH-1331.patch there is a need to limit crawler to some defined depth, and importance of this option is to avoid crawling of infinite loops, with dynamic generated urls, that occur in some sites, and to optimize crawler to select important urls. an option is define a iteration limit on generate,fetch,parse,updatedb cycle, but it works only if in each cycle, all of unfetched urls become fetched, (without recrawling them and with some other considerations) we can define a new parameter in CrawlDatum, named depth, and like score-opic algorithm, compute depth of a link after parse, and in generate, only select urls with valid depth. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1334) NPE in FetcherOutputFormat
[ https://issues.apache.org/jira/browse/NUTCH-1334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1334: - Attachment: NUTCH-1334.patch Will commit post 1.5 NPE in FetcherOutputFormat --- Key: NUTCH-1334 URL: https://issues.apache.org/jira/browse/NUTCH-1334 Project: Nutch Issue Type: Bug Reporter: Julien Nioche Attachments: NUTCH-1334.patch If fetcher.parse or fetcher.store.content are set to false AND the write method receives an instance of Parse or Content, a NPE will be thrown. This usually does not happen as the Fetcher does not output a Parse or Content based on the configuration, however this class is also used by the ArcSegmentCreator which is unaware of these parameters and will output a Parse or Content instance regardless of the configuration. One option would be to make the ArcSegmentCreator aware of the fetcher.* parameters and output things accordingly but it also makes sense to modify the FetcherOutputFormat so that it checks whether a subWriter has been created before trying to use it. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-809) Parse-metatags plugin
[ https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-809: Attachment: NUTCH-809-trunk.patch Patch for Nutch-809 against trunk. Delegates the indexing to index-metatags Parse-metatags plugin - Key: NUTCH-809 URL: https://issues.apache.org/jira/browse/NUTCH-809 Project: Nutch Issue Type: New Feature Components: parser Affects Versions: 1.4, nutchgora Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.5 Attachments: NUTCH-809-trunk.patch, NUTCH-809.patch, NUTCH-809_metatags_1.3.patch, metatags-plugin+tutorial.zip h2. Parse-metatags plugin The parse-metatags plugin consists of a HTMLParserFilter which takes as parameter a list of metatag names with '*' as default value. The values are separated by ';'. In order to extract the values of the metatags description and keywords, you must specify in nutch-site.xml {code:xml} property namemetatags.names/name valuedescription;keywords/value /property {code} The MetatagIndexer uses the output of the parsing above to create two fields 'keywords' and 'description'. Note that keywords is multivalued. The query-basic plugin is used to include these fields in the search e.g. in nutch-site.xml {code:xml} property namequery.basic.description.boost/name value2.0/value /property property namequery.basic.keywords.boost/name value2.0/value /property {code} This code has been developed by DigitalPebble Ltd and offered to the community by ANT.com -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1259) Store detected content type in crawldatum metadata
[ https://issues.apache.org/jira/browse/NUTCH-1259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1259: - Summary: Store detected content type in crawldatum metadata (was: TikaParser should not add Content-Type from HTTP Headers to Nutch Metadata) Store detected content type in crawldatum metadata -- Key: NUTCH-1259 URL: https://issues.apache.org/jira/browse/NUTCH-1259 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.5 Attachments: NUTCH-1259-1.5-1.patch The MIME-type detected by Tika's Detect() API is never added to a Parse's ContentMetaData or ParseMetaData. Because of this bad Content-Types will end up in the documents. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1258) MoreIndexingFilter should be able to read Content-Type from both parse metadata and content metadata
[ https://issues.apache.org/jira/browse/NUTCH-1258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1258: - Attachment: NUTCH-1258-v2.patch We now have access to the detected content-type from the crawldatum metadata as of NUTCH-1259. This patch tries to get this first then goes in the parse metadata. MoreIndexingFilter should be able to read Content-Type from both parse metadata and content metadata Key: NUTCH-1258 URL: https://issues.apache.org/jira/browse/NUTCH-1258 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 1.4 Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.5 Attachments: NUTCH-1258-1.5-1.patch, NUTCH-1258-v2.patch The MoreIndexingFilter reads the Content-Type from parse metadata. However, this usually contains a lot of crap because web developers can set it to anything they like. The filter must be able to read the Content-Type field from content metadata as well because that contains the type detected by Tika's Detector. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1264) Configurable indexing plugin (index-extra)
[ https://issues.apache.org/jira/browse/NUTCH-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1264: - Attachment: NUTCH-1264-trunk-v2.patch Configurable indexing plugin (index-extra) --- Key: NUTCH-1264 URL: https://issues.apache.org/jira/browse/NUTCH-1264 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 1.5 Reporter: Julien Nioche Attachments: NUTCH-1264-trunk-v2.patch, NUTCH-1264-trunk.patch We currently have several plugins already distributed or proposed which do very comparable things : - parse-meta [NUTCH-809] to generate metadata fields in parse-metadata and index them - headings [NUTCH-1005] to generate headings fields in parse-metadata and index them - index-extra [NUTCH-422] to index configurable fields - urlmeta [NUTCH-855] to propagate metadata from the seeds to the outlinks and index them - index-static [NUTCH-940] to generate configurable static fields All these plugins have in common that they allow to extract information from various sources and generate fields from them and are largely redundant. Instead this issue proposes to have a single plugin allowing to generate configurable fields from : - static values - parse metadata - content metadata - crawldb metadata and let the other plugins focus on the parsing and extraction of the values to index. This will make the addition of new fields simpler by relying on a stable common plugin instead of multiplying the code in various plugins. This plugin will replace index-static [NUTCH-940] and index-extra [NUTCH-422] and will serve as a basis for further improvements. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1264) Configurable indexing plugin (index-metadata)
[ https://issues.apache.org/jira/browse/NUTCH-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1264: - Description: We currently have several plugins already distributed or proposed which do very comparable things : - parse-meta [NUTCH-809] to generate metadata fields in parse-metadata and index them - headings [NUTCH-1005] to generate headings fields in parse-metadata and index them - index-extra [NUTCH-422] to index configurable fields - urlmeta [NUTCH-855] to propagate metadata from the seeds to the outlinks and index them - index-static [NUTCH-940] to generate configurable static fields All these plugins have in common that they allow to extract information from various sources and generate fields from them and are largely redundant. Instead this issue proposes to have a single plugin allowing to generate configurable fields from : - static values - parse metadata - content metadata - crawldb metadata and let the other plugins focus on the parsing and extraction of the values to index. This will make the addition of new fields simpler by relying on a stable common plugin instead of multiplying the code in various plugins. This plugin will replace index-extra [NUTCH-422] and will serve as a basis for further improvements. was: We currently have several plugins already distributed or proposed which do very comparable things : - parse-meta [NUTCH-809] to generate metadata fields in parse-metadata and index them - headings [NUTCH-1005] to generate headings fields in parse-metadata and index them - index-extra [NUTCH-422] to index configurable fields - urlmeta [NUTCH-855] to propagate metadata from the seeds to the outlinks and index them - index-static [NUTCH-940] to generate configurable static fields All these plugins have in common that they allow to extract information from various sources and generate fields from them and are largely redundant. Instead this issue proposes to have a single plugin allowing to generate configurable fields from : - static values - parse metadata - content metadata - crawldb metadata and let the other plugins focus on the parsing and extraction of the values to index. This will make the addition of new fields simpler by relying on a stable common plugin instead of multiplying the code in various plugins. This plugin will replace index-static [NUTCH-940] and index-extra [NUTCH-422] and will serve as a basis for further improvements. Summary: Configurable indexing plugin (index-metadata) (was: Configurable indexing plugin (index-extra) ) Configurable indexing plugin (index-metadata) -- Key: NUTCH-1264 URL: https://issues.apache.org/jira/browse/NUTCH-1264 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 1.5 Reporter: Julien Nioche Attachments: NUTCH-1264-trunk-v2.patch, NUTCH-1264-trunk.patch We currently have several plugins already distributed or proposed which do very comparable things : - parse-meta [NUTCH-809] to generate metadata fields in parse-metadata and index them - headings [NUTCH-1005] to generate headings fields in parse-metadata and index them - index-extra [NUTCH-422] to index configurable fields - urlmeta [NUTCH-855] to propagate metadata from the seeds to the outlinks and index them - index-static [NUTCH-940] to generate configurable static fields All these plugins have in common that they allow to extract information from various sources and generate fields from them and are largely redundant. Instead this issue proposes to have a single plugin allowing to generate configurable fields from : - static values - parse metadata - content metadata - crawldb metadata and let the other plugins focus on the parsing and extraction of the values to index. This will make the addition of new fields simpler by relying on a stable common plugin instead of multiplying the code in various plugins. This plugin will replace index-extra [NUTCH-422] and will serve as a basis for further improvements. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1264) Configurable indexing plugin (index-extra)
[ https://issues.apache.org/jira/browse/NUTCH-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1264: - Attachment: NUTCH-1264-trunk.patch Configurable indexing plugin (index-extra) --- Key: NUTCH-1264 URL: https://issues.apache.org/jira/browse/NUTCH-1264 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 1.5 Reporter: Julien Nioche Attachments: NUTCH-1264-trunk.patch We currently have several plugins already distributed or proposed which do very comparable things : - parse-meta [NUTCH-809] to generate metadata fields in parse-metadata and index them - headings [NUTCH-1005] to generate headings fields in parse-metadata and index them - index-extra [NUTCH-422] to index configurable fields - urlmeta [NUTCH-855] to propagate metadata from the seeds to the outlinks and index them - index-static [NUTCH-940] to generate configurable static fields All these plugins have in common that they allow to extract information from various sources and generate fields from them and are largely redundant. Instead this issue proposes to have a single plugin allowing to generate configurable fields from : - static values - parse metadata - content metadata - crawldb metadata and let the other plugins focus on the parsing and extraction of the values to index. This will make the addition of new fields simpler by relying on a stable common plugin instead of multiplying the code in various plugins. This plugin will replace index-static [NUTCH-940] and index-extra [NUTCH-422] and will serve as a basis for further improvements. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1243) Junit jar removed from lib
[ https://issues.apache.org/jira/browse/NUTCH-1243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1243: - Description: When calling 'ant test' the junit jar is added to the lib dir by Ivy but gets removed before the test classes are compiled. -This seems to happen with Ivy 2.1 but not with Ivy 2.2.- -We do have 2.2 in the /ivy directory but the ant script uses whatever is found in ~/.ant/lib - ideally we would like to be able to force the location of the jar file.- Actually the issue also happens with Ivy 2.2. I will commit a quick fix consisting of adding junit in the default ivy configuration, however it will be good to get to the bottom of this. was: When calling 'ant test' the junit jar is added to the lib dir by Ivy but gets removed before the test classes are compiled. This seems to happen with Ivy 2.1 but not with Ivy 2.2. We do have 2.2 in the /ivy directory but the ant script uses whatever is found in ~/.ant/lib - ideally we would like to be able to force the location of the jar file. As seen in [NUTCH-995] a workaround is to call : 'ant -lib ivy test' but having the value coded in the build script would be better Junit jar removed from lib -- Key: NUTCH-1243 URL: https://issues.apache.org/jira/browse/NUTCH-1243 Project: Nutch Issue Type: Bug Components: build Affects Versions: 1.5 Environment: Ivy 2.1.0 - 20090925235825 Reporter: Julien Nioche When calling 'ant test' the junit jar is added to the lib dir by Ivy but gets removed before the test classes are compiled. -This seems to happen with Ivy 2.1 but not with Ivy 2.2.- -We do have 2.2 in the /ivy directory but the ant script uses whatever is found in ~/.ant/lib - ideally we would like to be able to force the location of the jar file.- Actually the issue also happens with Ivy 2.2. I will commit a quick fix consisting of adding junit in the default ivy configuration, however it will be good to get to the bottom of this. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1053) Parsing of RSS feeds fails
[ https://issues.apache.org/jira/browse/NUTCH-1053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1053: - Fix Version/s: 1.5 I'd happily give an example of fix it myself if only I could find it :-) Moved to 1.5 and left open for now Parsing of RSS feeds fails --- Key: NUTCH-1053 URL: https://issues.apache.org/jira/browse/NUTCH-1053 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.5 Attachments: seed.txt See discussion on http://lucene.472066.n3.nabble.com/RSS-feed-parsing-on-Nutch-1-3-td3166487.html Have been able to reproduce the problem and will look into it -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1053) Parsing of RSS feeds fails
[ https://issues.apache.org/jira/browse/NUTCH-1053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1053: - Fix Version/s: (was: 1.4) Parsing of RSS feeds fails --- Key: NUTCH-1053 URL: https://issues.apache.org/jira/browse/NUTCH-1053 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.5 Attachments: seed.txt See discussion on http://lucene.472066.n3.nabble.com/RSS-feed-parsing-on-Nutch-1-3-td3166487.html Have been able to reproduce the problem and will look into it -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1046) Add tests for indexing to SOLR
[ https://issues.apache.org/jira/browse/NUTCH-1046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1046: - Affects Version/s: (was: 1.4) (was: 2.0) Fix Version/s: (was: 1.4) (was: 2.0) 1.5 Add tests for indexing to SOLR -- Key: NUTCH-1046 URL: https://issues.apache.org/jira/browse/NUTCH-1046 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Julien Nioche Fix For: 1.5 We currently have no tests for checking that the indexing to SOLR works as expected. Running an embedded SOLR Server within the tests would be good. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1064) o.a.n.util.MimeUtil uses deprecated Tika methods
[ https://issues.apache.org/jira/browse/NUTCH-1064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1064: - Fix Version/s: (was: 1.4) 1.5 Postpone to 1.5. Should have new Tika version available in the meantime o.a.n.util.MimeUtil uses deprecated Tika methods Key: NUTCH-1064 URL: https://issues.apache.org/jira/browse/NUTCH-1064 Project: Nutch Issue Type: Improvement Affects Versions: 1.4 Reporter: Julien Nioche Assignee: Julien Nioche Fix For: 1.5 this class is in serious need of refactoring as the underlying Tika API has changed a lot. The logic around what strategies to use e.g. trust the metadata returned by the server? trust Tika's detection? etc... should be reimplemented using the Detector implementations -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1090) LinkDb (invertlinks) should inform the user when it ignores internal links
[ https://issues.apache.org/jira/browse/NUTCH-1090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1090: - Fix Version/s: (was: 1.3) 1.5 LinkDb (invertlinks) should inform the user when it ignores internal links -- Key: NUTCH-1090 URL: https://issues.apache.org/jira/browse/NUTCH-1090 Project: Nutch Issue Type: Improvement Components: linkdb Affects Versions: 1.3 Reporter: Marek Bachmann Priority: Trivial Labels: configuration, information, log Fix For: 1.5 Attachments: LinkDb.patch I used nutch to crawl sites on a single domain. After the crawl was complete I tried to build a LinkDb. The LinkDb was empty. It comes up that this happens because the invertlinks command ignores internal links to the same domain by default. Unfortunately the LinkDb class doesn't tell anything about that. So it was hard to find out why the LinkDb was empty. I suggest to add an information for the user when the invertlinks command is ignoring internal links. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1040) Backport REST-API from 2.0
[ https://issues.apache.org/jira/browse/NUTCH-1040?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1040: - Affects Version/s: (was: 1.4) Fix Version/s: 1.5 Issue Type: New Feature (was: Task) Backport REST-API from 2.0 -- Key: NUTCH-1040 URL: https://issues.apache.org/jira/browse/NUTCH-1040 Project: Nutch Issue Type: New Feature Components: REST_api Reporter: Julien Nioche Fix For: 1.4, 1.5 See https://issues.apache.org/jira/browse/NUTCH-880 -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1129) Any23 Nutch plugin
[ https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1129: - Affects Version/s: (was: 1.4) Fix Version/s: (was: 1.4) 1.5 Any23 Nutch plugin -- Key: NUTCH-1129 URL: https://issues.apache.org/jira/browse/NUTCH-1129 Project: Nutch Issue Type: New Feature Components: parser Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: 1.5 This plugin should build on the Any23 library to provide us with a plugin which extracts RDF data from HTTP and file resources. Although as of writing Any23 not part of the ASF, the project is working towards integration into the Apache Incubator. Once the project proves its value, this would be an excellent addition to the Nutch 1.X codebase. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed
[ https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-585: Fix Version/s: (was: 1.4) 1.5 Marking for 1.5. Needs reviewing and won't make it into 1.4 [PARSE-HTML plugin] Block certain parts of HTML code from being indexed --- Key: NUTCH-585 URL: https://issues.apache.org/jira/browse/NUTCH-585 Project: Nutch Issue Type: Improvement Affects Versions: 0.9.0 Environment: All operating systems Reporter: Andrea Spinelli Assignee: Markus Jelsma Priority: Minor Fix For: 1.5 Attachments: blacklist_whitelist_plugin.patch, nutch-585-excludeNodes.patch, nutch-585-jostens-excludeDIVs.patch We are using nutch to index our own web sites; we would like not to index certain parts of our pages, because we know they are not relevant (for instance, there are several links to change the background color) and generate spurious matches. We have modified the plugin so that it ignores HTML code between certain HTML comments, like !-- START-IGNORE -- ... ignored part ... !-- STOP-IGNORE -- We feel this might be useful to someone else, maybe factorizing the comment strings as constants in the configuration files (say parser.html.ignore.start and parser.html.ignore.stop in nutch-site.xml). We are almost ready to contribute our code snippet. Looking forward for any expression of interest - or for an explanation why waht we are doing is plain wrong! -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1079) StringBuffer converted to StringBuilder
[ https://issues.apache.org/jira/browse/NUTCH-1079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1079: - Priority: Minor (was: Major) Affects Version/s: (was: 1.3) Fix Version/s: (was: 1.4) 1.5 Issue Type: Improvement (was: Bug) Not a bug but an improvement. Moved from 1.4 to 1.5 StringBuffer converted to StringBuilder --- Key: NUTCH-1079 URL: https://issues.apache.org/jira/browse/NUTCH-1079 Project: Nutch Issue Type: Improvement Components: fetcher, indexer Reporter: Kay Kay Assignee: Markus Jelsma Priority: Minor Fix For: 1.5 Attachments: NUTCH-1079.patch, NUTCH-rel_14-1079.patch All across the codebase, it contains StringBuffer, when thread-safety is probably not intended. This patch replaces StringBuffer to StringBuilder, as applicable. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1047) Pluggable indexing backends
[ https://issues.apache.org/jira/browse/NUTCH-1047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1047: - Affects Version/s: (was: 1.4) Fix Version/s: (was: 1.4) 1.5 Assignee: Julien Nioche Issue Type: New Feature (was: Improvement) Pluggable indexing backends --- Key: NUTCH-1047 URL: https://issues.apache.org/jira/browse/NUTCH-1047 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Julien Nioche Assignee: Julien Nioche Labels: indexing Fix For: 1.5 One possible feature would be to add a new endpoint for indexing-backends and make the indexing plugable. at the moment we are hardwired to SOLR - which is OK - but as other resources like ElasticSearch are becoming more popular it would be better to handle this as plugins. Not sure about the name of the endpoint though : we already have indexing-plugins (which are about generating fields sent to the backends) and moreover the backends are not necessarily for indexing / searching but could be just an external storage e.g. CouchDB. The term backend on its own would be confusing in 2.0 as this could be pertaining to the storage in GORA. 'indexing-backend' is the best name that came to my mind so far - please suggest better ones. We should come up with generic map/reduce jobs for indexing, deduplicating and cleaning and maybe add a Nutch extension point there so we can easily hook up indexing, cleaning and deduplicating for various backends. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1088) Write Solr XML documents
[ https://issues.apache.org/jira/browse/NUTCH-1088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1088: - Priority: Minor (was: Major) Fix Version/s: (was: 1.4) 1.5 Could do that with the pluggable indexing framework in NUTCH-1047? Write Solr XML documents Key: NUTCH-1088 URL: https://issues.apache.org/jira/browse/NUTCH-1088 Project: Nutch Issue Type: New Feature Components: indexer Reporter: Markus Jelsma Assignee: Markus Jelsma Priority: Minor Fix For: 1.5 Documents need to be reindexed when index-time analysis is modified. Indexing individual segments from Nutch is tedious, especially for small segments. This issue should add a feature that can write XML batches. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1117) JUnit test for index-anchor
[ https://issues.apache.org/jira/browse/NUTCH-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1117: - Fix Version/s: (was: 1.4) 1.5 JUnit test for index-anchor --- Key: NUTCH-1117 URL: https://issues.apache.org/jira/browse/NUTCH-1117 Project: Nutch Issue Type: Sub-task Components: build Affects Versions: 1.4 Reporter: Lewis John McGibbney Priority: Minor Fix For: 1.5 This issue is part of the larger attempt to provide a Junit test case for every Nutch plugin. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1119) JUnit test for index-static
[ https://issues.apache.org/jira/browse/NUTCH-1119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1119: - Fix Version/s: (was: 1.4) 1.5 JUnit test for index-static --- Key: NUTCH-1119 URL: https://issues.apache.org/jira/browse/NUTCH-1119 Project: Nutch Issue Type: Sub-task Components: build Affects Versions: 1.4 Reporter: Lewis John McGibbney Priority: Minor Fix For: 1.5 This issue is part of the larger attempt to provide a Junit test case for every Nutch plugin. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1118) JUnit test for index-basic
[ https://issues.apache.org/jira/browse/NUTCH-1118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1118: - Fix Version/s: (was: 1.4) 1.5 JUnit test for index-basic -- Key: NUTCH-1118 URL: https://issues.apache.org/jira/browse/NUTCH-1118 Project: Nutch Issue Type: Sub-task Components: build Affects Versions: 1.4 Reporter: Lewis John McGibbney Priority: Minor Fix For: 1.5 This issue is part of the larger attempt to provide a Junit test case for every Nutch plugin. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1124) JUnit test for scoring-opic
[ https://issues.apache.org/jira/browse/NUTCH-1124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1124: - Fix Version/s: (was: 1.4) 1.5 JUnit test for scoring-opic --- Key: NUTCH-1124 URL: https://issues.apache.org/jira/browse/NUTCH-1124 Project: Nutch Issue Type: Sub-task Components: build Affects Versions: 1.4 Reporter: Lewis John McGibbney Priority: Minor Fix For: 1.5 This issue is part of the larger attempt to provide a Junit test case for every Nutch plugin. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1123) JUnit test for scoring-link
[ https://issues.apache.org/jira/browse/NUTCH-1123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1123: - Fix Version/s: (was: 1.4) 1.5 JUnit test for scoring-link --- Key: NUTCH-1123 URL: https://issues.apache.org/jira/browse/NUTCH-1123 Project: Nutch Issue Type: Sub-task Components: build Affects Versions: 1.4 Reporter: Lewis John McGibbney Priority: Minor Fix For: 1.5 This issue is part of the larger attempt to provide a Junit test case for every Nutch plugin. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1128) JUnit test for urlmeta
[ https://issues.apache.org/jira/browse/NUTCH-1128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1128: - Fix Version/s: (was: 1.4) 1.5 JUnit test for urlmeta -- Key: NUTCH-1128 URL: https://issues.apache.org/jira/browse/NUTCH-1128 Project: Nutch Issue Type: Sub-task Components: build Affects Versions: 1.4 Reporter: Lewis John McGibbney Priority: Minor Fix For: 1.5 This issue is part of the larger attempt to provide a Junit test case for every Nutch plugin. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1127) JUnit test for urlfilter-validator
[ https://issues.apache.org/jira/browse/NUTCH-1127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1127: - Fix Version/s: (was: 1.4) 1.5 JUnit test for urlfilter-validator -- Key: NUTCH-1127 URL: https://issues.apache.org/jira/browse/NUTCH-1127 Project: Nutch Issue Type: Sub-task Components: build Affects Versions: 1.4 Reporter: Lewis John McGibbney Priority: Minor Fix For: 1.5 This issue is part of the larger attempt to provide a Junit test case for every Nutch plugin. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1130) JUnit test for Any23 RDF plugin
[ https://issues.apache.org/jira/browse/NUTCH-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1130: - Fix Version/s: (was: 1.4) 1.5 JUnit test for Any23 RDF plugin --- Key: NUTCH-1130 URL: https://issues.apache.org/jira/browse/NUTCH-1130 Project: Nutch Issue Type: Sub-task Components: build Affects Versions: 1.4 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: 1.5 The JUnit test should be written prior to the progression of the Any23 Nutch plugin -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1120) JUnit test for microformats-reltag
[ https://issues.apache.org/jira/browse/NUTCH-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1120: - Fix Version/s: (was: 1.4) 1.5 JUnit test for microformats-reltag -- Key: NUTCH-1120 URL: https://issues.apache.org/jira/browse/NUTCH-1120 Project: Nutch Issue Type: Sub-task Components: build Affects Versions: 1.4 Reporter: Lewis John McGibbney Priority: Minor Fix For: 1.5 This issue is part of the larger attempt to provide a Junit test case for every Nutch plugin. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1125) JUnit test for tld
[ https://issues.apache.org/jira/browse/NUTCH-1125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1125: - Fix Version/s: (was: 1.4) 1.5 JUnit test for tld -- Key: NUTCH-1125 URL: https://issues.apache.org/jira/browse/NUTCH-1125 Project: Nutch Issue Type: Sub-task Components: build Affects Versions: 1.4 Reporter: Lewis John McGibbney Priority: Minor Fix For: 1.5 This issue is part of the larger attempt to provide a Junit test case for every Nutch plugin. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1122) JUnit test for protocol-ftp
[ https://issues.apache.org/jira/browse/NUTCH-1122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1122: - Fix Version/s: (was: 1.4) 1.5 JUnit test for protocol-ftp --- Key: NUTCH-1122 URL: https://issues.apache.org/jira/browse/NUTCH-1122 Project: Nutch Issue Type: Sub-task Components: build Affects Versions: 1.4 Reporter: Lewis John McGibbney Priority: Minor Fix For: 1.5 This issue is part of the larger attempt to provide a Junit test case for every Nutch plugin. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1121) JUnit test for parse-js
[ https://issues.apache.org/jira/browse/NUTCH-1121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1121: - Fix Version/s: (was: 1.4) 1.5 JUnit test for parse-js --- Key: NUTCH-1121 URL: https://issues.apache.org/jira/browse/NUTCH-1121 Project: Nutch Issue Type: Sub-task Components: build Affects Versions: 1.4 Reporter: Lewis John McGibbney Priority: Minor Fix For: 1.5 This issue is part of the larger attempt to provide a Junit test case for every Nutch plugin. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira