[jira] [Commented] (NUTCH-1087) Deprecate crawl command and replace with example script
[ https://issues.apache.org/jira/browse/NUTCH-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13089405#comment-13089405 ] Andrzej Bialecki commented on NUTCH-1087: -- IIRC we had this discussion in the past... It's true that we already rely on Bash to do anything useful, no matter whether it's on Windows or on a *nix-like OS. And it's true that the crawl command has been a constant source of confusion over the years. The crawl application also suffered from some subtle bugs, especially when running in local mode (e.g. the PluginRepository leaks). But the argument about maintenance costs is IMHO moot - you have to maintain a shell script, too, so it's no different from maintaining a Java class. Where it differs, I think, is that moving the crawl cycle logic to a shell script now raises the bar for Java developers who are not familiar with Bash scripting - a robust crawl script is not easy to follow, as it needs to handle error conditions and manage input/output resources on HDFS. On the other hand it's easier for system admins to tweak a script rather than tweaking a Java code... so I guess it's also a question of who's the audience for this functionality. I'm +0 for removing Crawl and replacing it with a script, IMHO it doesn't change the picture in any significant way. > Deprecate crawl command and replace with example script > --- > > Key: NUTCH-1087 > URL: https://issues.apache.org/jira/browse/NUTCH-1087 > Project: Nutch > Issue Type: Task >Affects Versions: 1.4 >Reporter: Markus Jelsma >Priority: Minor > Fix For: 1.4 > > > * remove the crawl command > * add basic crawl shell script > See thread: > http://www.mail-archive.com/dev@nutch.apache.org/msg03848.html -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1014) Migrate from Apache ORO to java.util.regex
[ https://issues.apache.org/jira/browse/NUTCH-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13067972#comment-13067972 ] Andrzej Bialecki commented on NUTCH-1014: -- java.util.regex has the advantage of being a part of the JRE. However, it is quite slow for more complex regexes. See e.g. this benchmark: http://www.tusker.org/regex/regex_benchmark.html . In my experience with larger crawls this is especially important when using regexes for URL filtering and normalization - an innocent-looking regex can melt the cpu when processing a 64kB long junk URL, and consequently it can stall the crawl... In such cases it's good to have an option to fall back to a subset of regex features and use a DFA-based library like e.g. Brics. ORO is generally faster than j.u.regex (but also it isn't maintained anymore). Brics lacks support for many operators, but it's fast. Perhaps ICU4j would be a good alternative - it's fully JDK-compatible and offers good performance. > Migrate from Apache ORO to java.util.regex > -- > > Key: NUTCH-1014 > URL: https://issues.apache.org/jira/browse/NUTCH-1014 > Project: Nutch > Issue Type: Improvement >Reporter: Markus Jelsma > Fix For: 1.4, 2.0 > > > A separate issue tracking migration of all components from Apache ORO to > java.util.regex. Components involved are: > - RegexURLNormalzier > - OutlinkExtractor > - JSParseFilter > - MoreIndexingFilter > - BasicURLNormalizer -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-985) MoreIndexingFilter doesn't use properly formatted date fields for Solr
[ https://issues.apache.org/jira/browse/NUTCH-985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034724#comment-13034724 ] Andrzej Bialecki commented on NUTCH-985: - We should use the Solr's DateUtil in all such places, to avoid code duplication and confusion should the date format ever change... The patch does essentially the same what DateUtil does, only the DateUtil reuses SimpleDateFormat instances in a thread-safe way, so it's more efficient. > MoreIndexingFilter doesn't use properly formatted date fields for Solr > -- > > Key: NUTCH-985 > URL: https://issues.apache.org/jira/browse/NUTCH-985 > Project: Nutch > Issue Type: Bug > Components: indexer >Affects Versions: 1.3, 2.0 >Reporter: Dietrich Schmidt >Assignee: Markus Jelsma > Fix For: 1.3, 2.0 > > Attachments: NUTCH-985-trunk-1.patch, NUTCH-985.1.3-1.patch, > indexlastmodifieddate.jar > > > I am using the index-more plugin to parse the lastModified data in web > pages in order to store it in a Solr data field. > In solrindex-mapping.xml I am mapping lastModified to a field "changed" in > Solr: > > However, when posting data to Solr the SolrIndexer posts it as a long, > not as a date: > name="changed">107932680 name="tstamp">20110414144140188 name="date">20040315 > Solr rejects the data because of the improper data type. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-955) Ivy configuration
[ https://issues.apache.org/jira/browse/NUTCH-955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13004509#comment-13004509 ] Andrzej Bialecki commented on NUTCH-955: - Committed with a tweak in rev. 1079770. Thanks! > Ivy configuration > - > > Key: NUTCH-955 > URL: https://issues.apache.org/jira/browse/NUTCH-955 > Project: Nutch > Issue Type: Improvement > Components: build >Affects Versions: 2.0 >Reporter: Alexis >Assignee: Andrzej Bialecki > Fix For: 2.0 > > Attachments: ivy.patch > > > As mentioned in NUTCH-950, we can slightly improve the Ivy configuration to > help setup the Gora backend more easily. > If the user does not want to stick with default HSQL database, other > alternatives exist, such as MySQL and HBase. > org.restlet and xercesImpl versions should be changed as well. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Resolved: (NUTCH-955) Ivy configuration
[ https://issues.apache.org/jira/browse/NUTCH-955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-955. - Resolution: Fixed Fix Version/s: 2.0 Assignee: Andrzej Bialecki > Ivy configuration > - > > Key: NUTCH-955 > URL: https://issues.apache.org/jira/browse/NUTCH-955 > Project: Nutch > Issue Type: Improvement > Components: build >Affects Versions: 2.0 >Reporter: Alexis >Assignee: Andrzej Bialecki > Fix For: 2.0 > > Attachments: ivy.patch > > > As mentioned in NUTCH-950, we can slightly improve the Ivy configuration to > help setup the Gora backend more easily. > If the user does not want to stick with default HSQL database, other > alternatives exist, such as MySQL and HBase. > org.restlet and xercesImpl versions should be changed as well. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-962) max. redirects not handled correctly: fetcher stops at max-1 redirects
[ https://issues.apache.org/jira/browse/NUTCH-962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13004501#comment-13004501 ] Andrzej Bialecki commented on NUTCH-962: - Committed in 1079764 (trunk) and 1079765 (1.3). Thank you! > max. redirects not handled correctly: fetcher stops at max-1 redirects > -- > > Key: NUTCH-962 > URL: https://issues.apache.org/jira/browse/NUTCH-962 > Project: Nutch > Issue Type: Bug > Components: fetcher >Affects Versions: 1.2, 1.3, 2.0 >Reporter: Sebastian Nagel >Assignee: Andrzej Bialecki > Fix For: 1.3, 2.0 > > Attachments: Fetcher_redir.patch > > > The fetcher stops following redirects one redirect before the max. redirects > is reached. > The description of http.redirect.max > > The maximum number of redirects the fetcher will follow when > > trying to fetch a page. If set to negative or 0, fetcher won't immediately > > follow redirected URLs, instead it will record them for later fetching. > suggests that if set to 1 that one redirect will be followed. > I tried to crawl two documents the first redirecting by > > to the second with http.redirect.max = 1 > The second document is not fetched and the URL has state GONE in CrawlDb. > fetching file:/test/redirects/meta_refresh.html > redirectCount=0 > -finishing thread FetcherThread, activeThreads=1 > - content redirect to file:/test/redirects/to/meta_refresh_target.html > (fetching now) > - redirect count exceeded file:/test/redirects/to/meta_refresh_target.html > The attached patch would fix this: if http.redirect.max is 1 : one redirect > is followed. > Of course, this would mean there is no possibility to skip redirects at all > since 0 > (as well as negative values) means "treat redirects as ordinary links". -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Resolved: (NUTCH-962) max. redirects not handled correctly: fetcher stops at max-1 redirects
[ https://issues.apache.org/jira/browse/NUTCH-962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-962. - Resolution: Fixed Fix Version/s: 2.0 1.3 Assignee: Andrzej Bialecki > max. redirects not handled correctly: fetcher stops at max-1 redirects > -- > > Key: NUTCH-962 > URL: https://issues.apache.org/jira/browse/NUTCH-962 > Project: Nutch > Issue Type: Bug > Components: fetcher >Affects Versions: 1.2, 1.3, 2.0 >Reporter: Sebastian Nagel >Assignee: Andrzej Bialecki > Fix For: 1.3, 2.0 > > Attachments: Fetcher_redir.patch > > > The fetcher stops following redirects one redirect before the max. redirects > is reached. > The description of http.redirect.max > > The maximum number of redirects the fetcher will follow when > > trying to fetch a page. If set to negative or 0, fetcher won't immediately > > follow redirected URLs, instead it will record them for later fetching. > suggests that if set to 1 that one redirect will be followed. > I tried to crawl two documents the first redirecting by > > to the second with http.redirect.max = 1 > The second document is not fetched and the URL has state GONE in CrawlDb. > fetching file:/test/redirects/meta_refresh.html > redirectCount=0 > -finishing thread FetcherThread, activeThreads=1 > - content redirect to file:/test/redirects/to/meta_refresh_target.html > (fetching now) > - redirect count exceeded file:/test/redirects/to/meta_refresh_target.html > The attached patch would fix this: if http.redirect.max is 1 : one redirect > is followed. > Of course, this would mean there is no possibility to skip redirects at all > since 0 > (as well as negative values) means "treat redirects as ordinary links". -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-951) Backport changes from 2.0 into 1.3
[ https://issues.apache.org/jira/browse/NUTCH-951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13004490#comment-13004490 ] Andrzej Bialecki commented on NUTCH-951: - All changes have been ported. Thanks everyone! > Backport changes from 2.0 into 1.3 > -- > > Key: NUTCH-951 > URL: https://issues.apache.org/jira/browse/NUTCH-951 > Project: Nutch > Issue Type: Task >Affects Versions: 1.3 >Reporter: Julien Nioche >Assignee: Andrzej Bialecki >Priority: Blocker > Fix For: 1.3 > > > I've compared the changes from 2.0 with 1.3 and found the following > differences (excluding anything specific to 2.0/GORA) > * NUTCH-564 External parser supports encoding attribute (Antony > Bowesman, mattmann) > * NUTCH-714 Need a SFTP and SCP Protocol Handler (Sanjoy Ghosh, mattmann) > * NUTCH-825 Publish nutch artifacts to central maven repository > (mattmann) > * NUTCH-851 Port logging to slf4j (jnioche) > * NUTCH-861 Renamed HTMLParseFilter into ParseFilter > * NUTCH-872 Change the default fetcher.parse to FALSE (ab). > * NUTCH-876 Remove remaining robots/IP blocking code in lib-http (ab) > * NUTCH-880 REST API for Nutch (ab) > * NUTCH-883 Remove unused parameters from nutch-default.xml (jnioche) > * NUTCH-884 FetcherJob should run more reduce tasks than default (ab) > * NUTCH-886 A .gitignore file for Nutch (dogacan) > * NUTCH-894 Move statistical language identification from indexing to > parsing step > * NUTCH-921 Reduce dependency of Nutch on config files (ab) > * NUTCH-930 Remove remaining dependencies on Lucene API (ab) > * NUTCH-931 Simple admin API to fetch status and stop the service (ab) > * NUTCH-932 Bulk REST API to retrieve crawl results as JSON (ab) > Let's go through this and decide what to port to 1.3 -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Resolved: (NUTCH-951) Backport changes from 2.0 into 1.3
[ https://issues.apache.org/jira/browse/NUTCH-951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-951. - Resolution: Fixed > Backport changes from 2.0 into 1.3 > -- > > Key: NUTCH-951 > URL: https://issues.apache.org/jira/browse/NUTCH-951 > Project: Nutch > Issue Type: Task >Affects Versions: 1.3 >Reporter: Julien Nioche >Assignee: Andrzej Bialecki >Priority: Blocker > Fix For: 1.3 > > > I've compared the changes from 2.0 with 1.3 and found the following > differences (excluding anything specific to 2.0/GORA) > * NUTCH-564 External parser supports encoding attribute (Antony > Bowesman, mattmann) > * NUTCH-714 Need a SFTP and SCP Protocol Handler (Sanjoy Ghosh, mattmann) > * NUTCH-825 Publish nutch artifacts to central maven repository > (mattmann) > * NUTCH-851 Port logging to slf4j (jnioche) > * NUTCH-861 Renamed HTMLParseFilter into ParseFilter > * NUTCH-872 Change the default fetcher.parse to FALSE (ab). > * NUTCH-876 Remove remaining robots/IP blocking code in lib-http (ab) > * NUTCH-880 REST API for Nutch (ab) > * NUTCH-883 Remove unused parameters from nutch-default.xml (jnioche) > * NUTCH-884 FetcherJob should run more reduce tasks than default (ab) > * NUTCH-886 A .gitignore file for Nutch (dogacan) > * NUTCH-894 Move statistical language identification from indexing to > parsing step > * NUTCH-921 Reduce dependency of Nutch on config files (ab) > * NUTCH-930 Remove remaining dependencies on Lucene API (ab) > * NUTCH-931 Simple admin API to fetch status and stop the service (ab) > * NUTCH-932 Bulk REST API to retrieve crawl results as JSON (ab) > Let's go through this and decide what to port to 1.3 -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-951) Backport changes from 2.0 into 1.3
[ https://issues.apache.org/jira/browse/NUTCH-951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13004488#comment-13004488 ] Andrzej Bialecki commented on NUTCH-951: - * Ported NUTCH-872 in rev. 1079746. * Ported NUTCH-876 in rev. 1079753. * Ported NUTCH-921 in rev. 1079760. * NUTCH-884 is not applicable to 1.3 because here fetching executes in map tasks, so there's a correct number of them already. > Backport changes from 2.0 into 1.3 > -- > > Key: NUTCH-951 > URL: https://issues.apache.org/jira/browse/NUTCH-951 > Project: Nutch > Issue Type: Task >Affects Versions: 1.3 >Reporter: Julien Nioche >Assignee: Andrzej Bialecki >Priority: Blocker > Fix For: 1.3 > > > I've compared the changes from 2.0 with 1.3 and found the following > differences (excluding anything specific to 2.0/GORA) > * NUTCH-564 External parser supports encoding attribute (Antony > Bowesman, mattmann) > * NUTCH-714 Need a SFTP and SCP Protocol Handler (Sanjoy Ghosh, mattmann) > * NUTCH-825 Publish nutch artifacts to central maven repository > (mattmann) > * NUTCH-851 Port logging to slf4j (jnioche) > * NUTCH-861 Renamed HTMLParseFilter into ParseFilter > * NUTCH-872 Change the default fetcher.parse to FALSE (ab). > * NUTCH-876 Remove remaining robots/IP blocking code in lib-http (ab) > * NUTCH-880 REST API for Nutch (ab) > * NUTCH-883 Remove unused parameters from nutch-default.xml (jnioche) > * NUTCH-884 FetcherJob should run more reduce tasks than default (ab) > * NUTCH-886 A .gitignore file for Nutch (dogacan) > * NUTCH-894 Move statistical language identification from indexing to > parsing step > * NUTCH-921 Reduce dependency of Nutch on config files (ab) > * NUTCH-930 Remove remaining dependencies on Lucene API (ab) > * NUTCH-931 Simple admin API to fetch status and stop the service (ab) > * NUTCH-932 Bulk REST API to retrieve crawl results as JSON (ab) > Let's go through this and decide what to port to 1.3 -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (NUTCH-964) ERROR conf.Configuration - Failed to set setXIncludeAware(true)
[ https://issues.apache.org/jira/browse/NUTCH-964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12987547#action_12987547 ] Andrzej Bialecki commented on NUTCH-964: - This error has been bothering me for a while, too - it's great that an upgrade fixes it and doesn't break other stuff ;) One area that was sensitive to Xerces versions in the past was the Neko parser (in parse-html) but if its tests pass then +1 to commit the patch. We should upgrade trunk too. > ERROR conf.Configuration - Failed to set setXIncludeAware(true) > --- > > Key: NUTCH-964 > URL: https://issues.apache.org/jira/browse/NUTCH-964 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.3 >Reporter: Markus Jelsma > Fix For: 1.3 > > Attachments: NUTCH-964.patch > > > Each executed job results in a number of occurences of the exception below: > 2011-01-27 13:40:34,457 ERROR conf.Configuration - Failed to set > setXIncludeAware(true) for parser > org.apache.xerces.jaxp.DocumentBuilderFactoryImpl@3801318b:java.lang.UnsupportedOperationException: > This parser does not support specification "null" version "null" > java.lang.UnsupportedOperationException: This parser does not support > specification "null" version "null" > at > javax.xml.parsers.DocumentBuilderFactory.setXIncludeAware(DocumentBuilderFactory.java:590) > at > org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:1054) > at > org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:1040) > at > org.apache.hadoop.conf.Configuration.getProps(Configuration.java:980) > at org.apache.hadoop.conf.Configuration.get(Configuration.java:436) > at org.apache.hadoop.fs.FileSystem.getDefaultUri(FileSystem.java:103) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95) > at org.apache.nutch.crawl.Injector.inject(Injector.java:230) > at org.apache.nutch.crawl.Injector.run(Injector.java:248) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at org.apache.nutch.crawl.Injector.main(Injector.java:238) > This can be fixed by upgrading xercesImpl from 2.6.2 to 2.9.1. If modified > ivy and lib-xml's ivy configuration and can commit it. The question is, is > upgrading the correct method? I've tested Nutch with 2.9.1 and except the > lack of the annoying exception everything works as expected. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-939) Added -dir command line option to Indexer and SolrIndexer, allowing to specify directory containing segments
[ https://issues.apache.org/jira/browse/NUTCH-939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12973915#action_12973915 ] Andrzej Bialecki commented on NUTCH-939: - 1.2 release is out, and branch-1.2 is unlikely to result in a subsequent release - most users seem to be interested either in 1.3 or trunk. > Added -dir command line option to Indexer and SolrIndexer, allowing to > specify directory containing segments > - > > Key: NUTCH-939 > URL: https://issues.apache.org/jira/browse/NUTCH-939 > Project: Nutch > Issue Type: Improvement > Components: indexer >Affects Versions: 1.3 >Reporter: Claudio Martella >Assignee: Andrzej Bialecki >Priority: Minor > Fix For: 1.3 > > Attachments: Indexer.patch, SolrIndexer.patch > > > The patches add -dir option, so the user can specify the directory in which > the segments are to be found. The actual mode is to specify the list of > segments, which is not very easy with hdfs. Also, the -dir option is already > implemented in LinkDB and SegmentMerger, for example. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-948) Remove Lucene dependencies
[ https://issues.apache.org/jira/browse/NUTCH-948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-948. - Resolution: Fixed Committed in rev. 1051509. > Remove Lucene dependencies > -- > > Key: NUTCH-948 > URL: https://issues.apache.org/jira/browse/NUTCH-948 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.3 >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki > Fix For: 1.3 > > > Branch-1.3 still has Lucene libs, but uses Lucene only in one place, namely > it uses DateTools in index-basic. DateTools should be replaced with Solr's > DateUtil, as we did in trunk, and then we can remove Lucene libs as a > dependency. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-948) Remove Lucene dependencies
Remove Lucene dependencies -- Key: NUTCH-948 URL: https://issues.apache.org/jira/browse/NUTCH-948 Project: Nutch Issue Type: Improvement Affects Versions: 1.3 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Fix For: 1.3 Branch-1.3 still has Lucene libs, but uses Lucene only in one place, namely it uses DateTools in index-basic. DateTools should be replaced with Solr's DateUtil, as we did in trunk, and then we can remove Lucene libs as a dependency. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-939) Added -dir command line option to Indexer and SolrIndexer, allowing to specify directory containing segments
[ https://issues.apache.org/jira/browse/NUTCH-939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-939. - Resolution: Fixed Assignee: Andrzej Bialecki I modified the patch slightly to allow more flexibility (you can mix individual segment names and the -dir options) as well as allowing segments placed on different filesystems. Committed in rev. 1051505. Thank you! > Added -dir command line option to Indexer and SolrIndexer, allowing to > specify directory containing segments > - > > Key: NUTCH-939 > URL: https://issues.apache.org/jira/browse/NUTCH-939 > Project: Nutch > Issue Type: Improvement > Components: indexer >Affects Versions: 1.3 >Reporter: Claudio Martella >Assignee: Andrzej Bialecki >Priority: Minor > Fix For: 1.3 > > Attachments: Indexer.patch, SolrIndexer.patch > > > The patches add -dir option, so the user can specify the directory in which > the segments are to be found. The actual mode is to specify the list of > segments, which is not very easy with hdfs. Also, the -dir option is already > implemented in LinkDB and SegmentMerger, for example. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-939) Added -dir command line option to Indexer and SolrIndexer, allowing to specify directory containing segments
[ https://issues.apache.org/jira/browse/NUTCH-939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12936047#action_12936047 ] Andrzej Bialecki commented on NUTCH-939: - Please note that trunk uses a very different method of working with segments (called batches there), and -dir is not applicable there. > Added -dir command line option to Indexer and SolrIndexer, allowing to > specify directory containing segments > - > > Key: NUTCH-939 > URL: https://issues.apache.org/jira/browse/NUTCH-939 > Project: Nutch > Issue Type: Improvement > Components: indexer >Affects Versions: 1.2 >Reporter: Claudio Martella >Priority: Minor > Fix For: 1.2 > > Attachments: Indexer.patch, SolrIndexer.patch > > > The patches add -dir option, so the user can specify the directory in which > the segments are to be found. The actual mode is to specify the list of > segments, which is not very easy with hdfs. Also, the -dir option is already > implemented in LinkDB and SegmentMerger, for example. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-938) Imposible to fetch sites with robots.txt
[ https://issues.apache.org/jira/browse/NUTCH-938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12935745#action_12935745 ] Andrzej Bialecki commented on NUTCH-938: - These two properties are documented in nutch-default.xml, but they are mostly for internal use by Nutch. Other implementations of Fetcher (the OldFetcher) used to delegate the robot and politeness controls to protocol plugins. The current implementation of Fetcher performs these tasks itself, although in 1.2 protocol plugins still retain the code to implement these controls per protocol. In 1.3 (unreleased) and trunk this support has been removed from protocol plugins, so these lines will have no effect. > Imposible to fetch sites with robots.txt > - > > Key: NUTCH-938 > URL: https://issues.apache.org/jira/browse/NUTCH-938 > Project: Nutch > Issue Type: Bug > Components: fetcher >Affects Versions: 1.2 > Environment: red hat, nutch 1.2, jaca 1.6 >Reporter: Enrique Berlanga > Attachments: NUTCH-938.patch > > > Crawling a site with a robots.txt file like this: (e.g: > http://www.melilla.es) > --- > User-agent: * > Disallow: / > --- > No links are followed. > It doesn't matters the value set at "protocol.plugin.check.blocking" or > "protocol.plugin.check.robots" properties, because they are overloaded in > class org.apache.nutch.fetcher.Fetcher: > // set non-blocking & no-robots mode for HTTP protocol plugins. > getConf().setBoolean(Protocol.CHECK_BLOCKING, false); > getConf().setBoolean(Protocol.CHECK_ROBOTS, false); > False is the desired value, but in FetcherThread inner class, robot rules are > checket ignoring the configuration: > > RobotRules rules = protocol.getRobotRules(fit.url, fit.datum); > if (!rules.isAllowed(fit.u)) { > ... > LOG.debug("Denied by robots.txt: " + fit.url); > ... > continue; > } > --- > I suposse there is no problem in disabling that part of the code directly for > HTTP protocol. If so, I could submit a patch as soon as posible to get over > this. > Thanks in advance -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-932) Bulk REST API to retrieve crawl results as JSON
[ https://issues.apache.org/jira/browse/NUTCH-932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-932. - Resolution: Fixed Fix Version/s: 2.0 Committed in rev. 1039014. > Bulk REST API to retrieve crawl results as JSON > --- > > Key: NUTCH-932 > URL: https://issues.apache.org/jira/browse/NUTCH-932 > Project: Nutch > Issue Type: New Feature > Components: REST_api >Affects Versions: 2.0 >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki > Fix For: 2.0 > > Attachments: db.formatted.gz, NUTCH-932-2.patch, NUTCH-932-3.patch, > NUTCH-932-4.patch, NUTCH-932.patch, NUTCH-932.patch, NUTCH-932.patch > > > It would be useful to be able to retrieve results of a crawl as JSON. There > are a few things that need to be discussed: > * how to return bulk results using Restlet (WritableRepresentation subclass?) > * what should be the format of results? > I think it would make sense to provide a single record retrieval (by primary > key), all records, and records within a range. This incidentally matches well > the capabilities of the Gora Query class :) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-932) Bulk REST API to retrieve crawl results as JSON
[ https://issues.apache.org/jira/browse/NUTCH-932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-932: Attachment: NUTCH-932-4.patch Final version of the patch. > Bulk REST API to retrieve crawl results as JSON > --- > > Key: NUTCH-932 > URL: https://issues.apache.org/jira/browse/NUTCH-932 > Project: Nutch > Issue Type: New Feature > Components: REST_api >Affects Versions: 2.0 >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki > Attachments: db.formatted.gz, NUTCH-932-2.patch, NUTCH-932-3.patch, > NUTCH-932-4.patch, NUTCH-932.patch, NUTCH-932.patch, NUTCH-932.patch > > > It would be useful to be able to retrieve results of a crawl as JSON. There > are a few things that need to be discussed: > * how to return bulk results using Restlet (WritableRepresentation subclass?) > * what should be the format of results? > I think it would make sense to provide a single record retrieval (by primary > key), all records, and records within a range. This incidentally matches well > the capabilities of the Gora Query class :) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-938) Imposible to fetch sites with robots.txt
[ https://issues.apache.org/jira/browse/NUTCH-938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1293#action_1293 ] Andrzej Bialecki commented on NUTCH-938: - Nutch behavior in this case is correct. The goal of Nutch is to implement a well-behaved crawler that obeys robot rules and netiquette. Your patch simply disables these control mechanisms. If it works for you and you can risk the wrath of webmasters, that's fine, you are free to use this patch - but Nutch as a project cannot encourage such practice. Consequently I'm going to mark this issue as Won't Fix. > Imposible to fetch sites with robots.txt > - > > Key: NUTCH-938 > URL: https://issues.apache.org/jira/browse/NUTCH-938 > Project: Nutch > Issue Type: Bug > Components: fetcher >Affects Versions: 1.2 > Environment: red hat, nutch 1.2, jaca 1.6 >Reporter: Enrique Berlanga > Attachments: NUTCH-938.patch > > > Crawling a site with a robots.txt file like this: (e.g: > http://www.melilla.es) > --- > User-agent: * > Disallow: / > --- > No links are followed. > It doesn't matters the value set at "protocol.plugin.check.blocking" or > "protocol.plugin.check.robots" properties, because they are overloaded in > class org.apache.nutch.fetcher.Fetcher: > // set non-blocking & no-robots mode for HTTP protocol plugins. > getConf().setBoolean(Protocol.CHECK_BLOCKING, false); > getConf().setBoolean(Protocol.CHECK_ROBOTS, false); > False is the desired value, but in FetcherThread inner class, robot rules are > checket ignoring the configuration: > > RobotRules rules = protocol.getRobotRules(fit.url, fit.datum); > if (!rules.isAllowed(fit.u)) { > ... > LOG.debug("Denied by robots.txt: " + fit.url); > ... > continue; > } > --- > I suposse there is no problem in disabling that part of the code directly for > HTTP protocol. If so, I could submit a patch as soon as posible to get over > this. > Thanks in advance -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-932) Bulk REST API to retrieve crawl results as JSON
[ https://issues.apache.org/jira/browse/NUTCH-932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-932: Attachment: NUTCH-932-3.patch NutchTool is an abstract class in this patch. This actually minimizes the amount of code throughout, though paradoxically the patch file is larger than before... > Bulk REST API to retrieve crawl results as JSON > --- > > Key: NUTCH-932 > URL: https://issues.apache.org/jira/browse/NUTCH-932 > Project: Nutch > Issue Type: New Feature > Components: REST_api >Affects Versions: 2.0 >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki > Attachments: db.formatted.gz, NUTCH-932-2.patch, NUTCH-932-3.patch, > NUTCH-932.patch, NUTCH-932.patch, NUTCH-932.patch > > > It would be useful to be able to retrieve results of a crawl as JSON. There > are a few things that need to be discussed: > * how to return bulk results using Restlet (WritableRepresentation subclass?) > * what should be the format of results? > I think it would make sense to provide a single record retrieval (by primary > key), all records, and records within a range. This incidentally matches well > the capabilities of the Gora Query class :) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-932) Bulk REST API to retrieve crawl results as JSON
[ https://issues.apache.org/jira/browse/NUTCH-932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-932: Attachment: NUTCH-932-2.patch This patch simplifies the NutchTool API and reduces changes to implementations of NutchTool. I'd like to commit this patch soon. > Bulk REST API to retrieve crawl results as JSON > --- > > Key: NUTCH-932 > URL: https://issues.apache.org/jira/browse/NUTCH-932 > Project: Nutch > Issue Type: New Feature > Components: REST_api >Affects Versions: 2.0 >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki > Attachments: db.formatted.gz, NUTCH-932-2.patch, NUTCH-932.patch, > NUTCH-932.patch, NUTCH-932.patch > > > It would be useful to be able to retrieve results of a crawl as JSON. There > are a few things that need to be discussed: > * how to return bulk results using Restlet (WritableRepresentation subclass?) > * what should be the format of results? > I think it would make sense to provide a single record retrieval (by primary > key), all records, and records within a range. This incidentally matches well > the capabilities of the Gora Query class :) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-932) Bulk REST API to retrieve crawl results as JSON
[ https://issues.apache.org/jira/browse/NUTCH-932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-932: Attachment: NUTCH-932.patch Updated patch. This changes the NutchTool API to allow for execution steps that are not mapreduce jobs, and to pass arguments in arbitrary order, which was a side-effect of the Restlet API. As a proof of concept I reimplemented the Crawler class (a one-shot crawler). If there are no objections I'll commit this shortly. > Bulk REST API to retrieve crawl results as JSON > --- > > Key: NUTCH-932 > URL: https://issues.apache.org/jira/browse/NUTCH-932 > Project: Nutch > Issue Type: New Feature > Components: REST_api >Affects Versions: 2.0 >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki > Attachments: db.formatted.gz, NUTCH-932.patch, NUTCH-932.patch, > NUTCH-932.patch > > > It would be useful to be able to retrieve results of a crawl as JSON. There > are a few things that need to be discussed: > * how to return bulk results using Restlet (WritableRepresentation subclass?) > * what should be the format of results? > I think it would make sense to provide a single record retrieval (by primary > key), all records, and records within a range. This incidentally matches well > the capabilities of the Gora Query class :) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-880) REST API for Nutch
[ https://issues.apache.org/jira/browse/NUTCH-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928909#action_12928909 ] Andrzej Bialecki commented on NUTCH-880: - Thanks - this issue is already fixed in NUTCH-932, to be committed soon. > REST API for Nutch > -- > > Key: NUTCH-880 > URL: https://issues.apache.org/jira/browse/NUTCH-880 > Project: Nutch > Issue Type: New Feature >Affects Versions: 2.0 >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki > Fix For: 2.0 > > Attachments: API-2.patch, API.patch > > > This issue is for discussing a REST-style API for accessing Nutch. > Here's an initial idea: > * I propose to use org.restlet for handling requests and returning > JSON/XML/whatever responses. > * hook up all regular tools so that they can be driven via this API. This > would have to be an async API, since all Nutch operations take long time to > execute. It follows then that we need to be able also to list running > operations, retrieve their current status, and possibly > abort/cancel/stop/suspend/resume/...? This also means that we would have to > potentially create & manage many threads in a servlet - AFAIK this is frowned > upon by J2EE purists... > * package this in a webapp (that includes all deps, essentially nutch.job > content), with the restlet servlet as an entry point. > Open issues: > * how to implement the reading of crawl results via this API > * should we manage only crawls that use a single configuration per webapp, or > should we have a notion of crawl contexts (sets of crawl configs) with CRUD > ops on them? this would be nice, because it would allow managing of several > different crawls, with different configs, in a single webapp - but it > complicates the implementation a lot. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-932) Bulk REST API to retrieve crawl results as JSON
[ https://issues.apache.org/jira/browse/NUTCH-932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-932: Attachment: NUTCH-932.patch Updated patch - this recognizes now URL parameters such as fields, start/end keys, batch and crawl id. > Bulk REST API to retrieve crawl results as JSON > --- > > Key: NUTCH-932 > URL: https://issues.apache.org/jira/browse/NUTCH-932 > Project: Nutch > Issue Type: New Feature > Components: REST_api >Affects Versions: 2.0 >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki > Attachments: db.formatted.gz, NUTCH-932.patch, NUTCH-932.patch > > > It would be useful to be able to retrieve results of a crawl as JSON. There > are a few things that need to be discussed: > * how to return bulk results using Restlet (WritableRepresentation subclass?) > * what should be the format of results? > I think it would make sense to provide a single record retrieval (by primary > key), all records, and records within a range. This incidentally matches well > the capabilities of the Gora Query class :) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-932) Bulk REST API to retrieve crawl results as JSON
[ https://issues.apache.org/jira/browse/NUTCH-932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928355#action_12928355 ] Andrzej Bialecki commented on NUTCH-932: - Examples (with the db equivalent to the one in db.formatted.gz): {code} $ curl -s 'http://localhost:8192/nutch/db?fields=url&end=http://www.freebsd.org/&start=http://www.egothor.org/'| ./json_pp [ { "url": "http://www.egothor.org/"; }, { "url": "http://www.freebsd.org/"; } ] {code} {code} $ curl -s 'http://localhost:8192/nutch/db?fields=url,outlinks,markers,protocolStatus,parseStatus,contentType&start=http://www.getopt.org/&end=http://www.getopt.org/'| ./json_pp [ { "contentType": "text/html", "url": "http://www.getopt.org/";, "markers": { "_updmrk_": "1288890451-1134865895" }, "parseStatus": "success/ok (1/0), args=[]", "protocolStatus": "SUCCESS, args=[]", "outlinks": { "http://www.getopt.org/luke/": "Luke", "http://www.getopt.org/ecimf/contrib/ONTO/REA": "REA Ontology page", "http://www.getopt.org/CV.pdf": "CV here", "http://www.getopt.org/utils/build/api": "API", "http://svn.apache.org/viewvc/hadoop/hbase/trunk/src/java/org/apache/hadoop/hbase/util/JenkinsHash.java": "available here", "http://www.getopt.org/murmur/MurmurHash.java": "MurmurHash.java", "http://www.ebxml.org/": "ebXML / ebTWG", "http://www.freebsd.org/": "FreeBSD", "http://www.getopt.org/luke/webstart.html": "Launch with Java WebStart", "http://www.freebsd.org/%7Epicobsd": "PicoBSD", "http://home.comcast.net/~bretm/hash/6.html": "this discussion", "http://protege.stanford.edu/": "Protege", "http://jakarta.apache.org/lucene": "Lucene", "http://www.getopt.org/ecimf/contrib/ONTO/ebxml": "ebXML Ontology", "http://www.getopt.org/ecimf/": "here", "http://www.isthe.com/chongo/tech/comp/fnv/": "his website", "http://www.getopt.org/stempel/index.html": "Stempel", "http://www.sigram.com/": "SIGRAM", "http://www.egothor.org/": "Egothor", "http://thinlet.sourceforge.net/": "Thinlet", "http://www.getopt.org/utils/dist/utils-1.0.jar": "binary", "http://www.ecimf.org/": "ECIMF" } } ] {code} > Bulk REST API to retrieve crawl results as JSON > --- > > Key: NUTCH-932 > URL: https://issues.apache.org/jira/browse/NUTCH-932 > Project: Nutch > Issue Type: New Feature > Components: REST_api >Affects Versions: 2.0 >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki > Attachments: db.formatted.gz, NUTCH-932.patch, NUTCH-932.patch > > > It would be useful to be able to retrieve results of a crawl as JSON. There > are a few things that need to be discussed: > * how to return bulk results using Restlet (WritableRepresentation subclass?) > * what should be the format of results? > I think it would make sense to provide a single record retrieval (by primary > key), all records, and records within a range. This incidentally matches well > the capabilities of the Gora Query class :) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-932) Bulk REST API to retrieve crawl results as JSON
[ https://issues.apache.org/jira/browse/NUTCH-932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-932: Attachment: db.formatted.gz Example DB content (this was passed through a JSON pretty-printer, otherwise it's just one giant line...). > Bulk REST API to retrieve crawl results as JSON > --- > > Key: NUTCH-932 > URL: https://issues.apache.org/jira/browse/NUTCH-932 > Project: Nutch > Issue Type: New Feature > Components: REST_api >Affects Versions: 2.0 >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki > Attachments: db.formatted.gz, NUTCH-932.patch > > > It would be useful to be able to retrieve results of a crawl as JSON. There > are a few things that need to be discussed: > * how to return bulk results using Restlet (WritableRepresentation subclass?) > * what should be the format of results? > I think it would make sense to provide a single record retrieval (by primary > key), all records, and records within a range. This incidentally matches well > the capabilities of the Gora Query class :) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-932) Bulk REST API to retrieve crawl results as JSON
[ https://issues.apache.org/jira/browse/NUTCH-932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-932: Attachment: NUTCH-932.patch This patch adds bulk retrieval of crawl results. This is still very rough, e.g. there's no way to select crawlId or limit the fields... but it returns proper JSON. This patch also includes other enhancements and bugfixes - with this patch I was able to perform a complete crawl cycle via REST. > Bulk REST API to retrieve crawl results as JSON > --- > > Key: NUTCH-932 > URL: https://issues.apache.org/jira/browse/NUTCH-932 > Project: Nutch > Issue Type: New Feature > Components: REST_api >Affects Versions: 2.0 >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki > Attachments: NUTCH-932.patch > > > It would be useful to be able to retrieve results of a crawl as JSON. There > are a few things that need to be discussed: > * how to return bulk results using Restlet (WritableRepresentation subclass?) > * what should be the format of results? > I think it would make sense to provide a single record retrieval (by primary > key), all records, and records within a range. This incidentally matches well > the capabilities of the Gora Query class :) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-932) Bulk REST API to retrieve crawl results as JSON
Bulk REST API to retrieve crawl results as JSON --- Key: NUTCH-932 URL: https://issues.apache.org/jira/browse/NUTCH-932 Project: Nutch Issue Type: New Feature Components: REST_api Affects Versions: 2.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki It would be useful to be able to retrieve results of a crawl as JSON. There are a few things that need to be discussed: * how to return bulk results using Restlet (WritableRepresentation subclass?) * what should be the format of results? I think it would make sense to provide a single record retrieval (by primary key), all records, and records within a range. This incidentally matches well the capabilities of the Gora Query class :) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-931) Simple admin API to fetch status and stop the service
[ https://issues.apache.org/jira/browse/NUTCH-931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-931. - Resolution: Fixed Committed in rev. 1028736 with some changes. > Simple admin API to fetch status and stop the service > - > > Key: NUTCH-931 > URL: https://issues.apache.org/jira/browse/NUTCH-931 > Project: Nutch > Issue Type: Improvement > Components: REST_api >Affects Versions: 2.0 >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki > Fix For: 2.0 > > Attachments: NUTCH-931.patch > > > REST API needs a simple info / stats service and the ability to shutdown the > server. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-931) Simple admin API to fetch status and stop the service
[ https://issues.apache.org/jira/browse/NUTCH-931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-931: Attachment: NUTCH-931.patch AdminResource, mostly skeleton for now that implements only the "stop" command. > Simple admin API to fetch status and stop the service > - > > Key: NUTCH-931 > URL: https://issues.apache.org/jira/browse/NUTCH-931 > Project: Nutch > Issue Type: Improvement > Components: REST_api >Affects Versions: 2.0 >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki > Fix For: 2.0 > > Attachments: NUTCH-931.patch > > > REST API needs a simple info / stats service and the ability to shutdown the > server. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-931) Simple admin API to fetch status and stop the service
Simple admin API to fetch status and stop the service - Key: NUTCH-931 URL: https://issues.apache.org/jira/browse/NUTCH-931 Project: Nutch Issue Type: Improvement Components: REST_api Affects Versions: 2.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Fix For: 2.0 REST API needs a simple info / stats service and the ability to shutdown the server. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-930) Remove remaining dependencies on Lucene API
[ https://issues.apache.org/jira/browse/NUTCH-930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-930. - Resolution: Fixed Fix Version/s: 2.0 Committed in rev. 1028474. > Remove remaining dependencies on Lucene API > --- > > Key: NUTCH-930 > URL: https://issues.apache.org/jira/browse/NUTCH-930 > Project: Nutch > Issue Type: Improvement >Affects Versions: 2.0 >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki > Fix For: 2.0 > > Attachments: NUTCH-930.patch > > > Nutch doesn't use Lucene API anymore, all indexing happens via > Lucene-agnostic SolrJ API. The only place where we still use a minor part of > Lucene is in index-basic, and that use (DateTools) can be easily replaced. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-930) Remove remaining dependencies on Lucene API
[ https://issues.apache.org/jira/browse/NUTCH-930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-930: Attachment: NUTCH-930.patch Patch to fix the issue. I'll commit this shortly. > Remove remaining dependencies on Lucene API > --- > > Key: NUTCH-930 > URL: https://issues.apache.org/jira/browse/NUTCH-930 > Project: Nutch > Issue Type: Improvement >Affects Versions: 2.0 >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki > Attachments: NUTCH-930.patch > > > Nutch doesn't use Lucene API anymore, all indexing happens via > Lucene-agnostic SolrJ API. The only place where we still use a minor part of > Lucene is in index-basic, and that use (DateTools) can be easily replaced. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-930) Remove remaining dependencies on Lucene API
Remove remaining dependencies on Lucene API --- Key: NUTCH-930 URL: https://issues.apache.org/jira/browse/NUTCH-930 Project: Nutch Issue Type: Improvement Affects Versions: 2.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Nutch doesn't use Lucene API anymore, all indexing happens via Lucene-agnostic SolrJ API. The only place where we still use a minor part of Lucene is in index-basic, and that use (DateTools) can be easily replaced. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-880) REST API for Nutch
[ https://issues.apache.org/jira/browse/NUTCH-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-880. - Resolution: Fixed Fix Version/s: 2.0 Committed in rev. 1028235. The webapp part of this issue is tracked now in NUTCH-929. > REST API for Nutch > -- > > Key: NUTCH-880 > URL: https://issues.apache.org/jira/browse/NUTCH-880 > Project: Nutch > Issue Type: New Feature >Affects Versions: 2.0 >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki > Fix For: 2.0 > > Attachments: API-2.patch, API.patch > > > This issue is for discussing a REST-style API for accessing Nutch. > Here's an initial idea: > * I propose to use org.restlet for handling requests and returning > JSON/XML/whatever responses. > * hook up all regular tools so that they can be driven via this API. This > would have to be an async API, since all Nutch operations take long time to > execute. It follows then that we need to be able also to list running > operations, retrieve their current status, and possibly > abort/cancel/stop/suspend/resume/...? This also means that we would have to > potentially create & manage many threads in a servlet - AFAIK this is frowned > upon by J2EE purists... > * package this in a webapp (that includes all deps, essentially nutch.job > content), with the restlet servlet as an entry point. > Open issues: > * how to implement the reading of crawl results via this API > * should we manage only crawls that use a single configuration per webapp, or > should we have a notion of crawl contexts (sets of crawl configs) with CRUD > ops on them? this would be nice, because it would allow managing of several > different crawls, with different configs, in a single webapp - but it > complicates the implementation a lot. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-880) REST API for Nutch
[ https://issues.apache.org/jira/browse/NUTCH-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-880: Summary: REST API for Nutch (was: REST API (and webapp) for Nutch) The webapp part is tracked now in NUTCH-929. > REST API for Nutch > -- > > Key: NUTCH-880 > URL: https://issues.apache.org/jira/browse/NUTCH-880 > Project: Nutch > Issue Type: New Feature >Affects Versions: 2.0 >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki > Attachments: API-2.patch, API.patch > > > This issue is for discussing a REST-style API for accessing Nutch. > Here's an initial idea: > * I propose to use org.restlet for handling requests and returning > JSON/XML/whatever responses. > * hook up all regular tools so that they can be driven via this API. This > would have to be an async API, since all Nutch operations take long time to > execute. It follows then that we need to be able also to list running > operations, retrieve their current status, and possibly > abort/cancel/stop/suspend/resume/...? This also means that we would have to > potentially create & manage many threads in a servlet - AFAIK this is frowned > upon by J2EE purists... > * package this in a webapp (that includes all deps, essentially nutch.job > content), with the restlet servlet as an entry point. > Open issues: > * how to implement the reading of crawl results via this API > * should we manage only crawls that use a single configuration per webapp, or > should we have a notion of crawl contexts (sets of crawl configs) with CRUD > ops on them? this would be nice, because it would allow managing of several > different crawls, with different configs, in a single webapp - but it > complicates the implementation a lot. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-929) Create a REST-based admin UI for Nutch
Create a REST-based admin UI for Nutch -- Key: NUTCH-929 URL: https://issues.apache.org/jira/browse/NUTCH-929 Project: Nutch Issue Type: New Feature Components: administration gui Affects Versions: 2.0 Reporter: Andrzej Bialecki This is a follow up to NUTCH-880 - we need to expose the functionality of REST API in a user-friendly admin UI. Thanks to the nature of the API the UI can be implemented in any UI framework that speaks REST/JSON, so it could be a simple webapp (we already have jetty) or a Swing / Pivot / etc standalone application. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-926) Nutch follows wrong url in
[ https://issues.apache.org/jira/browse/NUTCH-926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12925543#action_12925543 ] Andrzej Bialecki commented on NUTCH-926: - bq. Nutch continues to crawl the WRONG subdomains! But it should not do this!! No need to shout, we hear you :) Indeed, Nutch behavior when following redirects doesn't play well with the rule of ignoring external outlinks. Strictly speaking, redirects are not outlinks, but the silent assumption behind ignoreExternalOutlinks is that we crawl content only from that hostname. And your patch would solve this particular issue. However, this is not as simple as it seems... My favorite example is www.ibm.com -> www8.ibm.com/index.html . If we apply your fix you won't be able to crawl www.ibm.com unless you inject all wwwNNN load-balanced hosts... so a simple equality of hostnames may not be sufficient. We have utilities to extract domain names, so we could compare domains but then we may mistreat money.cnn.com vs. weather.cnn.com ... > Nutch follows wrong url in - > > Key: NUTCH-926 > URL: https://issues.apache.org/jira/browse/NUTCH-926 > Project: Nutch > Issue Type: Bug > Components: parser >Affects Versions: 1.2 > Environment: gnu/linux centOs >Reporter: Marco Novo >Priority: Critical > Fix For: 1.3 > > Attachments: ParseOutputFormat.java.patch > > > We have nutch set to crawl a domain urllist and we want to fetch only passed > domains (hosts) not subdomains. > So > WWW.DOMAIN1.COM > .. > .. > .. > WWW.RIGHTDOMAIN.COM > .. > .. > .. > .. > WWW.DOMAIN.COM > We sets nutch to: > NOT FOLLOW EXERNAL LINKS > During crawling of WWW.RIGHTDOMAIN.COM > if a page contains > > > > > http://WRONG.RIGHTDOMAIN.COM";> > > > > > Nutch continues to crawl the WRONG subdomains! But it should not do this!! > During crawling of WWW.RIGHTDOMAIN.COM > if a page contains > > > > > http://WWW.WRONGDOMAIN.COM";> > > > > > Nutch continues to crawl the WRONG domain! But it should not do this! If that > we will spider all the web > We think the problem is in org.apache.nutch.parse ParseOutputFormat. We have > done a patch so we will attach it -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-880) REST API (and webapp) for Nutch
[ https://issues.apache.org/jira/browse/NUTCH-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-880: Attachment: API-2.patch An improved version, which actually works :) The configuration and job management is implemented, there is also a unit test that exercises this API. If there are no objections I'd like to commit this first version of the API, and continue improving it in other issues. > REST API (and webapp) for Nutch > --- > > Key: NUTCH-880 > URL: https://issues.apache.org/jira/browse/NUTCH-880 > Project: Nutch > Issue Type: New Feature >Affects Versions: 2.0 >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki > Attachments: API-2.patch, API.patch > > > This issue is for discussing a REST-style API for accessing Nutch. > Here's an initial idea: > * I propose to use org.restlet for handling requests and returning > JSON/XML/whatever responses. > * hook up all regular tools so that they can be driven via this API. This > would have to be an async API, since all Nutch operations take long time to > execute. It follows then that we need to be able also to list running > operations, retrieve their current status, and possibly > abort/cancel/stop/suspend/resume/...? This also means that we would have to > potentially create & manage many threads in a servlet - AFAIK this is frowned > upon by J2EE purists... > * package this in a webapp (that includes all deps, essentially nutch.job > content), with the restlet servlet as an entry point. > Open issues: > * how to implement the reading of crawl results via this API > * should we manage only crawls that use a single configuration per webapp, or > should we have a notion of crawl contexts (sets of crawl configs) with CRUD > ops on them? this would be nice, because it would allow managing of several > different crawls, with different configs, in a single webapp - but it > complicates the implementation a lot. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-913) Nutch should use new namespace for Gora
[ https://issues.apache.org/jira/browse/NUTCH-913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12924659#action_12924659 ] Andrzej Bialecki commented on NUTCH-913: - +1, let's commit it - I want to start playing with GORA-9, and that patch is in the org.apache namespace... > Nutch should use new namespace for Gora > --- > > Key: NUTCH-913 > URL: https://issues.apache.org/jira/browse/NUTCH-913 > Project: Nutch > Issue Type: Bug > Components: storage >Reporter: Doğacan Güney >Assignee: Doğacan Güney > Fix For: 2.0 > > Attachments: NUTCH-913_v1.patch, NUTCH-913_v2.patch > > > Gora is in Apache Incubator now (Yey!). We recently changed Gora's namespace > from org.gora to org.apache.gora. This means nutch should use the new > namespace otherwise it won't compile with newer builds of Gora. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-923) Multilingual support for Solr-index-mapping
[ https://issues.apache.org/jira/browse/NUTCH-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12924154#action_12924154 ] Andrzej Bialecki commented on NUTCH-923: - This doesn't solve the problem of potentially unbounded number of fields. Compliance is one thing, and you can clean up field names from invalid characters, but sanity is another thing - if you have {{title_*}} in your Solr schema then theoretically you are allowed to create unlimited number of fields with this prefix - Solr won't complain. > Multilingual support for Solr-index-mapping > --- > > Key: NUTCH-923 > URL: https://issues.apache.org/jira/browse/NUTCH-923 > Project: Nutch > Issue Type: Improvement > Components: indexer >Affects Versions: 1.2 >Reporter: Matthias Agethle >Assignee: Markus Jelsma >Priority: Minor > > It would be useful to extend the mapping-possibilites when indexing to solr. > One useful feature would be to use the detected language of the html page > (for example via the language-identifier plugin) and send the content to > corresponding language-aware solr-fields. > The mapping file could be as follows: > > > so that the title-field gets mapped to title_en for English-pages and > tilte_fr for French pages. > What do you think? Could this be useful also to others? > Or are there already other solutions out there? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-923) Multilingual support for Solr-index-mapping
[ https://issues.apache.org/jira/browse/NUTCH-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12923947#action_12923947 ] Andrzej Bialecki commented on NUTCH-923: - My point was simply that if you want to build your data schema dynamically, based on the actual input data, then you need to be aware that this process is inherently risky - now we could perhaps deal with "lang" and LanguageIdentifier, but tomorrow we may be dealing with dc.author or cc.license or something else, and then we will face the same issue, ie. a potentially unlimited number of fields created based on data. I don't have a good answer to this problem. On one hand this functionality is useful, on the other hand it's inherently risky in presence of less than ideal data, which is always a possibility... Perhaps introducing some sort of validation mechanism would make this safer to use. > Multilingual support for Solr-index-mapping > --- > > Key: NUTCH-923 > URL: https://issues.apache.org/jira/browse/NUTCH-923 > Project: Nutch > Issue Type: Improvement > Components: indexer >Affects Versions: 1.2 >Reporter: Matthias Agethle >Assignee: Markus Jelsma >Priority: Minor > > It would be useful to extend the mapping-possibilites when indexing to solr. > One useful feature would be to use the detected language of the html page > (for example via the language-identifier plugin) and send the content to > corresponding language-aware solr-fields. > The mapping file could be as follows: > > > so that the title-field gets mapped to title_en for English-pages and > tilte_fr for French pages. > What do you think? Could this be useful also to others? > Or are there already other solutions out there? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-923) Multilingual support for Solr-index-mapping
[ https://issues.apache.org/jira/browse/NUTCH-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12923896#action_12923896 ] Andrzej Bialecki commented on NUTCH-923: - This sounds useful, though the implementation needs to keep the following in mind: * you _assume_ that the lang field will have a nice predictable value, but unless you sanitize the values you can't assume anything... example: one page I saw had a language metadata set to a random string 8kB long with various control chars and '\0'-s. * again, if you don't sanitize and control the total number of unique values in the source field, you could end up with a number of fields approaching infinity, and Solr would melt down... > Multilingual support for Solr-index-mapping > --- > > Key: NUTCH-923 > URL: https://issues.apache.org/jira/browse/NUTCH-923 > Project: Nutch > Issue Type: Improvement > Components: indexer >Affects Versions: 1.2 >Reporter: Matthias Agethle >Assignee: Markus Jelsma >Priority: Minor > > It would be useful to extend the mapping-possibilites when indexing to solr. > One useful feature would be to use the detected language of the html page > (for example via the language-identifier plugin) and send the content to > corresponding language-aware solr-fields. > The mapping file could be as follows: > > > so that the title-field gets mapped to title_en for English-pages and > tilte_fr for French pages. > What do you think? Could this be useful also to others? > Or are there already other solutions out there? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-924) Static field in solr mapping
[ https://issues.apache.org/jira/browse/NUTCH-924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12923845#action_12923845 ] Andrzej Bialecki commented on NUTCH-924: - The functionality is useful, +1. But the patch has formatting errors. Please fix them before committing. The same functionality should be added to trunk, too. > Static field in solr mapping > > > Key: NUTCH-924 > URL: https://issues.apache.org/jira/browse/NUTCH-924 > Project: Nutch > Issue Type: Improvement > Components: indexer >Affects Versions: 1.3 >Reporter: David Stuart >Assignee: Markus Jelsma > Fix For: 1.3 > > Attachments: nutch_1.3_static_field.patch > > Original Estimate: 0h > Remaining Estimate: 0h > > Provide the facility to pass static data defined in solrindex-mapping.xml to > solr during the mapping process. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-907) DataStore API doesn't support multiple storage areas for multiple disjoint crawls
[ https://issues.apache.org/jira/browse/NUTCH-907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-907. - Resolution: Fixed Committed in rev. 1025963. Thank you Sertan for a high-quality patch and unit tests! > DataStore API doesn't support multiple storage areas for multiple disjoint > crawls > - > > Key: NUTCH-907 > URL: https://issues.apache.org/jira/browse/NUTCH-907 > Project: Nutch > Issue Type: Bug >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki > Fix For: 2.0 > > Attachments: NUTCH-907.patch, NUTCH-907.v2.patch > > > In Nutch 1.x it was possible to easily select a set of crawl data (crawldb, > page data, linkdb, etc) by specifying a path where the data was stored. This > enabled users to run several disjoint crawls with different configs, but > still using the same storage medium, just under different paths. > This is not possible now because there is a 1:1 mapping between a specific > DataStore instance and a set of crawl data. > In order to support this functionality the Gora API should be extended so > that it can create stores (and data tables in the underlying storage) that > use arbitrary prefixes to identify the particular crawl dataset. Then the > Nutch API should be extended to allow passing this "crawlId" value to select > one of possibly many existing crawl datasets. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (NUTCH-907) DataStore API doesn't support multiple storage areas for multiple disjoint crawls
[ https://issues.apache.org/jira/browse/NUTCH-907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki reassigned NUTCH-907: --- Assignee: Andrzej Bialecki > DataStore API doesn't support multiple storage areas for multiple disjoint > crawls > - > > Key: NUTCH-907 > URL: https://issues.apache.org/jira/browse/NUTCH-907 > Project: Nutch > Issue Type: Bug >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki > Fix For: 2.0 > > Attachments: NUTCH-907.patch, NUTCH-907.v2.patch > > > In Nutch 1.x it was possible to easily select a set of crawl data (crawldb, > page data, linkdb, etc) by specifying a path where the data was stored. This > enabled users to run several disjoint crawls with different configs, but > still using the same storage medium, just under different paths. > This is not possible now because there is a 1:1 mapping between a specific > DataStore instance and a set of crawl data. > In order to support this functionality the Gora API should be extended so > that it can create stores (and data tables in the underlying storage) that > use arbitrary prefixes to identify the particular crawl dataset. Then the > Nutch API should be extended to allow passing this "crawlId" value to select > one of possibly many existing crawl datasets. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-921) Reduce dependency of Nutch on config files
[ https://issues.apache.org/jira/browse/NUTCH-921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-921. - Resolution: Fixed Patch committed in rev. 1025960. Further improvements to be covered in other issues. > Reduce dependency of Nutch on config files > -- > > Key: NUTCH-921 > URL: https://issues.apache.org/jira/browse/NUTCH-921 > Project: Nutch > Issue Type: Improvement >Affects Versions: 2.0 >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki > Fix For: 2.0 > > Attachments: NUTCH-921.patch > > > Currently many components in Nutch rely on reading their configuration from > files. These files need to be on the classpath (or packed into a job jar). > This is inconvenient if you want to manage configuration via API, e.g. when > embedding Nutch, or running many jobs with slightly different configurations. > This issue tracks the improvement to make various components read their > config directly from Configuration properties. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-921) Reduce dependency of Nutch on config files
[ https://issues.apache.org/jira/browse/NUTCH-921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-921: Attachment: NUTCH-921.patch Patch that implements reading config parameters from Configuration, and falls back to config files if Configuration properties are unspecified. > Reduce dependency of Nutch on config files > -- > > Key: NUTCH-921 > URL: https://issues.apache.org/jira/browse/NUTCH-921 > Project: Nutch > Issue Type: Improvement >Affects Versions: 2.0 >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki > Fix For: 2.0 > > Attachments: NUTCH-921.patch > > > Currently many components in Nutch rely on reading their configuration from > files. These files need to be on the classpath (or packed into a job jar). > This is inconvenient if you want to manage configuration via API, e.g. when > embedding Nutch, or running many jobs with slightly different configurations. > This issue tracks the improvement to make various components read their > config directly from Configuration properties. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-921) Reduce dependency of Nutch on config files
Reduce dependency of Nutch on config files -- Key: NUTCH-921 URL: https://issues.apache.org/jira/browse/NUTCH-921 Project: Nutch Issue Type: Improvement Affects Versions: 2.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Fix For: 2.0 Currently many components in Nutch rely on reading their configuration from files. These files need to be on the classpath (or packed into a job jar). This is inconvenient if you want to manage configuration via API, e.g. when embedding Nutch, or running many jobs with slightly different configurations. This issue tracks the improvement to make various components read their config directly from Configuration properties. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-913) Nutch should use new namespace for Gora
[ https://issues.apache.org/jira/browse/NUTCH-913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920610#action_12920610 ] Andrzej Bialecki commented on NUTCH-913: - There are formatting issues in DomainStatistics.java - the file uses literal tabs, which we frown upon, but the patch introduces double-space indent in the changed lines. As ugly as it sounds I think this should be changed into tabs, and then reformatted in another commit. Other than that, +1, go for it. > Nutch should use new namespace for Gora > --- > > Key: NUTCH-913 > URL: https://issues.apache.org/jira/browse/NUTCH-913 > Project: Nutch > Issue Type: Bug > Components: storage >Reporter: Doğacan Güney >Assignee: Doğacan Güney > Fix For: 2.0 > > Attachments: NUTCH-913_v1.patch > > > Gora is in Apache Incubator now (Yey!). We recently changed Gora's namespace > from org.gora to org.apache.gora. This means nutch should use the new > namespace otherwise it won't compile with newer builds of Gora. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-864) Fetcher generates entries with status 0
[ https://issues.apache.org/jira/browse/NUTCH-864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916912#action_12916912 ] Andrzej Bialecki commented on NUTCH-864: - I think the difficulty comes from the simplification in 2.x as compared to 1.x, in that we keep a single status per page. In 1.x a side-effect of having two locations with two statuses (one "db status" in crawldb and one "fetch status" in segments) was that we had more information in updatedb to act upon. Now we should probably keep up to two statuses - one that reflects a temporary fetch status, as determined by fetcher, and a final (reconciled) status as determined by updatedb, based on the knoweldge of not only plain fetch status and old status but also possible redirects. If I'm not mistaken currently the status is immediately overwritten by fetcher, even before we come to updatedb, hence the problem.. > Fetcher generates entries with status 0 > --- > > Key: NUTCH-864 > URL: https://issues.apache.org/jira/browse/NUTCH-864 > Project: Nutch > Issue Type: Bug > Components: fetcher > Environment: Gora with SQLBackend > URL: https://svn.apache.org/repos/asf/nutch/branches/nutchbase > Last Changed Rev: 980748 > Last Changed Date: 2010-07-30 14:19:52 +0200 (Fri, 30 Jul 2010) >Reporter: Julien Nioche >Assignee: Doğacan Güney > Fix For: 2.0 > > > After a round of fetching which got the following protocol status : > 10/07/30 15:11:39 INFO mapred.JobClient: ACCESS_DENIED=2 > 10/07/30 15:11:39 INFO mapred.JobClient: SUCCESS=1177 > 10/07/30 15:11:39 INFO mapred.JobClient: GONE=3 > 10/07/30 15:11:39 INFO mapred.JobClient: TEMP_MOVED=138 > 10/07/30 15:11:39 INFO mapred.JobClient: EXCEPTION=93 > 10/07/30 15:11:39 INFO mapred.JobClient: MOVED=521 > 10/07/30 15:11:39 INFO mapred.JobClient: NOTFOUND=62 > I ran : ./nutch org.apache.nutch.crawl.WebTableReader -stats > 10/07/30 15:12:37 INFO crawl.WebTableReader: Statistics for WebTable: > 10/07/30 15:12:37 INFO crawl.WebTableReader: TOTAL urls: 2690 > 10/07/30 15:12:37 INFO crawl.WebTableReader: retry 0: 2690 > 10/07/30 15:12:37 INFO crawl.WebTableReader: min score: 0.0 > 10/07/30 15:12:37 INFO crawl.WebTableReader: avg score: 0.7587361 > 10/07/30 15:12:37 INFO crawl.WebTableReader: max score: 1.0 > 10/07/30 15:12:37 INFO crawl.WebTableReader: status 0 (null): 649 > 10/07/30 15:12:37 INFO crawl.WebTableReader: status 2 (status_fetched): > 1177 (SUCCESS=1177) > 10/07/30 15:12:37 INFO crawl.WebTableReader: status 3 (status_gone): 112 > 10/07/30 15:12:37 INFO crawl.WebTableReader: status 34 (status_retry): > 93 (EXCEPTION=93) > 10/07/30 15:12:37 INFO crawl.WebTableReader: status 4 (status_redir_temp): > 138 (TEMP_MOVED=138) > 10/07/30 15:12:37 INFO crawl.WebTableReader: status 5 (status_redir_perm): > 521 (MOVED=521) > 10/07/30 15:12:37 INFO crawl.WebTableReader: WebTable statistics: done > There should not be any entries with status 0 (null) > I will investigate a bit more... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-894) Move statistical language identification from indexing to parsing step
[ https://issues.apache.org/jira/browse/NUTCH-894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916907#action_12916907 ] Andrzej Bialecki commented on NUTCH-894: - +1, a nice clean up of our code base :) > Move statistical language identification from indexing to parsing step > -- > > Key: NUTCH-894 > URL: https://issues.apache.org/jira/browse/NUTCH-894 > Project: Nutch > Issue Type: Improvement > Components: parser >Affects Versions: 2.0 >Reporter: Julien Nioche >Assignee: Julien Nioche > Fix For: 2.0 > > Attachments: NUTCH-894.patch > > > The statistical identification of language is currently done part in the > indexing step, whereas the detection based on HTTP header and HTML code is > done during the parsing. > We could keep the same logic i.e. do the statistical detection only if > nothing has been found with the previous methods but as part of the parsing. > This would be useful for ParseFilters which need the language information or > to use with ScoringFilters e.g. to focus the crawl on a set of languages. > Since the statistical models have been ported to Tika we should probably rely > on them instead of maintaining our own. > Any thoughts on this? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-882) Design a Host table in GORA
[ https://issues.apache.org/jira/browse/NUTCH-882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916874#action_12916874 ] Andrzej Bialecki commented on NUTCH-882: - Doğacan, I missed your previous comment... the issue with partial bloom filters is usually solved that each task stores each own filter - this worked well for MapFile-s because they consisted of multiple parts, so then a Reader would open a part and a corresponding bloom filter. Here it's more complicated, I agree... though this reminds me of the situation that is handled by DynamicBloomFilter: it's basically a set of Bloom filters with a facade that hides this fact from the user. Here we could construct something similar, i.e. don't merge partial filters after closing the output, but instead when opening a Reader read all partial filters and pretend they are one. > Design a Host table in GORA > --- > > Key: NUTCH-882 > URL: https://issues.apache.org/jira/browse/NUTCH-882 > Project: Nutch > Issue Type: New Feature >Affects Versions: 2.0 >Reporter: Julien Nioche >Assignee: Julien Nioche > Fix For: 2.0 > > Attachments: hostdb.patch, NUTCH-882-v1.patch > > > Having a separate GORA table for storing information about hosts (and > domains?) would be very useful for : > * customising the behaviour of the fetching on a host basis e.g. number of > threads, min time between threads etc... > * storing stats > * keeping metadata and possibly propagate them to the webpages > * keeping a copy of the robots.txt and possibly use that later to filter the > webtable > * store sitemaps files and update the webtable accordingly > I'll try to come up with a GORA schema for such a host table but any comments > are of course already welcome -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-864) Fetcher generates entries with status 0
[ https://issues.apache.org/jira/browse/NUTCH-864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916871#action_12916871 ] Andrzej Bialecki commented on NUTCH-864: - +1 to using a specific value != 0 as a redirect status. Value of 0 is helpful as a guard value, i.e. to detect things that were not properly initialized. I would even argue that the default value of 0 should be explicitly named INVALID. > Fetcher generates entries with status 0 > --- > > Key: NUTCH-864 > URL: https://issues.apache.org/jira/browse/NUTCH-864 > Project: Nutch > Issue Type: Bug > Components: fetcher > Environment: Gora with SQLBackend > URL: https://svn.apache.org/repos/asf/nutch/branches/nutchbase > Last Changed Rev: 980748 > Last Changed Date: 2010-07-30 14:19:52 +0200 (Fri, 30 Jul 2010) >Reporter: Julien Nioche >Assignee: Doğacan Güney > Fix For: 2.0 > > > After a round of fetching which got the following protocol status : > 10/07/30 15:11:39 INFO mapred.JobClient: ACCESS_DENIED=2 > 10/07/30 15:11:39 INFO mapred.JobClient: SUCCESS=1177 > 10/07/30 15:11:39 INFO mapred.JobClient: GONE=3 > 10/07/30 15:11:39 INFO mapred.JobClient: TEMP_MOVED=138 > 10/07/30 15:11:39 INFO mapred.JobClient: EXCEPTION=93 > 10/07/30 15:11:39 INFO mapred.JobClient: MOVED=521 > 10/07/30 15:11:39 INFO mapred.JobClient: NOTFOUND=62 > I ran : ./nutch org.apache.nutch.crawl.WebTableReader -stats > 10/07/30 15:12:37 INFO crawl.WebTableReader: Statistics for WebTable: > 10/07/30 15:12:37 INFO crawl.WebTableReader: TOTAL urls: 2690 > 10/07/30 15:12:37 INFO crawl.WebTableReader: retry 0: 2690 > 10/07/30 15:12:37 INFO crawl.WebTableReader: min score: 0.0 > 10/07/30 15:12:37 INFO crawl.WebTableReader: avg score: 0.7587361 > 10/07/30 15:12:37 INFO crawl.WebTableReader: max score: 1.0 > 10/07/30 15:12:37 INFO crawl.WebTableReader: status 0 (null): 649 > 10/07/30 15:12:37 INFO crawl.WebTableReader: status 2 (status_fetched): > 1177 (SUCCESS=1177) > 10/07/30 15:12:37 INFO crawl.WebTableReader: status 3 (status_gone): 112 > 10/07/30 15:12:37 INFO crawl.WebTableReader: status 34 (status_retry): > 93 (EXCEPTION=93) > 10/07/30 15:12:37 INFO crawl.WebTableReader: status 4 (status_redir_temp): > 138 (TEMP_MOVED=138) > 10/07/30 15:12:37 INFO crawl.WebTableReader: status 5 (status_redir_perm): > 521 (MOVED=521) > 10/07/30 15:12:37 INFO crawl.WebTableReader: WebTable statistics: done > There should not be any entries with status 0 (null) > I will investigate a bit more... -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-907) DataStore API doesn't support multiple storage areas for multiple disjoint crawls
[ https://issues.apache.org/jira/browse/NUTCH-907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916870#action_12916870 ] Andrzej Bialecki commented on NUTCH-907: - Hi Sertan, Thanks for the patch, this looks very good! A few comments: * I'm not good at naming things either... schemaId is a little bit cryptic though. If we didn't already use crawlId I would vote for that (and then rename crawlId to batchId or fetchId), as it is now... I dont know, maybe datasetId .. * since we now create multiple datasets, we need somehow to manage them - i.e. list and delete at least (create is implicit). There is no such functionality in this patch, but this can be addressed also as a separate issue. * IndexerMapReduce.createIndexJob: I think it would be useful to pass the "datasetId" as a Job property - this way indexing filter plugins can use this property to populate NutchDocument fields if needed. FWIW, this may be a good idea to do in other jobs as well... > DataStore API doesn't support multiple storage areas for multiple disjoint > crawls > - > > Key: NUTCH-907 > URL: https://issues.apache.org/jira/browse/NUTCH-907 > Project: Nutch > Issue Type: Bug >Reporter: Andrzej Bialecki > Fix For: 2.0 > > Attachments: NUTCH-907.patch > > > In Nutch 1.x it was possible to easily select a set of crawl data (crawldb, > page data, linkdb, etc) by specifying a path where the data was stored. This > enabled users to run several disjoint crawls with different configs, but > still using the same storage medium, just under different paths. > This is not possible now because there is a 1:1 mapping between a specific > DataStore instance and a set of crawl data. > In order to support this functionality the Gora API should be extended so > that it can create stores (and data tables in the underlying storage) that > use arbitrary prefixes to identify the particular crawl dataset. Then the > Nutch API should be extended to allow passing this "crawlId" value to select > one of possibly many existing crawl datasets. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-880) REST API (and webapp) for Nutch
[ https://issues.apache.org/jira/browse/NUTCH-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913118#action_12913118 ] Andrzej Bialecki commented on NUTCH-880: - bq. I think we can combine the approach you outlined in NUTCH-907 with this one. I'm not sure... they are really not the same things - you can execute many crawls with different seed lists, but still using the same Configuration. bq. What is "CLASS" ? It's the same as bin/nutch fully.qualified.class.name, only here I require that it implements NutchTool. bq. Btw, Andrzej, I will be happy to help out with the implementation if you want. By all means - I didn't have time so far to progress beyond this patch... > REST API (and webapp) for Nutch > --- > > Key: NUTCH-880 > URL: https://issues.apache.org/jira/browse/NUTCH-880 > Project: Nutch > Issue Type: New Feature >Affects Versions: 2.0 >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki > Attachments: API.patch > > > This issue is for discussing a REST-style API for accessing Nutch. > Here's an initial idea: > * I propose to use org.restlet for handling requests and returning > JSON/XML/whatever responses. > * hook up all regular tools so that they can be driven via this API. This > would have to be an async API, since all Nutch operations take long time to > execute. It follows then that we need to be able also to list running > operations, retrieve their current status, and possibly > abort/cancel/stop/suspend/resume/...? This also means that we would have to > potentially create & manage many threads in a servlet - AFAIK this is frowned > upon by J2EE purists... > * package this in a webapp (that includes all deps, essentially nutch.job > content), with the restlet servlet as an entry point. > Open issues: > * how to implement the reading of crawl results via this API > * should we manage only crawls that use a single configuration per webapp, or > should we have a notion of crawl contexts (sets of crawl configs) with CRUD > ops on them? this would be nice, because it would allow managing of several > different crawls, with different configs, in a single webapp - but it > complicates the implementation a lot. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-909) Add alternative search-provider to Nutch site
[ https://issues.apache.org/jira/browse/NUTCH-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12912474#action_12912474 ] Andrzej Bialecki commented on NUTCH-909: - bq. It might be better to see the message "Search with Apache Solr" (as on the TIKA's site). Yes, let's make this uniform. > Add alternative search-provider to Nutch site > - > > Key: NUTCH-909 > URL: https://issues.apache.org/jira/browse/NUTCH-909 > Project: Nutch > Issue Type: Improvement > Components: documentation >Reporter: Alex Baranau >Priority: Minor > Attachments: NUTCH-909.patch > > > Add additional search provider (to existed Lucid Find) search-lucene.com. > Initiated in discussion: http://search-lucene.com/m/2suCr1UnDfF1 > According to Andrzej's suggestion, "when preparing the patch let's follow the > same rationales as those in TIKA-488, since they are applicable here too", so > please refer to that issue for more insight on implementation details. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-906) Nutch OpenSearch sometimes raises DOMExceptions due to Lucene column names not being valid XML tag names
[ https://issues.apache.org/jira/browse/NUTCH-906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-906. - Fix Version/s: 1.2 Resolution: Fixed Fixed in rev. 998261. Thanks! > Nutch OpenSearch sometimes raises DOMExceptions due to Lucene column names > not being valid XML tag names > > > Key: NUTCH-906 > URL: https://issues.apache.org/jira/browse/NUTCH-906 > Project: Nutch > Issue Type: Bug > Components: web gui >Affects Versions: 1.1 > Environment: Debian GNU/Linux 64-bit >Reporter: Asheesh Laroia >Assignee: Andrzej Bialecki > Fix For: 1.2 > > Attachments: > 0001-OpenSearch-If-a-Lucene-column-name-begins-with-a-num.patch > > Original Estimate: 0.33h > Remaining Estimate: 0.33h > > The Nutch FAQ explains that OpenSearch includes "all fields that are > available at search result time." However, some Lucene column names can start > with numbers. Valid XML tags cannot. If Nutch is generating OpenSearch > results for a document with a Lucene document column whose name starts with > numbers, the underlying Xerces library throws this exception: > org.w3c.dom.DOMException: INVALID_CHARACTER_ERR: An invalid or illegal XML > character is specified. > So I have written a patch that tests strings before they are used to generate > tags within OpenSearch. > I hope you merge this, or a better version of the patch! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (NUTCH-906) Nutch OpenSearch sometimes raises DOMExceptions due to Lucene column names not being valid XML tag names
[ https://issues.apache.org/jira/browse/NUTCH-906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki reassigned NUTCH-906: --- Assignee: Andrzej Bialecki > Nutch OpenSearch sometimes raises DOMExceptions due to Lucene column names > not being valid XML tag names > > > Key: NUTCH-906 > URL: https://issues.apache.org/jira/browse/NUTCH-906 > Project: Nutch > Issue Type: Bug > Components: web gui >Affects Versions: 1.1 > Environment: Debian GNU/Linux 64-bit >Reporter: Asheesh Laroia >Assignee: Andrzej Bialecki > Attachments: > 0001-OpenSearch-If-a-Lucene-column-name-begins-with-a-num.patch > > Original Estimate: 0.33h > Remaining Estimate: 0.33h > > The Nutch FAQ explains that OpenSearch includes "all fields that are > available at search result time." However, some Lucene column names can start > with numbers. Valid XML tags cannot. If Nutch is generating OpenSearch > results for a document with a Lucene document column whose name starts with > numbers, the underlying Xerces library throws this exception: > org.w3c.dom.DOMException: INVALID_CHARACTER_ERR: An invalid or illegal XML > character is specified. > So I have written a patch that tests strings before they are used to generate > tags within OpenSearch. > I hope you merge this, or a better version of the patch! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-862) HttpClient null pointer exception
[ https://issues.apache.org/jira/browse/NUTCH-862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-862. - Fix Version/s: 1.2 2.0 Resolution: Fixed Fix applied to branch-1.2 (rev. 998156), branch-1.3 (rev. 998158) and trunk (998160). Thank you! > HttpClient null pointer exception > - > > Key: NUTCH-862 > URL: https://issues.apache.org/jira/browse/NUTCH-862 > Project: Nutch > Issue Type: Bug > Components: fetcher >Affects Versions: 1.0.0 > Environment: linux, java 6 >Reporter: Sebastian Nagel >Assignee: Andrzej Bialecki >Priority: Minor > Fix For: 1.2, 2.0 > > Attachments: NUTCH-862.patch > > > When re-fetching a document (a continued crawl) HttpClient throws an null > pointer exception causing the document to be emptied: > 2010-07-27 12:45:09,199 INFO fetcher.Fetcher - fetching > http://localhost/doc/selfhtml/html/index.htm > 2010-07-27 12:45:09,203 ERROR httpclient.Http - java.lang.NullPointerException > 2010-07-27 12:45:09,204 ERROR httpclient.Http - at > org.apache.nutch.protocol.httpclient.HttpResponse.(HttpResponse.java:138) > 2010-07-27 12:45:09,204 ERROR httpclient.Http - at > org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:154) > 2010-07-27 12:45:09,204 ERROR httpclient.Http - at > org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:220) > 2010-07-27 12:45:09,204 ERROR httpclient.Http - at > org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:537) > 2010-07-27 12:45:09,204 INFO fetcher.Fetcher - fetch of > http://localhost/doc/selfhtml/html/index.htm failed with: > java.lang.NullPointerException > Because the document is re-fetched the server answers "304" (not modified): > 127.0.0.1 - - [27/Jul/2010:12:45:09 +0200] "GET /doc/selfhtml/html/index.htm > HTTP/1.0" 304 174 "-" "Nutch-1.0" > No content is sent in this case (empty http body). > Index: > trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpResponse.java > === > --- > trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpResponse.java > (revision 979647) > +++ > trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpResponse.java > (working copy) > @@ -134,7 +134,8 @@ > if (code == 200) throw new IOException(e.toString()); > // for codes other than 200 OK, we are fine with empty content >} finally { > -in.close(); > +if (in != null) > + in.close(); > get.abort(); >} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (NUTCH-862) HttpClient null pointer exception
[ https://issues.apache.org/jira/browse/NUTCH-862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki reassigned NUTCH-862: --- Assignee: Andrzej Bialecki > HttpClient null pointer exception > - > > Key: NUTCH-862 > URL: https://issues.apache.org/jira/browse/NUTCH-862 > Project: Nutch > Issue Type: Bug > Components: fetcher >Affects Versions: 1.0.0 > Environment: linux, java 6 >Reporter: Sebastian Nagel >Assignee: Andrzej Bialecki >Priority: Minor > Attachments: NUTCH-862.patch > > > When re-fetching a document (a continued crawl) HttpClient throws an null > pointer exception causing the document to be emptied: > 2010-07-27 12:45:09,199 INFO fetcher.Fetcher - fetching > http://localhost/doc/selfhtml/html/index.htm > 2010-07-27 12:45:09,203 ERROR httpclient.Http - java.lang.NullPointerException > 2010-07-27 12:45:09,204 ERROR httpclient.Http - at > org.apache.nutch.protocol.httpclient.HttpResponse.(HttpResponse.java:138) > 2010-07-27 12:45:09,204 ERROR httpclient.Http - at > org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:154) > 2010-07-27 12:45:09,204 ERROR httpclient.Http - at > org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:220) > 2010-07-27 12:45:09,204 ERROR httpclient.Http - at > org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:537) > 2010-07-27 12:45:09,204 INFO fetcher.Fetcher - fetch of > http://localhost/doc/selfhtml/html/index.htm failed with: > java.lang.NullPointerException > Because the document is re-fetched the server answers "304" (not modified): > 127.0.0.1 - - [27/Jul/2010:12:45:09 +0200] "GET /doc/selfhtml/html/index.htm > HTTP/1.0" 304 174 "-" "Nutch-1.0" > No content is sent in this case (empty http body). > Index: > trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpResponse.java > === > --- > trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpResponse.java > (revision 979647) > +++ > trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpResponse.java > (working copy) > @@ -134,7 +134,8 @@ > if (code == 200) throw new IOException(e.toString()); > // for codes other than 200 OK, we are fine with empty content >} finally { > -in.close(); > +if (in != null) > + in.close(); > get.abort(); >} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-880) REST API (and webapp) for Nutch
[ https://issues.apache.org/jira/browse/NUTCH-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-880: Attachment: API.patch Initial patch for discussion. This is a work in progress, so only some functionality is implemented, and even less than that is actually working ;) I would appreciate a review and comments. > REST API (and webapp) for Nutch > --- > > Key: NUTCH-880 > URL: https://issues.apache.org/jira/browse/NUTCH-880 > Project: Nutch > Issue Type: New Feature >Affects Versions: 2.0 >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki > Attachments: API.patch > > > This issue is for discussing a REST-style API for accessing Nutch. > Here's an initial idea: > * I propose to use org.restlet for handling requests and returning > JSON/XML/whatever responses. > * hook up all regular tools so that they can be driven via this API. This > would have to be an async API, since all Nutch operations take long time to > execute. It follows then that we need to be able also to list running > operations, retrieve their current status, and possibly > abort/cancel/stop/suspend/resume/...? This also means that we would have to > potentially create & manage many threads in a servlet - AFAIK this is frowned > upon by J2EE purists... > * package this in a webapp (that includes all deps, essentially nutch.job > content), with the restlet servlet as an entry point. > Open issues: > * how to implement the reading of crawl results via this API > * should we manage only crawls that use a single configuration per webapp, or > should we have a notion of crawl contexts (sets of crawl configs) with CRUD > ops on them? this would be nice, because it would allow managing of several > different crawls, with different configs, in a single webapp - but it > complicates the implementation a lot. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (NUTCH-880) REST API (and webapp) for Nutch
[ https://issues.apache.org/jira/browse/NUTCH-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki reassigned NUTCH-880: --- Assignee: Andrzej Bialecki > REST API (and webapp) for Nutch > --- > > Key: NUTCH-880 > URL: https://issues.apache.org/jira/browse/NUTCH-880 > Project: Nutch > Issue Type: New Feature >Affects Versions: 2.0 >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki > > This issue is for discussing a REST-style API for accessing Nutch. > Here's an initial idea: > * I propose to use org.restlet for handling requests and returning > JSON/XML/whatever responses. > * hook up all regular tools so that they can be driven via this API. This > would have to be an async API, since all Nutch operations take long time to > execute. It follows then that we need to be able also to list running > operations, retrieve their current status, and possibly > abort/cancel/stop/suspend/resume/...? This also means that we would have to > potentially create & manage many threads in a servlet - AFAIK this is frowned > upon by J2EE purists... > * package this in a webapp (that includes all deps, essentially nutch.job > content), with the restlet servlet as an entry point. > Open issues: > * how to implement the reading of crawl results via this API > * should we manage only crawls that use a single configuration per webapp, or > should we have a notion of crawl contexts (sets of crawl configs) with CRUD > ops on them? this would be nice, because it would allow managing of several > different crawls, with different configs, in a single webapp - but it > complicates the implementation a lot. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-907) DataStore API doesn't support multiple storage areas for multiple disjoint crawls
[ https://issues.apache.org/jira/browse/NUTCH-907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910109#action_12910109 ] Andrzej Bialecki commented on NUTCH-907: - That's very good news - in that case I'm fine with the Gora API as it is now, we should change Nutch to make use of this functionality. > DataStore API doesn't support multiple storage areas for multiple disjoint > crawls > - > > Key: NUTCH-907 > URL: https://issues.apache.org/jira/browse/NUTCH-907 > Project: Nutch > Issue Type: Bug >Reporter: Andrzej Bialecki > Fix For: 2.0 > > > In Nutch 1.x it was possible to easily select a set of crawl data (crawldb, > page data, linkdb, etc) by specifying a path where the data was stored. This > enabled users to run several disjoint crawls with different configs, but > still using the same storage medium, just under different paths. > This is not possible now because there is a 1:1 mapping between a specific > DataStore instance and a set of crawl data. > In order to support this functionality the Gora API should be extended so > that it can create stores (and data tables in the underlying storage) that > use arbitrary prefixes to identify the particular crawl dataset. Then the > Nutch API should be extended to allow passing this "crawlId" value to select > one of possibly many existing crawl datasets. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-882) Design a Host table in GORA
[ https://issues.apache.org/jira/browse/NUTCH-882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12909757#action_12909757 ] Andrzej Bialecki commented on NUTCH-882: - +1 to NutchContext. See also NUTCH-907 because the changes required in Gora API will likely make this task easier (once implemented ;) ). > Design a Host table in GORA > --- > > Key: NUTCH-882 > URL: https://issues.apache.org/jira/browse/NUTCH-882 > Project: Nutch > Issue Type: New Feature >Affects Versions: 2.0 >Reporter: Julien Nioche >Assignee: Julien Nioche > Fix For: 2.0 > > Attachments: NUTCH-882-v1.patch > > > Having a separate GORA table for storing information about hosts (and > domains?) would be very useful for : > * customising the behaviour of the fetching on a host basis e.g. number of > threads, min time between threads etc... > * storing stats > * keeping metadata and possibly propagate them to the webpages > * keeping a copy of the robots.txt and possibly use that later to filter the > webtable > * store sitemaps files and update the webtable accordingly > I'll try to come up with a GORA schema for such a host table but any comments > are of course already welcome -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-907) DataStore API doesn't support multiple storage areas for multiple disjoint crawls
DataStore API doesn't support multiple storage areas for multiple disjoint crawls - Key: NUTCH-907 URL: https://issues.apache.org/jira/browse/NUTCH-907 Project: Nutch Issue Type: Bug Reporter: Andrzej Bialecki Fix For: 2.0 In Nutch 1.x it was possible to easily select a set of crawl data (crawldb, page data, linkdb, etc) by specifying a path where the data was stored. This enabled users to run several disjoint crawls with different configs, but still using the same storage medium, just under different paths. This is not possible now because there is a 1:1 mapping between a specific DataStore instance and a set of crawl data. In order to support this functionality the Gora API should be extended so that it can create stores (and data tables in the underlying storage) that use arbitrary prefixes to identify the particular crawl dataset. Then the Nutch API should be extended to allow passing this "crawlId" value to select one of possibly many existing crawl datasets. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-893) DataStore.put() silently loses records when executed from multiple processes
[ https://issues.apache.org/jira/browse/NUTCH-893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12908791#action_12908791 ] Andrzej Bialecki commented on NUTCH-893: - +1 and +1. > DataStore.put() silently loses records when executed from multiple processes > > > Key: NUTCH-893 > URL: https://issues.apache.org/jira/browse/NUTCH-893 > Project: Nutch > Issue Type: Bug >Affects Versions: 2.0 > Environment: Gora HEAD, SqlStore, MySQL 5.1, Ubuntu 10.4 x64, Sun JDK > 1.6 >Reporter: Andrzej Bialecki >Priority: Blocker > Fix For: 2.0 > > Attachments: NUTCH-893.patch, NUTCH-893_v2.patch > > > In order to debug the issue described in NUTCH-879 I created a test to > simulate multiple clients appending to webtable (please see the patch), which > is the situation that we have in distributed map-reduce jobs. > There are two tests there: one that uses multiple threads within the same > JVM, and another that uses single thread in multiple JVMs. Each test first > clears webtable (be careful!), and then puts a bunch of pages, and finally > counts that all are present and their values correspond to keys. To make > things more interesting each execution context (thread or process) closes and > reopens its instance of DataStore a few times. > The multithreaded test passes just fine. However, the multi-process test > fails with missing keys, as many as 30%. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-893) DataStore.put() silently loses records when executed from multiple processes
[ https://issues.apache.org/jira/browse/NUTCH-893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12907297#action_12907297 ] Andrzej Bialecki commented on NUTCH-893: - Very good catch - yes, the test now passes for me too. This is actually good news for Gora :) I'll continue digging regarding NUTCH-879 ... don't hesitate if you have any ideas how to solve that. I suspect we may be losing keys in Generator or Fetcher, due to partitioning collisions but this hypothesis needs to be tested. > DataStore.put() silently loses records when executed from multiple processes > > > Key: NUTCH-893 > URL: https://issues.apache.org/jira/browse/NUTCH-893 > Project: Nutch > Issue Type: Bug >Affects Versions: 2.0 > Environment: Gora HEAD, SqlStore, MySQL 5.1, Ubuntu 10.4 x64, Sun JDK > 1.6 >Reporter: Andrzej Bialecki >Priority: Blocker > Fix For: 2.0 > > Attachments: NUTCH-893.patch, NUTCH-893_v2.patch > > > In order to debug the issue described in NUTCH-879 I created a test to > simulate multiple clients appending to webtable (please see the patch), which > is the situation that we have in distributed map-reduce jobs. > There are two tests there: one that uses multiple threads within the same > JVM, and another that uses single thread in multiple JVMs. Each test first > clears webtable (be careful!), and then puts a bunch of pages, and finally > counts that all are present and their values correspond to keys. To make > things more interesting each execution context (thread or process) closes and > reopens its instance of DataStore a few times. > The multithreaded test passes just fine. However, the multi-process test > fails with missing keys, as many as 30%. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-893) DataStore.put() silently loses records when executed from multiple processes
[ https://issues.apache.org/jira/browse/NUTCH-893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904226#action_12904226 ] Andrzej Bialecki commented on NUTCH-893: - Dogacan, flush() doesn't help - there are still missing keys. What's interesting is that the missing keys form sequential ranges. Could this be perhaps an issue with connection management, or some synchronization issue? > DataStore.put() silently loses records when executed from multiple processes > > > Key: NUTCH-893 > URL: https://issues.apache.org/jira/browse/NUTCH-893 > Project: Nutch > Issue Type: Bug >Affects Versions: 2.0 > Environment: Gora HEAD, SqlStore, MySQL 5.1, Ubuntu 10.4 x64, Sun JDK > 1.6 >Reporter: Andrzej Bialecki >Priority: Blocker > Fix For: 2.0 > > Attachments: NUTCH-893.patch > > > In order to debug the issue described in NUTCH-879 I created a test to > simulate multiple clients appending to webtable (please see the patch), which > is the situation that we have in distributed map-reduce jobs. > There are two tests there: one that uses multiple threads within the same > JVM, and another that uses single thread in multiple JVMs. Each test first > clears webtable (be careful!), and then puts a bunch of pages, and finally > counts that all are present and their values correspond to keys. To make > things more interesting each execution context (thread or process) closes and > reopens its instance of DataStore a few times. > The multithreaded test passes just fine. However, the multi-process test > fails with missing keys, as many as 30%. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-893) DataStore.put() silently loses records when executed from multiple processes
[ https://issues.apache.org/jira/browse/NUTCH-893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-893: Attachment: NUTCH-893.patch Unit test to illustrate the issue. > DataStore.put() silently loses records when executed from multiple processes > > > Key: NUTCH-893 > URL: https://issues.apache.org/jira/browse/NUTCH-893 > Project: Nutch > Issue Type: Bug >Affects Versions: 2.0 > Environment: Gora HEAD, SqlStore, MySQL 5.1, Ubuntu 10.4 x64, Sun JDK > 1.6 >Reporter: Andrzej Bialecki > Attachments: NUTCH-893.patch > > > In order to debug the issue described in NUTCH-879 I created a test to > simulate multiple clients appending to webtable (please see the patch), which > is the situation that we have in distributed map-reduce jobs. > There are two tests there: one that uses multiple threads within the same > JVM, and another that uses single thread in multiple JVMs. Each test first > clears webtable (be careful!), and then puts a bunch of pages, and finally > counts that all are present and their values correspond to keys. To make > things more interesting each execution context (thread or process) closes and > reopens its instance of DataStore a few times. > The multithreaded test passes just fine. However, the multi-process test > fails with missing keys, as many as 30%. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-893) DataStore.put() silently loses records when executed from multiple processes
DataStore.put() silently loses records when executed from multiple processes Key: NUTCH-893 URL: https://issues.apache.org/jira/browse/NUTCH-893 Project: Nutch Issue Type: Bug Environment: Gora HEAD, SqlStore, MySQL 5.1, Ubuntu 10.4 x64, Sun JDK 1.6 Reporter: Andrzej Bialecki In order to debug the issue described in NUTCH-879 I created a test to simulate multiple clients appending to webtable (please see the patch), which is the situation that we have in distributed map-reduce jobs. There are two tests there: one that uses multiple threads within the same JVM, and another that uses single thread in multiple JVMs. Each test first clears webtable (be careful!), and then puts a bunch of pages, and finally counts that all are present and their values correspond to keys. To make things more interesting each execution context (thread or process) closes and reopens its instance of DataStore a few times. The multithreaded test passes just fine. However, the multi-process test fails with missing keys, as many as 30%. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-893) DataStore.put() silently loses records when executed from multiple processes
[ https://issues.apache.org/jira/browse/NUTCH-893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-893: Affects Version/s: 2.0 > DataStore.put() silently loses records when executed from multiple processes > > > Key: NUTCH-893 > URL: https://issues.apache.org/jira/browse/NUTCH-893 > Project: Nutch > Issue Type: Bug >Affects Versions: 2.0 > Environment: Gora HEAD, SqlStore, MySQL 5.1, Ubuntu 10.4 x64, Sun JDK > 1.6 >Reporter: Andrzej Bialecki > > In order to debug the issue described in NUTCH-879 I created a test to > simulate multiple clients appending to webtable (please see the patch), which > is the situation that we have in distributed map-reduce jobs. > There are two tests there: one that uses multiple threads within the same > JVM, and another that uses single thread in multiple JVMs. Each test first > clears webtable (be careful!), and then puts a bunch of pages, and finally > counts that all are present and their values correspond to keys. To make > things more interesting each execution context (thread or process) closes and > reopens its instance of DataStore a few times. > The multithreaded test passes just fine. However, the multi-process test > fails with missing keys, as many as 30%. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-891) Nutch build should not depend on unversioned local deps
[ https://issues.apache.org/jira/browse/NUTCH-891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12900455#action_12900455 ] Andrzej Bialecki commented on NUTCH-891: - Yes, this would help. > Nutch build should not depend on unversioned local deps > --- > > Key: NUTCH-891 > URL: https://issues.apache.org/jira/browse/NUTCH-891 > Project: Nutch > Issue Type: Bug >Reporter: Andrzej Bialecki > > The fix in NUTCH-873 introduces an unknown variable to the build process. > Since local ivy artifacts are unversioned, different people that install Gora > jars at different points in time will use the same artifact id but in fact > the artifacts (jars) will differ because they will come from different > revisions of Gora sources. Therefore Nutch builds based on the same svn rev. > won't be repeatable across different environments. > As much as it pains the ivy purists ;) until Gora publishes versioned > artifacts I'd like to revert the fix in NUTCH-873 and add again Gora jars > built from a known external rev. We can add a README that contains commit id > from Gora. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-891) Nutch build should not depend on unversioned local deps
[ https://issues.apache.org/jira/browse/NUTCH-891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12900285#action_12900285 ] Andrzej Bialecki commented on NUTCH-891: - bq. So, your point is [..] Yes, that's exactly my point. bq. I'd say, why not make the Gora Ant build publish a gora-0.1-.jar? Sure, that would solve the problem for now - I'll bother the Gora devs, and you can create the patch, ok? :) Ultimately we should go with the other solution (publish to Maven), but it requires more involvement from Gora devs. bq. I'm not trying to be difficult about NUTCH-873 ... Neither am I, no egos here - I just find the current situation after the fix to be intractable, especially when doing bugfixing and testing - because even if APIs stay the same, hidden bugs may not be the same across revisions... > Nutch build should not depend on unversioned local deps > --- > > Key: NUTCH-891 > URL: https://issues.apache.org/jira/browse/NUTCH-891 > Project: Nutch > Issue Type: Bug >Reporter: Andrzej Bialecki > > The fix in NUTCH-873 introduces an unknown variable to the build process. > Since local ivy artifacts are unversioned, different people that install Gora > jars at different points in time will use the same artifact id but in fact > the artifacts (jars) will differ because they will come from different > revisions of Gora sources. Therefore Nutch builds based on the same svn rev. > won't be repeatable across different environments. > As much as it pains the ivy purists ;) until Gora publishes versioned > artifacts I'd like to revert the fix in NUTCH-873 and add again Gora jars > built from a known external rev. We can add a README that contains commit id > from Gora. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-884) FetcherJob should run more reduce tasks than default
[ https://issues.apache.org/jira/browse/NUTCH-884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-884. - Resolution: Fixed Committed in rev. 986647. > FetcherJob should run more reduce tasks than default > > > Key: NUTCH-884 > URL: https://issues.apache.org/jira/browse/NUTCH-884 > Project: Nutch > Issue Type: Improvement > Components: fetcher >Affects Versions: 2.0 >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki > Fix For: 2.0 > > Attachments: NUTCH-884.patch, NUTCH-884.patch > > > FetcherJob now performs fetching in the reduce phase. This means that in a > typical Hadoop setup there will be many fewer reduce tasks than map tasks, > and consequently the max. total throughput of Fetcher will be proportionally > reduced. I propose that FetcherJob should set the number of reduce tasks to > the number of map tasks. This way the fetching will be more granular. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-891) Nutch build should not depend on unversioned local deps
Nutch build should not depend on unversioned local deps --- Key: NUTCH-891 URL: https://issues.apache.org/jira/browse/NUTCH-891 Project: Nutch Issue Type: Bug Reporter: Andrzej Bialecki The fix in NUTCH-873 introduces an unknown variable to the build process. Since local ivy artifacts are unversioned, different people that install Gora jars at different points in time will use the same artifact id but in fact the artifacts (jars) will differ because they will come from different revisions of Gora sources. Therefore Nutch builds based on the same svn rev. won't be repeatable across different environments. As much as it pains the ivy purists ;) until Gora publishes versioned artifacts I'd like to revert the fix in NUTCH-873 and add again Gora jars built from a known external rev. We can add a README that contains commit id from Gora. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-890) SqlStore doesn't work with nested types in Avro schema
SqlStore doesn't work with nested types in Avro schema -- Key: NUTCH-890 URL: https://issues.apache.org/jira/browse/NUTCH-890 Project: Nutch Issue Type: Bug Affects Versions: 2.0 Environment: Nutch trunk, Gora trunk, Ubuntu 10.4 x64, Sun JDK 1.6, hsqldb-2.0.0, MySQL 5.1.41, HBase 0.20.6 / Hadoop 0.20.2 Reporter: Andrzej Bialecki ParseStatus and ProtocolStatus are not properly serialized and stored when using SqlStore. This may indicate a broader issue in Gora with processing of nested types in Avro schemas. HBaseStore works properly, i.e. both types can be correctly stored and retrieved. SqlStore produces either NULL or '\0\0' value. This happens both when using HSQLDB and MySQL. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-884) FetcherJob should run more reduce tasks than default
[ https://issues.apache.org/jira/browse/NUTCH-884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-884: Attachment: NUTCH-884.patch Corrected mistake in arg handling. > FetcherJob should run more reduce tasks than default > > > Key: NUTCH-884 > URL: https://issues.apache.org/jira/browse/NUTCH-884 > Project: Nutch > Issue Type: Improvement > Components: fetcher >Affects Versions: 2.0 >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki > Fix For: 2.0 > > Attachments: NUTCH-884.patch, NUTCH-884.patch > > > FetcherJob now performs fetching in the reduce phase. This means that in a > typical Hadoop setup there will be many fewer reduce tasks than map tasks, > and consequently the max. total throughput of Fetcher will be proportionally > reduced. I propose that FetcherJob should set the number of reduce tasks to > the number of map tasks. This way the fetching will be more granular. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-881) Good quality documentation for Nutch
[ https://issues.apache.org/jira/browse/NUTCH-881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899815#action_12899815 ] Andrzej Bialecki commented on NUTCH-881: - bq. So what is new in Nutch 2.0 which doesn't appear in Nutch 1.x ? Gora is the main thing which comes to mind. Yes. We also removed all search-related code from Nutch and rely exclusively on Solr to perform searching. This means that some APIs have been removed (e.g. query filters, text analysis, lucene indexing backend). bq. How do the config files differ? We still use the same nutch-default/nutch-site.xml, plus per-plugin config files. Some properties have changes, e.g the ones to limit max. number of urls per host in generator. We added some Gora-related files, gora.properties and gora-*-mapping.xml, that define what driver to use and how to map webtable columns onto storage-specific columns/fields. bq. How does Nutch's use of Hadoop differ? All jobs now use GoraInputFormat / GoraOutputFormat, which hides the details about the actual data storage backend. bq. How do the command lines differ? (Presumably you need different command lines to say where to store the crawldb, right?) Yes. Actually, this could be a separate issue to be solved - currently we assume there is one Nutch webtable per storage backend, so we don't specify the "db identifier" anywhere... but this prevents us from defining multiple crawl configs that use the same backend, so it should be addressed. > Good quality documentation for Nutch > > > Key: NUTCH-881 > URL: https://issues.apache.org/jira/browse/NUTCH-881 > Project: Nutch > Issue Type: Improvement > Components: documentation >Affects Versions: 2.0 >Reporter: Andrzej Bialecki > > This is, and has been, a long standing request from Nutch users. This becomes > an acute need as we redesign Nutch 2.0, because the collective knowledge and > the Wiki will no longer be useful without massive amount of editing. > IMHO the reference documentation should be in SVN, and not on the Wiki - the > Wiki is good for casual information and recipes but I think it's too messy > and not reliable enough as a reference. > I propose to start with the following: > 1. let's decide on the format of the docs. Each format has its own pros and > cons: > * HTML: easy to work with, but formatting may be messy unless we edit it by > hand, at which point it's no longer so easy... Good toolchains to convert to > other formats, but limited expressiveness of larger structures (e.g. book, > chapters, TOC, multi-column layouts, etc). > * Docbook: learning curve is higher, but not insurmountable... Naturally > yields very good structure. Figures/diagrams may be problematic - different > renderers (html, pdf) like to treat the scaling and placing somewhat > differently. > * Wiki-style (Confluence or TWiki): easy to use, but limited control over > larger structures. Maven Doxia can format cwiki, twiki, and a host of other > formats to e.g. html and pdf. > * other? > 2. start documenting the main tools and the main APIs (e.g. the plugins and > all the extension points). We can of course reuse material from the Wiki and > from various presentations (e.g. the ApacheCon slides). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-882) Design a Host table in GORA
[ https://issues.apache.org/jira/browse/NUTCH-882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899810#action_12899810 ] Andrzej Bialecki commented on NUTCH-882: - This functionality is very useful for larger crawls. Some comments about the design: * the table can be populated by injection, as in the patch, or from webtable. Since keys are from different spaces (url-s vs. hosts) I think it would be very tricky to try to do this on the fly in one of the existing jobs... so this means an additional step in the workflow. * I'm worried about the scalability of the approach taken by HostMDApplierJob - per-host data will be multiplied by the number of urls from a host and put into webtable, which will in turn balloon the size of webtable... A little background: what we see here is a design issue typical for mapreduce, where you have to merge data keyed by keys from different spaces (with different granularity). Possible solutions involve: * first converting the data to a common key space and then submit both data as mapreduce inputs, or * submitting only the finer-grained input to mapreduce and dynamically converting the keys on the fly (and reading data directly from the coarser-grained source, accessing it randomly). A similar situation is described in HADOOP-3063 together with a solution, namely, to use random access and use Bloom filters to quickly discover missing keys. So I propose that instead of statically merging the data (HostMDApplierJob) we could merge it dynamically on the fly, by implementing a high-performance reader of host table, and then use this reader directly in the context of map()/reduce() tasks as needed. This reader should use a Bloom filter to quickly determine nonexistent keys, and it may use a limited amount of in-memory cache for existing records. The bloom filter data should be re-computed on updates and stored/retrieved, to avoid lengthy initialization. The cost of using this approach is IMHO much smaller than the cost of statically joining this data. The static join costs both space and time to execute an additional jon. Let's consider the dynamic join cost, e.g. in Fetcher - HostDBReader would be used only when initializing host queues, so the number of IO-s would be at most the number of unique hosts on the fetchlist (at most, because some of host data may be missing - here's Bloom filter to the rescue to quickly discover this without doing any IO). During updatedb we would likely want to access this data in DbUpdateReducer. Keys are URLs here, and they are ordered in ascending order - but they are in host-reversed format, which means that URLs from similar hosts and domains are close together. This is beneficial, because when we read data from HostDBReader we will read records that are close together, thus avoiding seeks. We can also cache the retrieved per-host data in DbUpdateReducer. > Design a Host table in GORA > --- > > Key: NUTCH-882 > URL: https://issues.apache.org/jira/browse/NUTCH-882 > Project: Nutch > Issue Type: New Feature >Affects Versions: 2.0 >Reporter: Julien Nioche >Assignee: Julien Nioche > Fix For: 2.0 > > Attachments: NUTCH-882-v1.patch > > > Having a separate GORA table for storing information about hosts (and > domains?) would be very useful for : > * customising the behaviour of the fetching on a host basis e.g. number of > threads, min time between threads etc... > * storing stats > * keeping metadata and possibly propagate them to the webpages > * keeping a copy of the robots.txt and possibly use that later to filter the > webtable > * store sitemaps files and update the webtable accordingly > I'll try to come up with a GORA schema for such a host table but any comments > are of course already welcome -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-887) Delegate parsing of feeds to Tika
[ https://issues.apache.org/jira/browse/NUTCH-887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898667#action_12898667 ] Andrzej Bialecki commented on NUTCH-887: - bq. Huh, what do you mean? Nick just added a bunch of code to handle Compound document detection, and parsing Ah, good - I missed that, I need to take a closer look at this... bq. I'm starting to feel the creep of parsing plugins make their way back into Nutch instead of just jumping over into Tika The "creep" so far is just parse-html, which we were forced to add back because Tika HTML parsing was totally inadequate to our needs. I know there have been some progress on this front, but I suspect it's still not sufficient. The ultimate goal is still to use Tika for all formats that it can handle, preferrably "all formats" without further qualifiers ;) > Delegate parsing of feeds to Tika > - > > Key: NUTCH-887 > URL: https://issues.apache.org/jira/browse/NUTCH-887 > Project: Nutch > Issue Type: Wish > Components: parser >Affects Versions: 2.0 >Reporter: Julien Nioche > Fix For: 2.0 > > > [Starting a new thread from https://issues.apache.org/jira/browse/NUTCH-874] > One of the plugins which hasn't been ported yet is the feed parser. We could > rely on the one we recently added to Tika, knowing that there is a > substantial difference in the sense that the Tika feed parser generates a > simple XHTML representation of the document where the feeds are simply > represented as anchors whereas the Nutch version created new documents for > each feed. > There is also the parse-rss plugin in Nutch which is quite similar - what's > the difference with the feed one again? Since the Tika parser would handle > all sorts of feed formats why not simply rely on it? > Any thoughts on this? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-887) Delegate parsing of feeds to Tika
[ https://issues.apache.org/jira/browse/NUTCH-887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898647#action_12898647 ] Andrzej Bialecki commented on NUTCH-887: - bq. If there's something missing that Nutch needs, we'll add it to Tika and roll it into 0.8. There is something missing in Tika, and it's the support for compound documents, but it's not likely to be added in 0.8... not that we have such support in Nutch at the moment - it fell victim to the trunk/nutchbase switch, but it should be added back soon. I'd keep the "feed" plugin around for a while still, as an interim solution until Tika supports compound documents. +1 to getting rid of parse-rss. > Delegate parsing of feeds to Tika > - > > Key: NUTCH-887 > URL: https://issues.apache.org/jira/browse/NUTCH-887 > Project: Nutch > Issue Type: Wish > Components: parser >Affects Versions: 2.0 >Reporter: Julien Nioche > Fix For: 2.0 > > > [Starting a new thread from https://issues.apache.org/jira/browse/NUTCH-874] > One of the plugins which hasn't been ported yet is the feed parser. We could > rely on the one we recently added to Tika, knowing that there is a > substantial difference in the sense that the Tika feed parser generates a > simple XHTML representation of the document where the feeds are simply > represented as anchors whereas the Nutch version created new documents for > each feed. > There is also the parse-rss plugin in Nutch which is quite similar - what's > the difference with the feed one again? Since the Tika parser would handle > all sorts of feed formats why not simply rely on it? > Any thoughts on this? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-842) AutoGenerate WebPage code
[ https://issues.apache.org/jira/browse/NUTCH-842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898196#action_12898196 ] Andrzej Bialecki commented on NUTCH-842: - I think we can put it as depends in the compile target, but then we need to first check if there's any need to generate, i.e. check the timestamp of WebPage.avsc and WebPage.java and generate only when .avsc is more recent. Perhaps we should check in generated sources anyway, because if we don't do it then it will be difficult to compile in IDE-s without running Ant first... though to tell the truth it got difficult already when we switched to ivy. Perhaps we need a separate "eclipse" target to resolve dependencies, run gora compiler and generate Eclipse project files? > AutoGenerate WebPage code > - > > Key: NUTCH-842 > URL: https://issues.apache.org/jira/browse/NUTCH-842 > Project: Nutch > Issue Type: Improvement >Reporter: Doğacan Güney >Assignee: Doğacan Güney > Fix For: 2.0 > > Attachments: NUTCH-842.patch > > > This issue will track the addition of an ant task that will automatically > generate o.a.n.storage.WebPage (and ProtocolStatus and ParseStatus) from > src/gora/webpage.avsc. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-886) A .gitignore file for Nutch
[ https://issues.apache.org/jira/browse/NUTCH-886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12897638#action_12897638 ] Andrzej Bialecki commented on NUTCH-886: - +1. > A .gitignore file for Nutch > --- > > Key: NUTCH-886 > URL: https://issues.apache.org/jira/browse/NUTCH-886 > Project: Nutch > Issue Type: Improvement > Components: build >Affects Versions: 2.0 >Reporter: Doğacan Güney >Assignee: Doğacan Güney >Priority: Trivial > > We need a .gitignore file under nutch/ so git does not try to track many > unnecessary files. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-884) FetcherJob should run more reduce tasks than default
[ https://issues.apache.org/jira/browse/NUTCH-884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12897331#action_12897331 ] Andrzej Bialecki commented on NUTCH-884: - Ok, I'll clarify the help message. bq. If I understood the code correctly, I think this part should be -all and not -threads: Heh, yes of course. Thanks! > FetcherJob should run more reduce tasks than default > > > Key: NUTCH-884 > URL: https://issues.apache.org/jira/browse/NUTCH-884 > Project: Nutch > Issue Type: Improvement > Components: fetcher >Affects Versions: 2.0 >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki > Fix For: 2.0 > > Attachments: NUTCH-884.patch > > > FetcherJob now performs fetching in the reduce phase. This means that in a > typical Hadoop setup there will be many fewer reduce tasks than map tasks, > and consequently the max. total throughput of Fetcher will be proportionally > reduced. I propose that FetcherJob should set the number of reduce tasks to > the number of map tasks. This way the fetching will be more granular. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-884) FetcherJob should run more reduce tasks than default
[ https://issues.apache.org/jira/browse/NUTCH-884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-884: Attachment: NUTCH-884.patch Patch with the change. I also rearranged the arguments to FetcherJob.fetch(..) to make more sense (IMHO). > FetcherJob should run more reduce tasks than default > > > Key: NUTCH-884 > URL: https://issues.apache.org/jira/browse/NUTCH-884 > Project: Nutch > Issue Type: Improvement > Components: fetcher >Affects Versions: 2.0 >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki > Fix For: 2.0 > > Attachments: NUTCH-884.patch > > > FetcherJob now performs fetching in the reduce phase. This means that in a > typical Hadoop setup there will be many fewer reduce tasks than map tasks, > and consequently the max. total throughput of Fetcher will be proportionally > reduced. I propose that FetcherJob should set the number of reduce tasks to > the number of map tasks. This way the fetching will be more granular. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-872) Change the default fetcher.parse to FALSE
[ https://issues.apache.org/jira/browse/NUTCH-872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-872. - Fix Version/s: 2.0 Resolution: Fixed I changed the name of the option to "-parse" to be consistent with the nutch-default.xml naming. I also updated the API to use this name, it's less confusing this way. Committed in rev. 984401. Thanks for the feedback. > Change the default fetcher.parse to FALSE > - > > Key: NUTCH-872 > URL: https://issues.apache.org/jira/browse/NUTCH-872 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.2, 2.0 >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki > Fix For: 2.0 > > > I propose to change this property to false. The reason is that it's a safer > default - parsing issues don't lead to a loss of the downloaded content. For > larger crawls this is the recommended way to run Fetcher. Users that run > smaller crawls can still override it. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-884) FetcherJob should run more reduce tasks than default
FetcherJob should run more reduce tasks than default Key: NUTCH-884 URL: https://issues.apache.org/jira/browse/NUTCH-884 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 2.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki Fix For: 2.0 FetcherJob now performs fetching in the reduce phase. This means that in a typical Hadoop setup there will be many fewer reduce tasks than map tasks, and consequently the max. total throughput of Fetcher will be proportionally reduced. I propose that FetcherJob should set the number of reduce tasks to the number of map tasks. This way the fetching will be more granular. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-881) Good quality documentation for Nutch
Good quality documentation for Nutch Key: NUTCH-881 URL: https://issues.apache.org/jira/browse/NUTCH-881 Project: Nutch Issue Type: Improvement Components: documentation Affects Versions: 2.0 Reporter: Andrzej Bialecki This is, and has been, a long standing request from Nutch users. This becomes an acute need as we redesign Nutch 2.0, because the collective knowledge and the Wiki will no longer be useful without massive amount of editing. IMHO the reference documentation should be in SVN, and not on the Wiki - the Wiki is good for casual information and recipes but I think it's too messy and not reliable enough as a reference. I propose to start with the following: 1. let's decide on the format of the docs. Each format has its own pros and cons: * HTML: easy to work with, but formatting may be messy unless we edit it by hand, at which point it's no longer so easy... Good toolchains to convert to other formats, but limited expressiveness of larger structures (e.g. book, chapters, TOC, multi-column layouts, etc). * Docbook: learning curve is higher, but not insurmountable... Naturally yields very good structure. Figures/diagrams may be problematic - different renderers (html, pdf) like to treat the scaling and placing somewhat differently. * Wiki-style (Confluence or TWiki): easy to use, but limited control over larger structures. Maven Doxia can format cwiki, twiki, and a host of other formats to e.g. html and pdf. * other? 2. start documenting the main tools and the main APIs (e.g. the plugins and all the extension points). We can of course reuse material from the Wiki and from various presentations (e.g. the ApacheCon slides). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-880) REST API (and webapp) for Nutch
[ https://issues.apache.org/jira/browse/NUTCH-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-880: Description: This issue is for discussing a REST-style API for accessing Nutch. Here's an initial idea: * I propose to use org.restlet for handling requests and returning JSON/XML/whatever responses. * hook up all regular tools so that they can be driven via this API. This would have to be an async API, since all Nutch operations take long time to execute. It follows then that we need to be able also to list running operations, retrieve their current status, and possibly abort/cancel/stop/suspend/resume/...? This also means that we would have to potentially create & manage many threads in a servlet - AFAIK this is frowned upon by J2EE purists... * package this in a webapp (that includes all deps, essentially nutch.job content), with the restlet servlet as an entry point. Open issues: * how to implement the reading of crawl results via this API * should we manage only crawls that use a single configuration per webapp, or should we have a notion of crawl contexts (sets of crawl configs) with CRUD ops on them? this would be nice, because it would allow managing of several different crawls, with different configs, in a single webapp - but it complicates the implementation a lot. was: This issue is for discussing a REST-style API for accessing Nutch. Here's an initial idea: * I propose to use org.restlet for handling JSON requests * hook up all regular tools so that they can be driven via this API. This would have to be an async API, since all Nutch operations take long time to execute. It follows then that we need to be able also to list running operations, retrieve their current status, and possibly abort/cancel/stop/suspend/resume/...? This also means that we would have to potentially create & manage many threads in a servlet - AFAIK this is frowned upon by J2EE purists... * package this in a webapp (that includes all deps, essentially nutch.job content), with the restlet servlet as an entry point. Open issues: * how to implement the reading of crawl results via this API * should we manage only crawls that use a single configuration per webapp, or should we have a notion of crawl contexts (sets of crawl configs) with CRUD ops on them? this would be nice, because it would allow managing of several different crawls, with different configs, in a single webapp - but it complicates the implementation a lot. > REST API (and webapp) for Nutch > --- > > Key: NUTCH-880 > URL: https://issues.apache.org/jira/browse/NUTCH-880 > Project: Nutch > Issue Type: New Feature >Affects Versions: 2.0 >Reporter: Andrzej Bialecki > > This issue is for discussing a REST-style API for accessing Nutch. > Here's an initial idea: > * I propose to use org.restlet for handling requests and returning > JSON/XML/whatever responses. > * hook up all regular tools so that they can be driven via this API. This > would have to be an async API, since all Nutch operations take long time to > execute. It follows then that we need to be able also to list running > operations, retrieve their current status, and possibly > abort/cancel/stop/suspend/resume/...? This also means that we would have to > potentially create & manage many threads in a servlet - AFAIK this is frowned > upon by J2EE purists... > * package this in a webapp (that includes all deps, essentially nutch.job > content), with the restlet servlet as an entry point. > Open issues: > * how to implement the reading of crawl results via this API > * should we manage only crawls that use a single configuration per webapp, or > should we have a notion of crawl contexts (sets of crawl configs) with CRUD > ops on them? this would be nice, because it would allow managing of several > different crawls, with different configs, in a single webapp - but it > complicates the implementation a lot. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-880) REST API (and webapp) for Nutch
REST API (and webapp) for Nutch --- Key: NUTCH-880 URL: https://issues.apache.org/jira/browse/NUTCH-880 Project: Nutch Issue Type: New Feature Affects Versions: 2.0 Reporter: Andrzej Bialecki This issue is for discussing a REST-style API for accessing Nutch. Here's an initial idea: * I propose to use org.restlet for handling JSON requests * hook up all regular tools so that they can be driven via this API. This would have to be an async API, since all Nutch operations take long time to execute. It follows then that we need to be able also to list running operations, retrieve their current status, and possibly abort/cancel/stop/suspend/resume/...? This also means that we would have to potentially create & manage many threads in a servlet - AFAIK this is frowned upon by J2EE purists... * package this in a webapp (that includes all deps, essentially nutch.job content), with the restlet servlet as an entry point. Open issues: * how to implement the reading of crawl results via this API * should we manage only crawls that use a single configuration per webapp, or should we have a notion of crawl contexts (sets of crawl configs) with CRUD ops on them? this would be nice, because it would allow managing of several different crawls, with different configs, in a single webapp - but it complicates the implementation a lot. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-879) URL-s getting lost
[ https://issues.apache.org/jira/browse/NUTCH-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12897182#action_12897182 ] Andrzej Bialecki commented on NUTCH-879: - I haven't tried hbase yet, but I'm going to - will update this issue soon. > URL-s getting lost > -- > > Key: NUTCH-879 > URL: https://issues.apache.org/jira/browse/NUTCH-879 > Project: Nutch > Issue Type: Bug >Affects Versions: 2.0 > Environment: * Ubuntu 10.4 x64, Sun JDK 1.6 > * using 1-node Hadoop + HDFS > * trunk r983472, using MySQL store > * branch-1.3 >Reporter: Andrzej Bialecki > Attachments: branch-1.3-bench.txt, trunk-bench.txt > > > I ran the Benchmark using branch-1.3 and trunk (formerly nutchbase). With the > same Benchmark parameters and the same plugins branch-1.3 collects ~1.5mln > urls, while trunk collects ~20,000 urls. Clearly something is wrong. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-876) Remove remaining robots/IP blocking code in lib-http
[ https://issues.apache.org/jira/browse/NUTCH-876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-876. - Fix Version/s: 2.0 Resolution: Fixed Committed in rev. 984337. > Remove remaining robots/IP blocking code in lib-http > > > Key: NUTCH-876 > URL: https://issues.apache.org/jira/browse/NUTCH-876 > Project: Nutch > Issue Type: Bug > Components: fetcher >Affects Versions: 2.0 >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki > Fix For: 2.0 > > Attachments: NUTCH-876.patch > > > There are remains of the (very old) blocking code in > lib-http/.../HttpBase.java. This code was used with the OldFetcher to manage > politeness limits. New trunk doesn't have OldFetcher anymore, so this code is > useless. Furthermore, there is an actual bug here - FetcherJob forgets to set > Protocol.CHECK_BLOCKING and Protocol.CHECK_ROBOTS to false, and the defaults > in lib-http are set to true. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-879) URL-s getting lost
[ https://issues.apache.org/jira/browse/NUTCH-879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-879: Attachment: branch-1.3-bench.txt trunk-bench.txt DB stats and benchmark results. > URL-s getting lost > -- > > Key: NUTCH-879 > URL: https://issues.apache.org/jira/browse/NUTCH-879 > Project: Nutch > Issue Type: Bug >Affects Versions: 2.0 > Environment: * Ubuntu 10.4 x64, Sun JDK 1.6 > * using 1-node Hadoop + HDFS > * trunk r983472, using MySQL store > * branch-1.3 >Reporter: Andrzej Bialecki > Attachments: branch-1.3-bench.txt, trunk-bench.txt > > > I ran the Benchmark using branch-1.3 and trunk (formerly nutchbase). With the > same Benchmark parameters and the same plugins branch-1.3 collects ~1.5mln > urls, while trunk collects ~20,000 urls. Clearly something is wrong. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-879) URL-s getting lost
URL-s getting lost -- Key: NUTCH-879 URL: https://issues.apache.org/jira/browse/NUTCH-879 Project: Nutch Issue Type: Bug Affects Versions: 2.0 Environment: * Ubuntu 10.4 x64, Sun JDK 1.6 * using 1-node Hadoop + HDFS * trunk r983472, using MySQL store * branch-1.3 Reporter: Andrzej Bialecki I ran the Benchmark using branch-1.3 and trunk (formerly nutchbase). With the same Benchmark parameters and the same plugins branch-1.3 collects ~1.5mln urls, while trunk collects ~20,000 urls. Clearly something is wrong. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-876) Remove remaining robots/IP blocking code in lib-http
[ https://issues.apache.org/jira/browse/NUTCH-876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-876: Attachment: NUTCH-876.patch Patch to fix the issue. If there are no objections I'll commit this shortly. > Remove remaining robots/IP blocking code in lib-http > > > Key: NUTCH-876 > URL: https://issues.apache.org/jira/browse/NUTCH-876 > Project: Nutch > Issue Type: Bug > Components: fetcher >Affects Versions: 2.0 >Reporter: Andrzej Bialecki >Assignee: Andrzej Bialecki > Attachments: NUTCH-876.patch > > > There are remains of the (very old) blocking code in > lib-http/.../HttpBase.java. This code was used with the OldFetcher to manage > politeness limits. New trunk doesn't have OldFetcher anymore, so this code is > useless. Furthermore, there is an actual bug here - FetcherJob forgets to set > Protocol.CHECK_BLOCKING and Protocol.CHECK_ROBOTS to false, and the defaults > in lib-http are set to true. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (NUTCH-876) Remove remaining robots/IP blocking code in lib-http
Remove remaining robots/IP blocking code in lib-http Key: NUTCH-876 URL: https://issues.apache.org/jira/browse/NUTCH-876 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 2.0 Reporter: Andrzej Bialecki Assignee: Andrzej Bialecki There are remains of the (very old) blocking code in lib-http/.../HttpBase.java. This code was used with the OldFetcher to manage politeness limits. New trunk doesn't have OldFetcher anymore, so this code is useless. Furthermore, there is an actual bug here - FetcherJob forgets to set Protocol.CHECK_BLOCKING and Protocol.CHECK_ROBOTS to false, and the defaults in lib-http are set to true. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (NUTCH-858) No longer able to set per-field boosts on lucene documents
[ https://issues.apache.org/jira/browse/NUTCH-858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki resolved NUTCH-858. - Resolution: Fixed Committed to branch-1.2, revision 982970. > No longer able to set per-field boosts on lucene documents > -- > > Key: NUTCH-858 > URL: https://issues.apache.org/jira/browse/NUTCH-858 > Project: Nutch > Issue Type: Bug > Components: indexer >Affects Versions: 1.1 > Environment: n/a >Reporter: Edward Drapkin >Assignee: Andrzej Bialecki > Fix For: 1.2 > > Attachments: nutchdoc.patch > > > I'm working on upgrading from Nutch 0.9 to Nutch 1.1 and I've noticed that it > no longer seems possible to set boosts on specific fields in lucene > documents. This is, in my opinion, a major feature regression and removes a > huge component to fine tuning search. Can this be added? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-858) No longer able to set per-field boosts on lucene documents
[ https://issues.apache.org/jira/browse/NUTCH-858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrzej Bialecki updated NUTCH-858: Attachment: nutchdoc.patch Here's the ported functionality. Please note that the standard plugins still don't use per-field boosts. If no objections I'll commit this shortly. > No longer able to set per-field boosts on lucene documents > -- > > Key: NUTCH-858 > URL: https://issues.apache.org/jira/browse/NUTCH-858 > Project: Nutch > Issue Type: Bug > Components: indexer >Affects Versions: 1.1 > Environment: n/a >Reporter: Edward Drapkin >Assignee: Andrzej Bialecki > Fix For: 1.2 > > Attachments: nutchdoc.patch > > > I'm working on upgrading from Nutch 0.9 to Nutch 1.1 and I've noticed that it > no longer seems possible to set boosts on specific fields in lucene > documents. This is, in my opinion, a major feature regression and removes a > huge component to fine tuning search. Can this be added? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.