[jira] [Commented] (NUTCH-1087) Deprecate crawl command and replace with example script

2011-08-23 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13089405#comment-13089405
 ] 

Andrzej Bialecki  commented on NUTCH-1087:
--


IIRC we had this discussion in the past... It's true that we already rely on 
Bash to do anything useful, no matter whether it's on Windows or on a *nix-like 
OS. And it's true that the crawl command has been a constant source of 
confusion over the years. The crawl application also suffered from some subtle 
bugs, especially when running in local mode (e.g. the PluginRepository leaks).

But the argument about maintenance costs is IMHO moot - you have to maintain a 
shell script, too, so it's no different from maintaining a Java class. Where it 
differs, I think, is that moving the crawl cycle logic to a shell script now 
raises the bar for Java developers who are not familiar with Bash scripting - a 
robust crawl script is not easy to follow, as it needs to handle error 
conditions and manage input/output resources on HDFS. On the other hand it's 
easier for system admins to tweak a script rather than tweaking a Java code... 
so I guess it's also a question of who's the audience for this functionality.

I'm +0 for removing Crawl and replacing it with a script, IMHO it doesn't 
change the picture in any significant way.


> Deprecate crawl command and replace with example script
> ---
>
> Key: NUTCH-1087
> URL: https://issues.apache.org/jira/browse/NUTCH-1087
> Project: Nutch
>  Issue Type: Task
>Affects Versions: 1.4
>Reporter: Markus Jelsma
>Priority: Minor
> Fix For: 1.4
>
>
> * remove the crawl command
> * add basic crawl shell script
> See thread:
> http://www.mail-archive.com/dev@nutch.apache.org/msg03848.html

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1014) Migrate from Apache ORO to java.util.regex

2011-07-19 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13067972#comment-13067972
 ] 

Andrzej Bialecki  commented on NUTCH-1014:
--

java.util.regex has the advantage of being a part of the JRE. However, it is 
quite slow for more complex regexes. See e.g. this benchmark: 
http://www.tusker.org/regex/regex_benchmark.html . In my experience with larger 
crawls this is especially important when using regexes for URL filtering and 
normalization - an innocent-looking regex can melt the cpu when processing a 
64kB long junk URL, and consequently it can stall the crawl... In such cases 
it's good to have an option to fall back to a subset of regex features and use 
a DFA-based library like e.g. Brics. ORO is generally faster than j.u.regex 
(but also it isn't maintained anymore). Brics lacks support for many operators, 
but it's fast. Perhaps ICU4j would be a good alternative - it's fully 
JDK-compatible and offers good performance.

> Migrate from Apache ORO to java.util.regex
> --
>
> Key: NUTCH-1014
> URL: https://issues.apache.org/jira/browse/NUTCH-1014
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Markus Jelsma
> Fix For: 1.4, 2.0
>
>
> A separate issue tracking migration of all components from Apache ORO to 
> java.util.regex. Components involved are:
> - RegexURLNormalzier
> - OutlinkExtractor
> - JSParseFilter
> - MoreIndexingFilter
> - BasicURLNormalizer

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-985) MoreIndexingFilter doesn't use properly formatted date fields for Solr

2011-05-17 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13034724#comment-13034724
 ] 

Andrzej Bialecki  commented on NUTCH-985:
-

We should use the Solr's DateUtil in all such places, to avoid code duplication 
and confusion should the date format ever change... The patch does essentially 
the same what DateUtil does, only the DateUtil reuses SimpleDateFormat 
instances in a thread-safe way, so it's more efficient.

> MoreIndexingFilter doesn't use properly formatted date fields for Solr
> --
>
> Key: NUTCH-985
> URL: https://issues.apache.org/jira/browse/NUTCH-985
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.3, 2.0
>Reporter: Dietrich Schmidt
>Assignee: Markus Jelsma
> Fix For: 1.3, 2.0
>
> Attachments: NUTCH-985-trunk-1.patch, NUTCH-985.1.3-1.patch, 
> indexlastmodifieddate.jar
>
>
> I am using the index-more plugin to parse the lastModified data in web
> pages in order to store it in a Solr data field.
> In solrindex-mapping.xml I am mapping lastModified to a field "changed" in 
> Solr:
> 
> However, when posting data to Solr the SolrIndexer posts it as a long,
> not as a date:
>  name="changed">107932680 name="tstamp">20110414144140188 name="date">20040315
> Solr rejects the data because of the improper data type.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-955) Ivy configuration

2011-03-09 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13004509#comment-13004509
 ] 

Andrzej Bialecki  commented on NUTCH-955:
-

Committed with a tweak in rev. 1079770. Thanks!

> Ivy configuration
> -
>
> Key: NUTCH-955
> URL: https://issues.apache.org/jira/browse/NUTCH-955
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 2.0
>Reporter: Alexis
>Assignee: Andrzej Bialecki 
> Fix For: 2.0
>
> Attachments: ivy.patch
>
>
> As mentioned in NUTCH-950, we can slightly improve the Ivy configuration to 
> help setup the Gora backend more easily.
> If the user does not want to stick with default HSQL database, other 
> alternatives exist, such as MySQL and HBase.
> org.restlet and xercesImpl versions should be changed as well.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Resolved: (NUTCH-955) Ivy configuration

2011-03-09 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  resolved NUTCH-955.
-

   Resolution: Fixed
Fix Version/s: 2.0
 Assignee: Andrzej Bialecki 

> Ivy configuration
> -
>
> Key: NUTCH-955
> URL: https://issues.apache.org/jira/browse/NUTCH-955
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 2.0
>Reporter: Alexis
>Assignee: Andrzej Bialecki 
> Fix For: 2.0
>
> Attachments: ivy.patch
>
>
> As mentioned in NUTCH-950, we can slightly improve the Ivy configuration to 
> help setup the Gora backend more easily.
> If the user does not want to stick with default HSQL database, other 
> alternatives exist, such as MySQL and HBase.
> org.restlet and xercesImpl versions should be changed as well.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-962) max. redirects not handled correctly: fetcher stops at max-1 redirects

2011-03-09 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13004501#comment-13004501
 ] 

Andrzej Bialecki  commented on NUTCH-962:
-

Committed in 1079764 (trunk) and 1079765 (1.3). Thank you!

> max. redirects not handled correctly: fetcher stops at max-1 redirects
> --
>
> Key: NUTCH-962
> URL: https://issues.apache.org/jira/browse/NUTCH-962
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.2, 1.3, 2.0
>Reporter: Sebastian Nagel
>Assignee: Andrzej Bialecki 
> Fix For: 1.3, 2.0
>
> Attachments: Fetcher_redir.patch
>
>
> The fetcher stops following redirects one redirect before the max. redirects 
> is reached.
> The description of http.redirect.max
> > The maximum number of redirects the fetcher will follow when
> > trying to fetch a page. If set to negative or 0, fetcher won't immediately
> > follow redirected URLs, instead it will record them for later fetching.
> suggests that if set to 1 that one redirect will be followed.
> I tried to crawl two documents the first redirecting by
>  
> to the second with http.redirect.max = 1
> The second document is not fetched and the URL has state GONE in CrawlDb.
> fetching file:/test/redirects/meta_refresh.html
> redirectCount=0
> -finishing thread FetcherThread, activeThreads=1
>  - content redirect to file:/test/redirects/to/meta_refresh_target.html 
> (fetching now)
>  - redirect count exceeded file:/test/redirects/to/meta_refresh_target.html
> The attached patch would fix this: if http.redirect.max is 1 : one redirect 
> is followed.
> Of course, this would mean there is no possibility to skip redirects at all 
> since 0
> (as well as negative values) means "treat redirects as ordinary links".

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Resolved: (NUTCH-962) max. redirects not handled correctly: fetcher stops at max-1 redirects

2011-03-09 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  resolved NUTCH-962.
-

   Resolution: Fixed
Fix Version/s: 2.0
   1.3
 Assignee: Andrzej Bialecki 

> max. redirects not handled correctly: fetcher stops at max-1 redirects
> --
>
> Key: NUTCH-962
> URL: https://issues.apache.org/jira/browse/NUTCH-962
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.2, 1.3, 2.0
>Reporter: Sebastian Nagel
>Assignee: Andrzej Bialecki 
> Fix For: 1.3, 2.0
>
> Attachments: Fetcher_redir.patch
>
>
> The fetcher stops following redirects one redirect before the max. redirects 
> is reached.
> The description of http.redirect.max
> > The maximum number of redirects the fetcher will follow when
> > trying to fetch a page. If set to negative or 0, fetcher won't immediately
> > follow redirected URLs, instead it will record them for later fetching.
> suggests that if set to 1 that one redirect will be followed.
> I tried to crawl two documents the first redirecting by
>  
> to the second with http.redirect.max = 1
> The second document is not fetched and the URL has state GONE in CrawlDb.
> fetching file:/test/redirects/meta_refresh.html
> redirectCount=0
> -finishing thread FetcherThread, activeThreads=1
>  - content redirect to file:/test/redirects/to/meta_refresh_target.html 
> (fetching now)
>  - redirect count exceeded file:/test/redirects/to/meta_refresh_target.html
> The attached patch would fix this: if http.redirect.max is 1 : one redirect 
> is followed.
> Of course, this would mean there is no possibility to skip redirects at all 
> since 0
> (as well as negative values) means "treat redirects as ordinary links".

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-951) Backport changes from 2.0 into 1.3

2011-03-09 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13004490#comment-13004490
 ] 

Andrzej Bialecki  commented on NUTCH-951:
-

All changes have been ported. Thanks everyone!

> Backport changes from 2.0 into 1.3
> --
>
> Key: NUTCH-951
> URL: https://issues.apache.org/jira/browse/NUTCH-951
> Project: Nutch
>  Issue Type: Task
>Affects Versions: 1.3
>Reporter: Julien Nioche
>Assignee: Andrzej Bialecki 
>Priority: Blocker
> Fix For: 1.3
>
>
> I've compared the changes from 2.0 with 1.3 and found the following 
> differences (excluding anything specific to 2.0/GORA)
> *  NUTCH-564 External parser supports encoding attribute (Antony 
> Bowesman, mattmann)
> *  NUTCH-714 Need a SFTP and SCP Protocol Handler (Sanjoy Ghosh, mattmann)
> *  NUTCH-825 Publish nutch artifacts to central maven repository 
> (mattmann)
> *  NUTCH-851 Port logging to slf4j (jnioche)
> *  NUTCH-861 Renamed HTMLParseFilter into ParseFilter
> *  NUTCH-872 Change the default fetcher.parse to FALSE (ab).
> *  NUTCH-876 Remove remaining robots/IP blocking code in lib-http (ab)
> *  NUTCH-880 REST API for Nutch (ab)
> *  NUTCH-883 Remove unused parameters from nutch-default.xml (jnioche)
> *  NUTCH-884 FetcherJob should run more reduce tasks than default (ab)
> *  NUTCH-886 A .gitignore file for Nutch (dogacan)
> *  NUTCH-894 Move statistical language identification from indexing to 
> parsing step
> *  NUTCH-921 Reduce dependency of Nutch on config files (ab)
> *  NUTCH-930 Remove remaining dependencies on Lucene API (ab)
> *  NUTCH-931 Simple admin API to fetch status and stop the service (ab)
> *  NUTCH-932 Bulk REST API to retrieve crawl results as JSON (ab)
> Let's go through this and decide what to port to 1.3

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Resolved: (NUTCH-951) Backport changes from 2.0 into 1.3

2011-03-09 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  resolved NUTCH-951.
-

Resolution: Fixed

> Backport changes from 2.0 into 1.3
> --
>
> Key: NUTCH-951
> URL: https://issues.apache.org/jira/browse/NUTCH-951
> Project: Nutch
>  Issue Type: Task
>Affects Versions: 1.3
>Reporter: Julien Nioche
>Assignee: Andrzej Bialecki 
>Priority: Blocker
> Fix For: 1.3
>
>
> I've compared the changes from 2.0 with 1.3 and found the following 
> differences (excluding anything specific to 2.0/GORA)
> *  NUTCH-564 External parser supports encoding attribute (Antony 
> Bowesman, mattmann)
> *  NUTCH-714 Need a SFTP and SCP Protocol Handler (Sanjoy Ghosh, mattmann)
> *  NUTCH-825 Publish nutch artifacts to central maven repository 
> (mattmann)
> *  NUTCH-851 Port logging to slf4j (jnioche)
> *  NUTCH-861 Renamed HTMLParseFilter into ParseFilter
> *  NUTCH-872 Change the default fetcher.parse to FALSE (ab).
> *  NUTCH-876 Remove remaining robots/IP blocking code in lib-http (ab)
> *  NUTCH-880 REST API for Nutch (ab)
> *  NUTCH-883 Remove unused parameters from nutch-default.xml (jnioche)
> *  NUTCH-884 FetcherJob should run more reduce tasks than default (ab)
> *  NUTCH-886 A .gitignore file for Nutch (dogacan)
> *  NUTCH-894 Move statistical language identification from indexing to 
> parsing step
> *  NUTCH-921 Reduce dependency of Nutch on config files (ab)
> *  NUTCH-930 Remove remaining dependencies on Lucene API (ab)
> *  NUTCH-931 Simple admin API to fetch status and stop the service (ab)
> *  NUTCH-932 Bulk REST API to retrieve crawl results as JSON (ab)
> Let's go through this and decide what to port to 1.3

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-951) Backport changes from 2.0 into 1.3

2011-03-09 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13004488#comment-13004488
 ] 

Andrzej Bialecki  commented on NUTCH-951:
-

* Ported NUTCH-872 in rev. 1079746.
* Ported NUTCH-876 in rev. 1079753.
* Ported NUTCH-921 in rev. 1079760.
* NUTCH-884 is not applicable to 1.3 because here fetching executes in map 
tasks, so there's a correct number of them already.

> Backport changes from 2.0 into 1.3
> --
>
> Key: NUTCH-951
> URL: https://issues.apache.org/jira/browse/NUTCH-951
> Project: Nutch
>  Issue Type: Task
>Affects Versions: 1.3
>Reporter: Julien Nioche
>Assignee: Andrzej Bialecki 
>Priority: Blocker
> Fix For: 1.3
>
>
> I've compared the changes from 2.0 with 1.3 and found the following 
> differences (excluding anything specific to 2.0/GORA)
> *  NUTCH-564 External parser supports encoding attribute (Antony 
> Bowesman, mattmann)
> *  NUTCH-714 Need a SFTP and SCP Protocol Handler (Sanjoy Ghosh, mattmann)
> *  NUTCH-825 Publish nutch artifacts to central maven repository 
> (mattmann)
> *  NUTCH-851 Port logging to slf4j (jnioche)
> *  NUTCH-861 Renamed HTMLParseFilter into ParseFilter
> *  NUTCH-872 Change the default fetcher.parse to FALSE (ab).
> *  NUTCH-876 Remove remaining robots/IP blocking code in lib-http (ab)
> *  NUTCH-880 REST API for Nutch (ab)
> *  NUTCH-883 Remove unused parameters from nutch-default.xml (jnioche)
> *  NUTCH-884 FetcherJob should run more reduce tasks than default (ab)
> *  NUTCH-886 A .gitignore file for Nutch (dogacan)
> *  NUTCH-894 Move statistical language identification from indexing to 
> parsing step
> *  NUTCH-921 Reduce dependency of Nutch on config files (ab)
> *  NUTCH-930 Remove remaining dependencies on Lucene API (ab)
> *  NUTCH-931 Simple admin API to fetch status and stop the service (ab)
> *  NUTCH-932 Bulk REST API to retrieve crawl results as JSON (ab)
> Let's go through this and decide what to port to 1.3

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] Commented: (NUTCH-964) ERROR conf.Configuration - Failed to set setXIncludeAware(true)

2011-01-27 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12987547#action_12987547
 ] 

Andrzej Bialecki  commented on NUTCH-964:
-

This error has been bothering me for a while, too - it's great that an upgrade 
fixes it and doesn't break other stuff ;) One area that was sensitive to Xerces 
versions in the past was the Neko parser (in parse-html) but if its tests pass 
then +1 to commit the patch. We should upgrade trunk too.

> ERROR conf.Configuration - Failed to set setXIncludeAware(true)
> ---
>
> Key: NUTCH-964
> URL: https://issues.apache.org/jira/browse/NUTCH-964
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.3
>Reporter: Markus Jelsma
> Fix For: 1.3
>
> Attachments: NUTCH-964.patch
>
>
> Each executed job results in a number of occurences of the exception below:
> 2011-01-27 13:40:34,457 ERROR conf.Configuration - Failed to set 
> setXIncludeAware(true) for parser 
> org.apache.xerces.jaxp.DocumentBuilderFactoryImpl@3801318b:java.lang.UnsupportedOperationException:
>  This parser does not support specification "null" version "null"
> java.lang.UnsupportedOperationException: This parser does not support 
> specification "null" version "null"
> at 
> javax.xml.parsers.DocumentBuilderFactory.setXIncludeAware(DocumentBuilderFactory.java:590)
> at 
> org.apache.hadoop.conf.Configuration.loadResource(Configuration.java:1054)
> at 
> org.apache.hadoop.conf.Configuration.loadResources(Configuration.java:1040)
> at 
> org.apache.hadoop.conf.Configuration.getProps(Configuration.java:980)
> at org.apache.hadoop.conf.Configuration.get(Configuration.java:436)
> at org.apache.hadoop.fs.FileSystem.getDefaultUri(FileSystem.java:103)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95)
> at org.apache.nutch.crawl.Injector.inject(Injector.java:230)
> at org.apache.nutch.crawl.Injector.run(Injector.java:248)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at org.apache.nutch.crawl.Injector.main(Injector.java:238)
> This can be fixed by upgrading xercesImpl from 2.6.2 to 2.9.1. If modified 
> ivy and lib-xml's ivy configuration and can commit it. The question is, is 
> upgrading the correct method? I've tested Nutch with 2.9.1 and except the 
> lack of the annoying exception everything works as expected.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-939) Added -dir command line option to Indexer and SolrIndexer, allowing to specify directory containing segments

2010-12-21 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12973915#action_12973915
 ] 

Andrzej Bialecki  commented on NUTCH-939:
-

1.2 release is out, and branch-1.2 is unlikely to result in a subsequent 
release - most users seem to be interested either in 1.3 or trunk.

> Added -dir command line option to Indexer and SolrIndexer,  allowing to 
> specify directory containing segments
> -
>
> Key: NUTCH-939
> URL: https://issues.apache.org/jira/browse/NUTCH-939
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.3
>Reporter: Claudio Martella
>Assignee: Andrzej Bialecki 
>Priority: Minor
> Fix For: 1.3
>
> Attachments: Indexer.patch, SolrIndexer.patch
>
>
> The patches add -dir option, so the user can specify the directory in which 
> the segments are to be found. The actual mode is to specify the list of 
> segments, which is not very easy with hdfs. Also, the -dir option is already 
> implemented in LinkDB and SegmentMerger, for example.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-948) Remove Lucene dependencies

2010-12-21 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  resolved NUTCH-948.
-

Resolution: Fixed

Committed in rev. 1051509.

> Remove Lucene dependencies
> --
>
> Key: NUTCH-948
> URL: https://issues.apache.org/jira/browse/NUTCH-948
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.3
>Reporter: Andrzej Bialecki 
>Assignee: Andrzej Bialecki 
> Fix For: 1.3
>
>
> Branch-1.3 still has Lucene libs, but uses Lucene only in one place, namely 
> it uses DateTools in index-basic. DateTools should be replaced with Solr's 
> DateUtil, as we did in trunk, and then we can remove Lucene libs as a 
> dependency.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-948) Remove Lucene dependencies

2010-12-21 Thread Andrzej Bialecki (JIRA)
Remove Lucene dependencies
--

 Key: NUTCH-948
 URL: https://issues.apache.org/jira/browse/NUTCH-948
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.3
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 1.3


Branch-1.3 still has Lucene libs, but uses Lucene only in one place, namely it 
uses DateTools in index-basic. DateTools should be replaced with Solr's 
DateUtil, as we did in trunk, and then we can remove Lucene libs as a 
dependency.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-939) Added -dir command line option to Indexer and SolrIndexer, allowing to specify directory containing segments

2010-12-21 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  resolved NUTCH-939.
-

Resolution: Fixed
  Assignee: Andrzej Bialecki 

I modified the patch slightly to allow more flexibility (you can mix individual 
segment names and the -dir options) as well as allowing segments placed on 
different filesystems. Committed in rev. 1051505. Thank you!

> Added -dir command line option to Indexer and SolrIndexer,  allowing to 
> specify directory containing segments
> -
>
> Key: NUTCH-939
> URL: https://issues.apache.org/jira/browse/NUTCH-939
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.3
>Reporter: Claudio Martella
>Assignee: Andrzej Bialecki 
>Priority: Minor
> Fix For: 1.3
>
> Attachments: Indexer.patch, SolrIndexer.patch
>
>
> The patches add -dir option, so the user can specify the directory in which 
> the segments are to be found. The actual mode is to specify the list of 
> segments, which is not very easy with hdfs. Also, the -dir option is already 
> implemented in LinkDB and SegmentMerger, for example.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-939) Added -dir command line option to Indexer and SolrIndexer, allowing to specify directory containing segments

2010-11-26 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12936047#action_12936047
 ] 

Andrzej Bialecki  commented on NUTCH-939:
-

Please note that trunk uses a very different method of working with segments 
(called batches there), and -dir is not applicable there.

> Added -dir command line option to Indexer and SolrIndexer,  allowing to 
> specify directory containing segments
> -
>
> Key: NUTCH-939
> URL: https://issues.apache.org/jira/browse/NUTCH-939
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.2
>Reporter: Claudio Martella
>Priority: Minor
> Fix For: 1.2
>
> Attachments: Indexer.patch, SolrIndexer.patch
>
>
> The patches add -dir option, so the user can specify the directory in which 
> the segments are to be found. The actual mode is to specify the list of 
> segments, which is not very easy with hdfs. Also, the -dir option is already 
> implemented in LinkDB and SegmentMerger, for example.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-938) Imposible to fetch sites with robots.txt

2010-11-25 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12935745#action_12935745
 ] 

Andrzej Bialecki  commented on NUTCH-938:
-

These two properties are documented in nutch-default.xml, but they are mostly 
for internal use by Nutch. Other implementations of Fetcher (the OldFetcher) 
used to delegate the robot and politeness controls to protocol plugins. The 
current implementation of Fetcher performs these tasks itself, although in 1.2 
protocol plugins still retain the code to implement these controls per 
protocol. In 1.3 (unreleased) and trunk this support has been removed from 
protocol plugins, so these lines will have no effect.

> Imposible to fetch sites with robots.txt 
> -
>
> Key: NUTCH-938
> URL: https://issues.apache.org/jira/browse/NUTCH-938
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.2
> Environment: red hat, nutch 1.2, jaca 1.6
>Reporter: Enrique Berlanga
> Attachments: NUTCH-938.patch
>
>
> Crawling a site with a robots.txt file like this:  (e.g: 
> http://www.melilla.es)
> ---
> User-agent: *
> Disallow: /
> ---
> No links are followed. 
> It doesn't matters the value set at "protocol.plugin.check.blocking" or 
> "protocol.plugin.check.robots" properties, because they are overloaded in 
> class org.apache.nutch.fetcher.Fetcher:
> // set non-blocking & no-robots mode for HTTP protocol plugins.
> getConf().setBoolean(Protocol.CHECK_BLOCKING, false);
> getConf().setBoolean(Protocol.CHECK_ROBOTS, false);
> False is the desired value, but in FetcherThread inner class, robot rules are 
> checket ignoring the configuration:
> 
> RobotRules rules = protocol.getRobotRules(fit.url, fit.datum);
> if (!rules.isAllowed(fit.u)) {
>  ...
> LOG.debug("Denied by robots.txt: " + fit.url);
> ...
> continue;
> }
> ---
> I suposse there is no problem in disabling that part of the code directly for 
> HTTP protocol. If so, I could submit a patch as soon as posible to get over 
> this.
> Thanks in advance

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-932) Bulk REST API to retrieve crawl results as JSON

2010-11-25 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  resolved NUTCH-932.
-

   Resolution: Fixed
Fix Version/s: 2.0

Committed in rev. 1039014.

> Bulk REST API to retrieve crawl results as JSON
> ---
>
> Key: NUTCH-932
> URL: https://issues.apache.org/jira/browse/NUTCH-932
> Project: Nutch
>  Issue Type: New Feature
>  Components: REST_api
>Affects Versions: 2.0
>Reporter: Andrzej Bialecki 
>Assignee: Andrzej Bialecki 
> Fix For: 2.0
>
> Attachments: db.formatted.gz, NUTCH-932-2.patch, NUTCH-932-3.patch, 
> NUTCH-932-4.patch, NUTCH-932.patch, NUTCH-932.patch, NUTCH-932.patch
>
>
> It would be useful to be able to retrieve results of a crawl as JSON. There 
> are a few things that need to be discussed:
> * how to return bulk results using Restlet (WritableRepresentation subclass?)
> * what should be the format of results?
> I think it would make sense to provide a single record retrieval (by primary 
> key), all records, and records within a range. This incidentally matches well 
> the capabilities of the Gora Query class :)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-932) Bulk REST API to retrieve crawl results as JSON

2010-11-25 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-932:


Attachment: NUTCH-932-4.patch

Final version of the patch.

> Bulk REST API to retrieve crawl results as JSON
> ---
>
> Key: NUTCH-932
> URL: https://issues.apache.org/jira/browse/NUTCH-932
> Project: Nutch
>  Issue Type: New Feature
>  Components: REST_api
>Affects Versions: 2.0
>Reporter: Andrzej Bialecki 
>Assignee: Andrzej Bialecki 
> Attachments: db.formatted.gz, NUTCH-932-2.patch, NUTCH-932-3.patch, 
> NUTCH-932-4.patch, NUTCH-932.patch, NUTCH-932.patch, NUTCH-932.patch
>
>
> It would be useful to be able to retrieve results of a crawl as JSON. There 
> are a few things that need to be discussed:
> * how to return bulk results using Restlet (WritableRepresentation subclass?)
> * what should be the format of results?
> I think it would make sense to provide a single record retrieval (by primary 
> key), all records, and records within a range. This incidentally matches well 
> the capabilities of the Gora Query class :)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-938) Imposible to fetch sites with robots.txt

2010-11-24 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1293#action_1293
 ] 

Andrzej Bialecki  commented on NUTCH-938:
-

Nutch behavior in this case is correct. The goal of Nutch is to implement a 
well-behaved crawler that obeys robot rules and netiquette. Your patch simply 
disables these control mechanisms. If it works for you and you can risk the 
wrath of webmasters, that's fine, you are free to use this patch  - but Nutch 
as a project cannot encourage such practice.

Consequently I'm going to mark this issue as Won't Fix.

> Imposible to fetch sites with robots.txt 
> -
>
> Key: NUTCH-938
> URL: https://issues.apache.org/jira/browse/NUTCH-938
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.2
> Environment: red hat, nutch 1.2, jaca 1.6
>Reporter: Enrique Berlanga
> Attachments: NUTCH-938.patch
>
>
> Crawling a site with a robots.txt file like this:  (e.g: 
> http://www.melilla.es)
> ---
> User-agent: *
> Disallow: /
> ---
> No links are followed. 
> It doesn't matters the value set at "protocol.plugin.check.blocking" or 
> "protocol.plugin.check.robots" properties, because they are overloaded in 
> class org.apache.nutch.fetcher.Fetcher:
> // set non-blocking & no-robots mode for HTTP protocol plugins.
> getConf().setBoolean(Protocol.CHECK_BLOCKING, false);
> getConf().setBoolean(Protocol.CHECK_ROBOTS, false);
> False is the desired value, but in FetcherThread inner class, robot rules are 
> checket ignoring the configuration:
> 
> RobotRules rules = protocol.getRobotRules(fit.url, fit.datum);
> if (!rules.isAllowed(fit.u)) {
>  ...
> LOG.debug("Denied by robots.txt: " + fit.url);
> ...
> continue;
> }
> ---
> I suposse there is no problem in disabling that part of the code directly for 
> HTTP protocol. If so, I could submit a patch as soon as posible to get over 
> this.
> Thanks in advance

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-932) Bulk REST API to retrieve crawl results as JSON

2010-11-12 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-932:


Attachment: NUTCH-932-3.patch

NutchTool is an abstract class in this patch. This actually minimizes the 
amount of code throughout, though paradoxically the patch file is larger than 
before...

> Bulk REST API to retrieve crawl results as JSON
> ---
>
> Key: NUTCH-932
> URL: https://issues.apache.org/jira/browse/NUTCH-932
> Project: Nutch
>  Issue Type: New Feature
>  Components: REST_api
>Affects Versions: 2.0
>Reporter: Andrzej Bialecki 
>Assignee: Andrzej Bialecki 
> Attachments: db.formatted.gz, NUTCH-932-2.patch, NUTCH-932-3.patch, 
> NUTCH-932.patch, NUTCH-932.patch, NUTCH-932.patch
>
>
> It would be useful to be able to retrieve results of a crawl as JSON. There 
> are a few things that need to be discussed:
> * how to return bulk results using Restlet (WritableRepresentation subclass?)
> * what should be the format of results?
> I think it would make sense to provide a single record retrieval (by primary 
> key), all records, and records within a range. This incidentally matches well 
> the capabilities of the Gora Query class :)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-932) Bulk REST API to retrieve crawl results as JSON

2010-11-12 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-932:


Attachment: NUTCH-932-2.patch

This patch simplifies the NutchTool API and reduces changes to implementations 
of NutchTool. I'd like to commit this patch soon.

> Bulk REST API to retrieve crawl results as JSON
> ---
>
> Key: NUTCH-932
> URL: https://issues.apache.org/jira/browse/NUTCH-932
> Project: Nutch
>  Issue Type: New Feature
>  Components: REST_api
>Affects Versions: 2.0
>Reporter: Andrzej Bialecki 
>Assignee: Andrzej Bialecki 
> Attachments: db.formatted.gz, NUTCH-932-2.patch, NUTCH-932.patch, 
> NUTCH-932.patch, NUTCH-932.patch
>
>
> It would be useful to be able to retrieve results of a crawl as JSON. There 
> are a few things that need to be discussed:
> * how to return bulk results using Restlet (WritableRepresentation subclass?)
> * what should be the format of results?
> I think it would make sense to provide a single record retrieval (by primary 
> key), all records, and records within a range. This incidentally matches well 
> the capabilities of the Gora Query class :)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-932) Bulk REST API to retrieve crawl results as JSON

2010-11-05 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-932:


Attachment: NUTCH-932.patch

Updated patch. This changes the NutchTool API to allow for execution steps that 
are not mapreduce jobs, and to pass arguments in arbitrary order, which was a 
side-effect of the Restlet API.

As a proof of concept I reimplemented the Crawler class (a one-shot crawler). 
If there are no objections I'll commit this shortly.

> Bulk REST API to retrieve crawl results as JSON
> ---
>
> Key: NUTCH-932
> URL: https://issues.apache.org/jira/browse/NUTCH-932
> Project: Nutch
>  Issue Type: New Feature
>  Components: REST_api
>Affects Versions: 2.0
>Reporter: Andrzej Bialecki 
>Assignee: Andrzej Bialecki 
> Attachments: db.formatted.gz, NUTCH-932.patch, NUTCH-932.patch, 
> NUTCH-932.patch
>
>
> It would be useful to be able to retrieve results of a crawl as JSON. There 
> are a few things that need to be discussed:
> * how to return bulk results using Restlet (WritableRepresentation subclass?)
> * what should be the format of results?
> I think it would make sense to provide a single record retrieval (by primary 
> key), all records, and records within a range. This incidentally matches well 
> the capabilities of the Gora Query class :)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-880) REST API for Nutch

2010-11-05 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928909#action_12928909
 ] 

Andrzej Bialecki  commented on NUTCH-880:
-

Thanks - this issue is already fixed in NUTCH-932, to be committed soon.

> REST API for Nutch
> --
>
> Key: NUTCH-880
> URL: https://issues.apache.org/jira/browse/NUTCH-880
> Project: Nutch
>  Issue Type: New Feature
>Affects Versions: 2.0
>Reporter: Andrzej Bialecki 
>Assignee: Andrzej Bialecki 
> Fix For: 2.0
>
> Attachments: API-2.patch, API.patch
>
>
> This issue is for discussing a REST-style API for accessing Nutch.
> Here's an initial idea:
> * I propose to use org.restlet for handling requests and returning 
> JSON/XML/whatever responses.
> * hook up all regular tools so that they can be driven via this API. This 
> would have to be an async API, since all Nutch operations take long time to 
> execute. It follows then that we need to be able also to list running 
> operations, retrieve their current status, and possibly 
> abort/cancel/stop/suspend/resume/...? This also means that we would have to 
> potentially create & manage many threads in a servlet - AFAIK this is frowned 
> upon by J2EE purists...
> * package this in a webapp (that includes all deps, essentially nutch.job 
> content), with the restlet servlet as an entry point.
> Open issues:
> * how to implement the reading of crawl results via this API
> * should we manage only crawls that use a single configuration per webapp, or 
> should we have a notion of crawl contexts (sets of crawl configs) with CRUD 
> ops on them? this would be nice, because it would allow managing of several 
> different crawls, with different configs, in a single webapp - but it 
> complicates the implementation a lot.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-932) Bulk REST API to retrieve crawl results as JSON

2010-11-04 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-932:


Attachment: NUTCH-932.patch

Updated patch - this recognizes now URL parameters such as fields, start/end 
keys, batch and crawl id.

> Bulk REST API to retrieve crawl results as JSON
> ---
>
> Key: NUTCH-932
> URL: https://issues.apache.org/jira/browse/NUTCH-932
> Project: Nutch
>  Issue Type: New Feature
>  Components: REST_api
>Affects Versions: 2.0
>Reporter: Andrzej Bialecki 
>Assignee: Andrzej Bialecki 
> Attachments: db.formatted.gz, NUTCH-932.patch, NUTCH-932.patch
>
>
> It would be useful to be able to retrieve results of a crawl as JSON. There 
> are a few things that need to be discussed:
> * how to return bulk results using Restlet (WritableRepresentation subclass?)
> * what should be the format of results?
> I think it would make sense to provide a single record retrieval (by primary 
> key), all records, and records within a range. This incidentally matches well 
> the capabilities of the Gora Query class :)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-932) Bulk REST API to retrieve crawl results as JSON

2010-11-04 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12928355#action_12928355
 ] 

Andrzej Bialecki  commented on NUTCH-932:
-

Examples (with the db equivalent to the one in db.formatted.gz):

{code}
$ curl -s 
'http://localhost:8192/nutch/db?fields=url&end=http://www.freebsd.org/&start=http://www.egothor.org/'|
 ./json_pp
[
  {
"url": "http://www.egothor.org/";
  }, 
  {
"url": "http://www.freebsd.org/";
  }
]
{code}

{code}
$ curl -s 
'http://localhost:8192/nutch/db?fields=url,outlinks,markers,protocolStatus,parseStatus,contentType&start=http://www.getopt.org/&end=http://www.getopt.org/'|
 ./json_pp
[
  {
"contentType": "text/html", 
"url": "http://www.getopt.org/";, 
"markers": {
  "_updmrk_": "1288890451-1134865895"
}, 
"parseStatus": "success/ok (1/0), args=[]", 
"protocolStatus": "SUCCESS, args=[]", 
"outlinks": {
  "http://www.getopt.org/luke/": "Luke", 
  "http://www.getopt.org/ecimf/contrib/ONTO/REA": "REA Ontology page", 
  "http://www.getopt.org/CV.pdf": "CV here", 
  "http://www.getopt.org/utils/build/api": "API", 
  
"http://svn.apache.org/viewvc/hadoop/hbase/trunk/src/java/org/apache/hadoop/hbase/util/JenkinsHash.java":
 "available here", 
  "http://www.getopt.org/murmur/MurmurHash.java": "MurmurHash.java", 
  "http://www.ebxml.org/": "ebXML / ebTWG", 
  "http://www.freebsd.org/": "FreeBSD", 
  "http://www.getopt.org/luke/webstart.html": "Launch with Java WebStart", 
  "http://www.freebsd.org/%7Epicobsd": "PicoBSD", 
  "http://home.comcast.net/~bretm/hash/6.html": "this discussion", 
  "http://protege.stanford.edu/": "Protege", 
  "http://jakarta.apache.org/lucene": "Lucene", 
  "http://www.getopt.org/ecimf/contrib/ONTO/ebxml": "ebXML Ontology", 
  "http://www.getopt.org/ecimf/": "here", 
  "http://www.isthe.com/chongo/tech/comp/fnv/": "his website", 
  "http://www.getopt.org/stempel/index.html": "Stempel", 
  "http://www.sigram.com/": "SIGRAM", 
  "http://www.egothor.org/": "Egothor", 
  "http://thinlet.sourceforge.net/": "Thinlet", 
  "http://www.getopt.org/utils/dist/utils-1.0.jar": "binary", 
  "http://www.ecimf.org/": "ECIMF"
}
  }
]
{code}


> Bulk REST API to retrieve crawl results as JSON
> ---
>
> Key: NUTCH-932
> URL: https://issues.apache.org/jira/browse/NUTCH-932
> Project: Nutch
>  Issue Type: New Feature
>  Components: REST_api
>Affects Versions: 2.0
>Reporter: Andrzej Bialecki 
>Assignee: Andrzej Bialecki 
> Attachments: db.formatted.gz, NUTCH-932.patch, NUTCH-932.patch
>
>
> It would be useful to be able to retrieve results of a crawl as JSON. There 
> are a few things that need to be discussed:
> * how to return bulk results using Restlet (WritableRepresentation subclass?)
> * what should be the format of results?
> I think it would make sense to provide a single record retrieval (by primary 
> key), all records, and records within a range. This incidentally matches well 
> the capabilities of the Gora Query class :)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-932) Bulk REST API to retrieve crawl results as JSON

2010-11-04 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-932:


Attachment: db.formatted.gz

Example DB content (this was passed through a JSON pretty-printer, otherwise 
it's just one giant line...).

> Bulk REST API to retrieve crawl results as JSON
> ---
>
> Key: NUTCH-932
> URL: https://issues.apache.org/jira/browse/NUTCH-932
> Project: Nutch
>  Issue Type: New Feature
>  Components: REST_api
>Affects Versions: 2.0
>Reporter: Andrzej Bialecki 
>Assignee: Andrzej Bialecki 
> Attachments: db.formatted.gz, NUTCH-932.patch
>
>
> It would be useful to be able to retrieve results of a crawl as JSON. There 
> are a few things that need to be discussed:
> * how to return bulk results using Restlet (WritableRepresentation subclass?)
> * what should be the format of results?
> I think it would make sense to provide a single record retrieval (by primary 
> key), all records, and records within a range. This incidentally matches well 
> the capabilities of the Gora Query class :)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-932) Bulk REST API to retrieve crawl results as JSON

2010-11-04 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-932:


Attachment: NUTCH-932.patch

This patch adds bulk retrieval of crawl results. This is still very rough, e.g. 
there's no way to select crawlId or limit the fields... but it returns proper 
JSON.

This patch also includes other enhancements and bugfixes - with this patch I 
was able to perform a complete crawl cycle via REST.

> Bulk REST API to retrieve crawl results as JSON
> ---
>
> Key: NUTCH-932
> URL: https://issues.apache.org/jira/browse/NUTCH-932
> Project: Nutch
>  Issue Type: New Feature
>  Components: REST_api
>Affects Versions: 2.0
>Reporter: Andrzej Bialecki 
>Assignee: Andrzej Bialecki 
> Attachments: NUTCH-932.patch
>
>
> It would be useful to be able to retrieve results of a crawl as JSON. There 
> are a few things that need to be discussed:
> * how to return bulk results using Restlet (WritableRepresentation subclass?)
> * what should be the format of results?
> I think it would make sense to provide a single record retrieval (by primary 
> key), all records, and records within a range. This incidentally matches well 
> the capabilities of the Gora Query class :)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-932) Bulk REST API to retrieve crawl results as JSON

2010-10-29 Thread Andrzej Bialecki (JIRA)
Bulk REST API to retrieve crawl results as JSON
---

 Key: NUTCH-932
 URL: https://issues.apache.org/jira/browse/NUTCH-932
 Project: Nutch
  Issue Type: New Feature
  Components: REST_api
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 


It would be useful to be able to retrieve results of a crawl as JSON. There are 
a few things that need to be discussed:

* how to return bulk results using Restlet (WritableRepresentation subclass?)

* what should be the format of results?

I think it would make sense to provide a single record retrieval (by primary 
key), all records, and records within a range. This incidentally matches well 
the capabilities of the Gora Query class :)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-931) Simple admin API to fetch status and stop the service

2010-10-29 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  resolved NUTCH-931.
-

Resolution: Fixed

Committed in rev. 1028736 with some changes.

> Simple admin API to fetch status and stop the service
> -
>
> Key: NUTCH-931
> URL: https://issues.apache.org/jira/browse/NUTCH-931
> Project: Nutch
>  Issue Type: Improvement
>  Components: REST_api
>Affects Versions: 2.0
>Reporter: Andrzej Bialecki 
>Assignee: Andrzej Bialecki 
> Fix For: 2.0
>
> Attachments: NUTCH-931.patch
>
>
> REST API needs a simple info / stats service and the ability to shutdown the 
> server.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-931) Simple admin API to fetch status and stop the service

2010-10-28 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-931:


Attachment: NUTCH-931.patch

AdminResource, mostly skeleton for now that implements only the "stop" command.

> Simple admin API to fetch status and stop the service
> -
>
> Key: NUTCH-931
> URL: https://issues.apache.org/jira/browse/NUTCH-931
> Project: Nutch
>  Issue Type: Improvement
>  Components: REST_api
>Affects Versions: 2.0
>Reporter: Andrzej Bialecki 
>Assignee: Andrzej Bialecki 
> Fix For: 2.0
>
> Attachments: NUTCH-931.patch
>
>
> REST API needs a simple info / stats service and the ability to shutdown the 
> server.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-931) Simple admin API to fetch status and stop the service

2010-10-28 Thread Andrzej Bialecki (JIRA)
Simple admin API to fetch status and stop the service
-

 Key: NUTCH-931
 URL: https://issues.apache.org/jira/browse/NUTCH-931
 Project: Nutch
  Issue Type: Improvement
  Components: REST_api
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 2.0


REST API needs a simple info / stats service and the ability to shutdown the 
server.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-930) Remove remaining dependencies on Lucene API

2010-10-28 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  resolved NUTCH-930.
-

   Resolution: Fixed
Fix Version/s: 2.0

Committed in rev. 1028474.

> Remove remaining dependencies on Lucene API
> ---
>
> Key: NUTCH-930
> URL: https://issues.apache.org/jira/browse/NUTCH-930
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 2.0
>Reporter: Andrzej Bialecki 
>Assignee: Andrzej Bialecki 
> Fix For: 2.0
>
> Attachments: NUTCH-930.patch
>
>
> Nutch doesn't use Lucene API anymore, all indexing happens via 
> Lucene-agnostic SolrJ API. The only place where we still use a minor part of 
> Lucene is in index-basic, and that use (DateTools) can be easily replaced.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-930) Remove remaining dependencies on Lucene API

2010-10-28 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-930:


Attachment: NUTCH-930.patch

Patch to fix the issue. I'll commit this shortly.

> Remove remaining dependencies on Lucene API
> ---
>
> Key: NUTCH-930
> URL: https://issues.apache.org/jira/browse/NUTCH-930
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 2.0
>Reporter: Andrzej Bialecki 
>Assignee: Andrzej Bialecki 
> Attachments: NUTCH-930.patch
>
>
> Nutch doesn't use Lucene API anymore, all indexing happens via 
> Lucene-agnostic SolrJ API. The only place where we still use a minor part of 
> Lucene is in index-basic, and that use (DateTools) can be easily replaced.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-930) Remove remaining dependencies on Lucene API

2010-10-28 Thread Andrzej Bialecki (JIRA)
Remove remaining dependencies on Lucene API
---

 Key: NUTCH-930
 URL: https://issues.apache.org/jira/browse/NUTCH-930
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 


Nutch doesn't use Lucene API anymore, all indexing happens via Lucene-agnostic 
SolrJ API. The only place where we still use a minor part of Lucene is in 
index-basic, and that use (DateTools) can be easily replaced.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-880) REST API for Nutch

2010-10-28 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  resolved NUTCH-880.
-

   Resolution: Fixed
Fix Version/s: 2.0

Committed in rev. 1028235. The webapp part of this issue is tracked now in 
NUTCH-929.

> REST API for Nutch
> --
>
> Key: NUTCH-880
> URL: https://issues.apache.org/jira/browse/NUTCH-880
> Project: Nutch
>  Issue Type: New Feature
>Affects Versions: 2.0
>Reporter: Andrzej Bialecki 
>Assignee: Andrzej Bialecki 
> Fix For: 2.0
>
> Attachments: API-2.patch, API.patch
>
>
> This issue is for discussing a REST-style API for accessing Nutch.
> Here's an initial idea:
> * I propose to use org.restlet for handling requests and returning 
> JSON/XML/whatever responses.
> * hook up all regular tools so that they can be driven via this API. This 
> would have to be an async API, since all Nutch operations take long time to 
> execute. It follows then that we need to be able also to list running 
> operations, retrieve their current status, and possibly 
> abort/cancel/stop/suspend/resume/...? This also means that we would have to 
> potentially create & manage many threads in a servlet - AFAIK this is frowned 
> upon by J2EE purists...
> * package this in a webapp (that includes all deps, essentially nutch.job 
> content), with the restlet servlet as an entry point.
> Open issues:
> * how to implement the reading of crawl results via this API
> * should we manage only crawls that use a single configuration per webapp, or 
> should we have a notion of crawl contexts (sets of crawl configs) with CRUD 
> ops on them? this would be nice, because it would allow managing of several 
> different crawls, with different configs, in a single webapp - but it 
> complicates the implementation a lot.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-880) REST API for Nutch

2010-10-28 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-880:


Summary: REST API for Nutch  (was: REST API (and webapp) for Nutch)

The webapp part is tracked now in NUTCH-929.

> REST API for Nutch
> --
>
> Key: NUTCH-880
> URL: https://issues.apache.org/jira/browse/NUTCH-880
> Project: Nutch
>  Issue Type: New Feature
>Affects Versions: 2.0
>Reporter: Andrzej Bialecki 
>Assignee: Andrzej Bialecki 
> Attachments: API-2.patch, API.patch
>
>
> This issue is for discussing a REST-style API for accessing Nutch.
> Here's an initial idea:
> * I propose to use org.restlet for handling requests and returning 
> JSON/XML/whatever responses.
> * hook up all regular tools so that they can be driven via this API. This 
> would have to be an async API, since all Nutch operations take long time to 
> execute. It follows then that we need to be able also to list running 
> operations, retrieve their current status, and possibly 
> abort/cancel/stop/suspend/resume/...? This also means that we would have to 
> potentially create & manage many threads in a servlet - AFAIK this is frowned 
> upon by J2EE purists...
> * package this in a webapp (that includes all deps, essentially nutch.job 
> content), with the restlet servlet as an entry point.
> Open issues:
> * how to implement the reading of crawl results via this API
> * should we manage only crawls that use a single configuration per webapp, or 
> should we have a notion of crawl contexts (sets of crawl configs) with CRUD 
> ops on them? this would be nice, because it would allow managing of several 
> different crawls, with different configs, in a single webapp - but it 
> complicates the implementation a lot.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-929) Create a REST-based admin UI for Nutch

2010-10-28 Thread Andrzej Bialecki (JIRA)
Create a REST-based admin UI for Nutch
--

 Key: NUTCH-929
 URL: https://issues.apache.org/jira/browse/NUTCH-929
 Project: Nutch
  Issue Type: New Feature
  Components: administration gui
Affects Versions: 2.0
Reporter: Andrzej Bialecki 


This is a follow up to NUTCH-880 - we need to expose the functionality of REST 
API in a user-friendly admin UI. Thanks to the nature of the API the UI can be 
implemented in any UI framework that speaks REST/JSON, so it could be a simple 
webapp (we already have jetty) or a Swing / Pivot / etc standalone application.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-926) Nutch follows wrong url in

2010-10-27 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12925543#action_12925543
 ] 

Andrzej Bialecki  commented on NUTCH-926:
-

bq. Nutch continues to crawl the WRONG subdomains! But it should not do this!!
No need to shout, we hear you :)

Indeed, Nutch behavior when following redirects doesn't play well with the rule 
of ignoring external outlinks. Strictly speaking, redirects are not outlinks, 
but the silent assumption behind ignoreExternalOutlinks is that we crawl 
content only from that hostname.

And your patch would solve this particular issue. However, this is not as 
simple as it seems... My favorite example is www.ibm.com -> 
www8.ibm.com/index.html . If we apply your fix you won't be able to crawl 
www.ibm.com unless you inject all wwwNNN load-balanced hosts... so a simple 
equality of hostnames may not be sufficient. We have utilities to extract 
domain names, so we could compare domains but then we may mistreat 
money.cnn.com vs. weather.cnn.com ...

> Nutch follows wrong url in  -
>
> Key: NUTCH-926
> URL: https://issues.apache.org/jira/browse/NUTCH-926
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 1.2
> Environment: gnu/linux centOs
>Reporter: Marco Novo
>Priority: Critical
> Fix For: 1.3
>
> Attachments: ParseOutputFormat.java.patch
>
>
> We have nutch set to crawl a domain urllist and we want to fetch only passed 
> domains (hosts) not subdomains.
> So
> WWW.DOMAIN1.COM
> ..
> ..
> ..
> WWW.RIGHTDOMAIN.COM
> ..
> ..
> ..
> ..
> WWW.DOMAIN.COM
> We sets nutch to:
> NOT FOLLOW EXERNAL LINKS
> During crawling of WWW.RIGHTDOMAIN.COM
> if a page contains
> 
> 
> 
> 
> http://WRONG.RIGHTDOMAIN.COM";>
> 
> 
> 
> 
> Nutch continues to crawl the WRONG subdomains! But it should not do this!!
> During crawling of WWW.RIGHTDOMAIN.COM
> if a page contains
> 
> 
> 
> 
> http://WWW.WRONGDOMAIN.COM";>
> 
> 
> 
> 
> Nutch continues to crawl the WRONG domain! But it should not do this! If that 
> we will spider all the web
> We think the problem is in org.apache.nutch.parse ParseOutputFormat. We have 
> done a patch so we will attach it

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-880) REST API (and webapp) for Nutch

2010-10-26 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-880:


Attachment: API-2.patch

An improved version, which actually works :) The configuration and job 
management is implemented, there is also a unit test that exercises this API.

If there are no objections I'd like to commit this first version of the API, 
and continue improving it in other issues.

> REST API (and webapp) for Nutch
> ---
>
> Key: NUTCH-880
> URL: https://issues.apache.org/jira/browse/NUTCH-880
> Project: Nutch
>  Issue Type: New Feature
>Affects Versions: 2.0
>Reporter: Andrzej Bialecki 
>Assignee: Andrzej Bialecki 
> Attachments: API-2.patch, API.patch
>
>
> This issue is for discussing a REST-style API for accessing Nutch.
> Here's an initial idea:
> * I propose to use org.restlet for handling requests and returning 
> JSON/XML/whatever responses.
> * hook up all regular tools so that they can be driven via this API. This 
> would have to be an async API, since all Nutch operations take long time to 
> execute. It follows then that we need to be able also to list running 
> operations, retrieve their current status, and possibly 
> abort/cancel/stop/suspend/resume/...? This also means that we would have to 
> potentially create & manage many threads in a servlet - AFAIK this is frowned 
> upon by J2EE purists...
> * package this in a webapp (that includes all deps, essentially nutch.job 
> content), with the restlet servlet as an entry point.
> Open issues:
> * how to implement the reading of crawl results via this API
> * should we manage only crawls that use a single configuration per webapp, or 
> should we have a notion of crawl contexts (sets of crawl configs) with CRUD 
> ops on them? this would be nice, because it would allow managing of several 
> different crawls, with different configs, in a single webapp - but it 
> complicates the implementation a lot.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-913) Nutch should use new namespace for Gora

2010-10-25 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12924659#action_12924659
 ] 

Andrzej Bialecki  commented on NUTCH-913:
-

+1, let's commit it -  I want to start playing with GORA-9, and that patch is 
in the org.apache namespace...

> Nutch should use new namespace for Gora
> ---
>
> Key: NUTCH-913
> URL: https://issues.apache.org/jira/browse/NUTCH-913
> Project: Nutch
>  Issue Type: Bug
>  Components: storage
>Reporter: Doğacan Güney
>Assignee: Doğacan Güney
> Fix For: 2.0
>
> Attachments: NUTCH-913_v1.patch, NUTCH-913_v2.patch
>
>
> Gora is in Apache Incubator now (Yey!). We recently changed Gora's namespace 
> from org.gora to org.apache.gora. This means nutch should use the new 
> namespace otherwise it won't compile with newer builds of Gora.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-923) Multilingual support for Solr-index-mapping

2010-10-23 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12924154#action_12924154
 ] 

Andrzej Bialecki  commented on NUTCH-923:
-

This doesn't solve the problem of potentially unbounded number of fields. 
Compliance is one thing, and you can clean up field names from invalid 
characters, but sanity is another thing - if you have {{title_*}} in your Solr 
schema then theoretically you are allowed to create unlimited number of fields 
with this prefix - Solr won't complain.

> Multilingual support for Solr-index-mapping
> ---
>
> Key: NUTCH-923
> URL: https://issues.apache.org/jira/browse/NUTCH-923
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.2
>Reporter: Matthias Agethle
>Assignee: Markus Jelsma
>Priority: Minor
>
> It would be useful to extend the mapping-possibilites when indexing to solr.
> One useful feature would be to use the detected language of the html page 
> (for example via the language-identifier plugin) and send the content to 
> corresponding language-aware solr-fields.
> The mapping file could be as follows:
> 
> 
> so that the title-field gets mapped to title_en for English-pages and 
> tilte_fr for French pages.
> What do you think? Could this be useful also to others?
> Or are there already other solutions out there?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-923) Multilingual support for Solr-index-mapping

2010-10-22 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12923947#action_12923947
 ] 

Andrzej Bialecki  commented on NUTCH-923:
-

My point was simply that if you want to build your data schema dynamically, 
based on the actual input data, then you need to be aware that this process is 
inherently risky - now we could perhaps deal with "lang" and 
LanguageIdentifier, but tomorrow we may be dealing with dc.author or cc.license 
or something else, and then we will face the same issue, ie. a potentially 
unlimited number of fields created based on data.

I don't have a good answer to this problem. On one hand this functionality is 
useful, on the other hand it's inherently risky in presence of less than ideal 
data, which is always a possibility... Perhaps introducing some sort of 
validation mechanism would make this safer to use.

> Multilingual support for Solr-index-mapping
> ---
>
> Key: NUTCH-923
> URL: https://issues.apache.org/jira/browse/NUTCH-923
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.2
>Reporter: Matthias Agethle
>Assignee: Markus Jelsma
>Priority: Minor
>
> It would be useful to extend the mapping-possibilites when indexing to solr.
> One useful feature would be to use the detected language of the html page 
> (for example via the language-identifier plugin) and send the content to 
> corresponding language-aware solr-fields.
> The mapping file could be as follows:
> 
> 
> so that the title-field gets mapped to title_en for English-pages and 
> tilte_fr for French pages.
> What do you think? Could this be useful also to others?
> Or are there already other solutions out there?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-923) Multilingual support for Solr-index-mapping

2010-10-22 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12923896#action_12923896
 ] 

Andrzej Bialecki  commented on NUTCH-923:
-

This sounds useful, though the implementation needs to keep the following in 
mind:
* you _assume_ that the lang field will have a nice predictable value, but 
unless you sanitize the values you can't assume anything... example: one page I 
saw had a language metadata set to a random string 8kB long with various 
control chars and '\0'-s.

* again, if you don't sanitize and control the total number of unique values in 
the source field, you could end up with a number of fields approaching 
infinity, and Solr would melt down...

> Multilingual support for Solr-index-mapping
> ---
>
> Key: NUTCH-923
> URL: https://issues.apache.org/jira/browse/NUTCH-923
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.2
>Reporter: Matthias Agethle
>Assignee: Markus Jelsma
>Priority: Minor
>
> It would be useful to extend the mapping-possibilites when indexing to solr.
> One useful feature would be to use the detected language of the html page 
> (for example via the language-identifier plugin) and send the content to 
> corresponding language-aware solr-fields.
> The mapping file could be as follows:
> 
> 
> so that the title-field gets mapped to title_en for English-pages and 
> tilte_fr for French pages.
> What do you think? Could this be useful also to others?
> Or are there already other solutions out there?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-924) Static field in solr mapping

2010-10-22 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12923845#action_12923845
 ] 

Andrzej Bialecki  commented on NUTCH-924:
-

The functionality is useful, +1. But the patch has formatting errors. Please 
fix them before committing.

The same functionality should be added to trunk, too.

> Static field in solr mapping
> 
>
> Key: NUTCH-924
> URL: https://issues.apache.org/jira/browse/NUTCH-924
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.3
>Reporter: David Stuart
>Assignee: Markus Jelsma
> Fix For: 1.3
>
> Attachments: nutch_1.3_static_field.patch
>
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> Provide the facility to pass static data defined in solrindex-mapping.xml to 
> solr during the mapping process.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-907) DataStore API doesn't support multiple storage areas for multiple disjoint crawls

2010-10-21 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  resolved NUTCH-907.
-

Resolution: Fixed

Committed in rev. 1025963. Thank you Sertan for a high-quality patch and unit 
tests!

> DataStore API doesn't support multiple storage areas for multiple disjoint 
> crawls
> -
>
> Key: NUTCH-907
> URL: https://issues.apache.org/jira/browse/NUTCH-907
> Project: Nutch
>  Issue Type: Bug
>Reporter: Andrzej Bialecki 
>Assignee: Andrzej Bialecki 
> Fix For: 2.0
>
> Attachments: NUTCH-907.patch, NUTCH-907.v2.patch
>
>
> In Nutch 1.x it was possible to easily select a set of crawl data (crawldb, 
> page data, linkdb, etc) by specifying a path where the data was stored. This 
> enabled users to run several disjoint crawls with different configs, but 
> still using the same storage medium, just under different paths.
> This is not possible now because there is a 1:1 mapping between a specific 
> DataStore instance and a set of crawl data.
> In order to support this functionality the Gora API should be extended so 
> that it can create stores (and data tables in the underlying storage) that 
> use arbitrary prefixes to identify the particular crawl dataset. Then the 
> Nutch API should be extended to allow passing this "crawlId" value to select 
> one of possibly many existing crawl datasets.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (NUTCH-907) DataStore API doesn't support multiple storage areas for multiple disjoint crawls

2010-10-21 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-907?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  reassigned NUTCH-907:
---

Assignee: Andrzej Bialecki 

> DataStore API doesn't support multiple storage areas for multiple disjoint 
> crawls
> -
>
> Key: NUTCH-907
> URL: https://issues.apache.org/jira/browse/NUTCH-907
> Project: Nutch
>  Issue Type: Bug
>Reporter: Andrzej Bialecki 
>Assignee: Andrzej Bialecki 
> Fix For: 2.0
>
> Attachments: NUTCH-907.patch, NUTCH-907.v2.patch
>
>
> In Nutch 1.x it was possible to easily select a set of crawl data (crawldb, 
> page data, linkdb, etc) by specifying a path where the data was stored. This 
> enabled users to run several disjoint crawls with different configs, but 
> still using the same storage medium, just under different paths.
> This is not possible now because there is a 1:1 mapping between a specific 
> DataStore instance and a set of crawl data.
> In order to support this functionality the Gora API should be extended so 
> that it can create stores (and data tables in the underlying storage) that 
> use arbitrary prefixes to identify the particular crawl dataset. Then the 
> Nutch API should be extended to allow passing this "crawlId" value to select 
> one of possibly many existing crawl datasets.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-921) Reduce dependency of Nutch on config files

2010-10-21 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  resolved NUTCH-921.
-

Resolution: Fixed

Patch committed in rev. 1025960. Further improvements to be covered in other 
issues.

> Reduce dependency of Nutch on config files
> --
>
> Key: NUTCH-921
> URL: https://issues.apache.org/jira/browse/NUTCH-921
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 2.0
>Reporter: Andrzej Bialecki 
>Assignee: Andrzej Bialecki 
> Fix For: 2.0
>
> Attachments: NUTCH-921.patch
>
>
> Currently many components in Nutch rely on reading their configuration from 
> files. These files need to be on the classpath (or packed into a job jar). 
> This is inconvenient if you want to manage configuration via API, e.g. when 
> embedding Nutch, or running many jobs with slightly different configurations.
> This issue tracks the improvement to make various components read their 
> config directly from Configuration properties.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-921) Reduce dependency of Nutch on config files

2010-10-19 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-921:


Attachment: NUTCH-921.patch

Patch that implements reading config parameters from Configuration, and falls 
back to config files if Configuration properties are unspecified.

> Reduce dependency of Nutch on config files
> --
>
> Key: NUTCH-921
> URL: https://issues.apache.org/jira/browse/NUTCH-921
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 2.0
>Reporter: Andrzej Bialecki 
>Assignee: Andrzej Bialecki 
> Fix For: 2.0
>
> Attachments: NUTCH-921.patch
>
>
> Currently many components in Nutch rely on reading their configuration from 
> files. These files need to be on the classpath (or packed into a job jar). 
> This is inconvenient if you want to manage configuration via API, e.g. when 
> embedding Nutch, or running many jobs with slightly different configurations.
> This issue tracks the improvement to make various components read their 
> config directly from Configuration properties.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-921) Reduce dependency of Nutch on config files

2010-10-19 Thread Andrzej Bialecki (JIRA)
Reduce dependency of Nutch on config files
--

 Key: NUTCH-921
 URL: https://issues.apache.org/jira/browse/NUTCH-921
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 2.0


Currently many components in Nutch rely on reading their configuration from 
files. These files need to be on the classpath (or packed into a job jar). This 
is inconvenient if you want to manage configuration via API, e.g. when 
embedding Nutch, or running many jobs with slightly different configurations.

This issue tracks the improvement to make various components read their config 
directly from Configuration properties.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-913) Nutch should use new namespace for Gora

2010-10-13 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920610#action_12920610
 ] 

Andrzej Bialecki  commented on NUTCH-913:
-

There are formatting issues in DomainStatistics.java - the file uses literal 
tabs, which we frown upon, but the patch introduces double-space indent in the 
changed lines. As ugly as it sounds I think this should be changed into tabs, 
and then reformatted in another commit.

Other than that, +1, go for it.

> Nutch should use new namespace for Gora
> ---
>
> Key: NUTCH-913
> URL: https://issues.apache.org/jira/browse/NUTCH-913
> Project: Nutch
>  Issue Type: Bug
>  Components: storage
>Reporter: Doğacan Güney
>Assignee: Doğacan Güney
> Fix For: 2.0
>
> Attachments: NUTCH-913_v1.patch
>
>
> Gora is in Apache Incubator now (Yey!). We recently changed Gora's namespace 
> from org.gora to org.apache.gora. This means nutch should use the new 
> namespace otherwise it won't compile with newer builds of Gora.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-864) Fetcher generates entries with status 0

2010-10-01 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916912#action_12916912
 ] 

Andrzej Bialecki  commented on NUTCH-864:
-

I think the difficulty comes from the simplification in 2.x as compared to 1.x, 
in that we keep a single status per page. In 1.x a side-effect of having two 
locations with two statuses (one "db status" in crawldb and one "fetch status" 
in segments) was that we had more information in updatedb to act upon.

Now we should probably keep up to two statuses - one that reflects a temporary 
fetch status, as determined by fetcher, and a final (reconciled) status as 
determined by updatedb, based on the knoweldge of not only plain fetch status 
and old status but also possible redirects. If I'm not mistaken currently the 
status is immediately overwritten by fetcher, even before we come to updatedb, 
hence the problem..

> Fetcher generates entries with status 0
> ---
>
> Key: NUTCH-864
> URL: https://issues.apache.org/jira/browse/NUTCH-864
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
> Environment: Gora with SQLBackend
> URL: https://svn.apache.org/repos/asf/nutch/branches/nutchbase
> Last Changed Rev: 980748
> Last Changed Date: 2010-07-30 14:19:52 +0200 (Fri, 30 Jul 2010)
>Reporter: Julien Nioche
>Assignee: Doğacan Güney
> Fix For: 2.0
>
>
> After a round of fetching which got the following protocol status :
> 10/07/30 15:11:39 INFO mapred.JobClient: ACCESS_DENIED=2
> 10/07/30 15:11:39 INFO mapred.JobClient: SUCCESS=1177
> 10/07/30 15:11:39 INFO mapred.JobClient: GONE=3
> 10/07/30 15:11:39 INFO mapred.JobClient: TEMP_MOVED=138
> 10/07/30 15:11:39 INFO mapred.JobClient: EXCEPTION=93
> 10/07/30 15:11:39 INFO mapred.JobClient: MOVED=521
> 10/07/30 15:11:39 INFO mapred.JobClient: NOTFOUND=62
> I ran : ./nutch org.apache.nutch.crawl.WebTableReader -stats
> 10/07/30 15:12:37 INFO crawl.WebTableReader: Statistics for WebTable: 
> 10/07/30 15:12:37 INFO crawl.WebTableReader: TOTAL urls:  2690
> 10/07/30 15:12:37 INFO crawl.WebTableReader: retry 0: 2690
> 10/07/30 15:12:37 INFO crawl.WebTableReader: min score:   0.0
> 10/07/30 15:12:37 INFO crawl.WebTableReader: avg score:   0.7587361
> 10/07/30 15:12:37 INFO crawl.WebTableReader: max score:   1.0
> 10/07/30 15:12:37 INFO crawl.WebTableReader: status 0 (null): 649
> 10/07/30 15:12:37 INFO crawl.WebTableReader: status 2 (status_fetched):   
> 1177 (SUCCESS=1177)
> 10/07/30 15:12:37 INFO crawl.WebTableReader: status 3 (status_gone):  112 
> 10/07/30 15:12:37 INFO crawl.WebTableReader: status 34 (status_retry):
> 93 (EXCEPTION=93)
> 10/07/30 15:12:37 INFO crawl.WebTableReader: status 4 (status_redir_temp):
> 138  (TEMP_MOVED=138)
> 10/07/30 15:12:37 INFO crawl.WebTableReader: status 5 (status_redir_perm):
> 521 (MOVED=521)
> 10/07/30 15:12:37 INFO crawl.WebTableReader: WebTable statistics: done
> There should not be any entries with status 0 (null)
> I will investigate a bit more...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-894) Move statistical language identification from indexing to parsing step

2010-10-01 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916907#action_12916907
 ] 

Andrzej Bialecki  commented on NUTCH-894:
-

+1, a nice clean up of our code base :)

> Move statistical language identification from indexing to parsing step
> --
>
> Key: NUTCH-894
> URL: https://issues.apache.org/jira/browse/NUTCH-894
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 2.0
>Reporter: Julien Nioche
>Assignee: Julien Nioche
> Fix For: 2.0
>
> Attachments: NUTCH-894.patch
>
>
> The statistical identification of language is currently done part in the 
> indexing step, whereas the detection based on HTTP header and HTML code is 
> done during the parsing.
> We could keep the same logic i.e. do the statistical detection only if 
> nothing has been found with the previous methods but as part of the parsing. 
> This would be useful for ParseFilters which need the language information or 
> to use with ScoringFilters e.g. to focus the crawl on a set of languages.
> Since the statistical models have been ported to Tika we should probably rely 
> on them instead of maintaining our own.
> Any thoughts on this?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-882) Design a Host table in GORA

2010-10-01 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916874#action_12916874
 ] 

Andrzej Bialecki  commented on NUTCH-882:
-

Doğacan, I missed your previous comment... the issue with partial bloom filters 
is usually solved that each task stores each own filter - this worked well for 
MapFile-s because they consisted of multiple parts, so then a Reader would open 
a part and a corresponding bloom filter.

Here it's more complicated, I agree... though this reminds me of the situation 
that is handled by DynamicBloomFilter: it's basically a set of Bloom filters 
with a facade that hides this fact from the user. Here we could construct 
something similar, i.e. don't merge partial filters after closing the output, 
but instead when opening a Reader read all partial filters and pretend they are 
one.

> Design a Host table in GORA
> ---
>
> Key: NUTCH-882
> URL: https://issues.apache.org/jira/browse/NUTCH-882
> Project: Nutch
>  Issue Type: New Feature
>Affects Versions: 2.0
>Reporter: Julien Nioche
>Assignee: Julien Nioche
> Fix For: 2.0
>
> Attachments: hostdb.patch, NUTCH-882-v1.patch
>
>
> Having a separate GORA table for storing information about hosts (and 
> domains?) would be very useful for : 
> * customising the behaviour of the fetching on a host basis e.g. number of 
> threads, min time between threads etc...
> * storing stats
> * keeping metadata and possibly propagate them to the webpages 
> * keeping a copy of the robots.txt and possibly use that later to filter the 
> webtable
> * store sitemaps files and update the webtable accordingly
> I'll try to come up with a GORA schema for such a host table but any comments 
> are of course already welcome 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-864) Fetcher generates entries with status 0

2010-10-01 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-864?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916871#action_12916871
 ] 

Andrzej Bialecki  commented on NUTCH-864:
-

+1 to using a specific value != 0 as a redirect status. Value of 0 is helpful 
as a guard value, i.e. to detect things that were not properly initialized. I 
would even argue  that the default value of 0 should be explicitly named 
INVALID.

> Fetcher generates entries with status 0
> ---
>
> Key: NUTCH-864
> URL: https://issues.apache.org/jira/browse/NUTCH-864
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
> Environment: Gora with SQLBackend
> URL: https://svn.apache.org/repos/asf/nutch/branches/nutchbase
> Last Changed Rev: 980748
> Last Changed Date: 2010-07-30 14:19:52 +0200 (Fri, 30 Jul 2010)
>Reporter: Julien Nioche
>Assignee: Doğacan Güney
> Fix For: 2.0
>
>
> After a round of fetching which got the following protocol status :
> 10/07/30 15:11:39 INFO mapred.JobClient: ACCESS_DENIED=2
> 10/07/30 15:11:39 INFO mapred.JobClient: SUCCESS=1177
> 10/07/30 15:11:39 INFO mapred.JobClient: GONE=3
> 10/07/30 15:11:39 INFO mapred.JobClient: TEMP_MOVED=138
> 10/07/30 15:11:39 INFO mapred.JobClient: EXCEPTION=93
> 10/07/30 15:11:39 INFO mapred.JobClient: MOVED=521
> 10/07/30 15:11:39 INFO mapred.JobClient: NOTFOUND=62
> I ran : ./nutch org.apache.nutch.crawl.WebTableReader -stats
> 10/07/30 15:12:37 INFO crawl.WebTableReader: Statistics for WebTable: 
> 10/07/30 15:12:37 INFO crawl.WebTableReader: TOTAL urls:  2690
> 10/07/30 15:12:37 INFO crawl.WebTableReader: retry 0: 2690
> 10/07/30 15:12:37 INFO crawl.WebTableReader: min score:   0.0
> 10/07/30 15:12:37 INFO crawl.WebTableReader: avg score:   0.7587361
> 10/07/30 15:12:37 INFO crawl.WebTableReader: max score:   1.0
> 10/07/30 15:12:37 INFO crawl.WebTableReader: status 0 (null): 649
> 10/07/30 15:12:37 INFO crawl.WebTableReader: status 2 (status_fetched):   
> 1177 (SUCCESS=1177)
> 10/07/30 15:12:37 INFO crawl.WebTableReader: status 3 (status_gone):  112 
> 10/07/30 15:12:37 INFO crawl.WebTableReader: status 34 (status_retry):
> 93 (EXCEPTION=93)
> 10/07/30 15:12:37 INFO crawl.WebTableReader: status 4 (status_redir_temp):
> 138  (TEMP_MOVED=138)
> 10/07/30 15:12:37 INFO crawl.WebTableReader: status 5 (status_redir_perm):
> 521 (MOVED=521)
> 10/07/30 15:12:37 INFO crawl.WebTableReader: WebTable statistics: done
> There should not be any entries with status 0 (null)
> I will investigate a bit more...

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-907) DataStore API doesn't support multiple storage areas for multiple disjoint crawls

2010-10-01 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12916870#action_12916870
 ] 

Andrzej Bialecki  commented on NUTCH-907:
-

Hi Sertan,

Thanks for the patch, this looks very good! A few  comments:

* I'm not good at naming things either... schemaId is a little bit cryptic 
though. If we didn't already use crawlId I would vote for that (and then rename 
crawlId to batchId or fetchId), as it is now... I dont know, maybe datasetId ..

* since we now create multiple datasets, we need somehow to manage them - i.e. 
list and delete at least (create is implicit). There is no such functionality 
in this patch, but this can be addressed also as a separate issue.

* IndexerMapReduce.createIndexJob: I think it would be useful to pass the 
"datasetId" as a Job property - this way indexing filter plugins can use this 
property to populate NutchDocument fields if needed. FWIW, this may be a good 
idea to do in other jobs as well...

> DataStore API doesn't support multiple storage areas for multiple disjoint 
> crawls
> -
>
> Key: NUTCH-907
> URL: https://issues.apache.org/jira/browse/NUTCH-907
> Project: Nutch
>  Issue Type: Bug
>Reporter: Andrzej Bialecki 
> Fix For: 2.0
>
> Attachments: NUTCH-907.patch
>
>
> In Nutch 1.x it was possible to easily select a set of crawl data (crawldb, 
> page data, linkdb, etc) by specifying a path where the data was stored. This 
> enabled users to run several disjoint crawls with different configs, but 
> still using the same storage medium, just under different paths.
> This is not possible now because there is a 1:1 mapping between a specific 
> DataStore instance and a set of crawl data.
> In order to support this functionality the Gora API should be extended so 
> that it can create stores (and data tables in the underlying storage) that 
> use arbitrary prefixes to identify the particular crawl dataset. Then the 
> Nutch API should be extended to allow passing this "crawlId" value to select 
> one of possibly many existing crawl datasets.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-880) REST API (and webapp) for Nutch

2010-09-21 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12913118#action_12913118
 ] 

Andrzej Bialecki  commented on NUTCH-880:
-

bq. I think we can combine the approach you outlined in NUTCH-907 with this one.

I'm not sure... they are really not the same things - you can execute many 
crawls with different seed lists, but still using the same Configuration.

bq. What is "CLASS" ?

It's the same as bin/nutch fully.qualified.class.name, only here I require that 
it implements NutchTool.

bq. Btw, Andrzej, I will be happy to help out with the implementation if you 
want.

By all means - I didn't have time so far to progress beyond this patch...

> REST API (and webapp) for Nutch
> ---
>
> Key: NUTCH-880
> URL: https://issues.apache.org/jira/browse/NUTCH-880
> Project: Nutch
>  Issue Type: New Feature
>Affects Versions: 2.0
>Reporter: Andrzej Bialecki 
>Assignee: Andrzej Bialecki 
> Attachments: API.patch
>
>
> This issue is for discussing a REST-style API for accessing Nutch.
> Here's an initial idea:
> * I propose to use org.restlet for handling requests and returning 
> JSON/XML/whatever responses.
> * hook up all regular tools so that they can be driven via this API. This 
> would have to be an async API, since all Nutch operations take long time to 
> execute. It follows then that we need to be able also to list running 
> operations, retrieve their current status, and possibly 
> abort/cancel/stop/suspend/resume/...? This also means that we would have to 
> potentially create & manage many threads in a servlet - AFAIK this is frowned 
> upon by J2EE purists...
> * package this in a webapp (that includes all deps, essentially nutch.job 
> content), with the restlet servlet as an entry point.
> Open issues:
> * how to implement the reading of crawl results via this API
> * should we manage only crawls that use a single configuration per webapp, or 
> should we have a notion of crawl contexts (sets of crawl configs) with CRUD 
> ops on them? this would be nice, because it would allow managing of several 
> different crawls, with different configs, in a single webapp - but it 
> complicates the implementation a lot.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-909) Add alternative search-provider to Nutch site

2010-09-20 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-909?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12912474#action_12912474
 ] 

Andrzej Bialecki  commented on NUTCH-909:
-

bq. It might be better to see the message "Search with Apache Solr" (as on the 
TIKA's site).

Yes, let's make this uniform.

> Add alternative search-provider to Nutch site
> -
>
> Key: NUTCH-909
> URL: https://issues.apache.org/jira/browse/NUTCH-909
> Project: Nutch
>  Issue Type: Improvement
>  Components: documentation
>Reporter: Alex Baranau
>Priority: Minor
> Attachments: NUTCH-909.patch
>
>
> Add additional search provider (to existed Lucid Find) search-lucene.com. 
> Initiated in discussion: http://search-lucene.com/m/2suCr1UnDfF1
> According to Andrzej's suggestion, "when preparing the patch let's follow the 
> same rationales as those in TIKA-488, since they are applicable here too", so 
> please refer to that issue for more insight on implementation details.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-906) Nutch OpenSearch sometimes raises DOMExceptions due to Lucene column names not being valid XML tag names

2010-09-17 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  resolved NUTCH-906.
-

Fix Version/s: 1.2
   Resolution: Fixed

Fixed in rev. 998261. Thanks!

> Nutch OpenSearch sometimes raises DOMExceptions due to Lucene column names 
> not being valid XML tag names
> 
>
> Key: NUTCH-906
> URL: https://issues.apache.org/jira/browse/NUTCH-906
> Project: Nutch
>  Issue Type: Bug
>  Components: web gui
>Affects Versions: 1.1
> Environment: Debian GNU/Linux 64-bit
>Reporter: Asheesh Laroia
>Assignee: Andrzej Bialecki 
> Fix For: 1.2
>
> Attachments: 
> 0001-OpenSearch-If-a-Lucene-column-name-begins-with-a-num.patch
>
>   Original Estimate: 0.33h
>  Remaining Estimate: 0.33h
>
> The Nutch FAQ explains that OpenSearch includes "all fields that are 
> available at search result time." However, some Lucene column names can start 
> with numbers. Valid XML tags cannot. If Nutch is generating OpenSearch 
> results for a document with a Lucene document column whose name starts with 
> numbers, the underlying Xerces library throws this exception: 
> org.w3c.dom.DOMException: INVALID_CHARACTER_ERR: An invalid or illegal XML 
> character is specified. 
> So I have written a patch that tests strings before they are used to generate 
> tags within OpenSearch.
> I hope you merge this, or a better version of the patch!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (NUTCH-906) Nutch OpenSearch sometimes raises DOMExceptions due to Lucene column names not being valid XML tag names

2010-09-17 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  reassigned NUTCH-906:
---

Assignee: Andrzej Bialecki 

> Nutch OpenSearch sometimes raises DOMExceptions due to Lucene column names 
> not being valid XML tag names
> 
>
> Key: NUTCH-906
> URL: https://issues.apache.org/jira/browse/NUTCH-906
> Project: Nutch
>  Issue Type: Bug
>  Components: web gui
>Affects Versions: 1.1
> Environment: Debian GNU/Linux 64-bit
>Reporter: Asheesh Laroia
>Assignee: Andrzej Bialecki 
> Attachments: 
> 0001-OpenSearch-If-a-Lucene-column-name-begins-with-a-num.patch
>
>   Original Estimate: 0.33h
>  Remaining Estimate: 0.33h
>
> The Nutch FAQ explains that OpenSearch includes "all fields that are 
> available at search result time." However, some Lucene column names can start 
> with numbers. Valid XML tags cannot. If Nutch is generating OpenSearch 
> results for a document with a Lucene document column whose name starts with 
> numbers, the underlying Xerces library throws this exception: 
> org.w3c.dom.DOMException: INVALID_CHARACTER_ERR: An invalid or illegal XML 
> character is specified. 
> So I have written a patch that tests strings before they are used to generate 
> tags within OpenSearch.
> I hope you merge this, or a better version of the patch!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-862) HttpClient null pointer exception

2010-09-17 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  resolved NUTCH-862.
-

Fix Version/s: 1.2
   2.0
   Resolution: Fixed

Fix applied to branch-1.2 (rev. 998156), branch-1.3 (rev. 998158) and trunk 
(998160). Thank you!

> HttpClient null pointer exception
> -
>
> Key: NUTCH-862
> URL: https://issues.apache.org/jira/browse/NUTCH-862
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.0.0
> Environment: linux, java 6
>Reporter: Sebastian Nagel
>Assignee: Andrzej Bialecki 
>Priority: Minor
> Fix For: 1.2, 2.0
>
> Attachments: NUTCH-862.patch
>
>
> When re-fetching a document (a continued crawl) HttpClient throws an null 
> pointer exception causing the document to be emptied:
> 2010-07-27 12:45:09,199 INFO  fetcher.Fetcher - fetching 
> http://localhost/doc/selfhtml/html/index.htm
> 2010-07-27 12:45:09,203 ERROR httpclient.Http - java.lang.NullPointerException
> 2010-07-27 12:45:09,204 ERROR httpclient.Http - at 
> org.apache.nutch.protocol.httpclient.HttpResponse.(HttpResponse.java:138)
> 2010-07-27 12:45:09,204 ERROR httpclient.Http - at 
> org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:154)
> 2010-07-27 12:45:09,204 ERROR httpclient.Http - at 
> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:220)
> 2010-07-27 12:45:09,204 ERROR httpclient.Http - at 
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:537)
> 2010-07-27 12:45:09,204 INFO  fetcher.Fetcher - fetch of 
> http://localhost/doc/selfhtml/html/index.htm failed with: 
> java.lang.NullPointerException
> Because the document is re-fetched the server answers "304" (not modified):
> 127.0.0.1 - - [27/Jul/2010:12:45:09 +0200] "GET /doc/selfhtml/html/index.htm 
> HTTP/1.0" 304 174 "-" "Nutch-1.0"
> No content is sent in this case (empty http body).
> Index: 
> trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpResponse.java
> ===
> --- 
> trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpResponse.java
> (revision 979647)
> +++ 
> trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpResponse.java
> (working copy)
> @@ -134,7 +134,8 @@
>  if (code == 200) throw new IOException(e.toString());
>  // for codes other than 200 OK, we are fine with empty content
>} finally {
> -in.close();
> +if (in != null)
> +  in.close();
>  get.abort();
>}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (NUTCH-862) HttpClient null pointer exception

2010-09-17 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  reassigned NUTCH-862:
---

Assignee: Andrzej Bialecki 

> HttpClient null pointer exception
> -
>
> Key: NUTCH-862
> URL: https://issues.apache.org/jira/browse/NUTCH-862
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.0.0
> Environment: linux, java 6
>Reporter: Sebastian Nagel
>Assignee: Andrzej Bialecki 
>Priority: Minor
> Attachments: NUTCH-862.patch
>
>
> When re-fetching a document (a continued crawl) HttpClient throws an null 
> pointer exception causing the document to be emptied:
> 2010-07-27 12:45:09,199 INFO  fetcher.Fetcher - fetching 
> http://localhost/doc/selfhtml/html/index.htm
> 2010-07-27 12:45:09,203 ERROR httpclient.Http - java.lang.NullPointerException
> 2010-07-27 12:45:09,204 ERROR httpclient.Http - at 
> org.apache.nutch.protocol.httpclient.HttpResponse.(HttpResponse.java:138)
> 2010-07-27 12:45:09,204 ERROR httpclient.Http - at 
> org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:154)
> 2010-07-27 12:45:09,204 ERROR httpclient.Http - at 
> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:220)
> 2010-07-27 12:45:09,204 ERROR httpclient.Http - at 
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:537)
> 2010-07-27 12:45:09,204 INFO  fetcher.Fetcher - fetch of 
> http://localhost/doc/selfhtml/html/index.htm failed with: 
> java.lang.NullPointerException
> Because the document is re-fetched the server answers "304" (not modified):
> 127.0.0.1 - - [27/Jul/2010:12:45:09 +0200] "GET /doc/selfhtml/html/index.htm 
> HTTP/1.0" 304 174 "-" "Nutch-1.0"
> No content is sent in this case (empty http body).
> Index: 
> trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpResponse.java
> ===
> --- 
> trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpResponse.java
> (revision 979647)
> +++ 
> trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpResponse.java
> (working copy)
> @@ -134,7 +134,8 @@
>  if (code == 200) throw new IOException(e.toString());
>  // for codes other than 200 OK, we are fine with empty content
>} finally {
> -in.close();
> +if (in != null)
> +  in.close();
>  get.abort();
>}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-880) REST API (and webapp) for Nutch

2010-09-16 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-880:


Attachment: API.patch

Initial patch for discussion. This is a work in progress, so only some 
functionality is implemented, and even less than that is actually working ;)

I would appreciate a review and comments.

> REST API (and webapp) for Nutch
> ---
>
> Key: NUTCH-880
> URL: https://issues.apache.org/jira/browse/NUTCH-880
> Project: Nutch
>  Issue Type: New Feature
>Affects Versions: 2.0
>Reporter: Andrzej Bialecki 
>Assignee: Andrzej Bialecki 
> Attachments: API.patch
>
>
> This issue is for discussing a REST-style API for accessing Nutch.
> Here's an initial idea:
> * I propose to use org.restlet for handling requests and returning 
> JSON/XML/whatever responses.
> * hook up all regular tools so that they can be driven via this API. This 
> would have to be an async API, since all Nutch operations take long time to 
> execute. It follows then that we need to be able also to list running 
> operations, retrieve their current status, and possibly 
> abort/cancel/stop/suspend/resume/...? This also means that we would have to 
> potentially create & manage many threads in a servlet - AFAIK this is frowned 
> upon by J2EE purists...
> * package this in a webapp (that includes all deps, essentially nutch.job 
> content), with the restlet servlet as an entry point.
> Open issues:
> * how to implement the reading of crawl results via this API
> * should we manage only crawls that use a single configuration per webapp, or 
> should we have a notion of crawl contexts (sets of crawl configs) with CRUD 
> ops on them? this would be nice, because it would allow managing of several 
> different crawls, with different configs, in a single webapp - but it 
> complicates the implementation a lot.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (NUTCH-880) REST API (and webapp) for Nutch

2010-09-16 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  reassigned NUTCH-880:
---

Assignee: Andrzej Bialecki 

> REST API (and webapp) for Nutch
> ---
>
> Key: NUTCH-880
> URL: https://issues.apache.org/jira/browse/NUTCH-880
> Project: Nutch
>  Issue Type: New Feature
>Affects Versions: 2.0
>Reporter: Andrzej Bialecki 
>Assignee: Andrzej Bialecki 
>
> This issue is for discussing a REST-style API for accessing Nutch.
> Here's an initial idea:
> * I propose to use org.restlet for handling requests and returning 
> JSON/XML/whatever responses.
> * hook up all regular tools so that they can be driven via this API. This 
> would have to be an async API, since all Nutch operations take long time to 
> execute. It follows then that we need to be able also to list running 
> operations, retrieve their current status, and possibly 
> abort/cancel/stop/suspend/resume/...? This also means that we would have to 
> potentially create & manage many threads in a servlet - AFAIK this is frowned 
> upon by J2EE purists...
> * package this in a webapp (that includes all deps, essentially nutch.job 
> content), with the restlet servlet as an entry point.
> Open issues:
> * how to implement the reading of crawl results via this API
> * should we manage only crawls that use a single configuration per webapp, or 
> should we have a notion of crawl contexts (sets of crawl configs) with CRUD 
> ops on them? this would be nice, because it would allow managing of several 
> different crawls, with different configs, in a single webapp - but it 
> complicates the implementation a lot.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-907) DataStore API doesn't support multiple storage areas for multiple disjoint crawls

2010-09-16 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12910109#action_12910109
 ] 

Andrzej Bialecki  commented on NUTCH-907:
-

That's very good news - in that case I'm fine with the Gora API as it is now, 
we should change Nutch to make use of this functionality.

> DataStore API doesn't support multiple storage areas for multiple disjoint 
> crawls
> -
>
> Key: NUTCH-907
> URL: https://issues.apache.org/jira/browse/NUTCH-907
> Project: Nutch
>  Issue Type: Bug
>Reporter: Andrzej Bialecki 
> Fix For: 2.0
>
>
> In Nutch 1.x it was possible to easily select a set of crawl data (crawldb, 
> page data, linkdb, etc) by specifying a path where the data was stored. This 
> enabled users to run several disjoint crawls with different configs, but 
> still using the same storage medium, just under different paths.
> This is not possible now because there is a 1:1 mapping between a specific 
> DataStore instance and a set of crawl data.
> In order to support this functionality the Gora API should be extended so 
> that it can create stores (and data tables in the underlying storage) that 
> use arbitrary prefixes to identify the particular crawl dataset. Then the 
> Nutch API should be extended to allow passing this "crawlId" value to select 
> one of possibly many existing crawl datasets.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-882) Design a Host table in GORA

2010-09-15 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12909757#action_12909757
 ] 

Andrzej Bialecki  commented on NUTCH-882:
-

+1 to NutchContext. See also NUTCH-907 because the changes required in Gora API 
will likely make this task easier (once implemented ;) ).

> Design a Host table in GORA
> ---
>
> Key: NUTCH-882
> URL: https://issues.apache.org/jira/browse/NUTCH-882
> Project: Nutch
>  Issue Type: New Feature
>Affects Versions: 2.0
>Reporter: Julien Nioche
>Assignee: Julien Nioche
> Fix For: 2.0
>
> Attachments: NUTCH-882-v1.patch
>
>
> Having a separate GORA table for storing information about hosts (and 
> domains?) would be very useful for : 
> * customising the behaviour of the fetching on a host basis e.g. number of 
> threads, min time between threads etc...
> * storing stats
> * keeping metadata and possibly propagate them to the webpages 
> * keeping a copy of the robots.txt and possibly use that later to filter the 
> webtable
> * store sitemaps files and update the webtable accordingly
> I'll try to come up with a GORA schema for such a host table but any comments 
> are of course already welcome 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-907) DataStore API doesn't support multiple storage areas for multiple disjoint crawls

2010-09-15 Thread Andrzej Bialecki (JIRA)
DataStore API doesn't support multiple storage areas for multiple disjoint 
crawls
-

 Key: NUTCH-907
 URL: https://issues.apache.org/jira/browse/NUTCH-907
 Project: Nutch
  Issue Type: Bug
Reporter: Andrzej Bialecki 
 Fix For: 2.0


In Nutch 1.x it was possible to easily select a set of crawl data (crawldb, 
page data, linkdb, etc) by specifying a path where the data was stored. This 
enabled users to run several disjoint crawls with different configs, but still 
using the same storage medium, just under different paths.

This is not possible now because there is a 1:1 mapping between a specific 
DataStore instance and a set of crawl data.

In order to support this functionality the Gora API should be extended so that 
it can create stores (and data tables in the underlying storage) that use 
arbitrary prefixes to identify the particular crawl dataset. Then the Nutch API 
should be extended to allow passing this "crawlId" value to select one of 
possibly many existing crawl datasets.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-893) DataStore.put() silently loses records when executed from multiple processes

2010-09-13 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12908791#action_12908791
 ] 

Andrzej Bialecki  commented on NUTCH-893:
-

+1 and +1.

> DataStore.put() silently loses records when executed from multiple processes
> 
>
> Key: NUTCH-893
> URL: https://issues.apache.org/jira/browse/NUTCH-893
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 2.0
> Environment: Gora HEAD, SqlStore, MySQL 5.1, Ubuntu 10.4 x64, Sun JDK 
> 1.6
>Reporter: Andrzej Bialecki 
>Priority: Blocker
> Fix For: 2.0
>
> Attachments: NUTCH-893.patch, NUTCH-893_v2.patch
>
>
> In order to debug the issue described in NUTCH-879 I created a test to 
> simulate multiple clients appending to webtable (please see the patch), which 
> is the situation that we have in distributed map-reduce jobs.
> There are two tests there: one that uses multiple threads within the same 
> JVM, and another that uses single thread in multiple JVMs. Each test first 
> clears webtable (be careful!), and then puts a bunch of pages, and finally 
> counts that all are present and their values correspond to keys. To make 
> things more interesting each execution context (thread or process) closes and 
> reopens its instance of DataStore a few times.
> The multithreaded test passes just fine. However, the multi-process test 
> fails with missing keys, as many as 30%.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-893) DataStore.put() silently loses records when executed from multiple processes

2010-09-08 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12907297#action_12907297
 ] 

Andrzej Bialecki  commented on NUTCH-893:
-

Very good catch - yes, the test now passes for me too. This is actually good 
news for Gora :) I'll continue digging regarding NUTCH-879 ... don't hesitate 
if you have any ideas how to solve that. I suspect we may be losing keys in 
Generator or Fetcher, due to partitioning collisions but this hypothesis needs 
to be tested.

> DataStore.put() silently loses records when executed from multiple processes
> 
>
> Key: NUTCH-893
> URL: https://issues.apache.org/jira/browse/NUTCH-893
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 2.0
> Environment: Gora HEAD, SqlStore, MySQL 5.1, Ubuntu 10.4 x64, Sun JDK 
> 1.6
>Reporter: Andrzej Bialecki 
>Priority: Blocker
> Fix For: 2.0
>
> Attachments: NUTCH-893.patch, NUTCH-893_v2.patch
>
>
> In order to debug the issue described in NUTCH-879 I created a test to 
> simulate multiple clients appending to webtable (please see the patch), which 
> is the situation that we have in distributed map-reduce jobs.
> There are two tests there: one that uses multiple threads within the same 
> JVM, and another that uses single thread in multiple JVMs. Each test first 
> clears webtable (be careful!), and then puts a bunch of pages, and finally 
> counts that all are present and their values correspond to keys. To make 
> things more interesting each execution context (thread or process) closes and 
> reopens its instance of DataStore a few times.
> The multithreaded test passes just fine. However, the multi-process test 
> fails with missing keys, as many as 30%.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-893) DataStore.put() silently loses records when executed from multiple processes

2010-08-30 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12904226#action_12904226
 ] 

Andrzej Bialecki  commented on NUTCH-893:
-

Dogacan, flush() doesn't help - there are still missing keys. What's 
interesting is that the missing keys form sequential ranges. Could this be 
perhaps an issue with connection management, or some synchronization issue?

> DataStore.put() silently loses records when executed from multiple processes
> 
>
> Key: NUTCH-893
> URL: https://issues.apache.org/jira/browse/NUTCH-893
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 2.0
> Environment: Gora HEAD, SqlStore, MySQL 5.1, Ubuntu 10.4 x64, Sun JDK 
> 1.6
>Reporter: Andrzej Bialecki 
>Priority: Blocker
> Fix For: 2.0
>
> Attachments: NUTCH-893.patch
>
>
> In order to debug the issue described in NUTCH-879 I created a test to 
> simulate multiple clients appending to webtable (please see the patch), which 
> is the situation that we have in distributed map-reduce jobs.
> There are two tests there: one that uses multiple threads within the same 
> JVM, and another that uses single thread in multiple JVMs. Each test first 
> clears webtable (be careful!), and then puts a bunch of pages, and finally 
> counts that all are present and their values correspond to keys. To make 
> things more interesting each execution context (thread or process) closes and 
> reopens its instance of DataStore a few times.
> The multithreaded test passes just fine. However, the multi-process test 
> fails with missing keys, as many as 30%.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-893) DataStore.put() silently loses records when executed from multiple processes

2010-08-25 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-893:


Attachment: NUTCH-893.patch

Unit test to illustrate the issue.

> DataStore.put() silently loses records when executed from multiple processes
> 
>
> Key: NUTCH-893
> URL: https://issues.apache.org/jira/browse/NUTCH-893
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 2.0
> Environment: Gora HEAD, SqlStore, MySQL 5.1, Ubuntu 10.4 x64, Sun JDK 
> 1.6
>Reporter: Andrzej Bialecki 
> Attachments: NUTCH-893.patch
>
>
> In order to debug the issue described in NUTCH-879 I created a test to 
> simulate multiple clients appending to webtable (please see the patch), which 
> is the situation that we have in distributed map-reduce jobs.
> There are two tests there: one that uses multiple threads within the same 
> JVM, and another that uses single thread in multiple JVMs. Each test first 
> clears webtable (be careful!), and then puts a bunch of pages, and finally 
> counts that all are present and their values correspond to keys. To make 
> things more interesting each execution context (thread or process) closes and 
> reopens its instance of DataStore a few times.
> The multithreaded test passes just fine. However, the multi-process test 
> fails with missing keys, as many as 30%.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-893) DataStore.put() silently loses records when executed from multiple processes

2010-08-25 Thread Andrzej Bialecki (JIRA)
DataStore.put() silently loses records when executed from multiple processes


 Key: NUTCH-893
 URL: https://issues.apache.org/jira/browse/NUTCH-893
 Project: Nutch
  Issue Type: Bug
 Environment: Gora HEAD, SqlStore, MySQL 5.1, Ubuntu 10.4 x64, Sun JDK 
1.6
Reporter: Andrzej Bialecki 


In order to debug the issue described in NUTCH-879 I created a test to simulate 
multiple clients appending to webtable (please see the patch), which is the 
situation that we have in distributed map-reduce jobs.

There are two tests there: one that uses multiple threads within the same JVM, 
and another that uses single thread in multiple JVMs. Each test first clears 
webtable (be careful!), and then puts a bunch of pages, and finally counts that 
all are present and their values correspond to keys. To make things more 
interesting each execution context (thread or process) closes and reopens its 
instance of DataStore a few times.

The multithreaded test passes just fine. However, the multi-process test fails 
with missing keys, as many as 30%.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-893) DataStore.put() silently loses records when executed from multiple processes

2010-08-25 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-893:


Affects Version/s: 2.0

> DataStore.put() silently loses records when executed from multiple processes
> 
>
> Key: NUTCH-893
> URL: https://issues.apache.org/jira/browse/NUTCH-893
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 2.0
> Environment: Gora HEAD, SqlStore, MySQL 5.1, Ubuntu 10.4 x64, Sun JDK 
> 1.6
>Reporter: Andrzej Bialecki 
>
> In order to debug the issue described in NUTCH-879 I created a test to 
> simulate multiple clients appending to webtable (please see the patch), which 
> is the situation that we have in distributed map-reduce jobs.
> There are two tests there: one that uses multiple threads within the same 
> JVM, and another that uses single thread in multiple JVMs. Each test first 
> clears webtable (be careful!), and then puts a bunch of pages, and finally 
> counts that all are present and their values correspond to keys. To make 
> things more interesting each execution context (thread or process) closes and 
> reopens its instance of DataStore a few times.
> The multithreaded test passes just fine. However, the multi-process test 
> fails with missing keys, as many as 30%.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-891) Nutch build should not depend on unversioned local deps

2010-08-19 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12900455#action_12900455
 ] 

Andrzej Bialecki  commented on NUTCH-891:
-

Yes, this would help.

> Nutch build should not depend on unversioned local deps
> ---
>
> Key: NUTCH-891
> URL: https://issues.apache.org/jira/browse/NUTCH-891
> Project: Nutch
>  Issue Type: Bug
>Reporter: Andrzej Bialecki 
>
> The fix in NUTCH-873 introduces an unknown variable to the build process. 
> Since local ivy artifacts are unversioned, different people that install Gora 
> jars at different points in time will use the same artifact id but in fact 
> the artifacts (jars) will differ because they will come from different 
> revisions of Gora sources. Therefore Nutch builds based on the same svn rev. 
> won't be repeatable across different environments.
> As much as it pains the ivy purists ;) until Gora publishes versioned 
> artifacts I'd like to revert the fix in NUTCH-873 and add again Gora jars 
> built from a known external rev. We can add a README that contains commit id 
> from Gora.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-891) Nutch build should not depend on unversioned local deps

2010-08-19 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12900285#action_12900285
 ] 

Andrzej Bialecki  commented on NUTCH-891:
-

bq. So, your point is [..]

Yes, that's exactly my point.

bq. I'd say, why not make the Gora Ant build publish a gora-0.1-.jar?

Sure, that would solve the problem for now - I'll bother the Gora devs, and you 
can create the patch, ok? :) Ultimately we should go with the other solution 
(publish to Maven), but it requires more involvement from Gora devs.

bq. I'm not trying to be difficult about NUTCH-873 ...

Neither am I, no egos here - I just find the current situation after the fix to 
be intractable, especially when doing bugfixing and testing - because even if 
APIs stay the same, hidden bugs may not be the same across revisions...

> Nutch build should not depend on unversioned local deps
> ---
>
> Key: NUTCH-891
> URL: https://issues.apache.org/jira/browse/NUTCH-891
> Project: Nutch
>  Issue Type: Bug
>Reporter: Andrzej Bialecki 
>
> The fix in NUTCH-873 introduces an unknown variable to the build process. 
> Since local ivy artifacts are unversioned, different people that install Gora 
> jars at different points in time will use the same artifact id but in fact 
> the artifacts (jars) will differ because they will come from different 
> revisions of Gora sources. Therefore Nutch builds based on the same svn rev. 
> won't be repeatable across different environments.
> As much as it pains the ivy purists ;) until Gora publishes versioned 
> artifacts I'd like to revert the fix in NUTCH-873 and add again Gora jars 
> built from a known external rev. We can add a README that contains commit id 
> from Gora.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-884) FetcherJob should run more reduce tasks than default

2010-08-19 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  resolved NUTCH-884.
-

Resolution: Fixed

Committed in rev. 986647.

> FetcherJob should run more reduce tasks than default
> 
>
> Key: NUTCH-884
> URL: https://issues.apache.org/jira/browse/NUTCH-884
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 2.0
>Reporter: Andrzej Bialecki 
>Assignee: Andrzej Bialecki 
> Fix For: 2.0
>
> Attachments: NUTCH-884.patch, NUTCH-884.patch
>
>
> FetcherJob now performs fetching in the reduce phase. This means that in a 
> typical Hadoop setup there will be many fewer reduce tasks than map tasks, 
> and consequently the max. total throughput of Fetcher will be proportionally 
> reduced. I propose that FetcherJob should set the number of reduce tasks to 
> the number of map tasks. This way the fetching will be more granular.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-891) Nutch build should not depend on unversioned local deps

2010-08-19 Thread Andrzej Bialecki (JIRA)
Nutch build should not depend on unversioned local deps
---

 Key: NUTCH-891
 URL: https://issues.apache.org/jira/browse/NUTCH-891
 Project: Nutch
  Issue Type: Bug
Reporter: Andrzej Bialecki 


The fix in NUTCH-873 introduces an unknown variable to the build process. Since 
local ivy artifacts are unversioned, different people that install Gora jars at 
different points in time will use the same artifact id but in fact the 
artifacts (jars) will differ because they will come from different revisions of 
Gora sources. Therefore Nutch builds based on the same svn rev. won't be 
repeatable across different environments.

As much as it pains the ivy purists ;) until Gora publishes versioned artifacts 
I'd like to revert the fix in NUTCH-873 and add again Gora jars built from a 
known external rev. We can add a README that contains commit id from Gora.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-890) SqlStore doesn't work with nested types in Avro schema

2010-08-19 Thread Andrzej Bialecki (JIRA)
SqlStore doesn't work with nested types in Avro schema
--

 Key: NUTCH-890
 URL: https://issues.apache.org/jira/browse/NUTCH-890
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.0
 Environment: Nutch trunk, Gora trunk, Ubuntu 10.4 x64, Sun JDK 1.6, 
hsqldb-2.0.0, MySQL 5.1.41, HBase 0.20.6 / Hadoop 0.20.2
Reporter: Andrzej Bialecki 


ParseStatus and ProtocolStatus are not properly serialized and stored when 
using SqlStore. This may indicate a broader issue in Gora with processing of 
nested types in Avro schemas.

HBaseStore works properly, i.e. both types can be correctly stored and 
retrieved. SqlStore produces either NULL or '\0\0' value. This happens both 
when using HSQLDB and MySQL.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-884) FetcherJob should run more reduce tasks than default

2010-08-18 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-884:


Attachment: NUTCH-884.patch

Corrected mistake in arg handling.

> FetcherJob should run more reduce tasks than default
> 
>
> Key: NUTCH-884
> URL: https://issues.apache.org/jira/browse/NUTCH-884
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 2.0
>Reporter: Andrzej Bialecki 
>Assignee: Andrzej Bialecki 
> Fix For: 2.0
>
> Attachments: NUTCH-884.patch, NUTCH-884.patch
>
>
> FetcherJob now performs fetching in the reduce phase. This means that in a 
> typical Hadoop setup there will be many fewer reduce tasks than map tasks, 
> and consequently the max. total throughput of Fetcher will be proportionally 
> reduced. I propose that FetcherJob should set the number of reduce tasks to 
> the number of map tasks. This way the fetching will be more granular.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-881) Good quality documentation for Nutch

2010-08-18 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899815#action_12899815
 ] 

Andrzej Bialecki  commented on NUTCH-881:
-

bq. So what is new in Nutch 2.0 which doesn't appear in Nutch 1.x ? Gora is the 
main thing which comes to mind.

Yes. We also removed all search-related code from Nutch and rely exclusively on 
Solr to perform searching. This means that some APIs have been removed (e.g. 
query filters, text analysis, lucene indexing backend).

bq. How do the config files differ?

We still use the same nutch-default/nutch-site.xml, plus per-plugin config 
files. Some properties have changes, e.g the ones to limit max. number of urls 
per host in generator. We added some Gora-related files, gora.properties and 
gora-*-mapping.xml, that define what driver to use and how to map webtable 
columns onto storage-specific columns/fields.

bq. How does Nutch's use of Hadoop differ?

All jobs now use GoraInputFormat / GoraOutputFormat, which hides the details 
about the actual data storage backend.

bq. How do the command lines differ? (Presumably you need different command 
lines to say where to store the crawldb, right?)

Yes. Actually, this could be a separate issue to be solved - currently we 
assume there is one Nutch webtable per storage backend, so we don't specify the 
"db identifier" anywhere... but this prevents us from defining multiple crawl 
configs that use the same backend, so it should be addressed.

> Good quality documentation for Nutch
> 
>
> Key: NUTCH-881
> URL: https://issues.apache.org/jira/browse/NUTCH-881
> Project: Nutch
>  Issue Type: Improvement
>  Components: documentation
>Affects Versions: 2.0
>Reporter: Andrzej Bialecki 
>
> This is, and has been, a long standing request from Nutch users. This becomes 
> an acute need as we redesign Nutch 2.0, because the collective knowledge and 
> the Wiki will no longer be useful without massive amount of editing.
> IMHO the reference documentation should be in SVN, and not on the Wiki - the 
> Wiki is good for casual information and recipes but I think it's too messy 
> and not reliable enough as a reference.
> I propose to start with the following:
>  1. let's decide on the format of the docs. Each format has its own pros and 
> cons:
>   * HTML: easy to work with, but formatting may be messy unless we edit it by 
> hand, at which point it's no longer so easy... Good toolchains to convert to 
> other formats, but limited expressiveness of larger structures (e.g. book, 
> chapters, TOC, multi-column layouts, etc).
>   * Docbook: learning curve is higher, but not insurmountable... Naturally 
> yields very good structure. Figures/diagrams may be problematic - different 
> renderers (html, pdf) like to treat the scaling and placing somewhat 
> differently.
>   * Wiki-style (Confluence or TWiki): easy to use, but limited control over 
> larger structures. Maven Doxia can format cwiki, twiki, and a host of other 
> formats to e.g. html and pdf.
>   * other?
>  2. start documenting the main tools and the main APIs (e.g. the plugins and 
> all the extension points). We can of course reuse material from the Wiki and 
> from various presentations (e.g. the ApacheCon slides).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-882) Design a Host table in GORA

2010-08-18 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899810#action_12899810
 ] 

Andrzej Bialecki  commented on NUTCH-882:
-

This functionality is very useful for larger crawls. Some comments about the 
design:

* the table can be populated by injection, as in the patch, or from webtable. 
Since keys are from different spaces (url-s vs. hosts) I think it would be very 
tricky to try to do this on the fly in one of the existing jobs... so this 
means an additional step in the workflow.

* I'm worried about the scalability of the approach taken by HostMDApplierJob - 
per-host data will be multiplied by the number of urls from a host and put into 
webtable, which will in turn balloon the size of webtable...

A little background: what we see here is a design issue typical for mapreduce, 
where you have to merge data keyed by keys from different spaces (with 
different granularity). Possible solutions involve:
* first converting the data to a common key space and then submit both data as 
mapreduce inputs, or
* submitting only the finer-grained input to mapreduce and dynamically 
converting the keys on the fly (and reading data directly from the 
coarser-grained source, accessing it randomly).

A similar situation is described in HADOOP-3063 together with a solution, 
namely, to use random access and use Bloom filters to quickly discover missing 
keys.

So I propose that instead of statically merging the data (HostMDApplierJob) we 
could merge it dynamically on the fly, by implementing a high-performance 
reader of host table, and then use this reader directly in the context of 
map()/reduce() tasks as needed. This reader should use a Bloom filter to 
quickly determine nonexistent keys, and it may use a limited amount of 
in-memory cache for existing records. The bloom filter data should be 
re-computed on updates and stored/retrieved, to avoid lengthy initialization.

The cost of using this approach is IMHO much smaller than the cost of 
statically joining this data. The static join costs both space and time to 
execute an additional jon. Let's consider the dynamic join cost, e.g. in 
Fetcher - HostDBReader would be used only when initializing host queues, so the 
number of IO-s would be at most the number of unique hosts on the fetchlist (at 
most, because some of host data may be missing - here's Bloom filter to the 
rescue to quickly discover this without doing any IO). During updatedb we would 
likely want to access this data in DbUpdateReducer. Keys are URLs here, and 
they are ordered in ascending order - but they are in host-reversed format, 
which means that URLs from similar hosts and domains are close together. This 
is beneficial, because when we read data from HostDBReader we will read records 
that are close together, thus avoiding seeks. We can also cache the retrieved 
per-host data in DbUpdateReducer.

> Design a Host table in GORA
> ---
>
> Key: NUTCH-882
> URL: https://issues.apache.org/jira/browse/NUTCH-882
> Project: Nutch
>  Issue Type: New Feature
>Affects Versions: 2.0
>Reporter: Julien Nioche
>Assignee: Julien Nioche
> Fix For: 2.0
>
> Attachments: NUTCH-882-v1.patch
>
>
> Having a separate GORA table for storing information about hosts (and 
> domains?) would be very useful for : 
> * customising the behaviour of the fetching on a host basis e.g. number of 
> threads, min time between threads etc...
> * storing stats
> * keeping metadata and possibly propagate them to the webpages 
> * keeping a copy of the robots.txt and possibly use that later to filter the 
> webtable
> * store sitemaps files and update the webtable accordingly
> I'll try to come up with a GORA schema for such a host table but any comments 
> are of course already welcome 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-887) Delegate parsing of feeds to Tika

2010-08-15 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898667#action_12898667
 ] 

Andrzej Bialecki  commented on NUTCH-887:
-

bq. Huh, what do you mean? Nick just added a bunch of code to handle Compound 
document detection, and parsing

Ah, good - I missed that, I need to take a closer look at this...

bq. I'm starting to feel the creep of parsing plugins make their way back into 
Nutch instead of just jumping over into Tika

The "creep" so far is just parse-html, which we were forced to add back because 
Tika HTML parsing was totally inadequate to our needs. I know there have been 
some progress on this front, but I suspect it's still not sufficient. The 
ultimate goal is still to use Tika for all formats that it can handle, 
preferrably "all formats" without further qualifiers ;)

> Delegate parsing of feeds to Tika
> -
>
> Key: NUTCH-887
> URL: https://issues.apache.org/jira/browse/NUTCH-887
> Project: Nutch
>  Issue Type: Wish
>  Components: parser
>Affects Versions: 2.0
>Reporter: Julien Nioche
> Fix For: 2.0
>
>
> [Starting a new thread from https://issues.apache.org/jira/browse/NUTCH-874]
> One of the plugins which hasn't been ported yet is the feed parser. We could 
> rely on the one we recently added to Tika, knowing that there is a 
> substantial difference in the sense that the Tika feed parser generates a 
> simple XHTML representation of the document where the feeds are simply 
> represented as anchors whereas the Nutch version created new documents for 
> each feed.
> There is also the parse-rss plugin in Nutch which is quite similar - what's 
> the difference with the feed one again? Since the Tika parser would handle 
> all sorts of feed formats why not simply rely on it? 
> Any thoughts on this?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-887) Delegate parsing of feeds to Tika

2010-08-14 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898647#action_12898647
 ] 

Andrzej Bialecki  commented on NUTCH-887:
-

bq. If there's something missing that Nutch needs, we'll add it to Tika and 
roll it into 0.8.

There is something missing in Tika, and it's the support for compound 
documents, but it's not likely to be added in 0.8... not that we have such 
support in Nutch at the moment - it fell victim to the trunk/nutchbase switch, 
but it should be added back soon. I'd keep the "feed" plugin around for a while 
still, as an interim solution until Tika supports compound documents. +1 to 
getting rid of parse-rss.

> Delegate parsing of feeds to Tika
> -
>
> Key: NUTCH-887
> URL: https://issues.apache.org/jira/browse/NUTCH-887
> Project: Nutch
>  Issue Type: Wish
>  Components: parser
>Affects Versions: 2.0
>Reporter: Julien Nioche
> Fix For: 2.0
>
>
> [Starting a new thread from https://issues.apache.org/jira/browse/NUTCH-874]
> One of the plugins which hasn't been ported yet is the feed parser. We could 
> rely on the one we recently added to Tika, knowing that there is a 
> substantial difference in the sense that the Tika feed parser generates a 
> simple XHTML representation of the document where the feeds are simply 
> represented as anchors whereas the Nutch version created new documents for 
> each feed.
> There is also the parse-rss plugin in Nutch which is quite similar - what's 
> the difference with the feed one again? Since the Tika parser would handle 
> all sorts of feed formats why not simply rely on it? 
> Any thoughts on this?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-842) AutoGenerate WebPage code

2010-08-13 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12898196#action_12898196
 ] 

Andrzej Bialecki  commented on NUTCH-842:
-

I think we can put it as depends in the compile target, but then we need to 
first check if there's any need to generate, i.e. check the timestamp of 
WebPage.avsc and WebPage.java and generate only when .avsc is more recent.

Perhaps we should check in generated sources anyway, because if we don't do it 
then it will be difficult to compile in IDE-s without running Ant first... 
though to tell the truth it got difficult already when we switched to ivy. 
Perhaps we need a separate "eclipse" target to resolve dependencies, run gora 
compiler and generate Eclipse project files?

> AutoGenerate WebPage code
> -
>
> Key: NUTCH-842
> URL: https://issues.apache.org/jira/browse/NUTCH-842
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Doğacan Güney
>Assignee: Doğacan Güney
> Fix For: 2.0
>
> Attachments: NUTCH-842.patch
>
>
> This issue will track the addition of an ant task that will automatically 
> generate o.a.n.storage.WebPage (and ProtocolStatus and ParseStatus) from 
> src/gora/webpage.avsc.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-886) A .gitignore file for Nutch

2010-08-12 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12897638#action_12897638
 ] 

Andrzej Bialecki  commented on NUTCH-886:
-

+1.

> A .gitignore file for Nutch
> ---
>
> Key: NUTCH-886
> URL: https://issues.apache.org/jira/browse/NUTCH-886
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 2.0
>Reporter: Doğacan Güney
>Assignee: Doğacan Güney
>Priority: Trivial
>
> We need a .gitignore file under nutch/ so git does not try to track many 
> unnecessary files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-884) FetcherJob should run more reduce tasks than default

2010-08-11 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12897331#action_12897331
 ] 

Andrzej Bialecki  commented on NUTCH-884:
-

Ok, I'll clarify the help message.

bq. If I understood the code correctly, I think this part should be -all and 
not -threads:

Heh, yes of course. Thanks!

> FetcherJob should run more reduce tasks than default
> 
>
> Key: NUTCH-884
> URL: https://issues.apache.org/jira/browse/NUTCH-884
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 2.0
>Reporter: Andrzej Bialecki 
>Assignee: Andrzej Bialecki 
> Fix For: 2.0
>
> Attachments: NUTCH-884.patch
>
>
> FetcherJob now performs fetching in the reduce phase. This means that in a 
> typical Hadoop setup there will be many fewer reduce tasks than map tasks, 
> and consequently the max. total throughput of Fetcher will be proportionally 
> reduced. I propose that FetcherJob should set the number of reduce tasks to 
> the number of map tasks. This way the fetching will be more granular.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-884) FetcherJob should run more reduce tasks than default

2010-08-11 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-884:


Attachment: NUTCH-884.patch

Patch with the change. I also rearranged the arguments to FetcherJob.fetch(..) 
to make more sense (IMHO).

> FetcherJob should run more reduce tasks than default
> 
>
> Key: NUTCH-884
> URL: https://issues.apache.org/jira/browse/NUTCH-884
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 2.0
>Reporter: Andrzej Bialecki 
>Assignee: Andrzej Bialecki 
> Fix For: 2.0
>
> Attachments: NUTCH-884.patch
>
>
> FetcherJob now performs fetching in the reduce phase. This means that in a 
> typical Hadoop setup there will be many fewer reduce tasks than map tasks, 
> and consequently the max. total throughput of Fetcher will be proportionally 
> reduced. I propose that FetcherJob should set the number of reduce tasks to 
> the number of map tasks. This way the fetching will be more granular.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-872) Change the default fetcher.parse to FALSE

2010-08-11 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  resolved NUTCH-872.
-

Fix Version/s: 2.0
   Resolution: Fixed

I changed the name of the option to "-parse" to be consistent with the 
nutch-default.xml naming. I also updated the API to use this name, it's less 
confusing this way.

Committed in rev. 984401. Thanks for the feedback.

> Change the default fetcher.parse to FALSE
> -
>
> Key: NUTCH-872
> URL: https://issues.apache.org/jira/browse/NUTCH-872
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.2, 2.0
>Reporter: Andrzej Bialecki 
>Assignee: Andrzej Bialecki 
> Fix For: 2.0
>
>
> I propose to change this property to false. The reason is that it's a safer 
> default - parsing issues don't lead to a loss of the downloaded content. For 
> larger crawls this is the recommended way to run Fetcher. Users that run 
> smaller crawls can still override it.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-884) FetcherJob should run more reduce tasks than default

2010-08-11 Thread Andrzej Bialecki (JIRA)
FetcherJob should run more reduce tasks than default


 Key: NUTCH-884
 URL: https://issues.apache.org/jira/browse/NUTCH-884
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 
 Fix For: 2.0


FetcherJob now performs fetching in the reduce phase. This means that in a 
typical Hadoop setup there will be many fewer reduce tasks than map tasks, and 
consequently the max. total throughput of Fetcher will be proportionally 
reduced. I propose that FetcherJob should set the number of reduce tasks to the 
number of map tasks. This way the fetching will be more granular.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-881) Good quality documentation for Nutch

2010-08-11 Thread Andrzej Bialecki (JIRA)
Good quality documentation for Nutch


 Key: NUTCH-881
 URL: https://issues.apache.org/jira/browse/NUTCH-881
 Project: Nutch
  Issue Type: Improvement
  Components: documentation
Affects Versions: 2.0
Reporter: Andrzej Bialecki 


This is, and has been, a long standing request from Nutch users. This becomes 
an acute need as we redesign Nutch 2.0, because the collective knowledge and 
the Wiki will no longer be useful without massive amount of editing.

IMHO the reference documentation should be in SVN, and not on the Wiki - the 
Wiki is good for casual information and recipes but I think it's too messy and 
not reliable enough as a reference.

I propose to start with the following:

 1. let's decide on the format of the docs. Each format has its own pros and 
cons:
  * HTML: easy to work with, but formatting may be messy unless we edit it by 
hand, at which point it's no longer so easy... Good toolchains to convert to 
other formats, but limited expressiveness of larger structures (e.g. book, 
chapters, TOC, multi-column layouts, etc).
  * Docbook: learning curve is higher, but not insurmountable... Naturally 
yields very good structure. Figures/diagrams may be problematic - different 
renderers (html, pdf) like to treat the scaling and placing somewhat 
differently.
  * Wiki-style (Confluence or TWiki): easy to use, but limited control over 
larger structures. Maven Doxia can format cwiki, twiki, and a host of other 
formats to e.g. html and pdf.
  * other?

 2. start documenting the main tools and the main APIs (e.g. the plugins and 
all the extension points). We can of course reuse material from the Wiki and 
from various presentations (e.g. the ApacheCon slides).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-880) REST API (and webapp) for Nutch

2010-08-11 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-880:


Description: 
This issue is for discussing a REST-style API for accessing Nutch.

Here's an initial idea:

* I propose to use org.restlet for handling requests and returning 
JSON/XML/whatever responses.
* hook up all regular tools so that they can be driven via this API. This would 
have to be an async API, since all Nutch operations take long time to execute. 
It follows then that we need to be able also to list running operations, 
retrieve their current status, and possibly 
abort/cancel/stop/suspend/resume/...? This also means that we would have to 
potentially create & manage many threads in a servlet - AFAIK this is frowned 
upon by J2EE purists...
* package this in a webapp (that includes all deps, essentially nutch.job 
content), with the restlet servlet as an entry point.

Open issues:

* how to implement the reading of crawl results via this API
* should we manage only crawls that use a single configuration per webapp, or 
should we have a notion of crawl contexts (sets of crawl configs) with CRUD ops 
on them? this would be nice, because it would allow managing of several 
different crawls, with different configs, in a single webapp - but it 
complicates the implementation a lot.

  was:
This issue is for discussing a REST-style API for accessing Nutch.

Here's an initial idea:

* I propose to use org.restlet for handling JSON requests
* hook up all regular tools so that they can be driven via this API. This would 
have to be an async API, since all Nutch operations take long time to execute. 
It follows then that we need to be able also to list running operations, 
retrieve their current status, and possibly 
abort/cancel/stop/suspend/resume/...? This also means that we would have to 
potentially create & manage many threads in a servlet - AFAIK this is frowned 
upon by J2EE purists...
* package this in a webapp (that includes all deps, essentially nutch.job 
content), with the restlet servlet as an entry point.

Open issues:

* how to implement the reading of crawl results via this API
* should we manage only crawls that use a single configuration per webapp, or 
should we have a notion of crawl contexts (sets of crawl configs) with CRUD ops 
on them? this would be nice, because it would allow managing of several 
different crawls, with different configs, in a single webapp - but it 
complicates the implementation a lot.


> REST API (and webapp) for Nutch
> ---
>
> Key: NUTCH-880
> URL: https://issues.apache.org/jira/browse/NUTCH-880
> Project: Nutch
>  Issue Type: New Feature
>Affects Versions: 2.0
>Reporter: Andrzej Bialecki 
>
> This issue is for discussing a REST-style API for accessing Nutch.
> Here's an initial idea:
> * I propose to use org.restlet for handling requests and returning 
> JSON/XML/whatever responses.
> * hook up all regular tools so that they can be driven via this API. This 
> would have to be an async API, since all Nutch operations take long time to 
> execute. It follows then that we need to be able also to list running 
> operations, retrieve their current status, and possibly 
> abort/cancel/stop/suspend/resume/...? This also means that we would have to 
> potentially create & manage many threads in a servlet - AFAIK this is frowned 
> upon by J2EE purists...
> * package this in a webapp (that includes all deps, essentially nutch.job 
> content), with the restlet servlet as an entry point.
> Open issues:
> * how to implement the reading of crawl results via this API
> * should we manage only crawls that use a single configuration per webapp, or 
> should we have a notion of crawl contexts (sets of crawl configs) with CRUD 
> ops on them? this would be nice, because it would allow managing of several 
> different crawls, with different configs, in a single webapp - but it 
> complicates the implementation a lot.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-880) REST API (and webapp) for Nutch

2010-08-11 Thread Andrzej Bialecki (JIRA)
REST API (and webapp) for Nutch
---

 Key: NUTCH-880
 URL: https://issues.apache.org/jira/browse/NUTCH-880
 Project: Nutch
  Issue Type: New Feature
Affects Versions: 2.0
Reporter: Andrzej Bialecki 


This issue is for discussing a REST-style API for accessing Nutch.

Here's an initial idea:

* I propose to use org.restlet for handling JSON requests
* hook up all regular tools so that they can be driven via this API. This would 
have to be an async API, since all Nutch operations take long time to execute. 
It follows then that we need to be able also to list running operations, 
retrieve their current status, and possibly 
abort/cancel/stop/suspend/resume/...? This also means that we would have to 
potentially create & manage many threads in a servlet - AFAIK this is frowned 
upon by J2EE purists...
* package this in a webapp (that includes all deps, essentially nutch.job 
content), with the restlet servlet as an entry point.

Open issues:

* how to implement the reading of crawl results via this API
* should we manage only crawls that use a single configuration per webapp, or 
should we have a notion of crawl contexts (sets of crawl configs) with CRUD ops 
on them? this would be nice, because it would allow managing of several 
different crawls, with different configs, in a single webapp - but it 
complicates the implementation a lot.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-879) URL-s getting lost

2010-08-11 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12897182#action_12897182
 ] 

Andrzej Bialecki  commented on NUTCH-879:
-

I haven't tried hbase yet, but I'm going to - will update this issue soon.

> URL-s getting lost
> --
>
> Key: NUTCH-879
> URL: https://issues.apache.org/jira/browse/NUTCH-879
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 2.0
> Environment: * Ubuntu 10.4 x64, Sun JDK 1.6
> * using 1-node Hadoop + HDFS
> * trunk r983472, using MySQL store
> * branch-1.3
>Reporter: Andrzej Bialecki 
> Attachments: branch-1.3-bench.txt, trunk-bench.txt
>
>
> I ran the Benchmark using branch-1.3 and trunk (formerly nutchbase). With the 
> same Benchmark parameters and the same plugins branch-1.3 collects ~1.5mln 
> urls, while trunk collects ~20,000 urls. Clearly something is wrong.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-876) Remove remaining robots/IP blocking code in lib-http

2010-08-11 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  resolved NUTCH-876.
-

Fix Version/s: 2.0
   Resolution: Fixed

Committed in rev. 984337.

> Remove remaining robots/IP blocking code in lib-http
> 
>
> Key: NUTCH-876
> URL: https://issues.apache.org/jira/browse/NUTCH-876
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 2.0
>Reporter: Andrzej Bialecki 
>Assignee: Andrzej Bialecki 
> Fix For: 2.0
>
> Attachments: NUTCH-876.patch
>
>
> There are remains of the (very old) blocking code in 
> lib-http/.../HttpBase.java. This code was used with the OldFetcher to manage 
> politeness limits. New trunk doesn't have OldFetcher anymore, so this code is 
> useless. Furthermore, there is an actual bug here - FetcherJob forgets to set 
> Protocol.CHECK_BLOCKING and Protocol.CHECK_ROBOTS to false, and the defaults 
> in lib-http are set to true.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-879) URL-s getting lost

2010-08-10 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-879:


Attachment: branch-1.3-bench.txt
trunk-bench.txt

DB stats and benchmark results.

> URL-s getting lost
> --
>
> Key: NUTCH-879
> URL: https://issues.apache.org/jira/browse/NUTCH-879
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 2.0
> Environment: * Ubuntu 10.4 x64, Sun JDK 1.6
> * using 1-node Hadoop + HDFS
> * trunk r983472, using MySQL store
> * branch-1.3
>Reporter: Andrzej Bialecki 
> Attachments: branch-1.3-bench.txt, trunk-bench.txt
>
>
> I ran the Benchmark using branch-1.3 and trunk (formerly nutchbase). With the 
> same Benchmark parameters and the same plugins branch-1.3 collects ~1.5mln 
> urls, while trunk collects ~20,000 urls. Clearly something is wrong.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-879) URL-s getting lost

2010-08-10 Thread Andrzej Bialecki (JIRA)
URL-s getting lost
--

 Key: NUTCH-879
 URL: https://issues.apache.org/jira/browse/NUTCH-879
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.0
 Environment: * Ubuntu 10.4 x64, Sun JDK 1.6
* using 1-node Hadoop + HDFS
* trunk r983472, using MySQL store
* branch-1.3
Reporter: Andrzej Bialecki 


I ran the Benchmark using branch-1.3 and trunk (formerly nutchbase). With the 
same Benchmark parameters and the same plugins branch-1.3 collects ~1.5mln 
urls, while trunk collects ~20,000 urls. Clearly something is wrong.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-876) Remove remaining robots/IP blocking code in lib-http

2010-08-09 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-876:


Attachment: NUTCH-876.patch

Patch to fix the issue. If there are no objections I'll commit this shortly.

> Remove remaining robots/IP blocking code in lib-http
> 
>
> Key: NUTCH-876
> URL: https://issues.apache.org/jira/browse/NUTCH-876
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 2.0
>Reporter: Andrzej Bialecki 
>Assignee: Andrzej Bialecki 
> Attachments: NUTCH-876.patch
>
>
> There are remains of the (very old) blocking code in 
> lib-http/.../HttpBase.java. This code was used with the OldFetcher to manage 
> politeness limits. New trunk doesn't have OldFetcher anymore, so this code is 
> useless. Furthermore, there is an actual bug here - FetcherJob forgets to set 
> Protocol.CHECK_BLOCKING and Protocol.CHECK_ROBOTS to false, and the defaults 
> in lib-http are set to true.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (NUTCH-876) Remove remaining robots/IP blocking code in lib-http

2010-08-09 Thread Andrzej Bialecki (JIRA)
Remove remaining robots/IP blocking code in lib-http


 Key: NUTCH-876
 URL: https://issues.apache.org/jira/browse/NUTCH-876
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 2.0
Reporter: Andrzej Bialecki 
Assignee: Andrzej Bialecki 


There are remains of the (very old) blocking code in 
lib-http/.../HttpBase.java. This code was used with the OldFetcher to manage 
politeness limits. New trunk doesn't have OldFetcher anymore, so this code is 
useless. Furthermore, there is an actual bug here - FetcherJob forgets to set 
Protocol.CHECK_BLOCKING and Protocol.CHECK_ROBOTS to false, and the defaults in 
lib-http are set to true.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (NUTCH-858) No longer able to set per-field boosts on lucene documents

2010-08-06 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  resolved NUTCH-858.
-

Resolution: Fixed

Committed to branch-1.2, revision 982970.

> No longer able to set per-field boosts on lucene documents
> --
>
> Key: NUTCH-858
> URL: https://issues.apache.org/jira/browse/NUTCH-858
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.1
> Environment: n/a
>Reporter: Edward Drapkin
>Assignee: Andrzej Bialecki 
> Fix For: 1.2
>
> Attachments: nutchdoc.patch
>
>
> I'm working on upgrading from Nutch 0.9 to Nutch 1.1 and I've noticed that it 
> no longer seems possible to set boosts on specific fields in lucene 
> documents.  This is, in my opinion, a major feature regression and removes a 
> huge component to fine tuning search.  Can this be added?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-858) No longer able to set per-field boosts on lucene documents

2010-08-05 Thread Andrzej Bialecki (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-858:


Attachment: nutchdoc.patch

Here's the ported functionality. Please note that the standard plugins still 
don't use per-field boosts.

If no objections I'll commit this shortly.

> No longer able to set per-field boosts on lucene documents
> --
>
> Key: NUTCH-858
> URL: https://issues.apache.org/jira/browse/NUTCH-858
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.1
> Environment: n/a
>Reporter: Edward Drapkin
>Assignee: Andrzej Bialecki 
> Fix For: 1.2
>
> Attachments: nutchdoc.patch
>
>
> I'm working on upgrading from Nutch 0.9 to Nutch 1.1 and I've noticed that it 
> no longer seems possible to set boosts on specific fields in lucene 
> documents.  This is, in my opinion, a major feature regression and removes a 
> huge component to fine tuning search.  Can this be added?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



  1   2   >