from:"Doug Cutting \(JIRA\)"

[jira] Commented: (NUTCH-92) DistributedSearch incorrectly scores results

2005-09-15 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-92?page=comments#action_12329474 ] 

Doug Cutting commented on NUTCH-92:
---

A minor detail:

In Searcher, instead of

  int[] getDocFreqs(Term[]);

The new method will probably have to be something like

  public int[] getDocFreqs(TermSet);

And TermSet can implement Writable, as Nutch can't serialize Lucene Terms.



> DistributedSearch incorrectly scores results
> 
>
>  Key: NUTCH-92
>  URL: http://issues.apache.org/jira/browse/NUTCH-92
>  Project: Nutch
> Type: Bug
>   Components: searcher
> Versions: 0.8-dev, 0.7
> Reporter: Andrzej Bialecki 
> Assignee: Andrzej Bialecki 

>
> When running search servers in a distributed setup, using 
> DistributedSearch$Server and Client, total scores are incorrectly calculated. 
> The symptoms are that scores differ depending on how segments are deployed to 
> Servers, i.e. if there is uneven distribution of terms in segment indexes 
> (due to segment size or content differences) then scores will differ 
> depending on how many and which segments are deployed on a particular Server. 
> This may lead to prioritizing of non-relevant results over more relevant ones.
> The underlying reason for this is that each IndexSearcher (which uses local 
> index on each Server) calculates scores based on the local IDFs of query 
> terms, and not the global IDFs from all indexes together. This means that 
> scores arriving from different Servers to the Client cannot be meaningfully 
> compared, unless all indexes have similar distribution of Terms and similar 
> numbers of documents in them. However, currently the Client mixes all scores 
> together, sorts them by absolute values and picks top hits. These absolute 
> values will change if segments are un-evenly deployed to Servers.
> Currently the workaround is to deploy the same number of documents in 
> segments per Server, and to ensure that segments contain well-randomized 
> content so that term frequencies for common terms are very similar.
> The solution proposed here (as a result of discussion between ab and cutting, 
> patches are coming) is to calculate global IDFs prior to running the query, 
> and pre-boost query Terms with these global IDFs. This will require one more 
> RPC call per each query (this can be optimized later, e.g. through caching). 
> Then the scores will become normalized according to the global IDFs, and 
> Client will be able to meaningfully compare them. Scores will also become 
> independent of the segment content or local number of documents per Server. 
> This will involve at least the following changes:
> * change NutchSimilarity.idf(Term, Searcher) to always return 1.0f. This 
> enables us to manipulate scores independently of local IDFs.
> * add a new method to Searcher interface, int[] getDocFreqs(Term[]), which 
> will return document frequencies for query terms.
> * modify getSegmentNames() so that it returns also the total number of 
> documents in each segment, or implement this as a separate method (this will 
> be called once during segment init)
> * in DistributedSearch$Client.search() first make a call to servers to return 
> local IDFs for the current query, and calculate global IDFs for each relevant 
> Term in that query.
> * multiply the TermQuery boosts by idf(totalDocFreq, totalIndexedDocs), and 
> PhraseQuery boosts by the sum of the idf(totalDocFreqs, totalIndexedDocs) for 
> all of its terms
> This solution should be applicable with only minor changes to all branches, 
> but initially the patches will be relative to trunk/ .
> Comments, suggestions and review are welcome!

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-92) DistributedSearch incorrectly scores results

2005-09-15 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-92?page=comments#action_12329475 ] 

Doug Cutting commented on NUTCH-92:
---

Otis, I think this was actually comitted to Lucene, but the solution isn't 
quite appropriate for Nutch, which does not use Lucene's RMI-based 
RemoteSearchable, but instead has its own, leaner, RPC mechanism.

> DistributedSearch incorrectly scores results
> 
>
>  Key: NUTCH-92
>  URL: http://issues.apache.org/jira/browse/NUTCH-92
>  Project: Nutch
> Type: Bug
>   Components: searcher
> Versions: 0.8-dev, 0.7
> Reporter: Andrzej Bialecki 
> Assignee: Andrzej Bialecki 

>
> When running search servers in a distributed setup, using 
> DistributedSearch$Server and Client, total scores are incorrectly calculated. 
> The symptoms are that scores differ depending on how segments are deployed to 
> Servers, i.e. if there is uneven distribution of terms in segment indexes 
> (due to segment size or content differences) then scores will differ 
> depending on how many and which segments are deployed on a particular Server. 
> This may lead to prioritizing of non-relevant results over more relevant ones.
> The underlying reason for this is that each IndexSearcher (which uses local 
> index on each Server) calculates scores based on the local IDFs of query 
> terms, and not the global IDFs from all indexes together. This means that 
> scores arriving from different Servers to the Client cannot be meaningfully 
> compared, unless all indexes have similar distribution of Terms and similar 
> numbers of documents in them. However, currently the Client mixes all scores 
> together, sorts them by absolute values and picks top hits. These absolute 
> values will change if segments are un-evenly deployed to Servers.
> Currently the workaround is to deploy the same number of documents in 
> segments per Server, and to ensure that segments contain well-randomized 
> content so that term frequencies for common terms are very similar.
> The solution proposed here (as a result of discussion between ab and cutting, 
> patches are coming) is to calculate global IDFs prior to running the query, 
> and pre-boost query Terms with these global IDFs. This will require one more 
> RPC call per each query (this can be optimized later, e.g. through caching). 
> Then the scores will become normalized according to the global IDFs, and 
> Client will be able to meaningfully compare them. Scores will also become 
> independent of the segment content or local number of documents per Server. 
> This will involve at least the following changes:
> * change NutchSimilarity.idf(Term, Searcher) to always return 1.0f. This 
> enables us to manipulate scores independently of local IDFs.
> * add a new method to Searcher interface, int[] getDocFreqs(Term[]), which 
> will return document frequencies for query terms.
> * modify getSegmentNames() so that it returns also the total number of 
> documents in each segment, or implement this as a separate method (this will 
> be called once during segment init)
> * in DistributedSearch$Client.search() first make a call to servers to return 
> local IDFs for the current query, and calculate global IDFs for each relevant 
> Term in that query.
> * multiply the TermQuery boosts by idf(totalDocFreq, totalIndexedDocs), and 
> PhraseQuery boosts by the sum of the idf(totalDocFreqs, totalIndexedDocs) for 
> all of its terms
> This solution should be applicable with only minor changes to all branches, 
> but initially the patches will be relative to trunk/ .
> Comments, suggestions and review are welcome!

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-95) DeleteDuplicates depends on the order of input segments

2005-09-20 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-95?page=comments#action_12330057 ] 

Doug Cutting commented on NUTCH-95:
---

A simpler fix would be to simply sort the list of segment names, since segment 
names are dates.  This is imperfect, when, e.g., one merge's segments, but it 
is very simple!

> DeleteDuplicates depends on the order of input segments
> ---
>
>  Key: NUTCH-95
>  URL: http://issues.apache.org/jira/browse/NUTCH-95
>  Project: Nutch
> Type: Bug
>   Components: indexer
> Versions: 0.7, 0.8-dev, 0.6
> Reporter: Andrzej Bialecki 
> Assignee: Andrzej Bialecki 

>
> DeleteDuplicates depends on what order the input segments are processed, 
> which in turn depends on the order of segment dirs returned from 
> NutchFileSystem.listFiles(File). In most cases this is undesired and may lead 
> to deleting wrong records from indexes. The silent assumption that segments 
> at the end of the listing are more recent is not always true.
> Here's the explanation:
> * Dedup first deletes the URL duplicates by computing MD5 hashes for each 
> URL, and then sorting all records by (hash, segmentIdx, docIdx). SegmentIdx 
> is just an int index to the array of open IndexReaders - and if segment dirs 
> are moved/copied/renamed then entries in that array may change their  order. 
> And then for all equal triples Dedup keeps just the first entry. Naturally, 
> if segmentIdx is changed due to dir renaming, a different record will be kept 
> and different ones will be deleted...
> * then Dedup deletes content duplicates, again by computing hashes for each 
> content, and then sorting records by (hash, segmentIdx, docIdx). However, by 
> now we already have a different set of undeleted docs depending on the order 
> of input segments. On top of that, the same factor acts here, i.e. segmentIdx 
> changes when you re-shuffle the input segment dirs - so again, when identical 
> entries are compared the one with the lowest (segmentIdx, docIdx) is picked.
> Solution: use the fetched date from the first record in each segment to 
> determine the order of segments. Alternatively, modify DeleteDuplicates to 
> use the newer algorithm from SegmentMergeTool. This algorithm works by 
> sorting records using tuples of (urlHash, contentHash, fetchDate, score, 
> urlLength). Then:
> 1. If urlHash is the same, keep the doc with the highest fetchDate  (the 
> latest version, as recorded by Fetcher).
> 2. If contentHash is the same, keep the doc with the highest score, and then 
> if the scores are the same, keep the doc with the shortest url.
> Initial fix will be prepared for the trunk/ and then backported to the 
> release branch.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Resolved: (NUTCH-93) DF error on long filesystem name

2005-09-20 Thread Doug Cutting (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-93?page=all ]
 
Doug Cutting resolved NUTCH-93:
---

Resolution: Fixed

Fixed in the mapred branch.

> DF error on long filesystem name
> 
>
>  Key: NUTCH-93
>  URL: http://issues.apache.org/jira/browse/NUTCH-93
>  Project: Nutch
> Type: Bug
> Versions: 0.7
>  Environment: CentOS4.1 (like RedhatEnterprise4)
> Reporter: Shuji Umino
> Priority: Minor

>
> java.util.NoSuchElementException happened on start datanode.
> my system use LVM, default installed filesystem name is 'VolGroup00-LogVol00',
> divided two lines on type 'df' command.
> ---
> #df -k
> /dev/mapper/VolGroup00-LogVol00
>  152559732   1279408 143530692   1% /  
> <---  return next line
> /dev/sda2   101105  9098 86786  10% /boot
> none257352 0257352   0% /dev/shm
> ---
> [org.apache.nutch.ndfs.DF] fixed source 
> StringTokenizer tokens =
>   new StringTokenizer(lines.readLine(), " \t\n\r\f%");
> 
> this.filesystem = tokens.nextToken();   
> if (!tokens.hasMoreTokens()) {
>   //for long filesystem name
>   tokens = new StringTokenizer(lines.readLine(), " \t\n\r\f%");
> }

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Closed: (NUTCH-93) DF error on long filesystem name

2005-09-20 Thread Doug Cutting (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-93?page=all ]
 
Doug Cutting closed NUTCH-93:
-


> DF error on long filesystem name
> 
>
>  Key: NUTCH-93
>  URL: http://issues.apache.org/jira/browse/NUTCH-93
>  Project: Nutch
> Type: Bug
> Versions: 0.7
>  Environment: CentOS4.1 (like RedhatEnterprise4)
> Reporter: Shuji Umino
> Priority: Minor

>
> java.util.NoSuchElementException happened on start datanode.
> my system use LVM, default installed filesystem name is 'VolGroup00-LogVol00',
> divided two lines on type 'df' command.
> ---
> #df -k
> /dev/mapper/VolGroup00-LogVol00
>  152559732   1279408 143530692   1% /  
> <---  return next line
> /dev/sda2   101105  9098 86786  10% /boot
> none257352 0257352   0% /dev/shm
> ---
> [org.apache.nutch.ndfs.DF] fixed source 
> StringTokenizer tokens =
>   new StringTokenizer(lines.readLine(), " \t\n\r\f%");
> 
> this.filesystem = tokens.nextToken();   
> if (!tokens.hasMoreTokens()) {
>   //for long filesystem name
>   tokens = new StringTokenizer(lines.readLine(), " \t\n\r\f%");
> }

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-98) RobotRulesParser interprets robots.txt incorrectly

2005-09-29 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-98?page=comments#action_12330858 ] 

Doug Cutting commented on NUTCH-98:
---

Where is there a specification of robots.txt that defines how 'allow' and 
'disallow' lines interact?  I can't even find anything that specifies the 
semantics of 'allow' lines at all!

> RobotRulesParser interprets robots.txt incorrectly
> --
>
>  Key: NUTCH-98
>  URL: http://issues.apache.org/jira/browse/NUTCH-98
>  Project: Nutch
> Type: Bug
>   Components: fetcher
> Versions: 0.7
> Reporter: Jeff Bowden
> Priority: Minor
>  Attachments: RobotRulesParser.java.diff
>
> Here's a simple example that the current RobotRulesParser gets wrong:
> User-agent: *
> Disallow: /
> Allow: /rss
> The problem is that the isAllowed function takes the first rule that matches 
> and incorrectly decides that URLs starting with "/rss" are Disallowed.  The 
> correct algorithm is to take the *longest* rule that matches.  I will attach 
> a patch that fixes this.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-99) ports are hardcoded or random

2005-10-03 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-99?page=comments#action_12331220 ] 

Doug Cutting commented on NUTCH-99:
---

I like  the cleanup of the port numbers.  And removing the use of random port 
numbers may make some network administrators happy.  But switching from random 
to fixed ports in the TaskTracker means that only a single task tracker can be 
run at a time.  Currently I frequently find it useful to debug things by 
running multiple task trackers on a single box.

So we need to either loop, trying a range of port numbers, or switch back to 
random allocation, or both (since random allocations may collide).

According to the IANA, we should be able to randomly allocate stuff in 
49152-65535.  But that still could make folks upset who wish to set up 
restrictive firewalls.




> ports are hardcoded or random
> -
>
>  Key: NUTCH-99
>  URL: http://issues.apache.org/jira/browse/NUTCH-99
>  Project: Nutch
> Type: Bug
> Versions: 0.8-dev
> Reporter: Stefan Groschupf
> Priority: Critical
>  Fix For: 0.8-dev
>  Attachments: port_patch.txt, port_patch_02.txt
>
> Ports of tasktracker are random and the port of the datanode is hardcoded to 
> 7000 as strting port.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-99) ports are hardcoded or random

2005-10-03 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-99?page=comments#action_12331225 ] 

Doug Cutting commented on NUTCH-99:
---

What command line would you add this to?  I think this should simply start at 
the default port (e.g., 7030) and loop trying port+1 until BindException is not 
thrown.  A message should be logged for each failure.

> ports are hardcoded or random
> -
>
>  Key: NUTCH-99
>  URL: http://issues.apache.org/jira/browse/NUTCH-99
>  Project: Nutch
> Type: Bug
> Versions: 0.8-dev
> Reporter: Stefan Groschupf
> Priority: Critical
>  Fix For: 0.8-dev
>  Attachments: port_patch.txt, port_patch_02.txt
>
> Ports of tasktracker are random and the port of the datanode is hardcoded to 
> 7000 as strting port.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

2005-10-11 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-109?page=comments#action_12331847 ] 

Doug Cutting commented on NUTCH-109:


Is your HTTP client polite?  Does it only have a single connection open the the 
server at a time, and does it pause fetcher.server.delay between each request?  
It looks as though you are permitting three simultaneous requests, and I can 
see no delays.

How did you configure protocol-http and protocol-httpclient?  One can configure 
these to use multiple connections per server by increasing 
fetcher.threads.per.host.  By default they will only make a single request at a 
time.  One can also configure these to not delay between requests by setting 
fetcher.server.delay to zero.  Such settings are not considered polite, but 
they will substantially improve fetcher performance.


> Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation
> ---
>
>  Key: NUTCH-109
>  URL: http://issues.apache.org/jira/browse/NUTCH-109
>  Project: Nutch
> Type: Improvement
>   Components: fetcher
> Versions: 0.7, 0.8-dev, 0.6, 0.7.1
>  Environment: Nutch: Windows XP, J2SE 1.4.2_09
> Web Server: Suse Linux, Apache HTTPD, apache2-worker,  v. 2.0.53
> Reporter: Fuad Efendi
>  Attachments: protocol-httpclient-innovation-0.1.0.zip
>
> 1. TCP connection costs a lot, not only for Nutch and end-point web servers, 
> but also for intermediary network equipment 
> 2. Web Server creates Client thread and hopes that Nutch really uses 
> HTTP/1.1, or at least Nutch sends "Connection: close" before closing in JVM 
> "Socket.close()" ...
> I need to perform very objective tests, probably 2-3 days; new plugin 
> crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing 
> http-plugin needs few days...
> I am using separate network segment with Windows XP (Nutch), and Suse Linux 
> (Apache HTTPD + 120,000 pages)
> Please find attached new plugin based on 
> http://www.innovation.ch/java/HTTPClient/
> Please note: 
> Class HttpFactory contains cache of HTTPConnection objects; each object run 
> each thread; each object is absolutely thread-safe, so we can send multiple 
> GET requests using single instance:
>private static int CLIENTS_PER_HOST = 
> NutchConf.get().getInt("http.clients.per.host", 3);
> I'll add more comments after finishing tests...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-109) Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation

2005-10-11 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-109?page=comments#action_12331857 ] 

Doug Cutting commented on NUTCH-109:


Comparing protocol-http and protocol-httpclient with default settings, which 
permit only a single request at a time with five second delays between each 
request, to something that permits three simultaneous connections with no 
delays is not a fair comparison.  There is probably some advantage to using 
"Keep-Alive", but these benchmarks do not measure it.  To make a fair 
comparison you must configure Nutch with fetcher.server.delay=0 and 
fetcher.threads.per.host=3.

> Nutch - Fetcher - Performance Test - new Protocol-HTTPClient-Innovation
> ---
>
>  Key: NUTCH-109
>  URL: http://issues.apache.org/jira/browse/NUTCH-109
>  Project: Nutch
> Type: Improvement
>   Components: fetcher
> Versions: 0.7, 0.8-dev, 0.6, 0.7.1
>  Environment: Nutch: Windows XP, J2SE 1.4.2_09
> Web Server: Suse Linux, Apache HTTPD, apache2-worker,  v. 2.0.53
> Reporter: Fuad Efendi
>  Attachments: protocol-httpclient-innovation-0.1.0.zip
>
> 1. TCP connection costs a lot, not only for Nutch and end-point web servers, 
> but also for intermediary network equipment 
> 2. Web Server creates Client thread and hopes that Nutch really uses 
> HTTP/1.1, or at least Nutch sends "Connection: close" before closing in JVM 
> "Socket.close()" ...
> I need to perform very objective tests, probably 2-3 days; new plugin 
> crawled/parsed 23,000 pages for 1,321 seconds; it seems that existing 
> http-plugin needs few days...
> I am using separate network segment with Windows XP (Nutch), and Suse Linux 
> (Apache HTTPD + 120,000 pages)
> Please find attached new plugin based on 
> http://www.innovation.ch/java/HTTPClient/
> Please note: 
> Class HttpFactory contains cache of HTTPConnection objects; each object run 
> each thread; each object is absolutely thread-safe, so we can send multiple 
> GET requests using single instance:
>private static int CLIENTS_PER_HOST = 
> NutchConf.get().getInt("http.clients.per.host", 3);
> I'll add more comments after finishing tests...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-108) tasktracker crashs when reconnecting to a new jobtracker.

2005-10-17 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-108?page=comments#action_12332265 ] 

Doug Cutting commented on NUTCH-108:


I think the patch is to replace the loop at the start of TaskTracker.close() 
with something like:

  while (tasks.size() != 0) {
TaskInProgress tip = (TaskInProgress)tasks.first();
tip.jobHasFinished();
  }

I have not yet had time to test this.


> tasktracker crashs when reconnecting to a new jobtracker.
> -
>
>  Key: NUTCH-108
>  URL: http://issues.apache.org/jira/browse/NUTCH-108
>  Project: Nutch
> Type: Bug
> Versions: 0.8-dev
> Reporter: Stefan Groschupf
> Priority: Critical

>
> 051008 213532 Lost connection to JobTracker [/192.168.200.100:7020].  
> Retrying...
> 051008 213537 Client connection to 192.168.200.100:7020: starting
> 051008 213537 Client connection to 192.168.200.105:7030: closing
> 051008 213537 Server connection on port 7030 from 192.168.200.105: exiting
> 051008 213537 Server connection on port 7030 from 192.168.200.102: exiting
> 051008 213537 Client connection to 192.168.200.102:7030: closing
> 051008 213537 task_m_1iswra done; removing files.
> 051008 213537 Server connection on port 7030 from 192.168.200.101: exiting
> 051008 213537 Client connection to 192.168.200.101:7030: closing
> Exception in thread "main" java.util.ConcurrentModificationException
> at java.util.TreeMap$EntryIterator.nextEntry(TreeMap.java:1026)
> at java.util.TreeMap$ValueIterator.next(TreeMap.java:1057)
> at org.apache.nutch.mapred.TaskTracker.close(TaskTracker.java:134)
> at org.apache.nutch.mapred.TaskTracker.run(TaskTracker.java:285)
> at org.apache.nutch.mapred.TaskTracker.main(TaskTracker.java:629)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-114) getting number of urls and links from crawldb

2005-10-17 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-114?page=comments#action_12332267 ] 

Doug Cutting commented on NUTCH-114:


You could use UTF8 as the output key type, map to keys like, "links" and 
"entries", and use TextOutputFormat.  Then the output would be a text file with 
the link and entry counts.

> getting number of urls and links from crawldb
> -
>
>  Key: NUTCH-114
>  URL: http://issues.apache.org/jira/browse/NUTCH-114
>  Project: Nutch
> Type: New Feature
> Versions: 0.8-dev
> Reporter: Stefan Groschupf
> Priority: Minor
>  Fix For: 0.8-dev
>  Attachments: CrawlDbStat.java, CrawlDbStatMapper.java
>
> We need a tool that provide basic statistics about the crawldb.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-116) TestNDFS a JUnit test specifically for NDFS

2005-10-19 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-116?page=comments#action_12332493 ] 

Doug Cutting commented on NUTCH-116:


Paul,

This looks like good stuff.

I could commit it more easily if changes were restricted to those required by 
TestNDFS.  Changes to comments, documentation, logging, etc. are better 
contributed as separate patches.  It's also okay to submit a unit test that 
fails and then to submit fixes as separate patches.  That makes my job easier: 
I can first see that the unit test looks reasonable, then see that it fails, 
then see how the patch fixes it.  As it stands it will take me some time to 
fully evalute this patch.

A few quick comments: 

If BLOCKREPORT_INTERVAL and DATANODE_STARTUP_PERIOD may be overridden (as is 
reasonable, and perhaps required by TestNDFS) then perhaps they should be 
removed from FSConstants entirely.  Does that make sense?

In Server.java, why is notifyAll() safer than notify()?  The intent is to wake 
one and only one waiting Handler thread.  notifyAll() would cause all of the 
Handler threads to become runnable even when only a single call has arrived.  
Is this required by TestNDFS?

Thanks,

Doug

> TestNDFS a JUnit test specifically for NDFS
> ---
>
>  Key: NUTCH-116
>  URL: http://issues.apache.org/jira/browse/NUTCH-116
>  Project: Nutch
> Type: Test
>   Components: fetcher, indexer, searcher
> Versions: 0.8-dev
> Reporter: Paul Baclace
>  Attachments: TestNDFS.java, required_by_TestNDFS.patch
>
> TestNDFS is a JUnit test for NDFS using "pseudo multiprocessing" (or more 
> strictly, pseudo distributed) meaning all daemons run in one process and 
> sockets are used to communicate between daemons.  
> The test permutes various block sizes, number of files, file sizes, and 
> number of datanodes.  After creating 1 or more files and filling them with 
> random data, one datanode is shutdown, and then the files are verfified. 
> Next, all the random test files are deleted and we test for leakage 
> (non-deletion) by directly checking the real directories corresponding to the 
> datanodes still running.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-88) Enhance ParserFactory plugin selection policy

2005-10-19 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-88?page=comments#action_12332514 ] 

Doug Cutting commented on NUTCH-88:
---

I am seeing some problems using this.

First, the ParserFactory sometimes uses LOG.severe() which causes the Fetcher 
to exit.  Is there a reason this cannot be LOG.warning()?  LOG.severe() should 
only be used if you intend the application to exit.  This configuration problem 
does not seem to warrant that.  And I'm getting it with the default settings 
when an application/pdf is encountered.

The second problem I'm seeing is that most html pages are parsed by the 
ParseText parser.  I think this is because their HTTP content-type header is 
"text/html; charset=ISO-8859-1", which does not match "text/html".  Where 
should the content-type parameters be removed?


> Enhance ParserFactory plugin selection policy
> -
>
>  Key: NUTCH-88
>  URL: http://issues.apache.org/jira/browse/NUTCH-88
>  Project: Nutch
> Type: Improvement
>   Components: indexer
> Versions: 0.7, 0.8-dev
> Reporter: Jerome Charron
> Assignee: Jerome Charron
>  Fix For: 0.8-dev

>
> The ParserFactory choose the Parser plugin to use based on the content-types 
> and path-suffix defined in the parsers plugin.xml file.
> The selection policy is as follow:
> Content type has priority: the first plugin found whose "contentType" 
> attribute matches the beginning of the content's type is used. 
> If none match, then the first whose "pathSuffix" attribute matches the end of 
> the url's path is used.
> If neither of these match, then the first plugin whose "pathSuffix" is the 
> empty string is used.
> This policy has a lot of problems when no matching is found, because a random 
> parser is used (and there is a lot of chance this parser can't handle the 
> content).
> On the other hand, the content-type associated to a parser plugin is 
> specified in the plugin.xml of each plugin (this is the value used by the 
> ParserFactory), AND the code of each parser checks itself in its code if the 
> content-type is ok (it uses an hard-coded content-type value, and not uses 
> the value specified in the plugin.xml => possibility of missmatches between 
> content-type hard-coded and content-type delcared in plugin.xml).
> A complete list of problems and discussion aout this point is available in:
>   * http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg00744.html
>   * http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00789.html

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-88) Enhance ParserFactory plugin selection policy

2005-10-19 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-88?page=comments#action_12332518 ] 

Doug Cutting commented on NUTCH-88:
---

These both sound like good changes.  +1

> Enhance ParserFactory plugin selection policy
> -
>
>  Key: NUTCH-88
>  URL: http://issues.apache.org/jira/browse/NUTCH-88
>  Project: Nutch
> Type: Improvement
>   Components: indexer
> Versions: 0.7, 0.8-dev
> Reporter: Jerome Charron
> Assignee: Jerome Charron
>  Fix For: 0.8-dev

>
> The ParserFactory choose the Parser plugin to use based on the content-types 
> and path-suffix defined in the parsers plugin.xml file.
> The selection policy is as follow:
> Content type has priority: the first plugin found whose "contentType" 
> attribute matches the beginning of the content's type is used. 
> If none match, then the first whose "pathSuffix" attribute matches the end of 
> the url's path is used.
> If neither of these match, then the first plugin whose "pathSuffix" is the 
> empty string is used.
> This policy has a lot of problems when no matching is found, because a random 
> parser is used (and there is a lot of chance this parser can't handle the 
> content).
> On the other hand, the content-type associated to a parser plugin is 
> specified in the plugin.xml of each plugin (this is the value used by the 
> ParserFactory), AND the code of each parser checks itself in its code if the 
> content-type is ok (it uses an hard-coded content-type value, and not uses 
> the value specified in the plugin.xml => possibility of missmatches between 
> content-type hard-coded and content-type delcared in plugin.xml).
> A complete list of problems and discussion aout this point is available in:
>   * http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg00744.html
>   * http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00789.html

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-88) Enhance ParserFactory plugin selection policy

2005-10-19 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-88?page=comments#action_12332541 ] 

Doug Cutting commented on NUTCH-88:
---

If it's to happen at parse time then it should happen in the Content 
constructor, so that it's only done in one place, and we don't rely on each 
protocol to normalize the mime type.

That said, I'm not sure that we should do it at parse time.  The 
"charset=ISO-8859-1" is a valid part of the mime type, so I'm not sure we 
should remove it.  But it is not required for parser selection.  So I think it 
makes sense for the parser selector to remove the charset specification.

In any case, we should patch these problems ASAP, as they make the trunk and 
mapred branches behave very poorly.

> Enhance ParserFactory plugin selection policy
> -
>
>  Key: NUTCH-88
>  URL: http://issues.apache.org/jira/browse/NUTCH-88
>  Project: Nutch
> Type: Improvement
>   Components: indexer
> Versions: 0.7, 0.8-dev
> Reporter: Jerome Charron
> Assignee: Jerome Charron
>  Fix For: 0.8-dev

>
> The ParserFactory choose the Parser plugin to use based on the content-types 
> and path-suffix defined in the parsers plugin.xml file.
> The selection policy is as follow:
> Content type has priority: the first plugin found whose "contentType" 
> attribute matches the beginning of the content's type is used. 
> If none match, then the first whose "pathSuffix" attribute matches the end of 
> the url's path is used.
> If neither of these match, then the first plugin whose "pathSuffix" is the 
> empty string is used.
> This policy has a lot of problems when no matching is found, because a random 
> parser is used (and there is a lot of chance this parser can't handle the 
> content).
> On the other hand, the content-type associated to a parser plugin is 
> specified in the plugin.xml of each plugin (this is the value used by the 
> ParserFactory), AND the code of each parser checks itself in its code if the 
> content-type is ok (it uses an hard-coded content-type value, and not uses 
> the value specified in the plugin.xml => possibility of missmatches between 
> content-type hard-coded and content-type delcared in plugin.xml).
> A complete list of problems and discussion aout this point is available in:
>   * http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg00744.html
>   * http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00789.html

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-82) Nutch Commands should run on Windows without external tools

2005-10-19 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-82?page=comments#action_12332543 ] 

Doug Cutting commented on NUTCH-82:
---

I do not think we should have multiple versions of the command line tools, 
since that complicates maintenance.  A windows batch file is not portable, and 
is thus not a good candidate to replace the bash versions.  I also don't see 
that requiring perl is any better than requiring cygwin on windows, and I 
suspect even with Perl we'd probably require cygwin.  So, unless someone 
objects, I will close this issue.

> Nutch Commands should run on Windows without external tools
> ---
>
>  Key: NUTCH-82
>  URL: http://issues.apache.org/jira/browse/NUTCH-82
>  Project: Nutch
> Type: New Feature
>  Environment: Windows 2000
> Reporter: AJ Banck
>  Attachments: nutch.bat, nutch.bat, nutch.pl
>
> Currently there is only a shellscript to run the Nutch commands. This should 
> be platform independant.
> Best would be Ant tools, or scripts generated by a template tool to avoid 
> replication.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-82) Nutch Commands should run on Windows without external tools

2005-10-19 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-82?page=comments#action_12332549 ] 

Doug Cutting commented on NUTCH-82:
---

I do in fact sometimes develop Nutch on windows.

I would be happy if someone supplied a Java replacement for the command line 
tools.  That would indeed remove a dependency.  But I still don't see how 
requiring Perl simplifies things.  I'm also not much of a Perl programmer.  Are 
you willing to maintain translations of all of the scripts in nutch's bin 
directory as Perl?

Note that the mapred branch has more scripts.  Also note that the mapred branch 
relies on the 'df' program to portably access the amount of free space on a 
volume.  Is there a portable Perl alternative?

http://svn.apache.org/viewcvs.cgi/lucene/nutch/branches/mapred/bin/?rev=326780


> Nutch Commands should run on Windows without external tools
> ---
>
>  Key: NUTCH-82
>  URL: http://issues.apache.org/jira/browse/NUTCH-82
>  Project: Nutch
> Type: New Feature
>  Environment: Windows 2000
> Reporter: AJ Banck
>  Attachments: nutch.bat, nutch.bat, nutch.pl
>
> Currently there is only a shellscript to run the Nutch commands. This should 
> be platform independant.
> Best would be Ant tools, or scripts generated by a template tool to avoid 
> replication.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-88) Enhance ParserFactory plugin selection policy

2005-10-20 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-88?page=comments#action_12332609 ] 

Doug Cutting commented on NUTCH-88:
---

Jerome,  

This works well now.  I've merged your changes to the mapred branch.

Thanks!

Doug


> Enhance ParserFactory plugin selection policy
> -
>
>  Key: NUTCH-88
>  URL: http://issues.apache.org/jira/browse/NUTCH-88
>  Project: Nutch
> Type: Improvement
>   Components: indexer
> Versions: 0.7, 0.8-dev
> Reporter: Jerome Charron
> Assignee: Jerome Charron
>  Fix For: 0.8-dev

>
> The ParserFactory choose the Parser plugin to use based on the content-types 
> and path-suffix defined in the parsers plugin.xml file.
> The selection policy is as follow:
> Content type has priority: the first plugin found whose "contentType" 
> attribute matches the beginning of the content's type is used. 
> If none match, then the first whose "pathSuffix" attribute matches the end of 
> the url's path is used.
> If neither of these match, then the first plugin whose "pathSuffix" is the 
> empty string is used.
> This policy has a lot of problems when no matching is found, because a random 
> parser is used (and there is a lot of chance this parser can't handle the 
> content).
> On the other hand, the content-type associated to a parser plugin is 
> specified in the plugin.xml of each plugin (this is the value used by the 
> ParserFactory), AND the code of each parser checks itself in its code if the 
> content-type is ok (it uses an hard-coded content-type value, and not uses 
> the value specified in the plugin.xml => possibility of missmatches between 
> content-type hard-coded and content-type delcared in plugin.xml).
> A complete list of problems and discussion aout this point is available in:
>   * http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg00744.html
>   * http://www.mail-archive.com/nutch-dev%40lucene.apache.org/msg00789.html

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-82) Nutch Commands should run on Windows without external tools

2005-10-20 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-82?page=comments#action_12332610 ] 

Doug Cutting commented on NUTCH-82:
---

Ant and Tomcat supply both Unix shell scripts and Windows batch files.  Neither 
uses Perl.  I am hesitant to go this two-implementation route, as Nutch's 
scripting requirements (especially with MapReduce) are greater than Ant or 
Tomcat.  Nutch's new scripts manage daemons on remote servers with ssh and 
rsync, supplied via cygwin on Windows.


> Nutch Commands should run on Windows without external tools
> ---
>
>  Key: NUTCH-82
>  URL: http://issues.apache.org/jira/browse/NUTCH-82
>  Project: Nutch
> Type: New Feature
>  Environment: Windows 2000
> Reporter: AJ Banck
>  Attachments: nutch.bat, nutch.bat, nutch.pl
>
> Currently there is only a shellscript to run the Nutch commands. This should 
> be platform independant.
> Best would be Ant tools, or scripts generated by a template tool to avoid 
> replication.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Created: (NUTCH-124) protocol-httpclient does not follow redirects when fetching robots.txt

2005-11-04 Thread Doug Cutting (JIRA)

protocol-httpclient does not follow redirects when fetching robots.txt
--

 Key: NUTCH-124
 URL: http://issues.apache.org/jira/browse/NUTCH-124
 Project: Nutch
Type: Bug
  Components: fetcher  
Versions: 0.8-dev, 0.7.2-dev
Reporter: Doug Cutting


If a site's robots.txt redirects, protocol-httpclient does not correctly fetch 
the robots.txt and effectively ignores it for the site.  See 
http://www.webmasterworld.com/forum11/3008.htm.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Resolved: (NUTCH-124) protocol-httpclient does not follow redirects when fetching robots.txt

2005-11-07 Thread Doug Cutting (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-124?page=all ]
 
Doug Cutting resolved NUTCH-124:


Fix Version: 0.8-dev
 Resolution: Fixed

I have fixed this in the mapred branch.

> protocol-httpclient does not follow redirects when fetching robots.txt
> --
>
>  Key: NUTCH-124
>  URL: http://issues.apache.org/jira/browse/NUTCH-124
>  Project: Nutch
> Type: Bug
>   Components: fetcher
> Versions: 0.8-dev, 0.7.2-dev
> Reporter: Doug Cutting
>  Fix For: 0.8-dev

>
> If a site's robots.txt redirects, protocol-httpclient does not correctly 
> fetch the robots.txt and effectively ignores it for the site.  See 
> http://www.webmasterworld.com/forum11/3008.htm.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-99) ports are hardcoded or random

2005-11-10 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-99?page=comments#action_12357291 ] 

Doug Cutting commented on NUTCH-99:
---

I cannot get patch on linux to accept this. The absolute DOS paths seem to 
cause problems.  Can you please regenerate this with relative paths?  
Generating it on linux would also be preferable, as patch also has problems 
with EOL differences.

Also, ndfs.datanode.port would be a better name for that property.

And catching Exception is overkill.  This should be java.net.BindException, no?


> ports are hardcoded or random
> -
>
>  Key: NUTCH-99
>  URL: http://issues.apache.org/jira/browse/NUTCH-99
>  Project: Nutch
> Type: Bug
> Versions: 0.8-dev
> Reporter: Stefan Groschupf
> Priority: Critical
>  Fix For: 0.8-dev
>  Attachments: port_patch.txt, port_patch_02.txt, port_patch_03.txt
>
> Ports of tasktracker are random and the port of the datanode is hardcoded to 
> 7000 as strting port.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-99) ports are hardcoded or random

2005-11-14 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-99?page=comments#action_12357617 ] 

Doug Cutting commented on NUTCH-99:
---

Sounds good.  We should also probably note in the config property descriptions 
that these port numbers are the first in a range that will be tried.


> ports are hardcoded or random
> -
>
>  Key: NUTCH-99
>  URL: http://issues.apache.org/jira/browse/NUTCH-99
>  Project: Nutch
> Type: Bug
> Versions: 0.8-dev
> Reporter: Stefan Groschupf
> Priority: Critical
>  Fix For: 0.8-dev
>  Attachments:  port_patch_04.txt, port_patch.txt, port_patch_02.txt, 
> port_patch_03.txt
>
> Ports of tasktracker are random and the port of the datanode is hardcoded to 
> 7000 as strting port.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Resolved: (NUTCH-130) Be explicit about target JVM when building (1.4.x?)

2005-12-01 Thread Doug Cutting (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-130?page=all ]
 
Doug Cutting resolved NUTCH-130:


Fix Version: 0.8-dev
 Resolution: Fixed
  Assign To: Doug Cutting

I just committed this.  I moved the version to the default.properties file, and 
found a few other places where javac is called.

> Be explicit about target JVM when building (1.4.x?)
> ---
>
>  Key: NUTCH-130
>  URL: http://issues.apache.org/jira/browse/NUTCH-130
>  Project: Nutch
> Type: Improvement
> Reporter: [EMAIL PROTECTED]
> Assignee: Doug Cutting
> Priority: Minor
>  Fix For: 0.8-dev

>
> Below is patch for nutch build.xml.  It stipulates the target JVM is 1.4.x.  
> Without explicit target, a nutch built with 1.5.x java defaults to a 1.5.x 
> java target and won't run in a 1.4.x JVM.  Can be annoying (From the ant 
> javac doc, regards the target attribute: "We highly recommend to always 
> specify this attribute.").
> [debord 282] nutch > svn diff -u build.xml
> Subcommand 'diff' doesn't accept option '-u [--show-updates]'
> Type 'svn help diff' for usage.
> [debord 283] nutch > svn diff build.xml
> Index: build.xml
> ===
> --- build.xml   (revision 349779)
> +++ build.xml   (working copy)
> @@ -72,6 +72,8 @@
>   destdir="${build.classes}"
>   debug="${debug}"
>   optimize="${optimize}"
> + target="1.4"
> + source="1.4"
>   deprecation="${deprecation}">
>
>  

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Resolved: (NUTCH-116) TestNDFS a JUnit test specifically for NDFS

2005-12-01 Thread Doug Cutting (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-116?page=all ]
 
Doug Cutting resolved NUTCH-116:


Fix Version: 0.8-dev
 Resolution: Fixed

I just committed this.  Thanks, Paul, this is great to have!

> TestNDFS a JUnit test specifically for NDFS
> ---
>
>  Key: NUTCH-116
>  URL: http://issues.apache.org/jira/browse/NUTCH-116
>  Project: Nutch
> Type: Test
>   Components: fetcher, indexer, searcher
> Versions: 0.8-dev
> Reporter: Paul Baclace
>  Fix For: 0.8-dev
>  Attachments: TestNDFS.java, TestNDFS.java, 
> comments_msgs_and_local_renames_during_TestNDFS.patch, 
> required_by_TestNDFS.patch, required_by_TestNDFS_v2.patch, 
> required_by_TestNDFS_v3.patch
>
> TestNDFS is a JUnit test for NDFS using "pseudo multiprocessing" (or more 
> strictly, pseudo distributed) meaning all daemons run in one process and 
> sockets are used to communicate between daemons.  
> The test permutes various block sizes, number of files, file sizes, and 
> number of datanodes.  After creating 1 or more files and filling them with 
> random data, one datanode is shutdown, and then the files are verfified. 
> Next, all the random test files are deleted and we test for leakage 
> (non-deletion) by directly checking the real directories corresponding to the 
> datanodes still running.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-133) ParserFactory does not work as expected

2005-12-07 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-133?page=comments#action_12359624 ] 

Doug Cutting commented on NUTCH-133:


It would be great to have some junit tests which illustrate these problems.  If 
we can first all agree on the desired behaviour, then we can work on the 
appropriate fixes.  For example, we should have some tests which call 
ParseUtil.parse(Content) with various Content instances and check that these 
are parsed as we feel they should be.  Can you look at the failure cases from 
your test set and convert these to unit tests?  That way in the future we can 
be more certain that changes to the parser selection algorithm don't hurt the 
percentage of content that we can parse.

> ParserFactory does not work as expected
> ---
>
>  Key: NUTCH-133
>  URL: http://issues.apache.org/jira/browse/NUTCH-133
>  Project: Nutch
> Type: Bug
> Versions: 0.8-dev, 0.7.1, 0.7.2-dev
> Reporter: Stefan Groschupf
> Priority: Blocker
>  Attachments: ParserFactoryPatch_nutch.0.7_patch.txt, 
> Parserutil_test_patch.txt
>
> Marcel Schnippe detect a set of problems until working with different content 
> and parser types, we worked together to identify the problem source.
> From our point of view this described problems could be the source for many 
> other problems daily described in the mailing lists.
> Find a conclusion of the problems below.
> Problem:
> Some servers returns mixed case but correct header keys like 'Content-type' 
> or 'content-Length'  in the http response header.
> That's why for example a get("Content-Type") fails and a page is detected as 
> zip using the magic content type detection mechanism. 
> Also we note that this a common reason why pdf parsing fails since 
> Content-Length does return the correct value. 
> Sample:
> returns "text/HTML" or "application/PDF" or Content-length
> or this url:
> http://www.lanka.info/dictionary/EnglishToSinhala.jsp
> Solution:
> First just write only lower case keys into the properties and later convert 
> all keys that are used to query the metadata to lower case as well.
> e.g.:
> HttpResponse.java, line 353:
> use lower cases here and for all keys used to query header properties (also 
> content-length) change:  String key = line.substring(0, colonIndex); to  
> String key = line.substring(0, colonIndex) .toLowerCase();
> Problem:
> MimeTypes based discovery (magic and url based) is only done in case the 
> content type was not delivered by the web server, this happens not that 
> often, mostly this was a problem with mixed case keys in the header.
> see:
>  public Content toContent() {
> String contentType = getHeader("Content-Type");
> if (contentType == null) {
>   MimeType type = null;
>   if (MAGIC) {
> type = MIME.getMimeType(orig, content);
>   } else {
> type = MIME.getMimeType(orig);
>   }
>   if (type != null) {
>   contentType = type.getName();
>   } else {
>   contentType = "";
>   }
> }
> return new Content(orig, base, content, contentType, headers);
>   }
> Solution:
> Use the content-type information as it is from the webserver and move the 
> content type discovering from Protocol plugins to the Component where the 
> parsing is done - to the ParseFactory.
> Than just create a list of parsers for the content type returned by the 
> server and the custom detected content type. In the end we can iterate over 
> all parser until we got a successfully parsed status.
> Problem:
> Content will be parsed also if the protocol reports a exception and has a non 
> successful status, in such a case the content is new byte[0] in any case.
> Solution:
> Fetcher.java, line 243.
> Change:   if (!Fetcher.this.parsing ) { .. to 
>  if (!Fetcher.this.parsing || !protocolStatus.isSuccess()) {
>// TODO we may should not write out here emthy parse text and parse 
> date, i suggest give outputpage a parameter parsed true / false
>   outputPage(new FetcherOutput(fle, hash, protocolStatus),
> content, new ParseText(""),
> new ParseData(new ParseStatus(ParseStatus.NOTPARSED), "", new 
> Outlink[0], new Properties()));
> return null;
>   }
> Problem:
> Actually the configuration of parser is done based on plugin id's, but one 
> plugin can have several extentions, so  normally a plugin can provide several 
> parser, but this is no limited just wrong values are used in the 
> configuration process. 
> Solution:
> Change plugin id to  extension id in the parser configuration file and also 
> change this code in the parser factory to use extension id's everywhere.
> Problem:
> there is not a clear differentiation between content type and mime type. 
> I'm notice that some plugins call metaData.get("Content-Type) or 
> content.getContentType();
> Actually in theory this can return different values, since the content

[jira] Commented: (NUTCH-134) Summarizer doesn't select the best snippets

2005-12-07 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-134?page=comments#action_12359626 ] 

Doug Cutting commented on NUTCH-134:


Can we yet replace Nutch's summarizer with the summarizer in Lucene's contrib 
directory?  Are there features that Nutch requires that that does not yet 
implement?  It's a shame to maintain two summarizers.  When I first wrote 
Nutch's summarizer there was no Lucene contrib summarizer...

> Summarizer doesn't select the best snippets
> ---
>
>  Key: NUTCH-134
>  URL: http://issues.apache.org/jira/browse/NUTCH-134
>  Project: Nutch
> Type: Bug
>   Components: searcher
> Versions: 0.7, 0.8-dev, 0.7.1, 0.7.2-dev
> Reporter: Andrzej Bialecki 

>
> Summarizer.java tries to select the best fragments from the input text, where 
> the frequency of query terms is the highest. However, the logic in line 223 
> is flawed in that the excerptSet.add() operation will add new excerpts only 
> if they are not already present - the test is performed using the Comparator 
> that compares only the numUniqueTokens. This means that if there are two or 
> more excerpts, which score equally high, only the first of them will be 
> retained, and the rest of equally-scoring excerpts will be discarded, in 
> favor of other excerpts (possibly lower scoring).
> To fix this the Set should be replaced with a List + a sort operation. To 
> keep the relative position of excerpts in the original order the Excerpt 
> class should be extended with an "int order" field, and the collected 
> excerpts should be sorted in that order prior to adding them to the summary.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-133) ParserFactory does not work as expected

2005-12-07 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-133?page=comments#action_12359634 ] 

Doug Cutting commented on NUTCH-133:


Stefan, sorry I missed the test case.

If others agree that these cases should pass, then we should commit the test 
case alone as a start.  Then we can separately decide how to fix things.  From 
your description, with six "solutions", perhaps this should really be six 
separate patches to six separate bugs.

For example, dealing with case in content-type is a separable issue.  Should 
all metadata keys be case-insensitive?  If so then we should instead probably 
implement this with a case-insensitive Properties, a TreeMap using 
String.CASE_INSENSITIVE_ORDER.


> ParserFactory does not work as expected
> ---
>
>  Key: NUTCH-133
>  URL: http://issues.apache.org/jira/browse/NUTCH-133
>  Project: Nutch
> Type: Bug
> Versions: 0.8-dev, 0.7.1, 0.7.2-dev
> Reporter: Stefan Groschupf
> Priority: Blocker
>  Attachments: ParserFactoryPatch_nutch.0.7_patch.txt, 
> Parserutil_test_patch.txt
>
> Marcel Schnippe detect a set of problems until working with different content 
> and parser types, we worked together to identify the problem source.
> From our point of view this described problems could be the source for many 
> other problems daily described in the mailing lists.
> Find a conclusion of the problems below.
> Problem:
> Some servers returns mixed case but correct header keys like 'Content-type' 
> or 'content-Length'  in the http response header.
> That's why for example a get("Content-Type") fails and a page is detected as 
> zip using the magic content type detection mechanism. 
> Also we note that this a common reason why pdf parsing fails since 
> Content-Length does return the correct value. 
> Sample:
> returns "text/HTML" or "application/PDF" or Content-length
> or this url:
> http://www.lanka.info/dictionary/EnglishToSinhala.jsp
> Solution:
> First just write only lower case keys into the properties and later convert 
> all keys that are used to query the metadata to lower case as well.
> e.g.:
> HttpResponse.java, line 353:
> use lower cases here and for all keys used to query header properties (also 
> content-length) change:  String key = line.substring(0, colonIndex); to  
> String key = line.substring(0, colonIndex) .toLowerCase();
> Problem:
> MimeTypes based discovery (magic and url based) is only done in case the 
> content type was not delivered by the web server, this happens not that 
> often, mostly this was a problem with mixed case keys in the header.
> see:
>  public Content toContent() {
> String contentType = getHeader("Content-Type");
> if (contentType == null) {
>   MimeType type = null;
>   if (MAGIC) {
> type = MIME.getMimeType(orig, content);
>   } else {
> type = MIME.getMimeType(orig);
>   }
>   if (type != null) {
>   contentType = type.getName();
>   } else {
>   contentType = "";
>   }
> }
> return new Content(orig, base, content, contentType, headers);
>   }
> Solution:
> Use the content-type information as it is from the webserver and move the 
> content type discovering from Protocol plugins to the Component where the 
> parsing is done - to the ParseFactory.
> Than just create a list of parsers for the content type returned by the 
> server and the custom detected content type. In the end we can iterate over 
> all parser until we got a successfully parsed status.
> Problem:
> Content will be parsed also if the protocol reports a exception and has a non 
> successful status, in such a case the content is new byte[0] in any case.
> Solution:
> Fetcher.java, line 243.
> Change:   if (!Fetcher.this.parsing ) { .. to 
>  if (!Fetcher.this.parsing || !protocolStatus.isSuccess()) {
>// TODO we may should not write out here emthy parse text and parse 
> date, i suggest give outputpage a parameter parsed true / false
>   outputPage(new FetcherOutput(fle, hash, protocolStatus),
> content, new ParseText(""),
> new ParseData(new ParseStatus(ParseStatus.NOTPARSED), "", new 
> Outlink[0], new Properties()));
> return null;
>   }
> Problem:
> Actually the configuration of parser is done based on plugin id's, but one 
> plugin can have several extentions, so  normally a plugin can provide several 
> parser, but this is no limited just wrong values are used in the 
> configuration process. 
> Solution:
> Change plugin id to  extension id in the parser configuration file and also 
> change this code in the parser factory to use extension id's everywhere.
> Problem:
> there is not a clear differentiation between content type and mime type. 
> I'm notice that some plugins call metaData.get("Content-Type) or 
> content.getContentType();
> Actually in theory this can return different values, since the content type 
> coul

[jira] Commented: (NUTCH-133) ParserFactory does not work as expected

2005-12-07 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-133?page=comments#action_12359653 ] 

Doug Cutting commented on NUTCH-133:


I think we should distinguish between the value of content.getContentType() and 
metaData.get("Content-Type").  The former should be trusted and the latter 
should be what was declared by the server.

So, for HTTP we could do something like:

private String getContentType(HashMap headers, byte[] data) {
  String typeName = headers.get("Content-Type");
  MimeType type = typeName = null ? null : MimeTypes.getMimeType(typeName);
  if (typeName == null ||
 type = null ||
 (type.hasMagic() && !type.matches(data))  {
mimeType = MimeTypes.getMimeType(data);
typeName = mimeType.toString()
  }
  return typeName; 
}

This always double-checks that types match their magic (if any is defined) and 
only tries all magic strings when the declared type's magic is mismatched.  
Perhaps this checking could even be moved to the Content constructor.  Thoughts?

> ParserFactory does not work as expected
> ---
>
>  Key: NUTCH-133
>  URL: http://issues.apache.org/jira/browse/NUTCH-133
>  Project: Nutch
> Type: Bug
> Versions: 0.8-dev, 0.7.1, 0.7.2-dev
> Reporter: Stefan Groschupf
> Priority: Blocker
>  Attachments: ParserFactoryPatch_nutch.0.7_patch.txt, 
> Parserutil_test_patch.txt
>
> Marcel Schnippe detect a set of problems until working with different content 
> and parser types, we worked together to identify the problem source.
> From our point of view this described problems could be the source for many 
> other problems daily described in the mailing lists.
> Find a conclusion of the problems below.
> Problem:
> Some servers returns mixed case but correct header keys like 'Content-type' 
> or 'content-Length'  in the http response header.
> That's why for example a get("Content-Type") fails and a page is detected as 
> zip using the magic content type detection mechanism. 
> Also we note that this a common reason why pdf parsing fails since 
> Content-Length does return the correct value. 
> Sample:
> returns "text/HTML" or "application/PDF" or Content-length
> or this url:
> http://www.lanka.info/dictionary/EnglishToSinhala.jsp
> Solution:
> First just write only lower case keys into the properties and later convert 
> all keys that are used to query the metadata to lower case as well.
> e.g.:
> HttpResponse.java, line 353:
> use lower cases here and for all keys used to query header properties (also 
> content-length) change:  String key = line.substring(0, colonIndex); to  
> String key = line.substring(0, colonIndex) .toLowerCase();
> Problem:
> MimeTypes based discovery (magic and url based) is only done in case the 
> content type was not delivered by the web server, this happens not that 
> often, mostly this was a problem with mixed case keys in the header.
> see:
>  public Content toContent() {
> String contentType = getHeader("Content-Type");
> if (contentType == null) {
>   MimeType type = null;
>   if (MAGIC) {
> type = MIME.getMimeType(orig, content);
>   } else {
> type = MIME.getMimeType(orig);
>   }
>   if (type != null) {
>   contentType = type.getName();
>   } else {
>   contentType = "";
>   }
> }
> return new Content(orig, base, content, contentType, headers);
>   }
> Solution:
> Use the content-type information as it is from the webserver and move the 
> content type discovering from Protocol plugins to the Component where the 
> parsing is done - to the ParseFactory.
> Than just create a list of parsers for the content type returned by the 
> server and the custom detected content type. In the end we can iterate over 
> all parser until we got a successfully parsed status.
> Problem:
> Content will be parsed also if the protocol reports a exception and has a non 
> successful status, in such a case the content is new byte[0] in any case.
> Solution:
> Fetcher.java, line 243.
> Change:   if (!Fetcher.this.parsing ) { .. to 
>  if (!Fetcher.this.parsing || !protocolStatus.isSuccess()) {
>// TODO we may should not write out here emthy parse text and parse 
> date, i suggest give outputpage a parameter parsed true / false
>   outputPage(new FetcherOutput(fle, hash, protocolStatus),
> content, new ParseText(""),
> new ParseData(new ParseStatus(ParseStatus.NOTPARSED), "", new 
> Outlink[0], new Properties()));
> return null;
>   }
> Problem:
> Actually the configuration of parser is done based on plugin id's, but one 
> plugin can have several extentions, so  normally a plugin can provide several 
> parser, but this is no limited just wrong values are used in the 
> configuration process. 
> Solution:
> Change plugin id to  extension id in the parser configuration file and also 
> change this code in the parser factory to

[jira] Commented: (NUTCH-133) ParserFactory does not work as expected

2005-12-08 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-133?page=comments#action_12359753 ] 

Doug Cutting commented on NUTCH-133:


Stefan,

The primary reason to keep classes and method names the same is to simplify the 
evaluation of your patch.  A good patch should solve only one problem, and 
should change nothing unrelated to that problem.  Changes in indentation, etc. 
just make it harder for others to see what's really changed.  Cosmetic changes 
should be separate patches.

Jerome,

The extension and the declared content-type should both be used as hints to 
direct checking of magic.  If we have a known extension or content-type then we 
do not have to scan the entire list of mime types, but can rather first check 
the type(s) named by the extension and the content-type.  If these match the 
content then we're done.  This is an important optimization.  Only if those 
matches fail should we ever try matching all magic.  Does that make sense?

> ParserFactory does not work as expected
> ---
>
>  Key: NUTCH-133
>  URL: http://issues.apache.org/jira/browse/NUTCH-133
>  Project: Nutch
> Type: Bug
> Versions: 0.8-dev, 0.7.1, 0.7.2-dev
> Reporter: Stefan Groschupf
> Priority: Blocker
>  Attachments: ParserFactoryPatch_nutch.0.7_patch.txt, 
> Parserutil_test_patch.txt
>
> Marcel Schnippe detect a set of problems until working with different content 
> and parser types, we worked together to identify the problem source.
> From our point of view this described problems could be the source for many 
> other problems daily described in the mailing lists.
> Find a conclusion of the problems below.
> Problem:
> Some servers returns mixed case but correct header keys like 'Content-type' 
> or 'content-Length'  in the http response header.
> That's why for example a get("Content-Type") fails and a page is detected as 
> zip using the magic content type detection mechanism. 
> Also we note that this a common reason why pdf parsing fails since 
> Content-Length does return the correct value. 
> Sample:
> returns "text/HTML" or "application/PDF" or Content-length
> or this url:
> http://www.lanka.info/dictionary/EnglishToSinhala.jsp
> Solution:
> First just write only lower case keys into the properties and later convert 
> all keys that are used to query the metadata to lower case as well.
> e.g.:
> HttpResponse.java, line 353:
> use lower cases here and for all keys used to query header properties (also 
> content-length) change:  String key = line.substring(0, colonIndex); to  
> String key = line.substring(0, colonIndex) .toLowerCase();
> Problem:
> MimeTypes based discovery (magic and url based) is only done in case the 
> content type was not delivered by the web server, this happens not that 
> often, mostly this was a problem with mixed case keys in the header.
> see:
>  public Content toContent() {
> String contentType = getHeader("Content-Type");
> if (contentType == null) {
>   MimeType type = null;
>   if (MAGIC) {
> type = MIME.getMimeType(orig, content);
>   } else {
> type = MIME.getMimeType(orig);
>   }
>   if (type != null) {
>   contentType = type.getName();
>   } else {
>   contentType = "";
>   }
> }
> return new Content(orig, base, content, contentType, headers);
>   }
> Solution:
> Use the content-type information as it is from the webserver and move the 
> content type discovering from Protocol plugins to the Component where the 
> parsing is done - to the ParseFactory.
> Than just create a list of parsers for the content type returned by the 
> server and the custom detected content type. In the end we can iterate over 
> all parser until we got a successfully parsed status.
> Problem:
> Content will be parsed also if the protocol reports a exception and has a non 
> successful status, in such a case the content is new byte[0] in any case.
> Solution:
> Fetcher.java, line 243.
> Change:   if (!Fetcher.this.parsing ) { .. to 
>  if (!Fetcher.this.parsing || !protocolStatus.isSuccess()) {
>// TODO we may should not write out here emthy parse text and parse 
> date, i suggest give outputpage a parameter parsed true / false
>   outputPage(new FetcherOutput(fle, hash, protocolStatus),
> content, new ParseText(""),
> new ParseData(new ParseStatus(ParseStatus.NOTPARSED), "", new 
> Outlink[0], new Properties()));
> return null;
>   }
> Problem:
> Actually the configuration of parser is done based on plugin id's, but one 
> plugin can have several extentions, so  normally a plugin can provide several 
> parser, but this is no limited just wrong values are used in the 
> configuration process. 
> Solution:
> Change plugin id to  extension id in the parser configuration file and also 
> change this code in the parser factory to use extension id's everywhere.
> Prob

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2005-12-16 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12360645 ] 

Doug Cutting commented on NUTCH-139:


I'm confused as to why all of the constant names have "X_nutch" in them.  I'd 
expect to see something like that in their string values, but their names are 
already qualified by org.apache.nutch.ParseData, no?  Also, it would be easier 
if these were all defined in an interface, something like MetadataNames.  That 
way a class can "implement" that interface and then simply use the short names 
in code, e.g. CONTENT_TYPE, AUTHOR, etc.

> Standard metadata property names in the ParseData metadata
> --
>
>  Key: NUTCH-139
>  URL: http://issues.apache.org/jira/browse/NUTCH-139
>  Project: Nutch
> Type: Improvement
>   Components: fetcher
> Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, 
> although bug is independent of environment
> Reporter: Chris A. Mattmann
> Assignee: Chris A. Mattmann
> Priority: Minor
>  Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.Mattmann.patch.txt
>
> Currently, people are free to name their string-based properties anything 
> that they want, such as having names of "Content-type", "content-TyPe", 
> "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a 
> solution in which all property names be converted to lower case, but in 
> essence this really only fixes half the problem right (the case of 
> identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of 
> named Strings in the ParseData class that the protocol framework and the 
> parsing framework could use to identify common properties such as 
> "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something 
> like:
>  public class ParseData{
>.
> public static final String CONTENT_TYPE = "content-type";
> public static final String CREATOR = "creator";
>
> }
> In this fashion, users could at least know what the name of the standard 
> properties that they can obtain from the ParseData are, for example by making 
> a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the 
> content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, 
> "text/xml"); Of course, this wouldn't preclude users from doing what they are 
> currently doing, it would just provide a standard method of obtaining some of 
> the more common, critical metadata without pouring over the code base to 
> figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week 
> that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-3) multi values of header discarded

2005-12-17 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-3?page=comments#action_12360665 ] 

Doug Cutting commented on NUTCH-3:
--

I find the naming confusing, where setProperty adds a value.  I wonder whether 
we should provide a 'setProperty' that replaces all values, and have 
'addProperty' just add a single value?  Would such a 'setProperty' ever be 
useful?

> multi values of header discarded
> 
>
>  Key: NUTCH-3
>  URL: http://issues.apache.org/jira/browse/NUTCH-3
>  Project: Nutch
> Type: Bug
> Reporter: Stefan Groschupf
> Assignee: Stefan Groschupf
>  Fix For: 0.8-dev
>  Attachments: multiValuesPropertyPatch.txt
>
> orignal by: phoebe
> http://sourceforge.net/tracker/index.php?func=detail&aid=185&group_id=59548&atid=491356
> multi values of header discarded
> Each successive setting of a header value deletes the
> previous one.
> This patch allows multi values to be retained, such as
> cookies, using lf cr as a delimiter for each values.
> --- /tmp/HttpResponse.java 2005-01-27
> 19:57:55.0 -0500
> +++ HttpResponse.java 2005-01-27 20:45:01.0 -0500
> @@ -324,7 +324,19 @@
> }
> String value = line.substring(valueStart);
> - headers.put(key, value);
> +//Spec allows multiple values, such as Set-Cookie -
> using lf cr as delimiter
> + if ( headers.containsKey(key)) {
> + try {
> + Object obj= headers.get(key);
> + if ( obj != null) {
> + String oldvalue=
> headers.get(key).toString();
> + value = oldvalue +
> "\r\n" + value;
> + }
> + } catch (Exception e) {
> + e.printStackTrace();
> + }
> + }
> + headers.put(key, value);
> }
> private Map parseHeaders(PushbackInputStream in,
> StringBuffer line)
> @@ -399,5 +411,3 @@
> }

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-3) multi values of header discarded

2005-12-17 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-3?page=comments#action_12360702 ] 

Doug Cutting commented on NUTCH-3:
--

Yes, I prefer this.  +1

> multi values of header discarded
> 
>
>  Key: NUTCH-3
>  URL: http://issues.apache.org/jira/browse/NUTCH-3
>  Project: Nutch
> Type: Bug
> Reporter: Stefan Groschupf
> Assignee: Stefan Groschupf
>  Fix For: 0.8-dev
>  Attachments: contentPropertiesAddpatch.txt, multiValuesPropertyPatch.txt
>
> orignal by: phoebe
> http://sourceforge.net/tracker/index.php?func=detail&aid=185&group_id=59548&atid=491356
> multi values of header discarded
> Each successive setting of a header value deletes the
> previous one.
> This patch allows multi values to be retained, such as
> cookies, using lf cr as a delimiter for each values.
> --- /tmp/HttpResponse.java 2005-01-27
> 19:57:55.0 -0500
> +++ HttpResponse.java 2005-01-27 20:45:01.0 -0500
> @@ -324,7 +324,19 @@
> }
> String value = line.substring(valueStart);
> - headers.put(key, value);
> +//Spec allows multiple values, such as Set-Cookie -
> using lf cr as delimiter
> + if ( headers.containsKey(key)) {
> + try {
> + Object obj= headers.get(key);
> + if ( obj != null) {
> + String oldvalue=
> headers.get(key).toString();
> + value = oldvalue +
> "\r\n" + value;
> + }
> + } catch (Exception e) {
> + e.printStackTrace();
> + }
> + }
> + headers.put(key, value);
> }
> private Map parseHeaders(PushbackInputStream in,
> StringBuffer line)
> @@ -399,5 +411,3 @@
> }

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-159) Specify temp/working directory for crawl

2006-01-02 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-159?page=comments#action_12361541 ] 

Doug Cutting commented on NUTCH-159:


mapred.local.dir is the thing to set.  if that fails, then there is a bug.  
what did you have this set to?

> Specify temp/working directory for crawl
> 
>
>  Key: NUTCH-159
>  URL: http://issues.apache.org/jira/browse/NUTCH-159
>  Project: Nutch
> Type: Bug
>   Components: fetcher, indexer
> Versions: 0.8-dev
>  Environment: Linux/Debian
> Reporter: byron miller

>
> I ran a crawl of 100k web pages and got:
> org.apache.nutch.fs.FSError: java.io.IOException: No space left on device
> at 
> org.apache.nutch.fs.LocalFileSystem$LocalNFSFileOutputStream.write(LocalFileSystem.java:149)
> at org.apache.nutch.fs.FileUtil.copyContents(FileUtil.java:65)
> at 
> org.apache.nutch.fs.LocalFileSystem.renameRaw(LocalFileSystem.java:178)
> at 
> org.apache.nutch.fs.NutchFileSystem.rename(NutchFileSystem.java:224)
> at 
> org.apache.nutch.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:80)
> Caused by: java.io.IOException: No space left on device
> at java.io.FileOutputStream.writeBytes(Native Method)
> at java.io.FileOutputStream.write(FileOutputStream.java:260)
> at 
> org.apache.nutch.fs.LocalFileSystem$LocalNFSFileOutputStream.write(LocalFileSystem.java:147)
> ... 4 more
> Exception in thread "main" java.io.IOException: Job failed!
> at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:308)
> at org.apache.nutch.crawl.Fetcher.fetch(Fetcher.java:335)
> at org.apache.nutch.crawl.Crawl.main(Crawl.java:107)
> [EMAIL PROTECTED]:/data/nutch$ df -k
> It appears crawl created a /tmp/nutch directory that filled up even though i 
> specified a db directory.
> Need to add a parameter to the command line or make a globaly configurable 
> /tmp (work area) for the nutch instance so that crawls won't fail.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Resolved: (NUTCH-108) tasktracker crashs when reconnecting to a new jobtracker.

2006-01-05 Thread Doug Cutting (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-108?page=all ]
 
Doug Cutting resolved NUTCH-108:


Fix Version: 0.8-dev
 Resolution: Fixed

I just committed this patch.  Thanks, Paul!

> tasktracker crashs when reconnecting to a new jobtracker.
> -
>
>  Key: NUTCH-108
>  URL: http://issues.apache.org/jira/browse/NUTCH-108
>  Project: Nutch
> Type: Bug
> Versions: 0.8-dev
> Reporter: Stefan Groschupf
> Priority: Critical
>  Fix For: 0.8-dev
>  Attachments: TaskTracker.java.patch
>
> 051008 213532 Lost connection to JobTracker [/192.168.200.100:7020].  
> Retrying...
> 051008 213537 Client connection to 192.168.200.100:7020: starting
> 051008 213537 Client connection to 192.168.200.105:7030: closing
> 051008 213537 Server connection on port 7030 from 192.168.200.105: exiting
> 051008 213537 Server connection on port 7030 from 192.168.200.102: exiting
> 051008 213537 Client connection to 192.168.200.102:7030: closing
> 051008 213537 task_m_1iswra done; removing files.
> 051008 213537 Server connection on port 7030 from 192.168.200.101: exiting
> 051008 213537 Client connection to 192.168.200.101:7030: closing
> Exception in thread "main" java.util.ConcurrentModificationException
> at java.util.TreeMap$EntryIterator.nextEntry(TreeMap.java:1026)
> at java.util.TreeMap$ValueIterator.next(TreeMap.java:1057)
> at org.apache.nutch.mapred.TaskTracker.close(TaskTracker.java:134)
> at org.apache.nutch.mapred.TaskTracker.run(TaskTracker.java:285)
> at org.apache.nutch.mapred.TaskTracker.main(TaskTracker.java:629)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Resolved: (NUTCH-131) Non-documented variable: mapred.child.heap.size

2006-01-05 Thread Doug Cutting (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-131?page=all ]
 
Doug Cutting resolved NUTCH-131:


Fix Version: 0.8-dev
 Resolution: Fixed

I just committed this.  Thanks!

> Non-documented variable: mapred.child.heap.size
> ---
>
>  Key: NUTCH-131
>  URL: http://issues.apache.org/jira/browse/NUTCH-131
>  Project: Nutch
> Type: Bug
> Versions: 0.8-dev
> Reporter: Rod Taylor
>  Fix For: 0.8-dev
>  Attachments: nutch-131.patch
>
> Got complaints about lack of heap space. Seems it was the children out of 
> room for reduce of a updatedb.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-05 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12361891 ] 

Doug Cutting commented on NUTCH-139:


If we store protocol headers as metadata then we should store them as-is.  If 
they're incorrect, then we should store the correct value separately as an 
x-nutch metadata value.

We should never need to store content type or title in metadata, since these 
are fields of Content and Parse respectively.  The "Content-Type" in the 
metadata for an http request should thus be the raw http header, the 
Content.getContentType() should be the content type we actually think this is, 
and there should be no x-nutch-content-type value.  Similarly, x-nutch-title 
should never be set, as parsers should set the Parse title field instead.

Does this sound right?


> Standard metadata property names in the ParseData metadata
> --
>
>  Key: NUTCH-139
>  URL: http://issues.apache.org/jira/browse/NUTCH-139
>  Project: Nutch
> Type: Improvement
>   Components: fetcher
> Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, 
> although bug is independent of environment
> Reporter: Chris A. Mattmann
> Assignee: Chris A. Mattmann
> Priority: Minor
>  Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt, 
> NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything 
> that they want, such as having names of "Content-type", "content-TyPe", 
> "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a 
> solution in which all property names be converted to lower case, but in 
> essence this really only fixes half the problem right (the case of 
> identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of 
> named Strings in the ParseData class that the protocol framework and the 
> parsing framework could use to identify common properties such as 
> "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something 
> like:
>  public class ParseData{
>.
> public static final String CONTENT_TYPE = "content-type";
> public static final String CREATOR = "creator";
>
> }
> In this fashion, users could at least know what the name of the standard 
> properties that they can obtain from the ParseData are, for example by making 
> a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the 
> content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, 
> "text/xml"); Of course, this wouldn't preclude users from doing what they are 
> currently doing, it would just provide a standard method of obtaining some of 
> the more common, critical metadata without pouring over the code base to 
> figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week 
> that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-05 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12361922 ] 

Doug Cutting commented on NUTCH-139:


One more thing.  Content length should also not need to be stored in the 
metadata as an x-nutch value.  The content length is simply the length of the 
Content's data.  The protocol may have truncated the content, in which case 
perhaps we need an x-nutch-truncated-content metadata property or something, 
but we should not be overwriting the HTTP "Content-Length" header, nor should 
we trust that it reflects the length of the data actually fetched.


> Standard metadata property names in the ParseData metadata
> --
>
>  Key: NUTCH-139
>  URL: http://issues.apache.org/jira/browse/NUTCH-139
>  Project: Nutch
> Type: Improvement
>   Components: fetcher
> Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, 
> although bug is independent of environment
> Reporter: Chris A. Mattmann
> Assignee: Chris A. Mattmann
> Priority: Minor
>  Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt, 
> NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything 
> that they want, such as having names of "Content-type", "content-TyPe", 
> "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a 
> solution in which all property names be converted to lower case, but in 
> essence this really only fixes half the problem right (the case of 
> identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of 
> named Strings in the ParseData class that the protocol framework and the 
> parsing framework could use to identify common properties such as 
> "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something 
> like:
>  public class ParseData{
>.
> public static final String CONTENT_TYPE = "content-type";
> public static final String CREATOR = "creator";
>
> }
> In this fashion, users could at least know what the name of the standard 
> properties that they can obtain from the ParseData are, for example by making 
> a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the 
> content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, 
> "text/xml"); Of course, this wouldn't preclude users from doing what they are 
> currently doing, it would just provide a standard method of obtaining some of 
> the more common, critical metadata without pouring over the code base to 
> figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week 
> that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-06 Thread Doug Cutting (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-139?page=all ]

Doug Cutting updated NUTCH-139:
---

Comment: was deleted

> Standard metadata property names in the ParseData metadata
> --
>
>  Key: NUTCH-139
>  URL: http://issues.apache.org/jira/browse/NUTCH-139
>  Project: Nutch
> Type: Improvement
>   Components: fetcher
> Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, 
> although bug is independent of environment
> Reporter: Chris A. Mattmann
> Assignee: Chris A. Mattmann
> Priority: Minor
>  Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt, 
> NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything 
> that they want, such as having names of "Content-type", "content-TyPe", 
> "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a 
> solution in which all property names be converted to lower case, but in 
> essence this really only fixes half the problem right (the case of 
> identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of 
> named Strings in the ParseData class that the protocol framework and the 
> parsing framework could use to identify common properties such as 
> "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something 
> like:
>  public class ParseData{
>.
> public static final String CONTENT_TYPE = "content-type";
> public static final String CREATOR = "creator";
>
> }
> In this fashion, users could at least know what the name of the standard 
> properties that they can obtain from the ParseData are, for example by making 
> a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the 
> content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, 
> "text/xml"); Of course, this wouldn't preclude users from doing what they are 
> currently doing, it would just provide a standard method of obtaining some of 
> the more common, critical metadata without pouring over the code base to 
> figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week 
> that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-06 Thread Doug Cutting (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-139?page=all ]

Doug Cutting updated NUTCH-139:
---

Comment: was deleted

> Standard metadata property names in the ParseData metadata
> --
>
>  Key: NUTCH-139
>  URL: http://issues.apache.org/jira/browse/NUTCH-139
>  Project: Nutch
> Type: Improvement
>   Components: fetcher
> Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, 
> although bug is independent of environment
> Reporter: Chris A. Mattmann
> Assignee: Chris A. Mattmann
> Priority: Minor
>  Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt, 
> NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything 
> that they want, such as having names of "Content-type", "content-TyPe", 
> "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a 
> solution in which all property names be converted to lower case, but in 
> essence this really only fixes half the problem right (the case of 
> identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of 
> named Strings in the ParseData class that the protocol framework and the 
> parsing framework could use to identify common properties such as 
> "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something 
> like:
>  public class ParseData{
>.
> public static final String CONTENT_TYPE = "content-type";
> public static final String CREATOR = "creator";
>
> }
> In this fashion, users could at least know what the name of the standard 
> properties that they can obtain from the ParseData are, for example by making 
> a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the 
> content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, 
> "text/xml"); Of course, this wouldn't preclude users from doing what they are 
> currently doing, it would just provide a standard method of obtaining some of 
> the more common, critical metadata without pouring over the code base to 
> figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week 
> that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-06 Thread Doug Cutting (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-139?page=all ]

Doug Cutting updated NUTCH-139:
---

Comment: was deleted

> Standard metadata property names in the ParseData metadata
> --
>
>  Key: NUTCH-139
>  URL: http://issues.apache.org/jira/browse/NUTCH-139
>  Project: Nutch
> Type: Improvement
>   Components: fetcher
> Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, 
> although bug is independent of environment
> Reporter: Chris A. Mattmann
> Assignee: Chris A. Mattmann
> Priority: Minor
>  Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt, 
> NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything 
> that they want, such as having names of "Content-type", "content-TyPe", 
> "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a 
> solution in which all property names be converted to lower case, but in 
> essence this really only fixes half the problem right (the case of 
> identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of 
> named Strings in the ParseData class that the protocol framework and the 
> parsing framework could use to identify common properties such as 
> "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something 
> like:
>  public class ParseData{
>.
> public static final String CONTENT_TYPE = "content-type";
> public static final String CREATOR = "creator";
>
> }
> In this fashion, users could at least know what the name of the standard 
> properties that they can obtain from the ParseData are, for example by making 
> a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the 
> content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, 
> "text/xml"); Of course, this wouldn't preclude users from doing what they are 
> currently doing, it would just provide a standard method of obtaining some of 
> the more common, critical metadata without pouring over the code base to 
> figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week 
> that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-06 Thread Doug Cutting (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-139?page=all ]

Doug Cutting updated NUTCH-139:
---

Comment: was deleted

> Standard metadata property names in the ParseData metadata
> --
>
>  Key: NUTCH-139
>  URL: http://issues.apache.org/jira/browse/NUTCH-139
>  Project: Nutch
> Type: Improvement
>   Components: fetcher
> Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, 
> although bug is independent of environment
> Reporter: Chris A. Mattmann
> Assignee: Chris A. Mattmann
> Priority: Minor
>  Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt, 
> NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything 
> that they want, such as having names of "Content-type", "content-TyPe", 
> "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a 
> solution in which all property names be converted to lower case, but in 
> essence this really only fixes half the problem right (the case of 
> identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of 
> named Strings in the ParseData class that the protocol framework and the 
> parsing framework could use to identify common properties such as 
> "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something 
> like:
>  public class ParseData{
>.
> public static final String CONTENT_TYPE = "content-type";
> public static final String CREATOR = "creator";
>
> }
> In this fashion, users could at least know what the name of the standard 
> properties that they can obtain from the ParseData are, for example by making 
> a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the 
> content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, 
> "text/xml"); Of course, this wouldn't preclude users from doing what they are 
> currently doing, it would just provide a standard method of obtaining some of 
> the more common, critical metadata without pouring over the code base to 
> figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week 
> that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-06 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12361994 ] 

Doug Cutting commented on NUTCH-139:


Jerome,

Some HTTP headers have multiple values.  Correctly reflecting that was I 
thought the primary motivation for adding multiple values, not for recording 
historical values.

I still don't see a reason why the derived content type needs to be stored 
anywhere but in the contentType field of the Content.  And if a derived value 
ever needs to go into the metadata, it should always use an x-nutch key, so 
that it can be clearly distinguished from original values.

Chis,

The content length is not expensive to compute, it's simply the length of the 
content byte array.  Are there uses of content length where this is 
impractical?  If so, then perhaps we could, for performance, cache a 
protocol-independent, derived content length in an x-nutch header. 

Alternately, we could prefix all protocol headers with the protocol name, so 
that the HTTP "Content-Language" header could be stored as something like 
"http:Content-Language".  Then Nutch could avoid using the x-nutch prefix, and 
instead store the derived, protocol-independent value as simply "language".

Yes, these are issues of policy, but this patch violates my ideas about the 
correct policy.  We should not confuse protocol-specific HTTP headers with 
protocol-independent derived values.  And multiple-values should be the 
exception, used in cases where multiple values are really sensible (like email 
"Received" headers) not to store the historic values.

> Standard metadata property names in the ParseData metadata
> --
>
>  Key: NUTCH-139
>  URL: http://issues.apache.org/jira/browse/NUTCH-139
>  Project: Nutch
> Type: Improvement
>   Components: fetcher
> Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, 
> although bug is independent of environment
> Reporter: Chris A. Mattmann
> Assignee: Chris A. Mattmann
> Priority: Minor
>  Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt, 
> NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything 
> that they want, such as having names of "Content-type", "content-TyPe", 
> "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a 
> solution in which all property names be converted to lower case, but in 
> essence this really only fixes half the problem right (the case of 
> identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of 
> named Strings in the ParseData class that the protocol framework and the 
> parsing framework could use to identify common properties such as 
> "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something 
> like:
>  public class ParseData{
>.
> public static final String CONTENT_TYPE = "content-type";
> public static final String CREATOR = "creator";
>
> }
> In this fashion, users could at least know what the name of the standard 
> properties that they can obtain from the ParseData are, for example by making 
> a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the 
> content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, 
> "text/xml"); Of course, this wouldn't preclude users from doing what they are 
> currently doing, it would just provide a standard method of obtaining some of 
> the more common, critical metadata without pouring over the code base to 
> figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week 
> that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-160) Use standard Java Regex library rather than org.apache.oro.text.regex

2006-01-06 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-160?page=comments#action_12361999 ] 

Doug Cutting commented on NUTCH-160:


+1

I like this patch.  I don't see a need for us to use oro anywhere, since Java 
now has good builtin regex support.  And Java's regex's are faster in many 
cases, not just this:

http://tbray.org/ongoing/When/200x/2004/08/22/PJre

There are a few places in which Java's regex's are incompatible with Perl 5 
regex's, documented in the "Comparison to Perl 5" section of:

http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html

So this change is not completely back-compatible.

Any objections?

> Use standard Java Regex library rather than org.apache.oro.text.regex
> -
>
>  Key: NUTCH-160
>  URL: http://issues.apache.org/jira/browse/NUTCH-160
>  Project: Nutch
> Type: Improvement
> Versions: 0.8-dev
> Reporter: Rod Taylor
>  Attachments: regex.patch
>
> org.apache.oro.text.regex is based on perl 5.003 which has some corner cases 
> which perform poorly. The standard regular expression libraries for Java (1.4 
> and later) do not seen to contain these issues.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-153) TextParser is only supposed to parse plain text, but if given postscript, it can take hours and then fail

2006-01-06 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-153?page=comments#action_12362002 ] 

Doug Cutting commented on NUTCH-153:


Paul,

Does http://issues.apache.org/jira/browse/NUTCH-160 address this issue too?  
I.e., is at least part of the problem that oro has some slow cases that Java's 
built-in regex's do not?


> TextParser is only supposed to parse plain text, but if given postscript, it 
> can take hours and then fail
> -
>
>  Key: NUTCH-153
>  URL: http://issues.apache.org/jira/browse/NUTCH-153
>  Project: Nutch
> Type: Bug
>   Components: fetcher
> Versions: 0.8-dev
>  Environment: all
> Reporter: Paul Baclace
>  Attachments: TextParser.java.patch
>
> If TextParser is given postscript, it can take hours and then fail.  This can 
> be avoided with careful configuration, but if the server MIME type is wrong 
> and the basename of the URL has no "file extension", then the this parser 
> will take a long time and fail every time.
> Analysis: The real problem is OutlinkExtractor.java as reported with bug 
> NUTCH-150, but the problem cannot be entirely addressed with that patch since 
> the first call to reg expr match() can take a long time, despite quantifier 
> limits.  
> Suggested fix: Reject files with "%!PS-Adobe" in the first 40 characters of 
> the file.
> Actual experience has shown that for safety and fail-safe reasons, it is 
> worth protecting against GIGO directly in TextParse for this case, even 
> though the suggested fix is not a general solution.  (A general solution 
> would be a timeout on match().)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-06 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12362003 ] 

Doug Cutting commented on NUTCH-139:


Also, since the primary use of multiple metadata values should be for protocols 
where multiple-values are required, the method to add a value should be 
different from the method to set a value.  I commented on this before when 
multiple values were added: there should be separate add(String,String) and 
set(String,String) methods.  The former should be used, e.g., by HTTP when 
storing headers, and the latter should be used, e.g., when setting x-nutch 
values.


> Standard metadata property names in the ParseData metadata
> --
>
>  Key: NUTCH-139
>  URL: http://issues.apache.org/jira/browse/NUTCH-139
>  Project: Nutch
> Type: Improvement
>   Components: fetcher
> Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, 
> although bug is independent of environment
> Reporter: Chris A. Mattmann
> Assignee: Chris A. Mattmann
> Priority: Minor
>  Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt, 
> NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything 
> that they want, such as having names of "Content-type", "content-TyPe", 
> "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a 
> solution in which all property names be converted to lower case, but in 
> essence this really only fixes half the problem right (the case of 
> identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of 
> named Strings in the ParseData class that the protocol framework and the 
> parsing framework could use to identify common properties such as 
> "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something 
> like:
>  public class ParseData{
>.
> public static final String CONTENT_TYPE = "content-type";
> public static final String CREATOR = "creator";
>
> }
> In this fashion, users could at least know what the name of the standard 
> properties that they can obtain from the ParseData are, for example by making 
> a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the 
> content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, 
> "text/xml"); Of course, this wouldn't preclude users from doing what they are 
> currently doing, it would just provide a standard method of obtaining some of 
> the more common, critical metadata without pouring over the code base to 
> figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week 
> that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-152) TaskRunner io pipes are not setDaemon(true), cleanup and exception errors are incomplete, max heap too small

2006-01-06 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-152?page=comments#action_12362004 ] 

Doug Cutting commented on NUTCH-152:


re 1,2,5: sounds good.
re 3: Why is a separate thread needed for stdout?  Can you please elaborate on 
how this causes problems?
re 4: I'd expect the io pipes to get EOF when the process is killed.  Is that 
not the case?
re 6: this is now in nutch-default.xml, tasks can override it, or it can be set 
in nutch-default.xml, so the value in this file has little importance.


> TaskRunner io pipes are not setDaemon(true), cleanup and exception errors are 
> incomplete, max heap too small
> 
>
>  Key: NUTCH-152
>  URL: http://issues.apache.org/jira/browse/NUTCH-152
>  Project: Nutch
> Type: Bug
>   Components: fetcher
> Versions: 0.8-dev
>  Environment: all
> Reporter: Paul Baclace
>  Attachments: TaskRunner.java.patch
>
> 1. io pipes should be setDaemon(true) so that process cannot hang.
> 2. error messages for Exceptions are incomplete since e.getMessage() is used 
> and it can be empty (NullPointerException has an empty message).   Change 
> this to e.toString() which always has more meaning.
> 3. a separate thread is not used for the subprocess stdout pipe, but it must 
> be a separate thread if setDaemon(true).
> 4. TaskRunner.kill()  does not stop the io pipe threads, but it should.
> 5. If InterruptedException occurs, it was assumed to be for the current 
> (main) thread, but it should check this with Thread.interrupted() otherwise 
> spurious thread interrupts will be rethrown as IOException.
> 6. A recent run had some Tasktracker child processes that ran out of heap.  
> The default max heap size should be larger.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Resolved: (NUTCH-151) CommandRunner can hang after the main thread exec is finished and has inefficient busy loop

2006-01-06 Thread Doug Cutting (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-151?page=all ]
 
Doug Cutting resolved NUTCH-151:


Fix Version: 0.8-dev
 Resolution: Fixed

I just committed this.  Thanks, Paul!

> CommandRunner can hang after the main thread exec is finished and has 
> inefficient busy loop
> ---
>
>  Key: NUTCH-151
>  URL: http://issues.apache.org/jira/browse/NUTCH-151
>  Project: Nutch
> Type: Bug
>   Components: indexer
> Versions: 0.8-dev
>  Environment: all
> Reporter: Paul Baclace
>  Fix For: 0.8-dev
>  Attachments: CommandRunner.java, CommandRunner.java.patch
>
> I encountered a case where the JVM of a Tasktracker child did not exit after 
> the main thread returned; a thread dump showed only the threads named STDOUT 
> and STDERR from CommandRunner as non-daemon threads, and both were doing a 
> read().
> CommandRunner usually works correctly when the subprocess is expected to be 
> finished before the timeout or when no timeout is used. By _usually_, I mean 
> in the absence of external thread interrupts.  The busy loop that waits for 
> the process to finish has a sleep that is skipped over by an exception; this 
> causes the waiting main thread to compete with the subprocess in a tight loop 
> and effectively reduces the available cpu by 50%.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Resolved: (NUTCH-150) OutlinkExtractor extremely slow on some non-plain text

2006-01-06 Thread Doug Cutting (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-150?page=all ]
 
Doug Cutting resolved NUTCH-150:


Fix Version: 0.7.2-dev
 Resolution: Fixed

I just committed this.  Thanks, Paul!

> OutlinkExtractor extremely slow on some non-plain text
> --
>
>  Key: NUTCH-150
>  URL: http://issues.apache.org/jira/browse/NUTCH-150
>  Project: Nutch
> Type: Bug
> Versions: 0.8-dev
>  Environment: All
> Reporter: Paul Baclace
> Priority: Minor
>  Fix For: 0.7.2-dev
>  Attachments: OutlinkExtractor.java.patch
>
> While using mime settings which aggressively parsed everything by default, 
> rather than having conf/parse-plugins.xml  associate parse-default with *, 
> some parse tasks took an incredibly long time to finish.  For instance, a 
> single postscript file took 9 hours to parse.  Stacktraces indicated this to 
> be a problem with OutlinkExtractor.getOutlinks(...) during the call to reg 
> expr match().  
> Analysis:  The regular expression matching in 
> OutlinkExtractor.getOutlinks(...) encounters parasitic cases which have 
> extremely long runtimes when non-plain-text is processed.
> Workaround 1:  Avoid treating non-plain-text, especially postscript files, as 
> text or html.
> Workaround 2:  kill -SIGQUIT  the child TaskRunner process, this will 
> interrupt the match() and the process will continue.  This might need to be 
> done multiple times.  (In theory, SIGQUIT is not supposed to do this, but in 
> practice it does.)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-09 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12362242 ] 

Doug Cutting commented on NUTCH-139:


We can just use different names, rather than two metaData objects: X-nutch 
names for derived or other values that are usually protocol independent; and 
(possibly prefixed) names for protocol- or format-specific values.  The latter 
are sometimes multivalued, but the former are probably not.

The relevance to this patch is that this patch currently uses un-prefixed 
protocol-specific names to store derived, protocol-independent data, which is 
confusing.  This patch is meant to standardize property names.  Let's just 
standardize them once.  Protocol- and format-specific names should be defined 
in protocol- and format-specific files.  For example, if we want to define 
constants for http headers, they should probably go in the (new) lib-http 
plugin.

We also need to change ContentProperties to distinguish add(String,String) from 
set(String,String), and we may need to change some protocols to call 
add(String,String) instead of set(String,String).  I think that it makes sense 
to bundle that change in this patch too.

> Standard metadata property names in the ParseData metadata
> --
>
>  Key: NUTCH-139
>  URL: http://issues.apache.org/jira/browse/NUTCH-139
>  Project: Nutch
> Type: Improvement
>   Components: fetcher
> Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, 
> although bug is independent of environment
> Reporter: Chris A. Mattmann
> Assignee: Chris A. Mattmann
> Priority: Minor
>  Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt, 
> NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything 
> that they want, such as having names of "Content-type", "content-TyPe", 
> "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a 
> solution in which all property names be converted to lower case, but in 
> essence this really only fixes half the problem right (the case of 
> identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of 
> named Strings in the ParseData class that the protocol framework and the 
> parsing framework could use to identify common properties such as 
> "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something 
> like:
>  public class ParseData{
>.
> public static final String CONTENT_TYPE = "content-type";
> public static final String CREATOR = "creator";
>
> }
> In this fashion, users could at least know what the name of the standard 
> properties that they can obtain from the ParseData are, for example by making 
> a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the 
> content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, 
> "text/xml"); Of course, this wouldn't preclude users from doing what they are 
> currently doing, it would just provide a standard method of obtaining some of 
> the more common, critical metadata without pouring over the code base to 
> figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week 
> that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-09 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12362249 ] 

Doug Cutting commented on NUTCH-139:


Let me try to be more concrete.  I'd prefer that the X-nutch properties be 
removed from MetadataNames before this is committed, and moved to protocol- and 
parse-specific files.  Response.java would be a good place for things that are 
part of HTTP that are mimic'd by other protocols.

If you prefer, this could be done subsequently.  So my vote for this patch is 
currently 0.

MetadataNames.java should probably not be in util, but rather in protocol, near 
ContentProperties.  In general, we should avoid putting things in util unless 
they're really generic.  Perhaps we need a new package for metadata?  And 
ContentProperties could be renamed MetadataProperties.  That change is out of 
the scope of this patch (there I go again!) but it's best to place new stuff 
like MetadataNames.java and the protocol-specific property names in the right 
place to start, rather than have to move them in a subsequent patch.  Then we 
don't have to change all the import statements again, etc.

> Standard metadata property names in the ParseData metadata
> --
>
>  Key: NUTCH-139
>  URL: http://issues.apache.org/jira/browse/NUTCH-139
>  Project: Nutch
> Type: Improvement
>   Components: fetcher
> Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, 
> although bug is independent of environment
> Reporter: Chris A. Mattmann
> Assignee: Chris A. Mattmann
> Priority: Minor
>  Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt, 
> NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything 
> that they want, such as having names of "Content-type", "content-TyPe", 
> "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a 
> solution in which all property names be converted to lower case, but in 
> essence this really only fixes half the problem right (the case of 
> identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of 
> named Strings in the ParseData class that the protocol framework and the 
> parsing framework could use to identify common properties such as 
> "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something 
> like:
>  public class ParseData{
>.
> public static final String CONTENT_TYPE = "content-type";
> public static final String CREATOR = "creator";
>
> }
> In this fashion, users could at least know what the name of the standard 
> properties that they can obtain from the ParseData are, for example by making 
> a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the 
> content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, 
> "text/xml"); Of course, this wouldn't preclude users from doing what they are 
> currently doing, it would just provide a standard method of obtaining some of 
> the more common, critical metadata without pouring over the code base to 
> figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week 
> that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Resolved: (NUTCH-160) Use standard Java Regex library rather than org.apache.oro.text.regex

2006-01-09 Thread Doug Cutting (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-160?page=all ]
 
Doug Cutting resolved NUTCH-160:


Fix Version: 0.8-dev
 Resolution: Fixed

I just committed this patch.  Thanks!

> Use standard Java Regex library rather than org.apache.oro.text.regex
> -
>
>  Key: NUTCH-160
>  URL: http://issues.apache.org/jira/browse/NUTCH-160
>  Project: Nutch
> Type: Improvement
> Versions: 0.8-dev
> Reporter: Rod Taylor
>  Fix For: 0.8-dev
>  Attachments: regex.patch
>
> org.apache.oro.text.regex is based on perl 5.003 which has some corner cases 
> which perform poorly. The standard regular expression libraries for Java (1.4 
> and later) do not seen to contain these issues.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-170) Crash with multiple temp directories

2006-01-11 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-170?page=comments#action_12362482 ] 

Doug Cutting commented on NUTCH-170:


I have sucessfully used mapred.local.dir with multiple values on many occasions.

Can you please try to distill this to an easy to reproduce test-case?  Thanks.

> Crash with multiple temp directories
> 
>
>  Key: NUTCH-170
>  URL: http://issues.apache.org/jira/browse/NUTCH-170
>  Project: Nutch
> Type: Bug
> Reporter: Rod Taylor
> Priority: Critical

>
> A brief read of the code indicated it may be possible to use multiple local 
> directories using something like the below:
>   
> mapred.local.dir
> /local,/local1,/local2
> The local directory where MapReduce stores intermediate
> data files.
> 
>   
> This failed with the below exception during either the generate or update 
> phase (not entirely sure which).
> java.lang.ArrayIndexOutOfBoundsException
> at java.util.zip.CRC32.update(CRC32.java:51)
> at 
> org.apache.nutch.fs.NFSDataInputStream$Checker.read(NFSDataInputStream.java:92)
> at 
> org.apache.nutch.fs.NFSDataInputStream$PositionCache.read(NFSDataInputStream.java:156)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> at java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:313)
> at java.io.DataInputStream.readFully(DataInputStream.java:176)
> at 
> org.apache.nutch.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:55)
> at 
> org.apache.nutch.io.DataOutputBuffer.write(DataOutputBuffer.java:89)
> at org.apache.nutch.io.SequenceFile$Reader.next(SequenceFile.java:378)
> at org.apache.nutch.io.SequenceFile$Reader.next(SequenceFile.java:301)
> at org.apache.nutch.io.SequenceFile$Reader.next(SequenceFile.java:323)
> at 
> org.apache.nutch.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:60)
> at 
> org.apache.nutch.segment.SegmentReader$InputFormat$1.next(SegmentReader.java:80)
> at org.apache.nutch.mapred.MapTask$2.next(MapTask.java:106)
> at org.apache.nutch.mapred.MapRunner.run(MapRunner.java:48)
> at org.apache.nutch.mapred.MapTask.run(MapTask.java:116)
> at 
> org.apache.nutch.mapred.TaskTracker$Child.main(TaskTracker.java:604)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-171) Bring back multiple segment support for Generate / Update

2006-01-11 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-171?page=comments#action_12362507 ] 

Doug Cutting commented on NUTCH-171:


I'd like to hear more about why you want multiple segments, what's motivating 
this patch.  The 0.7 -numFetchers parameter was designed to permit distributed 
fetching.  With MapReduce the fetcher runs as a distributed map task, so the 
number of fetchers is now set to the number of map tasks.  The crawl db is 
updated with the output of all fetcher tasks in a single step, as you desire.

There may be something with the current implementation that is causing you 
problems, I'm just not yet sure what it is and why this is the solution.


> Bring back multiple segment support for Generate / Update
> -
>
>  Key: NUTCH-171
>  URL: http://issues.apache.org/jira/browse/NUTCH-171
>  Project: Nutch
> Type: Improvement
> Versions: 0.8-dev
> Reporter: Rod Taylor
> Priority: Minor
>  Attachments: multi_segment.patch
>
> We find it convenient to be able to run generate once for -topN 300M and have 
> multiple independent segments to work with (lower overhead) -- then run 
> update on all segments which succeeded simultaneously.
> This reactivates -numFetchers and fixes updatedb to handle multiple provided 
> segments again.
> Radu Mateescu wrote the attached patch for us with the below description 
> (lightly edited):
> The implementation of -numFetchers in 0.8 improperly plays with the number of 
> reduce tasks in order to generate a given number of fetch lists. Basically, 
> what it does is this: before the second reduce (map-reduce is applied twice 
> for generate), it sets the number of reduce tasks to numFetchers and ideally, 
> because each reduce will create a file like part-0, part-1, etc in 
> the ndfs, we'll end up with the number of desired fetched lists. But this 
> behaviour is incorrect for the following reasons:
> 1. the number of reduce tasks is orthogonal to the number of segments 
> somebody wants to create. The number of reduce tasks should be chosen based 
> on the physical topology rather then the number of segments someone might 
> want in ndfs
> 2. if in nutch-site.xml you specify a value for mapred.reduce.tasks property, 
> the numFetchers seems to be ignored
>  
> Therefore , I changed this behaviour to work like this: 
>  - generate will create numFetchers segments
>  - each reduce task will write in all segments (assuming there are enough 
> values to be written) in a round-robin fashion
> The end results for 3 reduce tasks and 2 segments will look like this :
>  
> /opt/nutch/bin>./nutch ndfs -ls segments
> 060111 17 parsing file:/opt/nutch/conf/nutch-default.xml
> 060111 18 parsing file:/opt/nutch/conf/nutch-site.xml
> 060111 18 Client connection to 192.168.0.1:5466: starting
> 060111 18 No FS indicated, using default:master:5466
> Found 2 items
> /user/root/segments/2006022144-0
> /user/root/segments/2006022144-1
>  
> /opt/nutch/bin>./nutch ndfs -ls segments/2006022144-0/crawl_generate
> 060111 122317 parsing file:/opt/nutch/conf/nutch-default.xml
> 060111 122317 parsing file:/opt/nutch/conf/nutch-site.xml
> 060111 122318 No FS indicated, using default:master:5466
> 060111 122318 Client connection to 192.168.0.1:5466: starting
> Found 3 items
> /user/root/segments/2006022144-0/crawl_generate/part-0  1276
> /user/root/segments/2006022144-0/crawl_generate/part-1  1289
> /user/root/segments/2006022144-0/crawl_generate/part-2  1858
>  
> /opt/nutch/bin>./nutch ndfs -ls segments/2006022144-1/crawl_generate
> 060111 122333 parsing file:/opt/nutch/conf/nutch-default.xml
> 060111 122334 parsing file:/opt/nutch/conf/nutch-site.xml
> 060111 122334 Client connection to 192.168.0.1:5466: starting
> 060111 122334 No FS indicated, using default:master:5466
> Found 3 items
> /user/root/segments/2006022144-1/crawl_generate/part-0  1207
> /user/root/segments/2006022144-1/crawl_generate/part-1  1236
> /user/root/segments/2006022144-1/crawl_generate/part-2  1841

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Resolved: (NUTCH-102) jobtracker does not start when webapps is in src

2006-01-18 Thread Doug Cutting (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-102?page=all ]
 
Doug Cutting resolved NUTCH-102:


Resolution: Fixed

I just applied this patch.  Thanks, Owen.

> jobtracker does not start when webapps is in src
> 
>
>  Key: NUTCH-102
>  URL: http://issues.apache.org/jira/browse/NUTCH-102
>  Project: Nutch
> Type: Bug
> Versions: 0.8-dev
> Reporter: Stefan Groschupf
> Priority: Minor
>  Fix For: 0.8-dev
>  Attachments: webapps.patch
>
> When starting the jobtracker from NUTCH_HOME by 
> bin/nutch-daemon.sh start jobtracker
> The jobtracker search for the webapps folder in NUTCH_HOME, but it is under 
> src/
> When manually copy the webapps folder into NUTCH_HOME jobtracker starts 
> without any problems. 
> Exception in thread "main" java.lang.NullPointerException
> at 
> org.apache.nutch.mapred.JobTrackerInfoServer.(JobTrackerInfoServer.java:67)
> at org.apache.nutch.mapred.JobTracker.(JobTracker.java:232)
> at org.apache.nutch.mapred.JobTracker.startTracker(JobTracker.java:43)
> at org.apache.nutch.mapred.JobTracker.main(JobTracker.java:1043)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Closed: (NUTCH-179) Proposition: Enable Nutch to use a parser plugin not just based on content type

2006-01-19 Thread Doug Cutting (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-179?page=all ]
 
Doug Cutting closed NUTCH-179:
--

Resolution: Invalid

Closed at submitter's request.

> Proposition: Enable Nutch to use a parser plugin not just based on content 
> type
> ---
>
>  Key: NUTCH-179
>  URL: http://issues.apache.org/jira/browse/NUTCH-179
>  Project: Nutch
> Type: Improvement
>   Components: fetcher
> Versions: 0.8-dev
> Reporter: Gal Nitzan

>
> Sorry, please close this issue.
> I figured that if I set my parse plugin first. I can always be called first 
> and than decide if I want to parse or not.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Resolved: (NUTCH-177) Default installation seems to produce working entity of nutch

2006-01-19 Thread Doug Cutting (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-177?page=all ]
 
Doug Cutting resolved NUTCH-177:


Fix Version: 0.8-dev
 Resolution: Fixed

The problem is that your seed url does not end in a slash, yet your url filter 
requires a slash.  In 0.8-dev (aka trunk) this is fixed, since urls are 
normalized before filtering, which adds a slash after the hostname.

> Default installation seems to produce working entity of nutch
> -
>
>  Key: NUTCH-177
>  URL: http://issues.apache.org/jira/browse/NUTCH-177
>  Project: Nutch
> Type: Bug
> Versions: 0.7.1
>  Environment: Linux SUSE 9.3
> Reporter: Matthias Günter
> Priority: Minor
>  Fix For: 0.8-dev
>  Attachments: crawl-urlfilter.txt, urllist.txt
>
> I downloaded 0.7.1 and installed it.
> Then changed crawl-urlfilter.txt for apache.org
> Then I added an urllist.txt  and tried scanning.
> Apparently the URL has been ignored, even when it matched the rule in the 
> crawl-url-filter.txt
> [EMAIL PROTECTED]:~/workspace/lucene/nutch-0.7.1/bin> sh ./nutch crawl 
> ../../urllist.txt
> 060115 141534 parsing 
> file:/home/guenter/workspace/lucene/nutch-0.7.1/conf/nutch-default.xml
> 060115 141534 parsing 
> file:/home/guenter/workspace/lucene/nutch-0.7.1/conf/crawl-tool.xml
> 060115 141534 parsing 
> file:/home/guenter/workspace/lucene/nutch-0.7.1/conf/nutch-site.xml
> 060115 141534 No FS indicated, using default:local
> 060115 141534 crawl started in: crawl-20060115141534
> 060115 141534 rootUrlFile = ../../urllist.txt
> 060115 141534 threads = 10
> 060115 141534 depth = 5
> 060115 141535 Created webdb at 
> LocalFS,/home/guenter/workspace/lucene/nutch-0.7.1/bin/crawl-20060115141534/db
> 060115 141535 Starting URL processing
> 060115 141535 Plugins: looking in: 
> /home/guenter/workspace/lucene/nutch-0.7.1/plugins
> 060115 141535 not including: 
> /home/guenter/workspace/lucene/nutch-0.7.1/plugins/query-more
> 060115 141535 parsing: 
> /home/guenter/workspace/lucene/nutch-0.7.1/plugins/query-site/plugin.xml
> 060115 141535 impl: point=org.apache.nutch.searcher.QueryFilter 
> class=org.apache.nutch.searcher.site.SiteQueryFilter
> 060115 141535 parsing: 
> /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-html/plugin.xml
> 060115 141535 impl: point=org.apache.nutch.parse.Parser 
> class=org.apache.nutch.parse.html.HtmlParser
> 060115 141535 parsing: 
> /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-text/plugin.xml
> 060115 141535 impl: point=org.apache.nutch.parse.Parser 
> class=org.apache.nutch.parse.text.TextParser
> 060115 141535 not including: 
> /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-ext
> 060115 141535 not including: 
> /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-pdf
> 060115 141535 not including: 
> /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-rss
> 060115 141535 parsing: 
> /home/guenter/workspace/lucene/nutch-0.7.1/plugins/query-basic/plugin.xml
> 060115 141535 impl: point=org.apache.nutch.searcher.QueryFilter 
> class=org.apache.nutch.searcher.basic.BasicQueryFilter
> 060115 141535 not including: 
> /home/guenter/workspace/lucene/nutch-0.7.1/plugins/index-more
> 060115 141535 not including: 
> /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-js
> 060115 141535 parsing: 
> /home/guenter/workspace/lucene/nutch-0.7.1/plugins/urlfilter-regex/plugin.xml
> 060115 141535 impl: point=org.apache.nutch.net.URLFilter 
> class=org.apache.nutch.net.RegexURLFilter
> 060115 141535 not including: 
> /home/guenter/workspace/lucene/nutch-0.7.1/plugins/protocol-ftp
> 060115 141535 not including: 
> /home/guenter/workspace/lucene/nutch-0.7.1/plugins/parse-msword
> 060115 141535 not including: 
> /home/guenter/workspace/lucene/nutch-0.7.1/plugins/creativecommons
> 060115 141535 not including: 
> /home/guenter/workspace/lucene/nutch-0.7.1/plugins/ontology
> 060115 141535 parsing: 
> /home/guenter/workspace/lucene/nutch-0.7.1/plugins/nutch-extensionpoints/plugin.xml
> 060115 141535 not including: 
> /home/guenter/workspace/lucene/nutch-0.7.1/plugins/protocol-file
> 060115 141535 parsing: 
> /home/guenter/workspace/lucene/nutch-0.7.1/plugins/protocol-http/plugin.xml
> 060115 141535 impl: point=org.apache.nutch.protocol.Protocol 
> class=org.apache.nutch.protocol.http.Http
> 060115 141535 not including: 
> /home/guenter/workspace/lucene/nutch-0.7.1/plugins/clustering-carrot2
> 060115 141535 not including: 
> /home/guenter/workspace/lucene/nutch-0.7.1/plugins/language-identifier
> 060115 141535 not including: 
> /home/guenter/workspace/lucene/nutch-0.7.1/plugins/urlfilter-prefix
> 060115 141535 parsing: 
> /home/guenter/workspace/lucene/nutch-0.7.1/plugins/query-url/plugin.xml
> 060115 141535 impl: point=org.apache.nutch.searcher.QueryFilter 
> class=org.apache.nutch.searcher.url.URLQueryFilter
> 060115 141535 parsing: 
> /home/guenter/workspace/lucene/nutch-0.7.1/plugins/

[jira] Resolved: (NUTCH-176) Using -dir: creates an error, when the directory already exists

2006-01-19 Thread Doug Cutting (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-176?page=all ]
 
Doug Cutting resolved NUTCH-176:


Resolution: Won't Fix

This check is intentionally made to prevent folks from accidentally overwriting 
crawls.

> Using -dir: creates an error, when the directory already exists
> ---
>
>  Key: NUTCH-176
>  URL: http://issues.apache.org/jira/browse/NUTCH-176
>  Project: Nutch
> Type: Bug
> Versions: 0.7.1
>  Environment: SUSE Linux 9.3
> Reporter: Matthias Günter
> Priority: Minor

>
> In my opinion -dir should work even, when the directory already exists.
> The error message is: 
> [EMAIL PROTECTED]:~/workspace/lucene/nutch-0.7.1/bin> sh ./nutch crawl 
> ../../urllist.txt  -dir tmpdir
> 060115 140500 parsing 
> file:/home/guenter/workspace/lucene/nutch-0.7.1/conf/nutch-default.xml
> 060115 140500 parsing 
> file:/home/guenter/workspace/lucene/nutch-0.7.1/conf/crawl-tool.xml
> 060115 140500 parsing 
> file:/home/guenter/workspace/lucene/nutch-0.7.1/conf/nutch-site.xml
> 060115 140500 No FS indicated, using default:local
> Exception in thread "main" java.lang.RuntimeException: tmpdir already exists.
> at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:121)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-136) mapreduce segment generator generates 50 % less than excepted urls

2006-01-19 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-136?page=comments#action_12363308 ] 

Doug Cutting commented on NUTCH-136:


The mapred-default.xml file is actually the best place to set these.

> mapreduce segment generator generates  50 % less  than excepted urls
> 
>
>  Key: NUTCH-136
>  URL: http://issues.apache.org/jira/browse/NUTCH-136
>  Project: Nutch
> Type: Bug
> Versions: 0.8-dev
> Reporter: Stefan Groschupf
> Priority: Critical

>
> We notice that segments generated with the map reduce segment generator 
> contains only 50 % of the expected urls. We had a crawldb with 40 000 urls 
> and the generate commands only created a 20 000 pages segment. This also 
> happened with the topN parameter, we everytime got around 50 % of the 
> expected urls.
> I tested the PartitionUrlByHost and it looks like it does its work. However 
> we fixed the problem by changing two things:
> First we set the partition to a normal hashPartitioner.
> Second we changed Generator.java line 48:
> limit = job.getLong("crawl.topN",Long.MAX_VALUE)/job.getNumReduceTasks();
> to:
> limit = job.getLong("crawl.topN",Long.MAX_VALUE);
> Now it works as expected. 
> Has anyone a idea what the real source of this problem can be?
> In general this is bug has the effect that all map reduce users fetch only 50 
> % of it's urls per iteration.  

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-136) mapreduce segment generator generates 50 % less than excepted urls

2006-01-19 Thread Doug Cutting (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-136?page=all ]

Doug Cutting updated NUTCH-136:
---

Comment: was deleted

> mapreduce segment generator generates  50 % less  than excepted urls
> 
>
>  Key: NUTCH-136
>  URL: http://issues.apache.org/jira/browse/NUTCH-136
>  Project: Nutch
> Type: Bug
> Versions: 0.8-dev
> Reporter: Stefan Groschupf
> Priority: Critical

>
> We notice that segments generated with the map reduce segment generator 
> contains only 50 % of the expected urls. We had a crawldb with 40 000 urls 
> and the generate commands only created a 20 000 pages segment. This also 
> happened with the topN parameter, we everytime got around 50 % of the 
> expected urls.
> I tested the PartitionUrlByHost and it looks like it does its work. However 
> we fixed the problem by changing two things:
> First we set the partition to a normal hashPartitioner.
> Second we changed Generator.java line 48:
> limit = job.getLong("crawl.topN",Long.MAX_VALUE)/job.getNumReduceTasks();
> to:
> limit = job.getLong("crawl.topN",Long.MAX_VALUE);
> Now it works as expected. 
> Has anyone a idea what the real source of this problem can be?
> In general this is bug has the effect that all map reduce users fetch only 50 
> % of it's urls per iteration.  

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-173) PerHost Crawling Policy ( crawl.ignore.external.links )

2006-01-19 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-173?page=comments#action_12363309 ] 

Doug Cutting commented on NUTCH-173:


Couldn't you instead use a prefix-urlfilter generated from your crawl seed?

> PerHost Crawling Policy ( crawl.ignore.external.links )
> ---
>
>  Key: NUTCH-173
>  URL: http://issues.apache.org/jira/browse/NUTCH-173
>  Project: Nutch
> Type: New Feature
>   Components: fetcher
> Versions: 0.7.1, 0.7, 0.8-dev
> Reporter: Philippe EUGENE
> Priority: Minor
>  Attachments: patch.txt, patch08.txt
>
> There is two major way of crawl in Nutch.
> Intranet Crawl : forbidden all, allow somes few host
> Whole-web crawl : allow all, forbidden few thinks
> I propose a third type of crawl.
> Directory Crawl : The purpose of this crawl is to manage few thousands of 
> host wihtout managing rules pattern in UrlFilterRegexp.
> I made two patch for : 0.7, 0.7.1 and 0.8-dev
> I propose a new boolean property in nutch-site.xml : 
> crawl.ignore.external.links, with false value at default.
> By default this new feature don't modify the behavior of nutch crawler.
> When you setup this property to true, the crawler don't fetch external links 
> of the host.
> So the crawl is limited to the host that you inject at the beginning at the 
> crawl.
> I know there is some proposal of new crawl policy using the CrawlDatum in 
> 0.8-dev branch. 
> This feature colud be a easiest way to add quickly new crawl feature to 
> nutch, waiting for a best way to improve crawl policy.
> I post two patch.
> Sorry for my very poor english 
> --
> Philippe

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-183) MapReduce has a series of problems concerning task-allocation to worker nodes

2006-01-21 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-183?page=comments#action_12363554 ] 

Doug Cutting commented on NUTCH-183:


Byron, that's exactly what Mike means by "speculative execution".

> MapReduce has a series of problems concerning task-allocation to worker nodes
> -
>
>  Key: NUTCH-183
>  URL: http://issues.apache.org/jira/browse/NUTCH-183
>  Project: Nutch
> Type: Improvement
>  Environment: All
> Reporter: Mike Cafarella
>  Attachments: jobtracker.patch
>
> The MapReduce JobTracker is not great at allocating tasks to TaskTracker 
> worker nodes.
> Here are the problems:
> 1) There is no speculative execution of tasks
> 2) Reduce tasks must wait until all map tasks are completed before doing any 
> work
> 3) TaskTrackers don't distinguish between Map and Reduce jobs.  Also, the 
> number of
> tasks at a single node is limited to some constant.  That means you can get 
> weird deadlock
> problems upon machine failure.  The reduces take up all the available 
> execution slots, but they
> don't do productive work, because they're waiting for a map task to complete. 
>  Of course, that
> map task won't even be started until the reduce tasks finish, so you can see 
> the problem...
> 4) The JobTracker is so complicated that it's hard to fix any of these.
> The right solution is a rewrite of the JobTracker to be a lot more flexible 
> in task handling.
> It has to be a lot simpler.  One way to make it simpler is to add an 
> abstraction I'll call
> "TaskInProgress".  Jobs are broken into chunks called TasksInProgress.  All 
> the TaskInProgress
> objects must be complete, somehow, before the Job is complete.
> A single TaskInProgress can be executed by one or more Tasks.  TaskTrackers 
> are assigned Tasks.
> If a Task fails, we report it back to the JobTracker, where the 
> TaskInProgress lives.  The TIP can then
> decide whether to launch additional  Tasks or not.
> Speculative execution is handled within the TIP.  It simply launches multiple 
> Tasks in parallel.  The
> TaskTrackers have no idea that these Tasks are actually doing the same chunk 
> of work.  The TIP
> is complete when any one of its Tasks are complete.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-187) Run Nutch on Windows without Cygwin

2006-01-25 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-187?page=comments#action_12363991 ] 

Doug Cutting commented on NUTCH-187:


Running outside of cygwin is not currently a priority.  To truly escape cygwin, 
all the scripts in bin/ would need to be replaced, ideally with portable Java, 
rather than maintaining two versions.  I would rather have a dependency on 
cygwin than have to maintain two versions of lots of stuff.  But mostly these 
are things which cannot be done from Java.  So we're left with cygwin.

> Run Nutch on Windows without Cygwin
> ---
>
>  Key: NUTCH-187
>  URL: http://issues.apache.org/jira/browse/NUTCH-187
>  Project: Nutch
> Type: Improvement
>   Components: ndfs
> Versions: 0.8-dev
>  Environment: Windows
> Reporter: Dominik Friedrich
> Priority: Minor
>  Attachments: DF.diff
>
> Currently you cannot start Nutch datanodes on Windows outside of a cygwin 
> environment because it relies on the df command to read the free disk space.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-25 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12363996 ] 

Doug Cutting commented on NUTCH-139:


I think this is all easily handled by naming, and that we don't need another 
map.

We keep using "title" and "content-type" as examples, when these are actually 
not problematic, since nutch already has dedicated fields for them.  Can 
someone please provide some examples of where multiple values are actually 
needed, besides the need to accurately represent multi-valued http and smtp 
headers?  For title and content-type, the header in the metadata is what the 
parser and protocol found, respectively, and the field in the Parse and Content 
are what will be used.  What are some cases where a value needs to be 
"overridden" where using an X-nutch value as the authoritative value will not 
suffice?


> Standard metadata property names in the ParseData metadata
> --
>
>  Key: NUTCH-139
>  URL: http://issues.apache.org/jira/browse/NUTCH-139
>  Project: Nutch
> Type: Improvement
>   Components: fetcher
> Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, 
> although bug is independent of environment
> Reporter: Chris A. Mattmann
> Assignee: Chris A. Mattmann
> Priority: Minor
>  Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt, 
> NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything 
> that they want, such as having names of "Content-type", "content-TyPe", 
> "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a 
> solution in which all property names be converted to lower case, but in 
> essence this really only fixes half the problem right (the case of 
> identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of 
> named Strings in the ParseData class that the protocol framework and the 
> parsing framework could use to identify common properties such as 
> "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something 
> like:
>  public class ParseData{
>.
> public static final String CONTENT_TYPE = "content-type";
> public static final String CREATOR = "creator";
>
> }
> In this fashion, users could at least know what the name of the standard 
> properties that they can obtain from the ParseData are, for example by making 
> a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the 
> content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, 
> "text/xml"); Of course, this wouldn't preclude users from doing what they are 
> currently doing, it would just provide a standard method of obtaining some of 
> the more common, critical metadata without pouring over the code base to 
> figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week 
> that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-186) mapred-default.xml is over ridden by nutch-site.xml

2006-01-25 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-186?page=comments#action_12363998 ] 

Doug Cutting commented on NUTCH-186:


The config rules at present are:

1. All user-settable values should be in nutch-default.xml, as documentation 
that they exist.  Any other config will override this.  This file should not be 
altered by users.

2. nutch-site.xml is always loaded last, overriding all other options.  This is 
empty by default.

mapred-default.xml was added specifically to permit the specification of things 
that a job can override.

I think the fix that's needed here is documentation.  The documentation for 
these parameters should perhaps caution against putting them in nutch-site.xml, 
and point folks towards mapred-default.xml.

We might eventually move to a more complex configuration, where we break things 
into modules, each with three parts: base, default, final.  So there could be a 
mapred-base.xml that listed all of the settable mapred parameters.  Then the 
overridable defauld value could be set in mapred-default.xml.  And 
non-overrideable values (e.g., the jobtracker host) could be specified in 
mapred-final.

> mapred-default.xml is over ridden by nutch-site.xml
> ---
>
>  Key: NUTCH-186
>  URL: http://issues.apache.org/jira/browse/NUTCH-186
>  Project: Nutch
> Type: Bug
> Versions: 0.8-dev
>  Environment: All
> Reporter: Gal Nitzan
> Priority: Minor
>  Attachments: myBeautifulPatch.patch
>
> If mapred.map.tasks and mapred.reduce.tasks are defined in nutch-site.xml and 
> also in mapred-default.xml the definitions from nutch-site.xml are those that 
> will take effect.
> So if a user mistakenly copies those entries into nutch-site.xml from the 
> nutch-default.xml she will not understand what happens.
> I would like to propose removing these setting completely from the 
> nutch-default.xml and put it only in mapred-default.xml where it belongs.
> I will be happy to supply a patch for that  if the proposition accepted.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-26 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12364125 ] 

Doug Cutting commented on NUTCH-139:


I think we're near agreement here.

Here are the changes I think this patch still needs:

MetadataNames belongs in the protocol package, not util.

We should rename ContentProperties to Metadata.

We should add an add() method to Metadata, and change set() to replace all 
values rather than add a new value.  Protocol code which creates properties 
from headers should then use add().

We could commit after simply moving MetadataNames to protocol, and leave the 
changes to ContentProperties for another commit, but I'd prefer it all be done 
together.

Any objections to these changes?



> Standard metadata property names in the ParseData metadata
> --
>
>  Key: NUTCH-139
>  URL: http://issues.apache.org/jira/browse/NUTCH-139
>  Project: Nutch
> Type: Improvement
>   Components: fetcher
> Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, 
> although bug is independent of environment
> Reporter: Chris A. Mattmann
> Assignee: Chris A. Mattmann
> Priority: Minor
>  Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.060105.patch, NUTCH-139.Mattmann.patch.txt, 
> NUTCH-139.jc.review.patch.txt
>
> Currently, people are free to name their string-based properties anything 
> that they want, such as having names of "Content-type", "content-TyPe", 
> "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a 
> solution in which all property names be converted to lower case, but in 
> essence this really only fixes half the problem right (the case of 
> identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of 
> named Strings in the ParseData class that the protocol framework and the 
> parsing framework could use to identify common properties such as 
> "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something 
> like:
>  public class ParseData{
>.
> public static final String CONTENT_TYPE = "content-type";
> public static final String CREATOR = "creator";
>
> }
> In this fashion, users could at least know what the name of the standard 
> properties that they can obtain from the ParseData are, for example by making 
> a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the 
> content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, 
> "text/xml"); Of course, this wouldn't preclude users from doing what they are 
> currently doing, it would just provide a standard method of obtaining some of 
> the more common, critical metadata without pouring over the code base to 
> figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week 
> that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-59) meta data support in webdb

2006-01-26 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-59?page=comments#action_12364127 ] 

Doug Cutting commented on NUTCH-59:
---

This patch is to the 0.7 release and will not work in the current trunk.

Please see:

http://www.mail-archive.com/nutch-dev@lucene.apache.org/msg02140.html

and 

http://issues.apache.org/jira/browse/NUTCH-61

So extensible metadata should be added to CrawlDatum when a fix for NUTCH-61 is 
committed to trunk.


> meta data support in webdb
> --
>
>  Key: NUTCH-59
>  URL: http://issues.apache.org/jira/browse/NUTCH-59
>  Project: Nutch
> Type: New Feature
> Reporter: Stefan Groschupf
> Priority: Minor
>  Attachments: webDBMetaDataPatch.txt
>
> Meta data support in web db would very usefully for a new set of nutch 
> feature that needs long life meta data. 
> Actually page meta data need to be regenerated or lookup every 30 days a page 
> is re-fetched, in a long context web db meta data would bring a dramatically 
> performance improvement for such tasks.
> Furthermore Storage of meta data in webdb would make a new generation of 
> linklist generation filters possible.  

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-27 Thread Doug Cutting (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-139?page=all ]

Doug Cutting updated NUTCH-139:
---

Attachment: (was: NUTCH-139.jc.review.patch.txt)

> Standard metadata property names in the ParseData metadata
> --
>
>  Key: NUTCH-139
>  URL: http://issues.apache.org/jira/browse/NUTCH-139
>  Project: Nutch
> Type: Improvement
>   Components: fetcher
> Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, 
> although bug is independent of environment
> Reporter: Chris A. Mattmann
> Assignee: Chris A. Mattmann
> Priority: Minor
>  Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.060105.patch
>
> Currently, people are free to name their string-based properties anything 
> that they want, such as having names of "Content-type", "content-TyPe", 
> "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a 
> solution in which all property names be converted to lower case, but in 
> essence this really only fixes half the problem right (the case of 
> identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of 
> named Strings in the ParseData class that the protocol framework and the 
> parsing framework could use to identify common properties such as 
> "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something 
> like:
>  public class ParseData{
>.
> public static final String CONTENT_TYPE = "content-type";
> public static final String CREATOR = "creator";
>
> }
> In this fashion, users could at least know what the name of the standard 
> properties that they can obtain from the ParseData are, for example by making 
> a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the 
> content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, 
> "text/xml"); Of course, this wouldn't preclude users from doing what they are 
> currently doing, it would just provide a standard method of obtaining some of 
> the more common, critical metadata without pouring over the code base to 
> figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week 
> that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-27 Thread Doug Cutting (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-139?page=all ]

Doug Cutting updated NUTCH-139:
---

Attachment: (was: NUTCH-139.Mattmann.patch.txt)

> Standard metadata property names in the ParseData metadata
> --
>
>  Key: NUTCH-139
>  URL: http://issues.apache.org/jira/browse/NUTCH-139
>  Project: Nutch
> Type: Improvement
>   Components: fetcher
> Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, 
> although bug is independent of environment
> Reporter: Chris A. Mattmann
> Assignee: Chris A. Mattmann
> Priority: Minor
>  Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.060105.patch
>
> Currently, people are free to name their string-based properties anything 
> that they want, such as having names of "Content-type", "content-TyPe", 
> "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a 
> solution in which all property names be converted to lower case, but in 
> essence this really only fixes half the problem right (the case of 
> identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of 
> named Strings in the ParseData class that the protocol framework and the 
> parsing framework could use to identify common properties such as 
> "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something 
> like:
>  public class ParseData{
>.
> public static final String CONTENT_TYPE = "content-type";
> public static final String CREATOR = "creator";
>
> }
> In this fashion, users could at least know what the name of the standard 
> properties that they can obtain from the ParseData are, for example by making 
> a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the 
> content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, 
> "text/xml"); Of course, this wouldn't preclude users from doing what they are 
> currently doing, it would just provide a standard method of obtaining some of 
> the more common, critical metadata without pouring over the code base to 
> figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week 
> that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-01-27 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12364242 ] 

Doug Cutting commented on NUTCH-139:


I was confused about which was the latest version.  (I deleted the older 
versions.  Is there a way to simply mark them obsolete?)

So, if Metadata and MetadataNames are moved from util into a metadata package 
(as suggested by Andrzej) then I am +1.

I don't see why we need separate subclasses of Metadata for content and parses. 
 Separate instances, yes, and we already have these, no?

Sorry for my confusion.

> Standard metadata property names in the ParseData metadata
> --
>
>  Key: NUTCH-139
>  URL: http://issues.apache.org/jira/browse/NUTCH-139
>  Project: Nutch
> Type: Improvement
>   Components: fetcher
> Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, 
> although bug is independent of environment
> Reporter: Chris A. Mattmann
> Assignee: Chris A. Mattmann
> Priority: Minor
>  Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.060105.patch
>
> Currently, people are free to name their string-based properties anything 
> that they want, such as having names of "Content-type", "content-TyPe", 
> "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a 
> solution in which all property names be converted to lower case, but in 
> essence this really only fixes half the problem right (the case of 
> identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of 
> named Strings in the ParseData class that the protocol framework and the 
> parsing framework could use to identify common properties such as 
> "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something 
> like:
>  public class ParseData{
>.
> public static final String CONTENT_TYPE = "content-type";
> public static final String CREATOR = "creator";
>
> }
> In this fashion, users could at least know what the name of the standard 
> properties that they can obtain from the ParseData are, for example by making 
> a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the 
> content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, 
> "text/xml"); Of course, this wouldn't preclude users from doing what they are 
> currently doing, it would just provide a standard method of obtaining some of 
> the more common, critical metadata without pouring over the code base to 
> figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week 
> that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Created: (NUTCH-193) move NDFS and MapReduce to a separate project

2006-01-31 Thread Doug Cutting (JIRA)

move NDFS and MapReduce to a separate project
-

 Key: NUTCH-193
 URL: http://issues.apache.org/jira/browse/NUTCH-193
 Project: Nutch
Type: Task
  Components: ndfs  
Versions: 0.8-dev
Reporter: Doug Cutting
 Assigned to: Doug Cutting 
 Fix For: 0.8-dev


The NDFS and MapReduce code should move from Nutch to a new Lucene sub-project 
named Hadoop.

My plan is to do this as follows:

1. Move all code in the following packages from Nutch to Hadoop:

org.apache.nutch.fs
org.apache.nutch.io
org.apache.nutch.ipc
org.apache.nutch.mapred
org.apache.nutch.ndfs

These packages will all be renamed to org.apache.hadoop, and Nutch code will be 
updated to reflect this.

2. Move selected classes from Nutch to Hadoop, as follows:

org.apache.nutch.util.NutchConf -> org.apache.hadoop.conf.Configuration
org.apache.nutch.util.NutchConfigurable -> org.apache.hadoop.Configurable 
org.apache.nutch.util.NutchConfigured -> org.apache.hadoop.Configured

org.apache.nutch.util.Progress -> org.apache.hadoop.util.Progress
org.apache.nutch.util.LogFormatter-> org.apache.hadoop.util.LogFormatter
org.apache.nutch.util.Daemon -> org.apache.hadoop.util.Daemon

3. Add a jar containing all of the above the Nutch's lib directory.

Does this plan sound reasonable?


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-193) move NDFS and MapReduce to a separate project

2006-01-31 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-193?page=comments#action_12364657 ] 

Doug Cutting commented on NUTCH-193:


NDFS, the Nutch Distributed Filesystem will be renamed HDFS, the Hadoop 
Distributed Filesystem.  Its code will live in the package 
org.apache.nutch.dfs, and its fs implementation class will be named 
DistributedFileSystem.

> move NDFS and MapReduce to a separate project
> -
>
>  Key: NUTCH-193
>  URL: http://issues.apache.org/jira/browse/NUTCH-193
>  Project: Nutch
> Type: Task
>   Components: ndfs
> Versions: 0.8-dev
> Reporter: Doug Cutting
> Assignee: Doug Cutting
>  Fix For: 0.8-dev

>
> The NDFS and MapReduce code should move from Nutch to a new Lucene 
> sub-project named Hadoop.
> My plan is to do this as follows:
> 1. Move all code in the following packages from Nutch to Hadoop:
> org.apache.nutch.fs
> org.apache.nutch.io
> org.apache.nutch.ipc
> org.apache.nutch.mapred
> org.apache.nutch.ndfs
> These packages will all be renamed to org.apache.hadoop, and Nutch code will 
> be updated to reflect this.
> 2. Move selected classes from Nutch to Hadoop, as follows:
> org.apache.nutch.util.NutchConf -> org.apache.hadoop.conf.Configuration
> org.apache.nutch.util.NutchConfigurable -> org.apache.hadoop.Configurable 
> org.apache.nutch.util.NutchConfigured -> org.apache.hadoop.Configured
> org.apache.nutch.util.Progress -> org.apache.hadoop.util.Progress
> org.apache.nutch.util.LogFormatter-> org.apache.hadoop.util.LogFormatter
> org.apache.nutch.util.Daemon -> org.apache.hadoop.util.Daemon
> 3. Add a jar containing all of the above the Nutch's lib directory.
> Does this plan sound reasonable?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-193) move NDFS and MapReduce to a separate project

2006-01-31 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-193?page=comments#action_12364665 ] 

Doug Cutting commented on NUTCH-193:


Andrzej: I'd like to do this soon, this week or next.  No matter how long I 
wait, there will probably always be a few patches queued that will need to be 
updated.  But hopefully we can avoid large patches like NUTCH-169.  What other 
patches are you concerned about in particular?

Sami: yes, the fuse stuff would then make a great hadoop contrib package.






> move NDFS and MapReduce to a separate project
> -
>
>  Key: NUTCH-193
>  URL: http://issues.apache.org/jira/browse/NUTCH-193
>  Project: Nutch
> Type: Task
>   Components: ndfs
> Versions: 0.8-dev
> Reporter: Doug Cutting
> Assignee: Doug Cutting
>  Fix For: 0.8-dev

>
> The NDFS and MapReduce code should move from Nutch to a new Lucene 
> sub-project named Hadoop.
> My plan is to do this as follows:
> 1. Move all code in the following packages from Nutch to Hadoop:
> org.apache.nutch.fs
> org.apache.nutch.io
> org.apache.nutch.ipc
> org.apache.nutch.mapred
> org.apache.nutch.ndfs
> These packages will all be renamed to org.apache.hadoop, and Nutch code will 
> be updated to reflect this.
> 2. Move selected classes from Nutch to Hadoop, as follows:
> org.apache.nutch.util.NutchConf -> org.apache.hadoop.conf.Configuration
> org.apache.nutch.util.NutchConfigurable -> org.apache.hadoop.Configurable 
> org.apache.nutch.util.NutchConfigured -> org.apache.hadoop.Configured
> org.apache.nutch.util.Progress -> org.apache.hadoop.util.Progress
> org.apache.nutch.util.LogFormatter-> org.apache.hadoop.util.LogFormatter
> org.apache.nutch.util.Daemon -> org.apache.hadoop.util.Daemon
> 3. Add a jar containing all of the above the Nutch's lib directory.
> Does this plan sound reasonable?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-192) meta data support for CrawlDatum

2006-01-31 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-192?page=comments#action_12364674 ] 

Doug Cutting commented on NUTCH-192:


I agree that Writable is probably overkill, that strings should be sufficient.

A mapping dictionary would save a lot of space, even with strings.  This could 
be a useful optimization, but should be left until after the initial (less 
optimized) addition of metadata to CrawlDatum.

> meta data support for CrawlDatum
> 
>
>  Key: NUTCH-192
>  URL: http://issues.apache.org/jira/browse/NUTCH-192
>  Project: Nutch
> Type: Improvement
> Versions: 0.8-dev
> Reporter: Stefan Groschupf
>  Fix For: 0.8-dev
>  Attachments: metadata300106.patch
>
> Supporting meta data in CrawlDatum would help to get a set of new nutch 
> features realized and makes a lot possible to smaller special focused search 
> engines.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-191) InputFormat used in job must be in JobTracker classpath (not loaded from job JAR)

2006-01-31 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-191?page=comments#action_12364678 ] 

Doug Cutting commented on NUTCH-191:


We've thus far avoided loading job-specific code in the JobTracker and 
TaskTracker, in order to keep these more reliable.  File splitting is performed 
by the job tracker.  So if you're overriding InputFormat.getSplits(), then 
fixing this is harder.  But if you're simply overriding getRecordReader(), then 
this should be easier to fix.  In that case one could fix this by moving 
getSplits() to a new interface that's used only by the TaskTracker.  If this is 
important to you, please submit a patch to this effect.

> InputFormat used in job must be in JobTracker classpath (not loaded from job 
> JAR)
> -
>
>  Key: NUTCH-191
>  URL: http://issues.apache.org/jira/browse/NUTCH-191
>  Project: Nutch
> Type: Bug
> Versions: 0.8-dev
>  Environment: ~20 node nutch mapreduce environment, running SVN trunk, on 
> Linux
> Reporter: Bryan Pendleton
> Priority: Minor

>
> During development, I've been creating/tweaking custom InputFormat 
> implementations. However, when you try to run a job against a running 
> cluster, you get:
>   Exception in thread "main" java.io.IOException: java.lang.RuntimeException: 
> java.lang.RuntimeException: java.lang.ClassNotFoundException: 
> my.custom.InputFormat
>   at org.apache.nutch.ipc.Client.call(Client.java:294)
>   at org.apache.nutch.ipc.RPC$Invoker.invoke(RPC.java:127)
>   at $Proxy0.submitJob(Unknown Source)
>   at org.apache.nutch.mapred.JobClient.submitJob(JobClient.java:259)
>   at org.apache.nutch.mapred.JobClient.runJob(JobClient.java:288)
>   at com.parc.uir.wikipedia.WikipediaJob.main(WikipediaJob.java:85)
> This error goes away if I restart the TaskTrackers/JobTracker with a 
> classpath which includes the needed code. Other classes (Mapper, Reducer) 
> appear to be available out of the jar file specified in the JobConf, but not 
> the InputFormat. Obviously, it's less than idea to have to restart the 
> JobTracker whenever there's a change to a job-specific class.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-193) move NDFS and MapReduce to a separate project

2006-01-31 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-193?page=comments#action_12364690 ] 

Doug Cutting commented on NUTCH-193:


Otis: yes, thanks, I meant org.apache.hadoop.dfs.

Andrzej: I'm awaiting Mike's commit of NUTCH-183, which should happen today.  
I'll then try to make the split tomorrow.

> move NDFS and MapReduce to a separate project
> -
>
>  Key: NUTCH-193
>  URL: http://issues.apache.org/jira/browse/NUTCH-193
>  Project: Nutch
> Type: Task
>   Components: ndfs
> Versions: 0.8-dev
> Reporter: Doug Cutting
> Assignee: Doug Cutting
>  Fix For: 0.8-dev

>
> The NDFS and MapReduce code should move from Nutch to a new Lucene 
> sub-project named Hadoop.
> My plan is to do this as follows:
> 1. Move all code in the following packages from Nutch to Hadoop:
> org.apache.nutch.fs
> org.apache.nutch.io
> org.apache.nutch.ipc
> org.apache.nutch.mapred
> org.apache.nutch.ndfs
> These packages will all be renamed to org.apache.hadoop, and Nutch code will 
> be updated to reflect this.
> 2. Move selected classes from Nutch to Hadoop, as follows:
> org.apache.nutch.util.NutchConf -> org.apache.hadoop.conf.Configuration
> org.apache.nutch.util.NutchConfigurable -> org.apache.hadoop.Configurable 
> org.apache.nutch.util.NutchConfigured -> org.apache.hadoop.Configured
> org.apache.nutch.util.Progress -> org.apache.hadoop.util.Progress
> org.apache.nutch.util.LogFormatter-> org.apache.hadoop.util.LogFormatter
> org.apache.nutch.util.Daemon -> org.apache.hadoop.util.Daemon
> 3. Add a jar containing all of the above the Nutch's lib directory.
> Does this plan sound reasonable?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-196) lib-xml and lib-log4j plugins

2006-02-01 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-196?page=comments#action_12364840 ] 

Doug Cutting commented on NUTCH-196:


I think we should try to limit Nutch's core code to a single XML parser and 
logging API.  I chose those built-in to the JDK when starting out.  Do you 
propose that we move core code to a different APIs?  Apache's commons-logging 
might be a better choice for a standard logging API.

> lib-xml and lib-log4j plugins
> -
>
>  Key: NUTCH-196
>  URL: http://issues.apache.org/jira/browse/NUTCH-196
>  Project: Nutch
> Type: Improvement
> Versions: 0.8-dev
> Reporter: Andrzej Bialecki 
> Assignee: Andrzej Bialecki 

>
> Many places in Nutch use XML. Parsing XML using the JDK API is painful. I 
> propose to add one (or more) library plugins with JDOM, DOM4J, Jaxen, etc. 
> This should simplify the current deployment, and help plugin writers to use 
> the existing API.
> Similarly, many plugins use log4j. Either we add it to the /lib, or we could 
> create a lib-log4j plugin.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Resolved: (NUTCH-197) NullPointerException in TaskRunner if application jar does not have "lib" directory

2006-02-01 Thread Doug Cutting (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-197?page=all ]
 
Doug Cutting resolved NUTCH-197:


Fix Version: 0.8-dev
 Resolution: Fixed

I just committed this.  Thanks, Owen!

> NullPointerException in TaskRunner if application jar does not have "lib" 
> directory
> ---
>
>  Key: NUTCH-197
>  URL: http://issues.apache.org/jira/browse/NUTCH-197
>  Project: Nutch
> Type: Bug
> Versions: 0.8-dev
>  Environment: linux 2.6, java 1.4.2
> Reporter: Owen O'Malley
> Priority: Minor
>  Fix For: 0.8-dev
>  Attachments: mapred_jar.patch
>
> When running a map/reduce application from a jar file, if the jar file does 
> not have a "lib" directory the job dies with a NullPointerException.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-192) meta data support for CrawlDatum

2006-02-01 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-192?page=comments#action_12364923 ] 

Doug Cutting commented on NUTCH-192:


I'm worried that this will substantially slow things.

I'd like to see some effort made to ensure that:

1. If no metadata is used, then no MapWritable's should be allocated.

2. If readFields() is called repeatedly on a single CrawlDatum instance, as few 
new objects should be alloacated as possible.  If MapWritable were to extend 
HashMap rather than wrap it, and MapWritable.readFields() first called clear(), 
then the HashMap's entry table could be reused.  Better yet would be to try to 
reuse the entries in the table.  If an entry exists with the same classes, then 
it and its key and value instances could be reused.  This optimization would 
require the use of a more extensible HashMap, perhaps like that in Jakarta 
Commons Collections.  Alternately, one could use a linked list instead of a 
HashMap, which should be plenty fast for things this size.

If an entry were defined as:

class Entry {
  Writable key;
  Writable value;
  Entry next;
}

Then MapWritable could have fields:
  Entry first;
  Entry last;
  Entry old;

clear() would set old=first; and first=last=null.
allocateEntry(Class keyClass, Class valueClass) would scan old, splicing out 
and returning the first entry whose classes match these.  If none is found then 
a new entry would be allocated.
readFields() would first identify each key and value class, call 
allocateEntry(), then call entry.key.readFields() and entry.value.readFields() 
and finally set last.next=entry and last=entry.

Also, why does MapWritable.write() create a DataOutputBuffer?  It should just 
write to out.




> meta data support for CrawlDatum
> 
>
>  Key: NUTCH-192
>  URL: http://issues.apache.org/jira/browse/NUTCH-192
>  Project: Nutch
> Type: Improvement
> Versions: 0.8-dev
> Reporter: Stefan Groschupf
>  Fix For: 0.8-dev
>  Attachments: metadata010206.patch, metadata300106.patch, metadata310106.patch
>
> Supporting meta data in CrawlDatum would help to get a set of new nutch 
> features realized and makes a lot possible to smaller special focused search 
> engines.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-198) SWF parser

2006-02-02 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-198?page=comments#action_12364983 ] 

Doug Cutting commented on NUTCH-198:


+1

> SWF parser
> --
>
>  Key: NUTCH-198
>  URL: http://issues.apache.org/jira/browse/NUTCH-198
>  Project: Nutch
> Type: New Feature
>   Components: fetcher
> Versions: 0.8-dev
> Reporter: Andrzej Bialecki 
> Assignee: Andrzej Bialecki 
>  Attachments: parse-swf.zip
>
> This is a parser for the Flash SWF files. It uses JavaSWF2 library (BSD 
> license), and uses some heuristic to extract as much text as possible 
> (including potential links) from ActionScript sections.
> If there are no objections, I'd like to add it soon.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-193) move NDFS and MapReduce to a separate project

2006-02-03 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-193?page=comments#action_12365087 ] 

Doug Cutting commented on NUTCH-193:


The name my kid gave a stuffed yellow elephant.  Short, relatively easy to 
spell and pronounce, meaningless, and not used elsewhere: those are my naming 
criteria.  Kids are good at generating such.  Googol is a kid's term.

> move NDFS and MapReduce to a separate project
> -
>
>  Key: NUTCH-193
>  URL: http://issues.apache.org/jira/browse/NUTCH-193
>  Project: Nutch
> Type: Task
>   Components: ndfs
> Versions: 0.8-dev
> Reporter: Doug Cutting
> Assignee: Doug Cutting
>  Fix For: 0.8-dev

>
> The NDFS and MapReduce code should move from Nutch to a new Lucene 
> sub-project named Hadoop.
> My plan is to do this as follows:
> 1. Move all code in the following packages from Nutch to Hadoop:
> org.apache.nutch.fs
> org.apache.nutch.io
> org.apache.nutch.ipc
> org.apache.nutch.mapred
> org.apache.nutch.ndfs
> These packages will all be renamed to org.apache.hadoop, and Nutch code will 
> be updated to reflect this.
> 2. Move selected classes from Nutch to Hadoop, as follows:
> org.apache.nutch.util.NutchConf -> org.apache.hadoop.conf.Configuration
> org.apache.nutch.util.NutchConfigurable -> org.apache.hadoop.Configurable 
> org.apache.nutch.util.NutchConfigured -> org.apache.hadoop.Configured
> org.apache.nutch.util.Progress -> org.apache.hadoop.util.Progress
> org.apache.nutch.util.LogFormatter-> org.apache.hadoop.util.LogFormatter
> org.apache.nutch.util.Daemon -> org.apache.hadoop.util.Daemon
> 3. Add a jar containing all of the above the Nutch's lib directory.
> Does this plan sound reasonable?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-02-03 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12365089 ] 

Doug Cutting commented on NUTCH-139:


Jerome: yes, it makes sense, but there's also metadata that's not tightly 
related to the protocol or the parser, e.g., the nutch segment that the page 
was fetched into and the score that's been assigned to the url.  I think we'd 
go crazy trying to divide the metadata up into categories, and that there's not 
much harm in stuffing it all in one bag.


> Standard metadata property names in the ParseData metadata
> --
>
>  Key: NUTCH-139
>  URL: http://issues.apache.org/jira/browse/NUTCH-139
>  Project: Nutch
> Type: Improvement
>   Components: fetcher
> Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, 
> although bug is independent of environment
> Reporter: Chris A. Mattmann
> Assignee: Chris A. Mattmann
> Priority: Minor
>  Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.060105.patch
>
> Currently, people are free to name their string-based properties anything 
> that they want, such as having names of "Content-type", "content-TyPe", 
> "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a 
> solution in which all property names be converted to lower case, but in 
> essence this really only fixes half the problem right (the case of 
> identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of 
> named Strings in the ParseData class that the protocol framework and the 
> parsing framework could use to identify common properties such as 
> "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something 
> like:
>  public class ParseData{
>.
> public static final String CONTENT_TYPE = "content-type";
> public static final String CREATOR = "creator";
>
> }
> In this fashion, users could at least know what the name of the standard 
> properties that they can obtain from the ParseData are, for example by making 
> a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the 
> content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, 
> "text/xml"); Of course, this wouldn't preclude users from doing what they are 
> currently doing, it would just provide a standard method of obtaining some of 
> the more common, critical metadata without pouring over the code base to 
> figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week 
> that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-193) move NDFS and MapReduce to a separate project

2006-02-03 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-193?page=comments#action_12365130 ] 

Doug Cutting commented on NUTCH-193:


Okay, I've moved the code from Nutch to Hadoop.  Now I need to repair Nutch so 
that it still works!

One remaining problem is the need to separate nutch config files from hadoop 
config files.  There's now a hadoop-default.xml and hadoop-site.xml, which are 
separate from the similarly-named nutch files.  For now, I'll fix this by 
adding the following methods to Hadoop's Configuration class:

void addDefaultResource(String name);
void addFinalResource(String name);

Then add a Nutch utility class like:

public class NutchConfiguration {
  public static Configuration create() {
Configuration conf = new Configuration();
addNutchResources(conf);
  }
  public static Configuration addNutchResources(Configuration conf) {
addDefaultResource("nutch-default.xml");
addFinalResource("nutch-site.xml");
  }
}

Then all of the places which currently call 'new NutchConf()' can be replaced 
with 'NutchConfiguration().create()'.

Longer-term we might consider a more radical re-design of the configuration 
API.  But first we need to get Hadoop and Nutch split.





> move NDFS and MapReduce to a separate project
> -
>
>  Key: NUTCH-193
>  URL: http://issues.apache.org/jira/browse/NUTCH-193
>  Project: Nutch
> Type: Task
>   Components: ndfs
> Versions: 0.8-dev
> Reporter: Doug Cutting
> Assignee: Doug Cutting
>  Fix For: 0.8-dev

>
> The NDFS and MapReduce code should move from Nutch to a new Lucene 
> sub-project named Hadoop.
> My plan is to do this as follows:
> 1. Move all code in the following packages from Nutch to Hadoop:
> org.apache.nutch.fs
> org.apache.nutch.io
> org.apache.nutch.ipc
> org.apache.nutch.mapred
> org.apache.nutch.ndfs
> These packages will all be renamed to org.apache.hadoop, and Nutch code will 
> be updated to reflect this.
> 2. Move selected classes from Nutch to Hadoop, as follows:
> org.apache.nutch.util.NutchConf -> org.apache.hadoop.conf.Configuration
> org.apache.nutch.util.NutchConfigurable -> org.apache.hadoop.Configurable 
> org.apache.nutch.util.NutchConfigured -> org.apache.hadoop.Configured
> org.apache.nutch.util.Progress -> org.apache.hadoop.util.Progress
> org.apache.nutch.util.LogFormatter-> org.apache.hadoop.util.LogFormatter
> org.apache.nutch.util.Daemon -> org.apache.hadoop.util.Daemon
> 3. Add a jar containing all of the above the Nutch's lib directory.
> Does this plan sound reasonable?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Resolved: (NUTCH-193) move NDFS and MapReduce to a separate project

2006-02-03 Thread Doug Cutting (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-193?page=all ]
 
Doug Cutting resolved NUTCH-193:


Resolution: Fixed

I just committed this.  Phew!

> move NDFS and MapReduce to a separate project
> -
>
>  Key: NUTCH-193
>  URL: http://issues.apache.org/jira/browse/NUTCH-193
>  Project: Nutch
> Type: Task
>   Components: ndfs
> Versions: 0.8-dev
> Reporter: Doug Cutting
> Assignee: Doug Cutting
>  Fix For: 0.8-dev

>
> The NDFS and MapReduce code should move from Nutch to a new Lucene 
> sub-project named Hadoop.
> My plan is to do this as follows:
> 1. Move all code in the following packages from Nutch to Hadoop:
> org.apache.nutch.fs
> org.apache.nutch.io
> org.apache.nutch.ipc
> org.apache.nutch.mapred
> org.apache.nutch.ndfs
> These packages will all be renamed to org.apache.hadoop, and Nutch code will 
> be updated to reflect this.
> 2. Move selected classes from Nutch to Hadoop, as follows:
> org.apache.nutch.util.NutchConf -> org.apache.hadoop.conf.Configuration
> org.apache.nutch.util.NutchConfigurable -> org.apache.hadoop.Configurable 
> org.apache.nutch.util.NutchConfigured -> org.apache.hadoop.Configured
> org.apache.nutch.util.Progress -> org.apache.hadoop.util.Progress
> org.apache.nutch.util.LogFormatter-> org.apache.hadoop.util.LogFormatter
> org.apache.nutch.util.Daemon -> org.apache.hadoop.util.Daemon
> 3. Add a jar containing all of the above the Nutch's lib directory.
> Does this plan sound reasonable?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-192) meta data support for CrawlDatum

2006-02-06 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-192?page=comments#action_12365366 ] 

Doug Cutting commented on NUTCH-192:


No, the stuff you're doing with MapWritable has nothing to do with Hadoop, but 
is all to support features you're adding to Nutch.  So, if anywhere, it belongs 
somewhere in Nutch.

> meta data support for CrawlDatum
> 
>
>  Key: NUTCH-192
>  URL: http://issues.apache.org/jira/browse/NUTCH-192
>  Project: Nutch
> Type: Improvement
> Versions: 0.8-dev
> Reporter: Stefan Groschupf
>  Fix For: 0.8-dev
>  Attachments: metadata010206.patch, metadata060206.patch, 
> metadata300106.patch, metadata310106.patch
>
> Supporting meta data in CrawlDatum would help to get a set of new nutch 
> features realized and makes a lot possible to smaller special focused search 
> engines.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-192) meta data support for CrawlDatum

2006-02-07 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-192?page=comments#action_12365450 ] 

Doug Cutting commented on NUTCH-192:


Sorry, I misspoke and overstated things too.  There are problems, but not with 
MapWritable, rather with WritableName: this refers to some Nutch classes that 
are not in Hadoop.  Aside from that, I agree that MapWritable could be 
generally useful.  Sorry I wasn't thinking clearly when I made my previous 
comment.


> meta data support for CrawlDatum
> 
>
>  Key: NUTCH-192
>  URL: http://issues.apache.org/jira/browse/NUTCH-192
>  Project: Nutch
> Type: Improvement
> Versions: 0.8-dev
> Reporter: Stefan Groschupf
>  Fix For: 0.8-dev
>  Attachments: metadata010206.patch, metadata060206.patch, 
> metadata300106.patch, metadata310106.patch
>
> Supporting meta data in CrawlDatum would help to get a set of new nutch 
> features realized and makes a lot possible to smaller special focused search 
> engines.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-192) meta data support for CrawlDatum

2006-02-08 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-192?page=comments#action_12365618 ] 

Doug Cutting commented on NUTCH-192:


Since these mappings are not something that users should alter, I'm not sure 
they should be in the config file.  I added related mappings to static code in 
NutchConfigurable.  Every Nutch invocation should reference that class, so 
adding registrations like there ensures they'll always be exectuted.  So, in 
any case, if they're loaded from a resource (config file or otherwise) the 
loading should probably happen in NutchConfigurable.  Putting it in a static 
block there means it isn't reloaded for each configuration, but, e.g., if 
plugins need to register new mappings, then perhaps we'll need to reload these 
resources each time a configuration is constructed.

> meta data support for CrawlDatum
> 
>
>  Key: NUTCH-192
>  URL: http://issues.apache.org/jira/browse/NUTCH-192
>  Project: Nutch
> Type: Improvement
> Versions: 0.8-dev
> Reporter: Stefan Groschupf
>  Fix For: 0.8-dev
>  Attachments: metadata010206.patch, metadata060206.patch, 
> metadata300106.patch, metadata310106.patch
>
> Supporting meta data in CrawlDatum would help to get a set of new nutch 
> features realized and makes a lot possible to smaller special focused search 
> engines.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-139) Standard metadata property names in the ParseData metadata

2006-02-08 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-139?page=comments#action_12365619 ] 

Doug Cutting commented on NUTCH-139:


+1  This looks great.  Thanks for all the hard work on this one!

> Standard metadata property names in the ParseData metadata
> --
>
>  Key: NUTCH-139
>  URL: http://issues.apache.org/jira/browse/NUTCH-139
>  Project: Nutch
> Type: Improvement
>   Components: fetcher
> Versions: 0.7.1, 0.7, 0.6, 0.7.2-dev, 0.8-dev
>  Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB  RAM, 
> although bug is independent of environment
> Reporter: Chris A. Mattmann
> Assignee: Chris A. Mattmann
> Priority: Minor
>  Fix For: 0.7.2-dev, 0.8-dev, 0.7.1, 0.7, 0.6
>  Attachments: NUTCH-139.060105.patch, NUTCH-139.060208.patch
>
> Currently, people are free to name their string-based properties anything 
> that they want, such as having names of "Content-type", "content-TyPe", 
> "CONTENT_TYPE" all having the same meaning. Stefan G. I believe proposed a 
> solution in which all property names be converted to lower case, but in 
> essence this really only fixes half the problem right (the case of 
> identifying that "CONTENT_TYPE"
> and "conTeNT_TyPE" and all the permutations are really the same). What about
> if I named it "Content Type", or "ContentType"?
>  I propose that a way to correct this would be to create a standard set of 
> named Strings in the ParseData class that the protocol framework and the 
> parsing framework could use to identify common properties such as 
> "Content-type", "Creator", "Language", etc.
>  The properties would be defined at the top of the ParseData class, something 
> like:
>  public class ParseData{
>.
> public static final String CONTENT_TYPE = "content-type";
> public static final String CREATOR = "creator";
>
> }
> In this fashion, users could at least know what the name of the standard 
> properties that they can obtain from the ParseData are, for example by making 
> a call to ParseData.getMetadata().get(ParseData.CONTENT_TYPE) to get the 
> content type or a call to ParseData.getMetadata().set(ParseData.CONTENT_TYPE, 
> "text/xml"); Of course, this wouldn't preclude users from doing what they are 
> currently doing, it would just provide a standard method of obtaining some of 
> the more common, critical metadata without pouring over the code base to 
> figure out what they are named.
> I'll contribute a patch near the end of the this week, or beg. of next week 
> that addresses this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-192) meta data support for CrawlDatum

2006-02-08 Thread Doug Cutting (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-192?page=all ]

Doug Cutting updated NUTCH-192:
---

Attachment: (was: metadata08_02_06.patch)

> meta data support for CrawlDatum
> 
>
>  Key: NUTCH-192
>  URL: http://issues.apache.org/jira/browse/NUTCH-192
>  Project: Nutch
> Type: Improvement
> Versions: 0.8-dev
> Reporter: Stefan Groschupf
>  Fix For: 0.8-dev
>  Attachments: metadata010206.patch, metadata060206.patch, 
> metadata08_02_06_FULL.patch, metadata300106.patch, metadata310106.patch
>
> Supporting meta data in CrawlDatum would help to get a set of new nutch 
> features realized and makes a lot possible to smaller special focused search 
> engines.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-192) meta data support for CrawlDatum

2006-02-08 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-192?page=comments#action_12365643 ] 

Doug Cutting commented on NUTCH-192:


+1 This looks good to me. Thanks for your persistence.

> meta data support for CrawlDatum
> 
>
>  Key: NUTCH-192
>  URL: http://issues.apache.org/jira/browse/NUTCH-192
>  Project: Nutch
> Type: Improvement
> Versions: 0.8-dev
> Reporter: Stefan Groschupf
>  Fix For: 0.8-dev
>  Attachments: metadata010206.patch, metadata060206.patch, 
> metadata08_02_06_FULL.patch, metadata300106.patch, metadata310106.patch
>
> Supporting meta data in CrawlDatum would help to get a set of new nutch 
> features realized and makes a lot possible to smaller special focused search 
> engines.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Created: (NUTCH-209) include nutch jar in mapred jobs

2006-02-09 Thread Doug Cutting (JIRA)

include nutch jar in mapred jobs


 Key: NUTCH-209
 URL: http://issues.apache.org/jira/browse/NUTCH-209
 Project: Nutch
Type: Improvement
Versions: 0.8-dev
Reporter: Doug Cutting
Priority: Minor
 Fix For: 0.8-dev


I just added a simple way in Hadoop to specify the job jar file.  When 
constructing a JobConf one can specify a class whose containing jar is set to 
be the job's jar.  To take advantage of this in Nutch, we could add a util 
class:

public class NutchJob extends JobConf {
  public NutchJob(Configuration conf) {
super(conf, NutchJob.class);
  }
}

Then change all of the places where we construct a JobConf to instead construct 
a NutchJob.

Finally, we should add an ant target called 'job' that constructs a job jar, 
containing all of the classes and the plugins, and make this the default 
target.  This way all Nutch code can be distributed with each job as it is 
submitted, and daemons would only need to be restarted when Hadoop code is 
updated.

Does this sound reasonable?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Resolved: (NUTCH-209) include nutch jar in mapred jobs

2006-02-09 Thread Doug Cutting (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-209?page=all ]
 
Doug Cutting resolved NUTCH-209:


Resolution: Fixed

I just committed this.

Michael, the 'bin/hadoop jar' command is not (yet) used by Nutch.  Please file 
a Hadoop bug to add the feature you're asking for.

> include nutch jar in mapred jobs
> 
>
>  Key: NUTCH-209
>  URL: http://issues.apache.org/jira/browse/NUTCH-209
>  Project: Nutch
> Type: Improvement
> Versions: 0.8-dev
> Reporter: Doug Cutting
> Priority: Minor
>  Fix For: 0.8-dev

>
> I just added a simple way in Hadoop to specify the job jar file.  When 
> constructing a JobConf one can specify a class whose containing jar is set to 
> be the job's jar.  To take advantage of this in Nutch, we could add a util 
> class:
> public class NutchJob extends JobConf {
>   public NutchJob(Configuration conf) {
> super(conf, NutchJob.class);
>   }
> }
> Then change all of the places where we construct a JobConf to instead 
> construct a NutchJob.
> Finally, we should add an ant target called 'job' that constructs a job jar, 
> containing all of the classes and the plugins, and make this the default 
> target.  This way all Nutch code can be distributed with each job as it is 
> submitted, and daemons would only need to be restarted when Hadoop code is 
> updated.
> Does this sound reasonable?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-209) include nutch jar in mapred jobs

2006-02-09 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-209?page=comments#action_12365798 ] 

Doug Cutting commented on NUTCH-209:


Andrzej, sorry, I didn't see your remark before I committed this!

A DFSClassLoader would have problems with plugins, since our plugin mechanism 
requires that we list a directory to find all defined plugins, and the 
ClassLoader API doesn't let one list directories.  That could be fixed, but 
it's not trivial.

Another way to address this concern is to permit one to specify different 
levels of DFS replication for different files.  So, while the default might be 
3, a job jar file might be replicated much more, so that individual nodes are 
not hit too hard by requests.  This is a feature that I believe Google 
implements, and one that folks at Yahoo! (who're now contributing to Hadoop) 
would like to add to Hadoop.

We could also try to make the job jar smaller, e.g., by only including enabled 
plugins.

> include nutch jar in mapred jobs
> 
>
>  Key: NUTCH-209
>  URL: http://issues.apache.org/jira/browse/NUTCH-209
>  Project: Nutch
> Type: Improvement
> Versions: 0.8-dev
> Reporter: Doug Cutting
> Priority: Minor
>  Fix For: 0.8-dev

>
> I just added a simple way in Hadoop to specify the job jar file.  When 
> constructing a JobConf one can specify a class whose containing jar is set to 
> be the job's jar.  To take advantage of this in Nutch, we could add a util 
> class:
> public class NutchJob extends JobConf {
>   public NutchJob(Configuration conf) {
> super(conf, NutchJob.class);
>   }
> }
> Then change all of the places where we construct a JobConf to instead 
> construct a NutchJob.
> Finally, we should add an ant target called 'job' that constructs a job jar, 
> containing all of the classes and the plugins, and make this the default 
> target.  This way all Nutch code can be distributed with each job as it is 
> submitted, and daemons would only need to be restarted when Hadoop code is 
> updated.
> Does this sound reasonable?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-211) FetchedSegments leave readers open

2006-02-15 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-211?page=comments#action_12366505 ] 

Doug Cutting commented on NUTCH-211:


The interfaces that FetchedSegments implements should have a close method.  
Moreover, these interfaces should extend a Closeable interface.  JDK 1.5 has 
added such an interface, and, in the meantime, I can  add one to Hadoop 
(org.apache.hadoop.io.Closeable) that Nutch can use until we upgrade to Java 
1.5.

So, once I've added Closeable to Hadoop, please submit a patch that makes 
HitContent and HitSummarizer extend Closeable, and FetchedSegments implement it.

Does that sound reasonable?

> FetchedSegments leave readers open
> --
>
>  Key: NUTCH-211
>  URL: http://issues.apache.org/jira/browse/NUTCH-211
>  Project: Nutch
> Type: Bug
> Versions: 0.8-dev
> Reporter: Stefan Groschupf
> Priority: Critical
>  Fix For: 0.8-dev

>
> I have a case here where the NutchBean is instantiated more than once, 
> however I do cache the nutch bean, but in some situations the bean needs to 
> re created. The problem is the  FetchedSegments leaves open all reads it 
> uses. So a nio Exception is thrown as soon I try to create the NutchBean 
> again. 
> I would suggest to add a close method to  FetchedSegments and all involved 
> objects to be able cleanly shutting down the NutchBean.
> Any comments? Would a patch be welcome?
> Caused by: java.nio.channels.ClosedChannelException
> at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:89)
> at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:272)
> at 
> org.apache.nutch.fs.LocalFileSystem$LocalNFSFileInputStream.seek(LocalFileSystem.java:83)
> at 
> org.apache.nutch.fs.NFSDataInputStream$Checker.seek(NFSDataInputStream.java:66)
> at 
> org.apache.nutch.fs.NFSDataInputStream$PositionCache.seek(NFSDataInputStream.java:162)
> at 
> org.apache.nutch.fs.NFSDataInputStream$Buffer.seek(NFSDataInputStream.java:191)
> at org.apache.nutch.fs.NFSDataInputStream.seek(NFSDataInputStream.java:241)
> at org.apache.nutch.io.SequenceFile$Reader.seek(SequenceFile.java:403)
> at org.apache.nutch.io.MapFile$Reader.seek(MapFile.java:329)
> at org.apache.nutch.io.MapFile$Reader.get(MapFile.java:374)
> at 
> org.apache.nutch.mapred.MapFileOutputFormat.getEntry(MapFileOutputFormat.java:76)
> at 
> org.apache.nutch.searcher.FetchedSegments$Segment.getEntry(FetchedSegments.java:93)
> at 
> org.apache.nutch.searcher.FetchedSegments$Segment.getParseText(FetchedSegments.java:84)
> at 
> org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:147)
> at org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.java:321)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Resolved: (NUTCH-211) FetchedSegments leave readers open

2006-02-16 Thread Doug Cutting (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-211?page=all ]
 
Doug Cutting resolved NUTCH-211:


Resolution: Fixed

I committed this, with a bunch of  whitespace fixes.

> FetchedSegments leave readers open
> --
>
>  Key: NUTCH-211
>  URL: http://issues.apache.org/jira/browse/NUTCH-211
>  Project: Nutch
> Type: Bug
> Versions: 0.8-dev
> Reporter: Stefan Groschupf
> Assignee: Stefan Groschupf
> Priority: Critical
>  Fix For: 0.8-dev
>  Attachments: closeFetchSegments.patch, closeable160206.patch
>
> I have a case here where the NutchBean is instantiated more than once, 
> however I do cache the nutch bean, but in some situations the bean needs to 
> re created. The problem is the  FetchedSegments leaves open all reads it 
> uses. So a nio Exception is thrown as soon I try to create the NutchBean 
> again. 
> I would suggest to add a close method to  FetchedSegments and all involved 
> objects to be able cleanly shutting down the NutchBean.
> Any comments? Would a patch be welcome?
> Caused by: java.nio.channels.ClosedChannelException
> at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:89)
> at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:272)
> at 
> org.apache.nutch.fs.LocalFileSystem$LocalNFSFileInputStream.seek(LocalFileSystem.java:83)
> at 
> org.apache.nutch.fs.NFSDataInputStream$Checker.seek(NFSDataInputStream.java:66)
> at 
> org.apache.nutch.fs.NFSDataInputStream$PositionCache.seek(NFSDataInputStream.java:162)
> at 
> org.apache.nutch.fs.NFSDataInputStream$Buffer.seek(NFSDataInputStream.java:191)
> at org.apache.nutch.fs.NFSDataInputStream.seek(NFSDataInputStream.java:241)
> at org.apache.nutch.io.SequenceFile$Reader.seek(SequenceFile.java:403)
> at org.apache.nutch.io.MapFile$Reader.seek(MapFile.java:329)
> at org.apache.nutch.io.MapFile$Reader.get(MapFile.java:374)
> at 
> org.apache.nutch.mapred.MapFileOutputFormat.getEntry(MapFileOutputFormat.java:76)
> at 
> org.apache.nutch.searcher.FetchedSegments$Segment.getEntry(FetchedSegments.java:93)
> at 
> org.apache.nutch.searcher.FetchedSegments$Segment.getParseText(FetchedSegments.java:84)
> at 
> org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:147)
> at org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.java:321)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Resolved: (NUTCH-216) cannot build in windows

2006-02-24 Thread Doug Cutting (JIRA)

 [ http://issues.apache.org/jira/browse/NUTCH-216?page=all ]
 
Doug Cutting resolved NUTCH-216:


Fix Version: 0.8-dev
 Resolution: Fixed

The reason 'exec' was used was to also restore file permissions, which 'untar' 
does not.  So I switched it to unjar and added a chmod to make the scripts 
executable.

> cannot build in windows
> ---
>
>  Key: NUTCH-216
>  URL: http://issues.apache.org/jira/browse/NUTCH-216
>  Project: Nutch
> Type: Bug
>  Environment: XP sp2
> jdk1.5
> ant 1.6.5
> Reporter: bin zhu
>  Fix For: 0.8-dev
>  Attachments: untar-build.patch
>
> Buildfile: build.xml
> init:
> [mkdir] Created dir: C:\data\asf\nutch-trunk\build
> [mkdir] Created dir: C:\data\asf\nutch-trunk\build\classes
> [mkdir] Created dir: C:\data\asf\nutch-trunk\build\test
> [mkdir] Created dir: C:\data\asf\nutch-trunk\build\test\classes
> [mkdir] Created dir: C:\data\asf\nutch-trunk\build\hadoop
> [unjar] Expanding: C:\data\asf\nutch-trunk\lib\hadoop-0.1-dev.jar into 
> C:\da
> ta\asf\nutch-trunk\build\hadoop
> BUILD FAILED
> C:\data\asf\nutch-trunk\build.xml:60: Execute failed: java.io.IOException: 
> Creat
> eProcess: tar xzf .././build/hadoop/bin.tgz error=2

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Created: (NUTCH-218) need DOAP file for Nutch

2006-02-28 Thread Doug Cutting (JIRA)

need DOAP file for Nutch


 Key: NUTCH-218
 URL: http://issues.apache.org/jira/browse/NUTCH-218
 Project: Nutch
Type: Task
Reporter: Doug Cutting


Can someone please draft a DOAP file for Nutch, so that we're listed at 
http://projects.apache.org/?

A DOAP generator is at:

http://projects.apache.org/create.html

Please attach it to this bug report.  Thanks.


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-221) prepare nutch for upcoming lucene 2.0

2006-03-03 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-221?page=comments#action_12368779 ] 

Doug Cutting commented on NUTCH-221:


+1  Thanks!

> prepare nutch for upcoming lucene 2.0
> -
>
>  Key: NUTCH-221
>  URL: http://issues.apache.org/jira/browse/NUTCH-221
>  Project: Nutch
> Type: Task
>  Environment: all
> Reporter: Sami Siren
> Assignee: Sami Siren
> Priority: Minor
>  Fix For: 0.8-dev
>  Attachments: nutch-lucene-deprecation.txt
>
> Remove all deprecated uses of lucene as they will vanish in 2.0

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-230) OPIC score for outlinks should be based on # of valid links, not total # of links.

2006-03-14 Thread Doug Cutting (JIRA)

[ 
http://issues.apache.org/jira/browse/NUTCH-230?page=comments#action_12370381 ] 

Doug Cutting commented on NUTCH-230:


Andrzej, that's true if we think links that are filtered are bad links, but if 
we instead think of them as non-links then this fix is correct.

I don't have a strong intuition about which is best.  Perhaps we should make it 
configurable, and let folks experiment?

Ken, do you see a marked improvement in scores when you make this change?  Can 
you provide some examples of cases where it makes a difference?


> OPIC score for outlinks should be based on # of valid links, not total # of 
> links.
> --
>
>  Key: NUTCH-230
>  URL: http://issues.apache.org/jira/browse/NUTCH-230
>  Project: Nutch
> Type: Improvement
> Versions: 0.8-dev
> Reporter: Ken Krugler
> Priority: Minor

>
> In ParseOutputFormat.java, the write() method currently divides the page 
> score by the # of outlinks:
>   score /= links.length;
> It then loops over the links, and any that pass the normalize/filter gauntlet 
> get added to the crawl output.
> But this means that any filtered links result in some amount of the page's 
> OPIC score being "lost".
> For Nutch 0.7, I built a list of valid (post-filter) links, and then used 
> that to determine the per-link OPIC score, after which I iterated over the 
> list, adding entries to the crawl output.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

1 2 >

1 - 100 of 151 matches

Mail list logo