[jira] Updated: (NUTCH-273) When a page is redirected, the original url is NOT updated.

2006-11-25 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-273?page=all ]

Stefan Groschupf updated NUTCH-273:
---

Priority: Blocker  (was: Major)

I agree this is a serious problem for any production use of nutch - a blocker 
since you end up refetching again and again the same pages. 

> When a page is redirected, the original url is NOT updated.
> ---
>
> Key: NUTCH-273
> URL: http://issues.apache.org/jira/browse/NUTCH-273
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 0.8
> Environment: n/a
>Reporter: Lukas Vlcek
>Priority: Blocker
>
> [Excerpt from maillist, sender: Andrzej Bialecki]
> When a page is redirected, the original url is NOT updated - so, CrawlDB will 
> never know that a redirect occured, it won't even know that a fetch 
> occured... This looks like a bug.
> In 0.7 this was recorded in the segment, and then it would affect the Page 
> status during updatedb. It should do so 0.8, too...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-357) crawling simulation

2006-08-21 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-357?page=all ]

Stefan Groschupf updated NUTCH-357:
---

Attachment: protocol-simulation-pluginV1.patch

A very first preview of a plugin that helps to simulate crawls. This protocol 
plugin can be used to replace the http protocol plugin and return defined 
content during a fetch. To simulate custom scenarios a interface names 
Simulator can be implemented with just one method. 
The plugin comes with a very simple basic Simulator implementation, however 
this already allows to simulate the by today known nutch scoring problems, like 
pages pointing to itself or link chains. 
For more details see the java doc, however I plan to improve the java doc with 
a native speaker. 

Feedback is welcome. 

> crawling simulation
> ---
>
> Key: NUTCH-357
> URL: http://issues.apache.org/jira/browse/NUTCH-357
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 0.8.1, 0.9.0
>Reporter: Stefan Groschupf
> Fix For: 0.9.0
>
> Attachments: protocol-simulation-pluginV1.patch
>
>
> We recently discovered  some serious issue related to crawling and scoring. 
> Reproducing these problems is a kind of difficult, since first of all it is 
> not polite to re-crawl a set of pages again and again, secondly it is 
> difficult to catch the page that cause a problem. 
> Therefore it would be very useful to have a testbed to simulate crawls where  
> we can control the response of  "web servers". 
> For the very beginning simulate very basic situation like a page points to it 
> self,  link chains or internal links would already be very usefully. 
> However later on simulate crawls against existing data collections like TREC 
> or a webgraph would be much more interesting, for instance to caculate the 
> quality of the nutch OPIC implementation against page rank scores of the 
> webgraph or evaluaing crawling strategies.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (NUTCH-357) crawling simulation

2006-08-21 Thread Stefan Groschupf (JIRA)
crawling simulation
---

 Key: NUTCH-357
 URL: http://issues.apache.org/jira/browse/NUTCH-357
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.8.1, 0.9.0
Reporter: Stefan Groschupf
 Fix For: 0.9.0


We recently discovered  some serious issue related to crawling and scoring. 
Reproducing these problems is a kind of difficult, since first of all it is not 
polite to re-crawl a set of pages again and again, secondly it is difficult to 
catch the page that cause a problem. 
Therefore it would be very useful to have a testbed to simulate crawls where  
we can control the response of  "web servers". 
For the very beginning simulate very basic situation like a page points to it 
self,  link chains or internal links would already be very usefully. 

However later on simulate crawls against existing data collections like TREC or 
a webgraph would be much more interesting, for instance to caculate the quality 
of the nutch OPIC implementation against page rank scores of the webgraph or 
evaluaing crawling strategies.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-356) Plugin repository cache can lead to memory leak

2006-08-21 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-356?page=comments#action_12429534 ] 

Stefan Groschupf commented on NUTCH-356:


Hi Enrico, 
there will be as much PluginRepositories as Configuration objects. 
So in case you create many configuration objects you will have a problem with 
the memory. 
There is no way around having a singleton pluginrepository. However you can 
reset the the pluginRepository by remove the cached object from the 
configuration object. 
In any case do not cache the pluginrepository is a bad idea, thinkabout writing 
a own plugin that solve your problem that should be a cleaner solution for your 
problem. 

Would you agree to close this issue since we will not be able to commit your 
changes. 
Stefan  

> Plugin repository cache can lead to memory leak
> ---
>
> Key: NUTCH-356
> URL: http://issues.apache.org/jira/browse/NUTCH-356
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 0.8
>Reporter: Enrico Triolo
> Attachments: NutchTest.java, patch.txt
>
>
> While I was trying to solve a problem I reported a while ago (see Nutch-314), 
> I found out that actually the problem was related to the plugin cache used in 
> class PluginRepository.java.
> As  I said in Nutch-314, I think I somehow 'force' the way nutch is meant to 
> work, since I need to frequently submit new urls and append their contents to 
> the index; I don't (and I can't) have an urls.txt file with all urls I'm 
> going to fetch, but I recreate it each time a new url is submitted.
> Thus,  I think in the majority of times you won't have problems using nutch 
> as-is, since the problem I found occours only if nutch is used in a way 
> similar to the one I use.
> To simplify your test I'm attaching a class that performs something similar 
> to what I need. It fetches and index some sample urls; to avoid webmasters 
> complaints I left the sample urls list empty, so you should modify the source 
> code and add some urls.
> Then you only have to run it and watch your memory consumption with top. In 
> my experience I get an OutOfMemoryException after a couple of minutes, but it 
> clearly depends on your heap settings and on the plugins you are using (I'm 
> using 
> 'protocol-file|protocol-http|parse-(rss|html|msword|pdf|text)|language-identifier|index-(basic|more)|query-(basic|more|site|url)|urlfilter-regex|summary-basic|scoring-opic').
> The problem is bound to the PluginRepository 'singleton' instance, since it 
> never get released. It seems that some class maintains a reference to it and 
> this class is never released since it is cached somewhere in the 
> configuration.
> So I modified the PluginRepository's 'get' method so that it never uses the 
> cache and always returns a new instance (you can find the patch in 
> attachment). This way the memory consumption is always stable and I get no 
> OOM anymore.
> Clearly this is not the solution, since I guess there are many performance 
> issues involved, but for the moment it works.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-354) MapWritable, nextEntry is not reset when Entries are recycled

2006-08-21 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-354?page=comments#action_12429496 ] 

Stefan Groschupf commented on NUTCH-354:


Since this issue is already closed I can not attach the patch file, so I attach 
it as text within this comment.
If you need the file let me know and I send you a offlist mail. 

>
Index: src/test/org/apache/nutch/crawl/TestMapWritable.java
===
--- src/test/org/apache/nutch/crawl/TestMapWritable.java(revision 
432325)
+++ src/test/org/apache/nutch/crawl/TestMapWritable.java(working copy)
@@ -180,6 +180,31 @@
 assertEquals(before, after);
   }
 
+  public void testRecycling() throws Exception {
+UTF8 value = new UTF8("value");
+UTF8 key1 = new UTF8("a");
+UTF8 key2 = new UTF8("b");
+
+MapWritable writable = new MapWritable();
+writable.put(key1, value);
+assertEquals(writable.get(key1), value);
+assertNull(writable.get(key2));
+
+DataOutputBuffer dob = new DataOutputBuffer();
+writable.write(dob);
+writable.clear();
+writable.put(key1, value);
+writable.put(key2, value);
+assertEquals(writable.get(key1), value);
+assertEquals(writable.get(key2), value);
+
+DataInputBuffer dib = new DataInputBuffer();
+dib.reset(dob.getData(), dob.getLength());
+writable.readFields(dib);
+assertEquals(writable.get(key1), value);
+assertNull(writable.get(key2));
+  }
+  
   public static void main(String[] args) throws Exception {
 TestMapWritable writable = new TestMapWritable();
 writable.testPerformance();
<<<

> MapWritable,  nextEntry is not reset when Entries are recycled
> --
>
> Key: NUTCH-354
> URL: http://issues.apache.org/jira/browse/NUTCH-354
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 0.8
>Reporter: Stefan Groschupf
>Priority: Blocker
> Fix For: 0.9.0, 0.8.1
>
> Attachments: resetNextEntryInMapWritableV1.patch
>
>
> MapWritables recycle entries from it internal linked-List for performance 
> reasons. The nextEntry of a entry is not reseted in case a recyclable entry 
> is found. This can cause wrong data in a MapWritable. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-354) MapWritable, nextEntry is not reset when Entries are recycled

2006-08-19 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-354?page=all ]

Stefan Groschupf updated NUTCH-354:
---

Attachment: resetNextEntryInMapWritableV1.patch

Resets the next Entry of a recycled entry.

> MapWritable,  nextEntry is not reset when Entries are recycled
> --
>
> Key: NUTCH-354
> URL: http://issues.apache.org/jira/browse/NUTCH-354
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 0.8
>Reporter: Stefan Groschupf
>Priority: Blocker
> Fix For: 0.9.0, 0.8.1
>
> Attachments: resetNextEntryInMapWritableV1.patch
>
>
> MapWritables recycle entries from it internal linked-List for performance 
> reasons. The nextEntry of a entry is not reseted in case a recyclable entry 
> is found. This can cause wrong data in a MapWritable. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (NUTCH-354) MapWritable, nextEntry is not reset when Entries are recycled

2006-08-19 Thread Stefan Groschupf (JIRA)
MapWritable,  nextEntry is not reset when Entries are recycled 
---

 Key: NUTCH-354
 URL: http://issues.apache.org/jira/browse/NUTCH-354
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8
Reporter: Stefan Groschupf
Priority: Blocker
 Fix For: 0.8.1, 0.9.0


MapWritables recycle entries from it internal linked-List for performance 
reasons. The nextEntry of a entry is not reseted in case a recyclable entry is 
found. This can cause wrong data in a MapWritable. 


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-336) Harvested links shouldn't get db.score.injected in addition to inbound contributions

2006-08-17 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-336?page=all ]

Stefan Groschupf updated NUTCH-336:
---

Priority: Critical  (was: Minor)

I think that is a fundamental problem since I observe there are many pages e.g. 
presentation slides that have exactly one link to the next page. In very long 
presentation the last page has a very high score. We should commit this patch 
soon. Wrong scoring is a serious issue in the moment. 

> Harvested links shouldn't get db.score.injected in addition to inbound 
> contributions
> 
>
> Key: NUTCH-336
> URL: http://issues.apache.org/jira/browse/NUTCH-336
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 0.8
>Reporter: Chris Schneider
>Priority: Critical
> Attachments: NUTCH-336.patch.txt
>
>
> Currently (even with Stefan's fix for NUTCH-324), harvested links have their 
> initial scores set to db.score.injected + (sum of inbound contributions * 
> db.score.link.[internal | external]), but this will place (at least external) 
> harvested links even higher than injected URLs on the fetch list. Perhaps 
> more importantly, this effect cascades.
> As a simple example, if each page in A->B->C->D has exactly one external link 
> and only A is injected, then D will receive an initial score of at least 
> (4*db.score.injected) with the default db.score.link.external of 1.0. Higher 
> values of db.score.injected and db.score.link.external obviously exacerbate 
> this problem.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-337) Fetcher ignores the fetcher.parse value configured in config file

2006-08-17 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-337?page=all ]

Stefan Groschupf updated NUTCH-337:
---

Priority: Major  (was: Trivial)

> Fetcher ignores the fetcher.parse value configured in config file
> -
>
> Key: NUTCH-337
> URL: http://issues.apache.org/jira/browse/NUTCH-337
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 0.8, 0.9.0
>Reporter: Jeremy Huylebroeck
> Attachments: respectFetcherParsePropertyV1.patch
>
>
> using the command line call to Fetcher, if the noParsing parameter is given, 
> everything is fine.
> if the noParsing is not given, the value in the nutch-site.xml (or 
> nutch-default.xml) should be taken but it is "true" that is always given to 
> the call to fetch.
> it should be the value from the conf.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-337) Fetcher ignores the fetcher.parse value configured in config file

2006-08-17 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-337?page=all ]

Stefan Groschupf updated NUTCH-337:
---

Attachment: respectFetcherParsePropertyV1.patch

Hi Jeremy, thanks for catching this. Attached a fix. Should be easy for a 
contributor to commit this to trunk

> Fetcher ignores the fetcher.parse value configured in config file
> -
>
> Key: NUTCH-337
> URL: http://issues.apache.org/jira/browse/NUTCH-337
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 0.8, 0.9.0
>Reporter: Jeremy Huylebroeck
>Priority: Trivial
> Attachments: respectFetcherParsePropertyV1.patch
>
>
> using the command line call to Fetcher, if the noParsing parameter is given, 
> everything is fine.
> if the noParsing is not given, the value in the nutch-site.xml (or 
> nutch-default.xml) should be taken but it is "true" that is always given to 
> the call to fetch.
> it should be the value from the conf.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-341) IndexMerger now deletes entire after completing

2006-08-17 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-341?page=all ]

Stefan Groschupf updated NUTCH-341:
---

Attachment: doNotDeleteTmpIndexMergeDirV1.patch

+1. 
I agree it makes completly no sense to be required creating a tmp folder 
manually and nutch deletes it afterwards with all content. 
Very dangerous if a user provides  / as tmp folder. The attached patch 
rollsback the missing line and I would love to ask that one developer with 
write access can roll in this in asap!
THANKS!


> IndexMerger now deletes entire  after completing
> 
>
> Key: NUTCH-341
> URL: http://issues.apache.org/jira/browse/NUTCH-341
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 0.8
>Reporter: Chris Schneider
>Priority: Critical
> Attachments: doNotDeleteTmpIndexMergeDirV1.patch
>
>
> Change 383304 deleted the following line near Line 117 (see 
> 
>  for details):
> workDir = new File(workDir, "indexmerger-workingdir");
> Previously, if no -workingdir  parameter was specified, 
> IndexMerger.main() would place an "indexmerger-workingdir" directory into the 
> default directory and then delete the former after completing. Now, 
> IndexMerger.main() defaults the value of its workDir to "indexmerger" within 
> the default directory, and deletes this workDir afterward.
> However, if -workingdir  _is_ specified, IndexMerger.main() will 
> now set workDir to _this_ path and delete the _entire_  
> afterward. Previously, IndexMerger.main() would only delete 
> /"indexmerger-workingdir", without deleting  itself. 
> This is because the line mentioned above always appended 
> "indexmerger-workingdir" to workDir.
> Our hardware configuration on the jobtracker/namenode box attempts to keep 
> all large datasets on a separate, large hard drive. Accordingly, we were 
> keeping dfs.name.dir, dfs.data.dir, mapred.system.dir, and mapred.local.dir 
> on this drive. Unfortunately, we were passing the folder containing these 
> folders in the  parameter to the IndexMerger. As a result, the 
> first time we ran the IndexMerger, we ended up trashing our entire DFS!
> Perhaps the way that the IndexMerger handles its  parmaeter now 
> is an acceptable design. However, given the way it handled this parameter in 
> the past, I feel that the current implementation is unacceptably dangerous.
> More importantly, perhaps there's some way that we could make hadoop more 
> robust in handling its critical data files. I plan to place a directory owned 
> by root with "dr" permissions into each of these critical directories 
> in order to prevent any of them from suffering the fate of our DFS. This 
> could become part of a standard hadoop installation.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-342) Nutch commands log to nutch/logs/hadoop.logs by default

2006-08-17 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-342?page=comments#action_12428922 ] 

Stefan Groschupf commented on NUTCH-342:


We should cleanup logging in nutch in general asap! 
The way things are configured by today is everything else than elegant or 
clean. :-(  

> Nutch commands log to nutch/logs/hadoop.logs by default
> ---
>
> Key: NUTCH-342
> URL: http://issues.apache.org/jira/browse/NUTCH-342
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 0.8
>Reporter: Chris Schneider
>Priority: Minor
> Attachments: NUTCH-342.patch
>
>
> If (by default) Nutch commands are going to send their output to a file named 
> "hadoop.log", then it seems like the default location for this file should be 
> the same location where Hadoop is putting its hadoop.log file (i.e., 
> $HADOOP_LOG_DIR). Currently, if I set HADOOP_LOG_DIR to a special location 
> (via hadoop-env.sh), this has no effect on where Nutch commands send their 
> output.
> Some would probably suggest that I could just set NUTCH_LOG_DIR to 
> $HADOOP_LOG_DIR myself. I still think that it should be defaulted this way in 
> the nutch script. However, I'm unaware of an elegant way to modify such Nutch 
> environment variables anyway. The hadoop-env.sh file provides a convenient 
> place to modify Hadoop environment variables, but doing the same for Nutch 
> environment variables presumably requires you to modify .bash_profile or a 
> similar user script file (which is the way I used to accomplish this kind of 
> thing with Nutch 0.7).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-343) Index MP3 SHA1 hashes

2006-08-17 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-343?page=comments#action_12428920 ] 

Stefan Groschupf commented on NUTCH-343:


Thanks for the contribution, also that your patch has a test. :-)
Just a small comment from taking a first look to the patch file. 
My personal experience is that some nutch developers have strong opitions about 
code formating, so you may be want to check your code formating. :-)

> Index MP3 SHA1 hashes
> -
>
> Key: NUTCH-343
> URL: http://issues.apache.org/jira/browse/NUTCH-343
> Project: Nutch
>  Issue Type: New Feature
>Affects Versions: 0.8, 0.9.0, 0.8.1
>Reporter: Hasan Diwan
> Attachments: parsemp3.pat
>
>
> Add indexing of the mp3s sha1 hash.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-345) Add support for Content-Encoding: deflated

2006-08-17 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-345?page=comments#action_12428918 ] 

Stefan Groschupf commented on NUTCH-345:


Shouldn't the DeflateUtils also be part of the protocol-http plugin? 
Also since it is a larger contribution and not just a small bug fix it would be 
great to have a junit test within the patch. 
Thanks for the contribution.



> Add support for Content-Encoding: deflated
> --
>
> Key: NUTCH-345
> URL: http://issues.apache.org/jira/browse/NUTCH-345
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 0.9.0
>Reporter: Pascal Beis
>Priority: Minor
> Attachments: nutch-deflate.patch
>
>
> Add support for the "deflated" content-encoding, next to the already
> implemented GZIP content-encoding. Patch attached. See also the
> "Patch: deflate encoding" thread on nutch-dev on August 7/8 2006.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-346) Improve readability of logs/hadoop.log

2006-08-17 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-346?page=comments#action_12428917 ] 

Stefan Groschupf commented on NUTCH-346:


+1
I agree, can you please create a patch file and attach it to this bug. 
Thanks

> Improve readability of logs/hadoop.log
> --
>
> Key: NUTCH-346
> URL: http://issues.apache.org/jira/browse/NUTCH-346
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 0.9.0
> Environment: ubuntu dapper
>Reporter: Renaud Richardet
>Priority: Minor
>
> adding
> log4j.logger.org.apache.nutch.plugin.PluginRepository=WARN
> to conf/log4j.properties
> dramatically improves the readability of the logs in logs/hadoop.log (removes 
> all INFO)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-347) Build: plugins' Jars not found

2006-08-17 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-347?page=comments#action_12428915 ] 

Stefan Groschupf commented on NUTCH-347:


Please submit this patch! 
Thanks!

> Build: plugins' Jars not found
> --
>
> Key: NUTCH-347
> URL: http://issues.apache.org/jira/browse/NUTCH-347
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 0.8
>Reporter: Otis Gospodnetic
> Attachments: nutch_build_plugins_patch.txt
>
>
> While building Nutch, I noticed several places where various Jars from 
> plugins' lib directories could not be found, for example:
> $ ant package
> ...
> deploy:
>  [copy] Warning: Could not find file 
> /home/otis/dev/repos/lucene/nutch/trunk/build/lib-log4j/lib-log4j.jar to copy.
> init:
> init-plugin:
> compile:
> jar:
> deps-test:
> deploy:
>  [copy] Warning: Could not find file 
> /home/otis/dev/repos/lucene/nutch/trunk/build/lib-nekohtml/lib-nekohtml.jar 
> to copy.
> ...
> The problem is, these "lib-.jar" files do not exist.  Instead, those Jars 
> are typically named with a version in the name, like log4j-1.2.11.jar.  I 
> could not find where this "lib-" prefix comes from, nor where the version is 
> dropped from the name.  Anyone knows?
> In order to avoid these errors I had to make symbolic links and fake things:
> e.g.
>   ln -s log4j-1.2.11.jar lib-log4j.jar
> But this should really be fixed somewhere, I just can't see where... :(
> Note that this doesn't completely break the build, but missing Jars can't be 
> a good thing.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Resolved: (NUTCH-322) Fetcher discards ProtocolStatus, doesn't store redirected pages

2006-08-17 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-322?page=all ]

Stefan Groschupf resolved NUTCH-322.


Resolution: Duplicate

duplicate of NUTCH-353

> Fetcher discards ProtocolStatus, doesn't store redirected pages
> ---
>
> Key: NUTCH-322
> URL: http://issues.apache.org/jira/browse/NUTCH-322
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 0.8
>Reporter: Andrzej Bialecki 
> Fix For: 0.9.0
>
>
> Fetcher doesn't store ProtocolStatus in output segments. ProtocolStatus 
> contains important information, such as protocol-level response code, 
> lastModified time, and possibly other messages.
> I propose that ProtocolStatus should be stored inside CrawlDatum.metaData, 
> which is then stored into crawl_fetch (in Fetcher.FetcherThread.output()). In 
> addition, if ProtocolStatus contains a valid lastModified time, that 
> CrawlDatum's modified time should also be set to this value.
> Additionally, Fetcher doesn't store redirected pages. Content of such pages 
> is silently discarded. When Fetcher translates from protocol-level status to 
> crawldb-level status it should probably store such pages with the following 
> translation of status codes:
> * ProtocolStatus.TEMP_MOVED -> CrawlDatum.STATUS_DB_RETRY. This code 
> indicates a transient change, so we probably shouldn't mark the initial URL 
> as bad.
> * ProtocolStatus.MOVED -> CrawlDatum.STATUS_DB_GONE. This code indicates a 
> permanent change, so the initial URL is no longer valid, i.e. it will always 
> result in redirects.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-353) pages that serverside forwards will be refetched every time

2006-08-17 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-353?page=all ]

Stefan Groschupf updated NUTCH-353:
---

Attachment: doNotRefecthForwarderPagesV1.patch

Since we discussed that nutch need to be more polite we should fix that asap. 

> pages that serverside forwards will be refetched every time
> ---
>
> Key: NUTCH-353
> URL: http://issues.apache.org/jira/browse/NUTCH-353
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 0.8.1, 0.9.0
>Reporter: Stefan Groschupf
>Priority: Blocker
> Fix For: 0.8.1
>
> Attachments: doNotRefecthForwarderPagesV1.patch
>
>
> Pages that do a serverside forward are not written with a status change back 
> into the crawlDb. Also the nextFetchTime is not changed. 
> This causes a refetch of the same page again and again. The result is nutch 
> is not polite and refetching the forwarding and target page in each segment 
> iteration. Also it effects the scoring since the forward page contribute it's 
> score to all outlinks.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (NUTCH-353) pages that serverside forwards will be refetched every time

2006-08-17 Thread Stefan Groschupf (JIRA)
pages that serverside forwards will be refetched every time
---

 Key: NUTCH-353
 URL: http://issues.apache.org/jira/browse/NUTCH-353
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8.1, 0.9.0
Reporter: Stefan Groschupf
Priority: Blocker
 Fix For: 0.8.1
 Attachments: doNotRefecthForwarderPagesV1.patch

Pages that do a serverside forward are not written with a status change back 
into the crawlDb. Also the nextFetchTime is not changed. 
This causes a refetch of the same page again and again. The result is nutch is 
not polite and refetching the forwarding and target page in each segment 
iteration. Also it effects the scoring since the forward page contribute it's 
score to all outlinks.



-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-322) Fetcher discards ProtocolStatus, doesn't store redirected pages

2006-08-17 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-322?page=comments#action_12428858 ] 

Stefan Groschupf commented on NUTCH-322:


I think this is a serious problem. Page A server side redirect to Page B. Page 
A is never writen to the output. That causes that Page A does not change the 
state or the next fetch time, what means that page A is fetched again, again, 
again ... ∞

I suggest that we write out Page A with a status change to STATUS_DB_GONE.


> Fetcher discards ProtocolStatus, doesn't store redirected pages
> ---
>
> Key: NUTCH-322
> URL: http://issues.apache.org/jira/browse/NUTCH-322
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 0.8
>Reporter: Andrzej Bialecki 
> Fix For: 0.9.0
>
>
> Fetcher doesn't store ProtocolStatus in output segments. ProtocolStatus 
> contains important information, such as protocol-level response code, 
> lastModified time, and possibly other messages.
> I propose that ProtocolStatus should be stored inside CrawlDatum.metaData, 
> which is then stored into crawl_fetch (in Fetcher.FetcherThread.output()). In 
> addition, if ProtocolStatus contains a valid lastModified time, that 
> CrawlDatum's modified time should also be set to this value.
> Additionally, Fetcher doesn't store redirected pages. Content of such pages 
> is silently discarded. When Fetcher translates from protocol-level status to 
> crawldb-level status it should probably store such pages with the following 
> translation of status codes:
> * ProtocolStatus.TEMP_MOVED -> CrawlDatum.STATUS_DB_RETRY. This code 
> indicates a transient change, so we probably shouldn't mark the initial URL 
> as bad.
> * ProtocolStatus.MOVED -> CrawlDatum.STATUS_DB_GONE. This code indicates a 
> permanent change, so the initial URL is no longer valid, i.e. it will always 
> result in redirects.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-350) urls blocked db.fetch.retry.max * http.max.delays times during fetching are marked as STATUS_DB_GONE

2006-08-17 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-350?page=all ]

Stefan Groschupf updated NUTCH-350:
---

Attachment: protocolRetryV5.patch

This patch will dramatically increase the number of successfully fetched pages 
of a intranet crawl over the time. 

> urls blocked db.fetch.retry.max * http.max.delays times during fetching are 
> marked as STATUS_DB_GONE
> 
>
> Key: NUTCH-350
> URL: http://issues.apache.org/jira/browse/NUTCH-350
> Project: Nutch
>  Issue Type: Bug
>Reporter: Stefan Groschupf
>Priority: Critical
> Attachments: protocolRetryV5.patch
>
>
> Intranet crawls or focused crawls will fetch many pages from the same host. 
> This causes that a thread will be blocked since a other thread already fetch 
> from the same host. It is very likely that threads are more often blocked 
> than http.max.delays. In such a case the HttpBase.blockAddr method throws a 
> HttpException. This will be handled in the fetcher by increment the 
> crawlDatum retries and set the status to STATUS_FETCH_RETRY. That means that 
> at least you have only db.fetch.retry.max * http.max.delays chances to fetch 
> a url. But in intranet or focused crawls it is very likely that this is not 
> enough. Increaing one of the involved properties dramatically slow down the 
> fetch. 
> I suggest to not increase the CrawlDatum RetriesSinceFetch in case the 
> problem was caused by a blocked thread.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (NUTCH-350) urls blocked db.fetch.retry.max * http.max.delays times during fetching are marked as STATUS_DB_GONE

2006-08-17 Thread Stefan Groschupf (JIRA)
urls blocked db.fetch.retry.max * http.max.delays times during fetching are 
marked as STATUS_DB_GONE  
--

 Key: NUTCH-350
 URL: http://issues.apache.org/jira/browse/NUTCH-350
 Project: Nutch
  Issue Type: Bug
Reporter: Stefan Groschupf
Priority: Critical


Intranet crawls or focused crawls will fetch many pages from the same host. 
This causes that a thread will be blocked since a other thread already fetch 
from the same host. It is very likely that threads are more often blocked than 
http.max.delays. In such a case the HttpBase.blockAddr method throws a 
HttpException. This will be handled in the fetcher by increment the crawlDatum 
retries and set the status to STATUS_FETCH_RETRY. That means that at least you 
have only db.fetch.retry.max * http.max.delays chances to fetch a url. But in 
intranet or focused crawls it is very likely that this is not enough. Increaing 
one of the involved properties dramatically slow down the fetch. 
I suggest to not increase the CrawlDatum RetriesSinceFetch in case the problem 
was caused by a blocked thread.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-348) Generator is building fetch list using *lowest* scoring URLs

2006-08-16 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-348?page=all ]

Stefan Groschupf updated NUTCH-348:
---

Attachment: sortPatchV1.patch

What people think about this kind of solution?

> Generator is building fetch list using *lowest* scoring URLs
> 
>
> Key: NUTCH-348
> URL: http://issues.apache.org/jira/browse/NUTCH-348
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Reporter: Chris Schneider
> Attachments: sortPatchV1.patch
>
>
> Ever since revision 391271, when the CrawlDatum key was replaced by a 
> FloatWritable key, the Generator.Selector.reduce method has been outputting 
> the *lowest* scoring URLs! The CrawlDatum class has a Comparator that 
> essentially treats higher scoring CrawlDatum objects as "less than" lower 
> scoring CrawlDatum objects, so the higher scoring ones would appear first in 
> a sequence file sorted using this as the key.
> When a FloatWritable based on the score itself (as returned from 
> scfilters.generatorSortValue) became the sort key, it should have been 
> negated in Generator.Selector.map to have the same result. Curiously, there 
> is a comment to this effect immediately before the FloatWritable is set:
>   // sort by decreasing score
>   sortValue.set(sort);
> It seems like the simplest way to fix this is to just negate the score, and 
> this seems to work for me:
>   // sort by decreasing score
>   // 2006-08-15 CSc REALLY sort by decreasing score
>   sortValue.set(-sort);
> Unfortunately, this means that any crawls that have been done using 
> Generator.java after revision 391271 should be discarded, as they were 
> focused on fetching the lowest scoring unfetched URLs in the crawldb, 
> essentially pointing the crawler 180 degrees from its intended direction.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-233) wrong regular expression hang reduce process for ever

2006-08-16 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-233?page=comments#action_12428542 ] 

Stefan Groschupf commented on NUTCH-233:


Hi Otis, 
yes for a serious whole web crawl I need to change this reg ex first.
It only hangs with some random urls that for example comes from link farms the 
crawler runs into. 

> wrong regular expression hang reduce process for ever
> -
>
> Key: NUTCH-233
> URL: http://issues.apache.org/jira/browse/NUTCH-233
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 0.8
>Reporter: Stefan Groschupf
>Priority: Blocker
> Fix For: 0.9.0
>
>
> Looks like that the expression ".*(/.+?)/.*?\1/.*?\1/" in regex-urlfilter.txt 
> wasn't compatible with java.util.regex that is actually used in the regex url 
> filter. 
> May be it was missed to change it when the regular expression packages was 
> changed.
> The problem was that until reducing a fetch map output the reducer hangs 
> forever since the outputformat was applying the urlfilter a url that causes 
> the hang.
> 060315 230823 task_r_3n4zga at 
> java.lang.Character.codePointAt(Character.java:2335)
> 060315 230823 task_r_3n4zga at 
> java.util.regex.Pattern$Dot.match(Pattern.java:4092)
> 060315 230823 task_r_3n4zga at 
> java.util.regex.Pattern$Curly.match1(Pattern.java:
> I changed the regular expression to ".*(/[^/]+)/[^/]+\1/[^/]+\1/" and now the 
> fetch job works. (thanks to Grant and Chris B. helping to find the new regex)
> However may people can review it and can suggest improvements, since the old 
> regex would match :
> "abcd/foo/bar/foo/bar/foo/" and so will the new one match it also. But the 
> old regex would also match :
> "abcd/foo/bar/xyz/foo/bar/foo/" which the new regex will not match.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-349) Port Nutch to use Hadoop Text instead of UTF8

2006-08-16 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-349?page=comments#action_12428537 ] 

Stefan Groschupf commented on NUTCH-349:


my vote goes to #2.
Having a tool that need to be started manually would be better than complicate 
the already fragile code from my point of view. 

> Port Nutch to use Hadoop Text instead of UTF8
> -
>
> Key: NUTCH-349
> URL: http://issues.apache.org/jira/browse/NUTCH-349
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 0.9.0
>Reporter: Andrzej Bialecki 
>
> Currently Nutch uses org.apache.hadoop.io.UTF8 class to store/read Strings. 
> This class has been deprecated in Hadoop 0.5.0, and Text class should be used 
> instead. Sooner or later we will need to move Nutch to use this class instead 
> of UTF8.
> This raises numerous issues regarding the compatibility of existing data in 
> CrawlDB, LinkDB and segments. I can see two ways to solve this:
> * add code in readers of respective formats to convert UTF8->Text on the fly. 
> New writers would only use Text. This is less than ideal, because it 
> complicates the code, and also at some point in time the UTF8 class will be 
> removed.
> * create a converter (to be maintaines as long as UTF8 exists), which 
> converts existing data in bulk from UTF8 to Text. This requires an additional 
> processing step when upgrading to convert all existing data to the new format.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-332) doubling score causes by page internal anchors.

2006-07-27 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-332?page=all ]

Stefan Groschupf updated NUTCH-332:
---

Attachment: scoreDoubling.patch

A patch to solve this problem. 

This is a example page:
http://bid.berkeley.edu/bidclass/readings/benjamin.html
This page has several anchors that causes the problem in this case.

What happens is: 
foo.com/a.html points to foo.com/a.html#chapter1
we normalize foo.com/a.html#chapter1 to:
foo.com/a.html

foo.com/a.html contributes all scores to foo.com/a.html. 


> doubling score causes by page internal anchors.
> ---
>
> Key: NUTCH-332
> URL: http://issues.apache.org/jira/browse/NUTCH-332
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 0.8-dev
>Reporter: Stefan Groschupf
>Priority: Blocker
> Fix For: 0.8-dev
>
> Attachments: scoreDoubling.patch
>
>
> When a page has no outlinks but several links to itself e.g. it has a set of 
> anchors the scores of the page are distributed to its outlinks. But all this 
> outlinks pointing to the page back. This causes that the page score is 
> doubled. 
> I'm not sure but may be this causes also a never ending fetching loop of this 
> page, since outlinks with the status of CrawlDatum.STATUS_LINKED are set 
> CrawlDatum.STATUS_DB_UNFETCHED in CrawlDBReducer line: 107. 
> So may be the status fetched will be overwritten with unfetched. 
> In such a case we fetch the page every-time again and also every-time double 
> the score of this page what causes very high scores without any reasons.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (NUTCH-332) doubling score causes by page internal anchors.

2006-07-27 Thread Stefan Groschupf (JIRA)
doubling score causes by page internal anchors.
---

 Key: NUTCH-332
 URL: http://issues.apache.org/jira/browse/NUTCH-332
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8-dev
Reporter: Stefan Groschupf
Priority: Blocker
 Fix For: 0.8-dev


When a page has no outlinks but several links to itself e.g. it has a set of 
anchors the scores of the page are distributed to its outlinks. But all this 
outlinks pointing to the page back. This causes that the page score is doubled. 
I'm not sure but may be this causes also a never ending fetching loop of this 
page, since outlinks with the status of CrawlDatum.STATUS_LINKED are set 
CrawlDatum.STATUS_DB_UNFETCHED in CrawlDBReducer line: 107. 
So may be the status fetched will be overwritten with unfetched. 
In such a case we fetch the page every-time again and also every-time double 
the score of this page what causes very high scores without any reasons.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-318) log4j not proper configured, readdb doesnt give any information

2006-07-25 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-318?page=comments#action_12423539 ] 

Stefan Groschupf commented on NUTCH-318:


Yes this happens only in a distributed environment. Please also see my last 
mail in the hadoop dev list. I think there are more general logging problems, 
that only occurs in a distributed environment. So you will not track them down 
using local runner.

> log4j not proper configured, readdb doesnt give any information
> ---
>
> Key: NUTCH-318
> URL: http://issues.apache.org/jira/browse/NUTCH-318
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 0.8-dev
>Reporter: Stefan Groschupf
>Priority: Critical
> Fix For: 0.9-dev
>
>
> In the latest .8 sources the readdb command doesn't dump any information 
> anymore. 
> This is realeated to the miss configured log4j.properties file. 
> changing:
> log4j.rootLogger=INFO,DRFA
> to:
> log4j.rootLogger=INFO,DRFA,stdout
> dumps the information to the console, but not in a nice way. 
> What makes me wonder  is that these information should be also in the log 
> file, but the arn't, so there are may be even here problems.
> Also what is the different between hadoop-XXX-jobtracker-XXX.out and 
> hadoop-XXX-jobtracker-XXX.log ?? Shouldn't there just one of them?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-233) wrong regular expression hang reduce process for ever

2006-07-25 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-233?page=comments#action_12423438 ] 

Stefan Groschupf commented on NUTCH-233:


I think this should be fixed in .8 too, since everybody that does real whole 
web crawl with over a 100 Mio pages will run into this problem. The problems 
are for example from spam bot generated urls. 



> wrong regular expression hang reduce process for ever
> -
>
> Key: NUTCH-233
> URL: http://issues.apache.org/jira/browse/NUTCH-233
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 0.8-dev
>Reporter: Stefan Groschupf
>Priority: Blocker
> Fix For: 0.9-dev
>
>
> Looks like that the expression ".*(/.+?)/.*?\1/.*?\1/" in regex-urlfilter.txt 
> wasn't compatible with java.util.regex that is actually used in the regex url 
> filter. 
> May be it was missed to change it when the regular expression packages was 
> changed.
> The problem was that until reducing a fetch map output the reducer hangs 
> forever since the outputformat was applying the urlfilter a url that causes 
> the hang.
> 060315 230823 task_r_3n4zga at 
> java.lang.Character.codePointAt(Character.java:2335)
> 060315 230823 task_r_3n4zga at 
> java.util.regex.Pattern$Dot.match(Pattern.java:4092)
> 060315 230823 task_r_3n4zga at 
> java.util.regex.Pattern$Curly.match1(Pattern.java:
> I changed the regular expression to ".*(/[^/]+)/[^/]+\1/[^/]+\1/" and now the 
> fetch job works. (thanks to Grant and Chris B. helping to find the new regex)
> However may people can review it and can suggest improvements, since the old 
> regex would match :
> "abcd/foo/bar/foo/bar/foo/" and so will the new one match it also. But the 
> old regex would also match :
> "abcd/foo/bar/xyz/foo/bar/foo/" which the new regex will not match.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Commented: (NUTCH-318) log4j not proper configured, readdb doesnt give any information

2006-07-25 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-318?page=comments#action_12423433 ] 

Stefan Groschupf commented on NUTCH-318:


Shouldn't that be fixed in .8 since by today this tool just produce no output?!


> log4j not proper configured, readdb doesnt give any information
> ---
>
> Key: NUTCH-318
> URL: http://issues.apache.org/jira/browse/NUTCH-318
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 0.8-dev
>Reporter: Stefan Groschupf
>Priority: Critical
> Fix For: 0.9-dev
>
>
> In the latest .8 sources the readdb command doesn't dump any information 
> anymore. 
> This is realeated to the miss configured log4j.properties file. 
> changing:
> log4j.rootLogger=INFO,DRFA
> to:
> log4j.rootLogger=INFO,DRFA,stdout
> dumps the information to the console, but not in a nice way. 
> What makes me wonder  is that these information should be also in the log 
> file, but the arn't, so there are may be even here problems.
> Also what is the different between hadoop-XXX-jobtracker-XXX.out and 
> hadoop-XXX-jobtracker-XXX.log ?? Shouldn't there just one of them?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (NUTCH-329) CrawlDbReader processTopNJob does not set jobNames

2006-07-23 Thread Stefan Groschupf (JIRA)
CrawlDbReader processTopNJob does not set jobNames
--

 Key: NUTCH-329
 URL: http://issues.apache.org/jira/browse/NUTCH-329
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8-dev
Reporter: Stefan Groschupf
Priority: Minor
 Fix For: 0.8-dev


processTopNJob runs two job and both have no jobname setted. 


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-325) UrlFilters.java throws NPE in case urlfilter.order contains Filters that are not in plugin.includes

2006-07-20 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-325?page=all ]

Stefan Groschupf updated NUTCH-325:
---

Attachment: UrlFiltersNPE.patch

A patch that uses a Arralist instead of an array and put only entries into the 
list when the entry is not null. Means only loaded Urlfilter that are loaded 
will be also stored into the filters array that is cached into the 
Configuration object. 

> UrlFilters.java throws NPE in case urlfilter.order contains Filters that are 
> not in plugin.includes
> ---
>
> Key: NUTCH-325
> URL: http://issues.apache.org/jira/browse/NUTCH-325
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 0.8-dev
>Reporter: Stefan Groschupf
>Priority: Minor
> Fix For: 0.8-dev
>
> Attachments: UrlFiltersNPE.patch
>
>
> In URLFilters constructor we use an array as long as we have filters defined 
> in the urlfilter.order property. 
> In case those filters are not included in the plugin.include property end up 
> putting null entries into the array.
> This cause a NPE in URLFilters line 82.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (NUTCH-325) UrlFilters.java throws NPE in case urlfilter.order contains Filters that are not in plugin.includes

2006-07-20 Thread Stefan Groschupf (JIRA)
UrlFilters.java throws NPE in case urlfilter.order contains Filters that are 
not in plugin.includes
---

 Key: NUTCH-325
 URL: http://issues.apache.org/jira/browse/NUTCH-325
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8-dev
Reporter: Stefan Groschupf
Priority: Minor
 Fix For: 0.8-dev


In URLFilters constructor we use an array as long as we have filters defined in 
the urlfilter.order property. 
In case those filters are not included in the plugin.include property end up 
putting null entries into the array.

This cause a NPE in URLFilters line 82.



-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Resolved: (NUTCH-319) OPICScoringFilter should use logging API instead of printStackTrace

2006-07-19 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-319?page=all ]

Stefan Groschupf resolved NUTCH-319.


Resolution: Won't Fix

Sorry, that is bogus since it is wriiten to the logging stream.

> OPICScoringFilter should use logging API instead of printStackTrace
> ---
>
> Key: NUTCH-319
> URL: http://issues.apache.org/jira/browse/NUTCH-319
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 0.8-dev
>Reporter: Stefan Groschupf
> Assigned To: Andrzej Bialecki 
>Priority: Trivial
> Fix For: 0.8-dev
>
>
> OPICScoringFilter line 107 should be a logging not a   
> e.printStackTrace(LogUtil.getWarnStream(LOG)), isn't it?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-324) db.score.link.internal and db.score.link.external are ignored

2006-07-19 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-324?page=all ]

Stefan Groschupf updated NUTCH-324:
---

Attachment: InternalAndExternalLinkScoreFactor.patch

Multiply the score of a page during distributeScoreToOutlink with 
db.score.link.internal or db.score.link.external.

> db.score.link.internal and db.score.link.external are ignored
> -
>
> Key: NUTCH-324
> URL: http://issues.apache.org/jira/browse/NUTCH-324
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Reporter: Stefan Groschupf
>Priority: Critical
> Attachments: InternalAndExternalLinkScoreFactor.patch
>
>
> Configuration properties db.score.link.external and db.score.link.internal  
> are ignored.
> In case of e.g. message board webpages or pages that have larger navigation 
> menus on each page having a lower impact of internal links makes a lot of 
> sense for scoring.
> Also for web spam this is a serious problem, since now spammers can setup 
> just one domain with dynamically generated pages and this highly manipulate 
> the nutch scores. 
> So I also suggest that we give db.score.link.internal by default a value of 
> something like 0.25. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (NUTCH-324) db.score.link.internal and db.score.link.external are ignored

2006-07-19 Thread Stefan Groschupf (JIRA)
db.score.link.internal and db.score.link.external are ignored
-

 Key: NUTCH-324
 URL: http://issues.apache.org/jira/browse/NUTCH-324
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Reporter: Stefan Groschupf
Priority: Critical


Configuration properties db.score.link.external and db.score.link.internal  are 
ignored.
In case of e.g. message board webpages or pages that have larger navigation 
menus on each page having a lower impact of internal links makes a lot of sense 
for scoring.
Also for web spam this is a serious problem, since now spammers can setup just 
one domain with dynamically generated pages and this highly manipulate the 
nutch scores. 
So I also suggest that we give db.score.link.internal by default a value of 
something like 0.25. 


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Updated: (NUTCH-323) CrawlDatum.set just reference a mapWritable of a other object but not copy it.

2006-07-19 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-323?page=all ]

Stefan Groschupf updated NUTCH-323:
---

Attachment: MapWritableCopyConstructor.patch

Attached patch add a copy constructor to  the map writable and use it in the 
CrawlDatum.set methode. However there are more methods in the code where meta 
data are passed from one CrawlDatum to a other, but I don't can see any risk of 
concurent usage of the mapWritable there. 


> CrawlDatum.set just reference a mapWritable of a other object but not copy it.
> --
>
> Key: NUTCH-323
> URL: http://issues.apache.org/jira/browse/NUTCH-323
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 0.8-dev
>Reporter: Stefan Groschupf
>Priority: Critical
> Fix For: 0.8-dev
>
> Attachments: MapWritableCopyConstructor.patch
>
>
> Using CrawlDatum.set(aOtherCrawlDatum) copies the data from one CrawlDatum to 
> a other. 
> Also a reference of the MapWritable is passed. Means both project share the 
> same mapWritable and its content. 
> This causes problems with concurent manipulate mapWritables and its key-value 
> tuples. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (NUTCH-323) CrawlDatum.set just reference a mapWritable of a other object but not copy it.

2006-07-19 Thread Stefan Groschupf (JIRA)
CrawlDatum.set just reference a mapWritable of a other object but not copy it.
--

 Key: NUTCH-323
 URL: http://issues.apache.org/jira/browse/NUTCH-323
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8-dev
Reporter: Stefan Groschupf
Priority: Critical
 Fix For: 0.8-dev


Using CrawlDatum.set(aOtherCrawlDatum) copies the data from one CrawlDatum to a 
other. 
Also a reference of the MapWritable is passed. Means both project share the 
same mapWritable and its content. 
This causes problems with concurent manipulate mapWritables and its key-value 
tuples. 



-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (NUTCH-319) OPICScoringFilter should use logging API instead of printStackTrace

2006-07-15 Thread Stefan Groschupf (JIRA)
OPICScoringFilter should use logging API instead of printStackTrace
---

 Key: NUTCH-319
 URL: http://issues.apache.org/jira/browse/NUTCH-319
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8-dev
Reporter: Stefan Groschupf
 Assigned To: Andrzej Bialecki 
Priority: Trivial
 Fix For: 0.8-dev


OPICScoringFilter line 107 should be a logging not a   
e.printStackTrace(LogUtil.getWarnStream(LOG)), isn't it?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] Created: (NUTCH-318) log4j not proper configured, readdb doesnt give any information

2006-07-10 Thread Stefan Groschupf (JIRA)
log4j not proper configured, readdb doesnt give any information
---

 Key: NUTCH-318
 URL: http://issues.apache.org/jira/browse/NUTCH-318
 Project: Nutch
Type: Bug

Versions: 0.8-dev
Reporter: Stefan Groschupf
Priority: Critical
 Fix For: 0.8-dev


In the latest .8 sources the readdb command doesn't dump any information 
anymore. 
This is realeated to the miss configured log4j.properties file. 
changing:
log4j.rootLogger=INFO,DRFA
to:
log4j.rootLogger=INFO,DRFA,stdout
dumps the information to the console, but not in a nice way. 

What makes me wonder  is that these information should be also in the log file, 
but the arn't, so there are may be even here problems.
Also what is the different between hadoop-XXX-jobtracker-XXX.out and 
hadoop-XXX-jobtracker-XXX.log ?? Shouldn't there just one of them?


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-307) wrong configured log4j.properties

2006-06-19 Thread Stefan Groschupf (JIRA)
wrong configured log4j.properties
-

 Key: NUTCH-307
 URL: http://issues.apache.org/jira/browse/NUTCH-307
 Project: Nutch
Type: Bug

Reporter: Stefan Groschupf
Priority: Blocker
 Fix For: 0.8-dev


In nutch/conf is only one  log4j.properties and it define:
log4j.appender.DRFA.File=${nutch.log.dir}/${nutch.log.file}
nutch.log.dir and nutch.log.file is only defined in the bin/nutch script. 
In case of starting a distributed nutch instance with bin/start-all the remove 
tasktracker crash with:

 java.io.FileNotFoundException: / (Is a directory)
cr06:   at java.io.FileOutputStream.openAppend(Native Method)
cr06:   at java.io.FileOutputStream.(FileOutputStream.java:177)
cr06:   at java.io.FileOutputStream.(FileOutputStream.java:102)
cr06:   at org.apache.log4j.FileAppender.setFile(FileAppender.java:289)
cr06:   at org.apache.log4j.FileAppender.activateOptions(FileAppender.java:163)
cr06:   at 
org.apache.log4j.DailyRollingFileAppender.activateOptions(DailyRollingFileAppender.java:215)
cr06:   at 
org.apache.log4j.config.PropertySetter.activate(PropertySetter.java:256)

since the hadoop scripts used to start the tasktrackers and datanodes never 
define the nutch log properties but the log4j.properties require such a 
definition.
I suggest to leave the log4j.properties as it is in hadoop but define the 
hadoop property names in the bin/nutch script instead of intriduce new variable 
names. 


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-289) CrawlDatum should store IP address

2006-06-12 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-289?page=all ]

Stefan Groschupf updated NUTCH-289:
---

Attachment: ipInCrawlDatumDraftV5.patch

Release Candidate 1 of this patch.

This patch contains:
+ add IP Address to CrawlDatum Version 5 (as byte[4]) 
+ a IpAddress Resolver (map runnable) tool to lookup the IP's multithreaded
+ add a property to define if the IpAddress Resolver should be started as a 
part of the crawlDb update tool to update the parseoutput folder (contains 
CrawlDatum Status Linked) of a segment before updating the crawlDb.
+ using cached IP during Generation

Please review this patch and give me any improvement suggestion, I think this 
is a very important issue, since it helps to do _real_ whole web crawls and not 
end up in a honey pot after some fetch iterations.
Also if you like please vote for this issue. :-) Thanks.

> CrawlDatum should store IP address
> --
>
>  Key: NUTCH-289
>  URL: http://issues.apache.org/jira/browse/NUTCH-289
>  Project: Nutch
> Type: Bug

>   Components: fetcher
> Versions: 0.8-dev
> Reporter: Doug Cutting
>  Attachments: ipInCrawlDatumDraftV1.patch, ipInCrawlDatumDraftV4.patch, 
> ipInCrawlDatumDraftV5.patch
>
> If the CrawlDatum stored the IP address of the host of it's URL, then one 
> could:
> - partition fetch lists on the basis of IP address, for better politeness;
> - truncate pages to fetch per IP address, rather than just hostname.  This 
> would be a good way to limit the impact of domain spammers.
> The IP addresses could be resolved when a CrawlDatum is first created for a 
> new outlink, or perhaps during CrawlDB update.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-293) support for Crawl-delay in Robots.txt

2006-06-07 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-293?page=comments#action_12415236 ] 

Stefan Groschupf commented on NUTCH-293:


Hi Andrzej, 
I agree but writing a queue based fetcher is a big step. I already have some 
basic code (nio based).
Also I don't think that a new fetcher will be as stable as that we can put it 
into a .8 release. Since we plan to have .8 release it think it is a good idea 
for now to add this functionality. Maybe we do it configurable and switch it 
off by default?

In any case I suggest that we solve NUTCH-289 first and than getting the  
fetcher done.


> support for Crawl-delay in Robots.txt
> -
>
>  Key: NUTCH-293
>  URL: http://issues.apache.org/jira/browse/NUTCH-293
>  Project: Nutch
> Type: Improvement

>   Components: fetcher
> Versions: 0.8-dev
> Reporter: Stefan Groschupf
> Priority: Critical
>  Attachments: crawlDelayv1.patch
>
> Nutch need support for Crawl-delay defined in robots.txt, it is not a 
> standard but a de-facto standard.
> See:
> http://help.yahoo.com/help/us/ysearch/slurp/slurp-03.html
> Webmasters start blocking nutch since we do not support it.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-293) support for Crawl-delay in Robots.txt

2006-06-07 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-293?page=comments#action_12415171 ] 

Stefan Groschupf commented on NUTCH-293:


Any comments? There was already a posting in the nutch agent mailing list, 
where someone had banned nutch since nutch does not support crawl-delay.
Becasue nutch tries to be polite from my point of view this is a small but 
important change.
If there are no improvement suggestions can someone of the committers take care 
of that _please_? :-) 

> support for Crawl-delay in Robots.txt
> -
>
>  Key: NUTCH-293
>  URL: http://issues.apache.org/jira/browse/NUTCH-293
>  Project: Nutch
> Type: Improvement

>   Components: fetcher
> Versions: 0.8-dev
> Reporter: Stefan Groschupf
> Priority: Critical
>  Attachments: crawlDelayv1.patch
>
> Nutch need support for Crawl-delay defined in robots.txt, it is not a 
> standard but a de-facto standard.
> See:
> http://help.yahoo.com/help/us/ysearch/slurp/slurp-03.html
> Webmasters start blocking nutch since we do not support it.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-301) CommonGrams loads analysis.common.terms.file for each query

2006-06-07 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-301?page=all ]

Stefan Groschupf updated NUTCH-301:
---

Attachment: CommonGramsCacheV1.patch

Cache HashMap COMMON_TERMS in configuration instance.

> CommonGrams loads analysis.common.terms.file for each query
> ---
>
>  Key: NUTCH-301
>  URL: http://issues.apache.org/jira/browse/NUTCH-301
>  Project: Nutch
> Type: Improvement

>   Components: searcher
> Versions: 0.8-dev
> Reporter: Chris Schneider
>  Attachments: CommonGramsCacheV1.patch
>
> The move away from static objects toward instance variables has resulted in 
> CommonGrams constructor parsing its analysis.common.terms.file for each 
> query. I'm not certain how large a performance impact this really is, but it 
> seems like something you'd want to avoid doing for each query. Perhaps the 
> solution is to keep around an instance of the CommonGrams object itself?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-302) java doc of CrawlDb is wrong

2006-06-07 Thread Stefan Groschupf (JIRA)
java doc of CrawlDb is wrong


 Key: NUTCH-302
 URL: http://issues.apache.org/jira/browse/NUTCH-302
 Project: Nutch
Type: Bug

Reporter: Stefan Groschupf
Priority: Trivial
 Fix For: 0.8-dev


CrawlDb has the same java doc as Injector. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-289) CrawlDatum should store IP address

2006-06-07 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-289?page=all ]

Stefan Groschupf updated NUTCH-289:
---

Attachment: ipInCrawlDatumDraftV4.patch

Attached a patch that does only use any time 4 byte for the ip. Means we do 
ignore ipv6. This save us a 4 byte in each crawldatum for now.
I tested the resolver tool with a 200++mio crawldb and in average a performance 
of 500 IP lookups / sec per box is possible by using 1000 threads.

I really would love to get this into the sources as the basic version of having 
the IP address in  the crawlDatum, since I'm working on a tool set of spam 
detectors that all need ip adresses somehow.
May be let's exclude the tool but start with the crawlDatum? :-?
Any improvement suggestions?
Thanks.


> CrawlDatum should store IP address
> --
>
>  Key: NUTCH-289
>  URL: http://issues.apache.org/jira/browse/NUTCH-289
>  Project: Nutch
> Type: Bug

>   Components: fetcher
> Versions: 0.8-dev
> Reporter: Doug Cutting
>  Attachments: ipInCrawlDatumDraftV1.patch, ipInCrawlDatumDraftV4.patch
>
> If the CrawlDatum stored the IP address of the host of it's URL, then one 
> could:
> - partition fetch lists on the basis of IP address, for better politeness;
> - truncate pages to fetch per IP address, rather than just hostname.  This 
> would be a good way to limit the impact of domain spammers.
> The IP addresses could be resolved when a CrawlDatum is first created for a 
> new outlink, or perhaps during CrawlDB update.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-289) CrawlDatum should store IP address

2006-06-05 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-289?page=all ]

Stefan Groschupf updated NUTCH-289:
---

Attachment: ipInCrawlDatumDraftV1.patch

To keep the discussion alive attached a _first draft_ for storing the ip in the 
crawlDatum for public discussion.

Some notes. 
The IP is stored as byte[] in the crawlDatum itself not in the meta data.
There is a IpAddressResolver maprunnable tool to update a crawlDb using 
multithreaded ip lookups.
In case a IP is available in the crawlDatum the Generator use the "cached" ip. 

To discuss:
I don't like the idea of post process the complete crawlDb any time after a 
update. 
Processing crawlDb is expansive in storage usage and time. 
We can have a property "ipLookups" with possible values 
.
Than we can add also some code to lookup the IP in the ParseOutputFormat as 
discussed or we start IpAddressResolver as job in the updateDb tool code.

In the moment I write the ip address bytes like this:
out.writeInt(ipAddress.length);
out.write(ipAddress); 
I think for now we can define that byte[] ipAddress is everytime 4 bytes long, 
or should we be IPv6 compatible by today?

Please give me some comments I have a strong interest to get this issue fixed 
asap and I'm willing to improve things as required. :-)

> CrawlDatum should store IP address
> --
>
>  Key: NUTCH-289
>  URL: http://issues.apache.org/jira/browse/NUTCH-289
>  Project: Nutch
> Type: Bug

>   Components: fetcher
> Versions: 0.8-dev
> Reporter: Doug Cutting
>  Attachments: ipInCrawlDatumDraftV1.patch
>
> If the CrawlDatum stored the IP address of the host of it's URL, then one 
> could:
> - partition fetch lists on the basis of IP address, for better politeness;
> - truncate pages to fetch per IP address, rather than just hostname.  This 
> would be a good way to limit the impact of domain spammers.
> The IP addresses could be resolved when a CrawlDatum is first created for a 
> new outlink, or perhaps during CrawlDB update.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-258) Once Nutch logs a SEVERE log item, Nutch fails forevermore

2006-06-05 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-258?page=comments#action_12414763 ] 

Stefan Groschupf commented on NUTCH-258:


Scott, 
I agree with you. However we need a clean patch to solve the problem, we can 
not just comment things out of the code.
So I vote for the issue and I vote to reopen this issue.

> Once Nutch logs a SEVERE log item, Nutch fails forevermore
> --
>
>  Key: NUTCH-258
>  URL: http://issues.apache.org/jira/browse/NUTCH-258
>  Project: Nutch
> Type: Bug

>   Components: fetcher
> Versions: 0.8-dev
>  Environment: All
> Reporter: Scott Ganyo
> Priority: Critical
>  Attachments: dumbfix.patch
>
> Once a SEVERE log item is written, Nutch shuts down any fetching forevermore. 
>  This is from the run() method in Fetcher.java:
> public void run() {
>   synchronized (Fetcher.this) {activeThreads++;} // count threads
>   
>   try {
> UTF8 key = new UTF8();
> CrawlDatum datum = new CrawlDatum();
> 
> while (true) {
>   if (LogFormatter.hasLoggedSevere()) // something bad happened
> break;// exit
>   
> Notice the last 2 lines.  This will prevent Nutch from ever Fetching again 
> once this is hit as LogFormatter is storing this data as a static.
> (Also note that "LogFormatter.hasLoggedSevere()" is also checked in 
> org.apache.nutch.net.URLFilterChecker and will disable this class as well.)
> This must be fixed or Nutch cannot be run as any kind of long-running 
> service.  Furthermore, I believe it is a poor decision to rely on a logging 
> event to determine the state of the application - this could have any number 
> of side-effects that would be extremely difficult to track down.  (As it has 
> already for me.)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-298) if a 404 for a robots.txt is returned a NPE is thrown

2006-06-04 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-298?page=all ]

Stefan Groschupf updated NUTCH-298:
---

Summary: if a 404 for a robots.txt is returned a NPE is thrown  (was: if a 
404 for a robots.txt is returned no page is fetched at all from the host)

Sorry, worng description.

> if a 404 for a robots.txt is returned a NPE is thrown
> -
>
>  Key: NUTCH-298
>  URL: http://issues.apache.org/jira/browse/NUTCH-298
>  Project: Nutch
> Type: Bug

> Reporter: Stefan Groschupf
>  Fix For: 0.8-dev
>  Attachments: fixNpeRobotRuleSet.patch
>
> What happen:
> Is no RobotRuleSet is in the cache for a host, we create try to fetch the 
> robots.txt.
> In case http response code is not 200 or 403 but for example 404 we do " 
> robotRules = EMPTY_RULES; " (line: 402)
> EMPTY_RULES is a RobotRuleSet created with the default constructor.
> tmpEntries and entries is null and will never changed.
> If we now try to fetch a page from the host that use the EMPTY_RULES is used 
> and we call isAllowed in the RobotRuleSet.
> In this case a NPE is thrown in this line:
>  if (entries == null) {
> entries= new RobotsEntry[tmpEntries.size()];
> possible Solution:
> We can intialize the tmpEntries by default and also remove other null checks 
> and initialisations.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-298) if a 404 for a robots.txt is returned no page is fetched at all from the host

2006-06-03 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-298?page=all ]

Stefan Groschupf updated NUTCH-298:
---

Attachment: fixNpeRobotRuleSet.patch

fix the npe in RobotRuleSet happen in case we use a empthy RuleSet

> if a 404 for a robots.txt is returned no page is fetched at all from the host
> -
>
>  Key: NUTCH-298
>  URL: http://issues.apache.org/jira/browse/NUTCH-298
>  Project: Nutch
> Type: Bug

> Reporter: Stefan Groschupf
>  Fix For: 0.8-dev
>  Attachments: fixNpeRobotRuleSet.patch
>
> What happen:
> Is no RobotRuleSet is in the cache for a host, we create try to fetch the 
> robots.txt.
> In case http response code is not 200 or 403 but for example 404 we do " 
> robotRules = EMPTY_RULES; " (line: 402)
> EMPTY_RULES is a RobotRuleSet created with the default constructor.
> tmpEntries and entries is null and will never changed.
> If we now try to fetch a page from the host that use the EMPTY_RULES is used 
> and we call isAllowed in the RobotRuleSet.
> In this case a NPE is thrown in this line:
>  if (entries == null) {
> entries= new RobotsEntry[tmpEntries.size()];
> possible Solution:
> We can intialize the tmpEntries by default and also remove other null checks 
> and initialisations.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-298) if a 404 for a robots.txt is returned no page is fetched at all from the host

2006-06-03 Thread Stefan Groschupf (JIRA)
if a 404 for a robots.txt is returned no page is fetched at all from the host
-

 Key: NUTCH-298
 URL: http://issues.apache.org/jira/browse/NUTCH-298
 Project: Nutch
Type: Bug

Reporter: Stefan Groschupf
 Fix For: 0.8-dev


What happen:

Is no RobotRuleSet is in the cache for a host, we create try to fetch the 
robots.txt.
In case http response code is not 200 or 403 but for example 404 we do " 
robotRules = EMPTY_RULES; " (line: 402)
EMPTY_RULES is a RobotRuleSet created with the default constructor.
tmpEntries and entries is null and will never changed.
If we now try to fetch a page from the host that use the EMPTY_RULES is used 
and we call isAllowed in the RobotRuleSet.
In this case a NPE is thrown in this line:
 if (entries == null) {
entries= new RobotsEntry[tmpEntries.size()];

possible Solution:
We can intialize the tmpEntries by default and also remove other null checks 
and initialisations.


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-297) sandbox svn folder

2006-06-03 Thread Stefan Groschupf (JIRA)
sandbox svn folder
--

 Key: NUTCH-297
 URL: http://issues.apache.org/jira/browse/NUTCH-297
 Project: Nutch
Type: Sub-task

Reporter: Stefan Groschupf
 Assigned to: Doug Cutting 
Priority: Trivial


Having a svn sandbox repository would allow people to work on a image search.
Should be outside of nutch/trunk, may be nutch/sandbox/imageSearch/trunk ?


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Closed: (NUTCH-286) Handling common error-pages as 404

2006-06-02 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-286?page=all ]
 
Stefan Groschupf closed NUTCH-286:
--

Resolution: Won't Fix

I hope everybody agree with the statement: We can not detect http response 
codes based on responded html content.
Prune the index is a good idea to solve the problem.

> Handling common error-pages as 404
> --
>
>  Key: NUTCH-286
>  URL: http://issues.apache.org/jira/browse/NUTCH-286
>  Project: Nutch
> Type: Improvement

> Reporter: Stefan Neufeind

>
> Idea: Some pages from some software-packages/scripts report an "http 200 ok" 
> even though a specific page could not be found. Example I just found  is:
> http://www.deteimmobilien.de/unternehmen/nbjmup;Uipnbt/IfsctuAefufjnnpcjmjfo/ef
> That's a typo3-page explaining in it's standard-layout and wording: "The 
> requested page did not exist or was inaccessible."
> So I had the idea if somebody might create a plugin that could find commonly 
> used formulations for "page does not exist" etc. and turn the page into a 404 
> before feeding them  into the nutch-index  - although the server responded 
> with status 200 ok.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Resolved: (NUTCH-282) Showing too few results on a page (Paging not correct)

2006-06-02 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-282?page=all ]
 
Stefan Groschupf resolved NUTCH-282:


Resolution: Duplicate

Duplicate of NUTCH-288

> Showing too few results on a page (Paging not correct)
> --
>
>  Key: NUTCH-282
>  URL: http://issues.apache.org/jira/browse/NUTCH-282
>  Project: Nutch
> Type: Bug

>   Components: web gui
> Versions: 0.8-dev
> Reporter: Stefan Neufeind

>
> I did a search and got back the  value "itemsPerPage" from opensearch. But 
> the output shows "results 1-8" and I have a total of 46 searchresults.
> Same happens for the webinterface.
> Why aren't "enough" results fetched?
> The problem might be somewhere in the area of where Nutch should only display 
> a certaian number of websites per site.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-290) parse-pdf: Garbage indexed when text-extraction not allowed

2006-06-02 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-290?page=comments#action_12414469 ] 

Stefan Groschupf commented on NUTCH-290:


As far I understand the code, the next parser is only used if the previous 
parser return with a unsuccessfully paring status. If the parser throws an 
expception these exception is not catched in the parseutil at all.
So the pdf parser should throw an expception and not report a unsucessfully 
status to solve this problem, isn't it?


> parse-pdf: Garbage indexed when text-extraction not allowed
> ---
>
>  Key: NUTCH-290
>  URL: http://issues.apache.org/jira/browse/NUTCH-290
>  Project: Nutch
> Type: Bug

>   Components: indexer
> Versions: 0.8-dev
> Reporter: Stefan Neufeind
>  Attachments: NUTCH-290-canExtractContent.patch
>
> It seems that garbage (or undecoded text?) is indexed when text-extraction 
> for a PDF is not allowed.
> Example-PDF:
> http://www.task-switch.nl/Dutch/articles/Management_en_Architectuur_v3.pdf

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-274) Empty row in/at end of URL-list results in error

2006-06-02 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-274?page=all ]

Stefan Groschupf updated NUTCH-274:
---

Attachment: ignoreEmpthyLineDuringInjectV1.patch

Ignore empthy lines during injecting.
Thanks for spotting this Stefan!

> Empty row in/at end of URL-list results in error
> 
>
>  Key: NUTCH-274
>  URL: http://issues.apache.org/jira/browse/NUTCH-274
>  Project: Nutch
> Type: Bug

> Versions: 0.8-dev
>  Environment: nightly-2006-05-20
> Reporter: Stefan Neufeind
> Priority: Minor
>  Attachments: ignoreEmpthyLineDuringInjectV1.patch
>
> This is minor - but it's a little unclean :-)
> Reproduce: Have a URL-file with one URL followed by a newline, thus producing 
> an empty line.
> Outcome: Fetcher-threads try to fetch two URLs at the same time. First one is 
> fine - but second is empty and therefor fails proper protocol-detection.
> 60521 022639   Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer)
> 060521 022639   Nutch Query Filter (org.apache.nutch.searcher.QueryFilter)
> 060521 022639 found resource parse-plugins.xml at 
> file:/home/mm/nutch-nightly/conf/parse-plugins.xml
> 060521 022639 Using URL normalizer: org.apache.nutch.net.BasicUrlNormalizer
> 060521 022639 fetching http://www.bild.de/
> 060521 022639 fetching 
> 060521 022639 fetch of  failed with: 
> org.apache.nutch.protocol.ProtocolNotFound: java.net.MalformedURLException: 
> no protocol: 
> 060521 022639 http.proxy.host = null
> 060521 022639 http.proxy.port = 8080
> 060521 022639 http.timeout = 1
> 060521 022639 http.content.limit = 65536
> 060521 022639 http.agent = NutchCVS/0.8-dev (Nutch; 
> http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
> 060521 022639 fetcher.server.delay = 1000
> 060521 022639 http.max.delays = 1000
> 060521 022640 ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser 
> mapped to contentType text/xml via parse-plugins.xml, but
>  its plugin.xml file does not claim to support contentType: text/xml
> 060521 022640 ParserFactory:Plugin: org.apache.nutch.parse.html.HtmlParser 
> mapped to contentType text/xml via parse-plugins.xml, but
>  its plugin.xml file does not claim to support contentType: text/xml
> 060521 022640 ParserFactory: Plugin: org.apache.nutch.parse.rss.RSSParser 
> mapped to contentType text/xml via parse-plugins.xml, but 
> not enabled via plugin.includes in nutch-default.xml
> 060521 022640 Using Signature impl: org.apache.nutch.crawl.MD5Signature
> 060521 022640  map 0%  reduce 0%
> 060521 022640 1 pages, 1 errors, 1.0 pages/s, 40 kb/s, 
> 060521 022640 1 pages, 1 errors, 1.0 pages/s, 40 kb/s, 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-274) Empty row in/at end of URL-list results in error

2006-06-02 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-274?page=comments#action_12414457 ] 

Stefan Groschupf commented on NUTCH-274:


Should we fix this in TextInputFormat of Hadoop to ignore emthy lines or in the 
Injector?

> Empty row in/at end of URL-list results in error
> 
>
>  Key: NUTCH-274
>  URL: http://issues.apache.org/jira/browse/NUTCH-274
>  Project: Nutch
> Type: Bug

> Versions: 0.8-dev
>  Environment: nightly-2006-05-20
> Reporter: Stefan Neufeind
> Priority: Minor

>
> This is minor - but it's a little unclean :-)
> Reproduce: Have a URL-file with one URL followed by a newline, thus producing 
> an empty line.
> Outcome: Fetcher-threads try to fetch two URLs at the same time. First one is 
> fine - but second is empty and therefor fails proper protocol-detection.
> 60521 022639   Nutch Analysis (org.apache.nutch.analysis.NutchAnalyzer)
> 060521 022639   Nutch Query Filter (org.apache.nutch.searcher.QueryFilter)
> 060521 022639 found resource parse-plugins.xml at 
> file:/home/mm/nutch-nightly/conf/parse-plugins.xml
> 060521 022639 Using URL normalizer: org.apache.nutch.net.BasicUrlNormalizer
> 060521 022639 fetching http://www.bild.de/
> 060521 022639 fetching 
> 060521 022639 fetch of  failed with: 
> org.apache.nutch.protocol.ProtocolNotFound: java.net.MalformedURLException: 
> no protocol: 
> 060521 022639 http.proxy.host = null
> 060521 022639 http.proxy.port = 8080
> 060521 022639 http.timeout = 1
> 060521 022639 http.content.limit = 65536
> 060521 022639 http.agent = NutchCVS/0.8-dev (Nutch; 
> http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
> 060521 022639 fetcher.server.delay = 1000
> 060521 022639 http.max.delays = 1000
> 060521 022640 ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser 
> mapped to contentType text/xml via parse-plugins.xml, but
>  its plugin.xml file does not claim to support contentType: text/xml
> 060521 022640 ParserFactory:Plugin: org.apache.nutch.parse.html.HtmlParser 
> mapped to contentType text/xml via parse-plugins.xml, but
>  its plugin.xml file does not claim to support contentType: text/xml
> 060521 022640 ParserFactory: Plugin: org.apache.nutch.parse.rss.RSSParser 
> mapped to contentType text/xml via parse-plugins.xml, but 
> not enabled via plugin.includes in nutch-default.xml
> 060521 022640 Using Signature impl: org.apache.nutch.crawl.MD5Signature
> 060521 022640  map 0%  reduce 0%
> 060521 022640 1 pages, 1 errors, 1.0 pages/s, 40 kb/s, 
> 060521 022640 1 pages, 1 errors, 1.0 pages/s, 40 kb/s, 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-275) Fetcher not parsing XHTML-pages at all

2006-06-02 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-275?page=comments#action_12414456 ] 

Stefan Groschupf commented on NUTCH-275:


Should we switch off mime.type.magic by default? 
Some people was reporting the same problems.

> Fetcher not parsing XHTML-pages at all
> --
>
>  Key: NUTCH-275
>  URL: http://issues.apache.org/jira/browse/NUTCH-275
>  Project: Nutch
> Type: Bug

> Versions: 0.8-dev
>  Environment: problem with nightly-2006-05-20; worked fine with same website 
> on 0.7.2
> Reporter: Stefan Neufeind

>
> Server reports page as "text/html" - so I thought it would be processed as 
> html.
> But something I guess evaluated the headers of the document and re-labeled it 
> as "text/xml" (why not text/xhtml?).
> For some reason there is no plugin to be found for indexing text/xml (why 
> does TextParser not feel responsible?).
> Links inside this document are NOT indexed at all - no digging this website 
> actually stops here.
> Funny thing: For some magical reasons the dtd-files referenced in the header 
> seem to be valid links for the fetcher and as such are indexed in the next 
> round (if urlfilter allows).
> 060521 025018 fetching http://www.secreturl.something/
> 060521 025018 http.proxy.host = null
> 060521 025018 http.proxy.port = 8080
> 060521 025018 http.timeout = 1
> 060521 025018 http.content.limit = 65536
> 060521 025018 http.agent = NutchCVS/0.8-dev (Nutch; 
> http://lucene.apache.org/nutch/bot.html; nutch-agent@lucene.apache.org)
> 060521 025018 fetcher.server.delay = 1000
> 060521 025018 http.max.delays = 1000
> 060521 025018 ParserFactory:Plugin: org.apache.nutch.parse.text.TextParser 
> mapped to contentType text/xml via parse-plugins.xml, but
>  its plugin.xml file does not claim to support contentType: text/xml
> 060521 025018 ParserFactory: Plugin: org.apache.nutch.parse.rss.RSSParser 
> mapped to contentType text/xml via parse-plugins.xml, but 
> not enabled via plugin.includes in nutch-default.xml
> 060521 025019 Using Signature impl: org.apache.nutch.crawl.MD5Signature
> 060521 025019  map 0%  reduce 0%
> 060521 025019 1 pages, 0 errors, 1.0 pages/s, 40 kb/s, 
> 060521 025019 1 pages, 0 errors, 1.0 pages/s, 40 kb/s, 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-281) cached.jsp: base-href needs to be outside comments

2006-06-02 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-281?page=comments#action_12414454 ] 

Stefan Groschupf commented on NUTCH-281:


Can you submit a patch file?

> cached.jsp: base-href needs to be outside comments
> --
>
>  Key: NUTCH-281
>  URL: http://issues.apache.org/jira/browse/NUTCH-281
>  Project: Nutch
> Type: Bug

>   Components: web gui
> Reporter: Stefan Neufeind
> Priority: Trivial

>
> see cached.jsp
> 
> does not take effect when showing a cached page because of the comments 
> around it

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-284) NullPointerException during index

2006-06-02 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-284?page=comments#action_12414453 ] 

Stefan Groschupf commented on NUTCH-284:


Please try discuss such things first in the user mailing list than open a 
issue. 
Maintaining the issue tracking is very time consuming. But if there is a bug 
please continue open bug reports. :)
Thanks.


> NullPointerException during index
> -
>
>  Key: NUTCH-284
>  URL: http://issues.apache.org/jira/browse/NUTCH-284
>  Project: Nutch
> Type: Bug

>   Components: indexer
> Versions: 0.8-dev
> Reporter: Stefan Neufeind

>
> For  quite a few this "reduce > sort" has been going on. Then it fails. What 
> could be wrong with this?
> 060524 212613 reduce > sort
> 060524 212614 reduce > sort
> 060524 212615 reduce > sort
> 060524 212615 found resource common-terms.utf8 at 
> file:/home/mm/nutch-nightly-prod/conf/common-terms.utf8
> 060524 212615 found resource common-terms.utf8 at 
> file:/home/mm/nutch-nightly-prod/conf/common-terms.utf8
> 060524 212619 Optimizing index.
> 060524 212619 job_jlbhhm
> java.lang.NullPointerException
> at 
> org.apache.nutch.indexer.Indexer$OutputFormat$1.write(Indexer.java:111)
> at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:269)
> at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:253)
> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:282)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:114)
> Exception in thread "main" java.io.IOException: Job failed!
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:341)
> at org.apache.nutch.indexer.Indexer.index(Indexer.java:287)
> at org.apache.nutch.indexer.Indexer.main(Indexer.java:304)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Closed: (NUTCH-284) NullPointerException during index

2006-06-02 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-284?page=all ]
 
Stefan Groschupf closed NUTCH-284:
--

Resolution: Won't Fix

>Yes, I was missing index-basic.

> NullPointerException during index
> -
>
>  Key: NUTCH-284
>  URL: http://issues.apache.org/jira/browse/NUTCH-284
>  Project: Nutch
> Type: Bug

>   Components: indexer
> Versions: 0.8-dev
> Reporter: Stefan Neufeind

>
> For  quite a few this "reduce > sort" has been going on. Then it fails. What 
> could be wrong with this?
> 060524 212613 reduce > sort
> 060524 212614 reduce > sort
> 060524 212615 reduce > sort
> 060524 212615 found resource common-terms.utf8 at 
> file:/home/mm/nutch-nightly-prod/conf/common-terms.utf8
> 060524 212615 found resource common-terms.utf8 at 
> file:/home/mm/nutch-nightly-prod/conf/common-terms.utf8
> 060524 212619 Optimizing index.
> 060524 212619 job_jlbhhm
> java.lang.NullPointerException
> at 
> org.apache.nutch.indexer.Indexer$OutputFormat$1.write(Indexer.java:111)
> at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:269)
> at org.apache.nutch.indexer.Indexer.reduce(Indexer.java:253)
> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:282)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:114)
> Exception in thread "main" java.io.IOException: Job failed!
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:341)
> at org.apache.nutch.indexer.Indexer.index(Indexer.java:287)
> at org.apache.nutch.indexer.Indexer.main(Indexer.java:304)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Closed: (NUTCH-287) Exception when searching with sort

2006-06-02 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-287?page=all ]
 
Stefan Groschupf closed NUTCH-287:
--

Resolution: Won't Fix

http://www.mail-archive.com/nutch-user%40lucene.apache.org/msg04696.html

> Exception when searching with sort
> --
>
>  Key: NUTCH-287
>  URL: http://issues.apache.org/jira/browse/NUTCH-287
>  Project: Nutch
> Type: Bug

>   Components: searcher
> Versions: 0.8-dev
> Reporter: Stefan Neufeind
> Priority: Critical

>
> Running a search with  &sort=url works.
> But when using&sort=title   I get the following exception.
> 2006-05-25 14:04:25 StandardWrapperValve[jsp]: Servlet.service() for servlet 
> jsp threw exception
> java.lang.RuntimeException: Unknown sort value type!
> at 
> org.apache.nutch.searcher.IndexSearcher.translateHits(IndexSearcher.java:157)
> at 
> org.apache.nutch.searcher.IndexSearcher.search(IndexSearcher.java:95)
> at org.apache.nutch.searcher.NutchBean.search(NutchBean.java:239)
> at org.apache.jsp.search_jsp._jspService(search_jsp.java:257)
> at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:94)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
> at 
> org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:324)
> at 
> org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:292)
> at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:236)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
> at 
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:252)
> at 
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173)
> at 
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:214)
> at 
> org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
> at 
> org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
> at 
> org.apache.catalina.core.StandardContextValve.invokeInternal(StandardContextValve.java:198)
> at 
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:152)
> at 
> org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
> at 
> org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
> at 
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:137)
> at 
> org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
> at 
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:118)
> at 
> org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:102)
> at 
> org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
> at 
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
> at 
> org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
> at 
> org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
> at 
> org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:929)
> at 
> org.apache.coyote.tomcat5.CoyoteAdapter.service(CoyoteAdapter.java:160)
> at 
> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:799)
> at 
> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConnection(Http11Protocol.java:705)
> at 
> org.apache.tomcat.util.net.TcpWorkerThread.runIt(PoolTcpEndpoint.java:577)
> at 
> org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684)
> at java.lang.Thread.run(Thread.java:595)
> What is in those lines is:
>   WritableComparable sortValue;   // convert value to writable
>   if (sortField == null) {
> sortValue = new FloatWritable(scoreDocs[i].score);
>   } else {
> Object raw = ((FieldDoc)scoreDocs[i]).fields[0];
> if (raw instanceof Integer) {
>   sortValue = new IntWritable(((Integer)raw).intValue());
> } else if (raw instanceof Float) {
>   sortValue = new FloatWritable(((Float)raw).floatValue());
> } else if (raw instanceof String) {
>   sortValue = new UTF8((String)raw);
> } else {
>   throw new RuntimeException("Unknown sort value type!");
> }
>   }
> So I thought that maybe raw is an instance of something "strange" and tried 
> raw.getClass().getName() or also raw.toString() to track the cause down - but 
> that always resulted in a NullPointerException. So it seems I'm having raw 
> being null for some strange reason.
> When I try with "title2" (or somethin

[jira] Commented: (NUTCH-290) parse-pdf: Garbage indexed when text-extraction not allowed

2006-06-02 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-290?page=comments#action_12414448 ] 

Stefan Groschupf commented on NUTCH-290:


If a parser throws an exeption:
Fetcher, 261:
 try {
  parse = this.parseUtil.parse(content);
  parseStatus = parse.getData().getStatus();
} catch (Exception e) {
  parseStatus = new ParseStatus(e);
}
if (!parseStatus.isSuccess()) {
  LOG.warning("Error parsing: " + key + ": " + parseStatus);
  parse = parseStatus.getEmptyParse(getConf());
}

than we use the empty parse object:
and a empthy parse contans just no text, see getText
private static class EmptyParseImpl implements Parse {

private ParseData data = null;

public EmptyParseImpl(ParseStatus status, Configuration conf) {
  data = new ParseData(status, "", new Outlink[0],
   new Metadata(), new Metadata());
  data.setConf(conf);
}

public ParseData getData() {
  return data;
}

public String getText() {
  return "";
}
  }
 So the Problem should be somewhere else.

> parse-pdf: Garbage indexed when text-extraction not allowed
> ---
>
>  Key: NUTCH-290
>  URL: http://issues.apache.org/jira/browse/NUTCH-290
>  Project: Nutch
> Type: Bug

>   Components: indexer
> Versions: 0.8-dev
> Reporter: Stefan Neufeind
>  Attachments: NUTCH-290-canExtractContent.patch
>
> It seems that garbage (or undecoded text?) is indexed when text-extraction 
> for a PDF is not allowed.
> Example-PDF:
> http://www.task-switch.nl/Dutch/articles/Management_en_Architectuur_v3.pdf

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-291) OpenSearchServlet should return "date" as well as "lastModified"

2006-06-02 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-291?page=comments#action_12414445 ] 

Stefan Groschupf commented on NUTCH-291:


lastModified will be only indexed if you switch on the index-more plugin.
If you think you should change the way lastmodified and date is stored in the 
index, please submit a patch for MoreIndexingFilter.

> OpenSearchServlet should return "date" as well as "lastModified"
> 
>
>  Key: NUTCH-291
>  URL: http://issues.apache.org/jira/browse/NUTCH-291
>  Project: Nutch
> Type: Improvement

>   Components: web gui
> Versions: 0.8-dev
> Reporter: Stefan Neufeind
>  Attachments: NUTCH-291-unfinished.patch
>
> Currently lastModified is provided by OpenSearchServlet - but only in case 
> the date lastModified-date is known.
> Since you can sort by "date" (which is lastModified or if not present the 
> fetchdate), it might be useful if OpenSearchServlet could provide "date" as 
> well.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-292) OpenSearchServlet: OutOfMemoryError: Java heap space

2006-06-02 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-292?page=comments#action_12414443 ] 

Stefan Groschupf commented on NUTCH-292:


+1, Can someone create a clean patch file?

> OpenSearchServlet: OutOfMemoryError: Java heap space
> 
>
>  Key: NUTCH-292
>  URL: http://issues.apache.org/jira/browse/NUTCH-292
>  Project: Nutch
> Type: Bug

>   Components: web gui
> Versions: 0.8-dev
> Reporter: Stefan Neufeind
> Priority: Critical
>  Attachments: summarizer.diff
>
> java.lang.RuntimeException: java.lang.OutOfMemoryError: Java heap space
>   
> org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:203)
>   org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.java:329)
>   
> org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java:155)
>   javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
>   javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
> The URL I use is:
> [...]something[...]/opensearch?query=mysearch&start=0&hitsPerSite=3&hitsPerPage=20&sort=url
> It seems to be a problem specific to the date I'm working with. Moving the 
> start from 0 to 10 or changing the query works fine.
> Or maybe it doesn't have to do with sorting but it's just that I hit one "bad 
> search-result" that has a broken summary?
> !! The problem is repeatable. So if anybody has an idea where to search / 
> what to fix, I can easily try that out !!

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-288) hitsPerSite-functionality "flawed": problems writing a page-navigation

2006-06-02 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-288?page=comments#action_12414441 ] 

Stefan Groschupf commented on NUTCH-288:


HI Stefan, 
>Also it does go back page by page until you get to the last result-page
Isn't it possible to caculate the latest page instead of using a while loop to 
find the latest page?


> hitsPerSite-functionality "flawed": problems writing a page-navigation
> --
>
>  Key: NUTCH-288
>  URL: http://issues.apache.org/jira/browse/NUTCH-288
>  Project: Nutch
> Type: Bug

>   Components: web gui
> Versions: 0.8-dev
> Reporter: Stefan Neufeind
>  Attachments: NUTCH-288-OpenSearch-fix.patch
>
> The deduplication-functionality on a per-site-basis (hitsPerSite = 3) leads 
> to problems when trying to offer a page-navigation (e.g. allow the user to 
> jump to page 10). This is because dedup is done after fetching.
> RSS shows a maximum number of 7763 documents (that is without dedup!), I set 
> it to display 10 items per page. My "naive" approach was to estimate I have 
> 7763/10 = 777 pages. But already when moving to page 3 I got no more 
> searchresults (I guess because of dedup). And when moving to page 10 I  got 
> an exception (see below).
> 2006-05-25 16:24:43 StandardWrapperValve[OpenSearch]: Servlet.service() for 
> servlet OpenSearch threw exception
> java.lang.NegativeArraySizeException
> at org.apache.nutch.searcher.Hits.getHits(Hits.java:65)
> at 
> org.apache.nutch.searcher.OpenSearchServlet.doGet(OpenSearchServlet.java:149)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:689)
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
> at 
> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:252)
> at 
> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:173)
> at 
> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:214)
> at 
> org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
> at 
> org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
> at 
> org.apache.catalina.core.StandardContextValve.invokeInternal(StandardContextValve.java:198)
> at 
> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:152)
> at 
> org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
> at 
> org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
> at 
> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:137)
> at 
> org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
> at 
> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:118)
> at 
> org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:102)
> at 
> org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
> at 
> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
> at 
> org.apache.catalina.core.StandardValveContext.invokeNext(StandardValveContext.java:104)
> at 
> org.apache.catalina.core.StandardPipeline.invoke(StandardPipeline.java:520)
> at 
> org.apache.catalina.core.ContainerBase.invoke(ContainerBase.java:929)
> at 
> org.apache.coyote.tomcat5.CoyoteAdapter.service(CoyoteAdapter.java:160)
> at 
> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:799)
> at 
> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.processConnection(Http11Protocol.java:705)
> at 
> org.apache.tomcat.util.net.TcpWorkerThread.runIt(PoolTcpEndpoint.java:577)
> at 
> org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:684)
> at java.lang.Thread.run(Thread.java:595)
> Only workaround I see for the moment: Fetching RSS without duplication, dedup 
> myself and cache the RSS-result to improve performance. But a cleaner 
> solution would imho be nice. Is there a performant way of doing deduplication 
> and knowing for sure how many documents are available to view? For sure this 
> would mean to dedup all search-results first ...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-286) Handling common error-pages as 404

2006-06-02 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-286?page=comments#action_12414439 ] 

Stefan Groschupf commented on NUTCH-286:


This is difficult to realize since the http error code is readed from response 
in the fetcher and setted into the protocol status , content analysis can only 
done during parsing. 
Also normally such pages do not get a high OPIC score and should be not in the 
top search results. 
However this is a wrong configured http server response, so you may should open 
a bug in the typo3 issue tracking. 
Should we close this issue?

> Handling common error-pages as 404
> --
>
>  Key: NUTCH-286
>  URL: http://issues.apache.org/jira/browse/NUTCH-286
>  Project: Nutch
> Type: Improvement

> Reporter: Stefan Neufeind

>
> Idea: Some pages from some software-packages/scripts report an "http 200 ok" 
> even though a specific page could not be found. Example I just found  is:
> http://www.deteimmobilien.de/unternehmen/nbjmup;Uipnbt/IfsctuAefufjnnpcjmjfo/ef
> That's a typo3-page explaining in it's standard-layout and wording: "The 
> requested page did not exist or was inaccessible."
> So I had the idea if somebody might create a plugin that could find commonly 
> used formulations for "page does not exist" etc. and turn the page into a 404 
> before feeding them  into the nutch-index  - although the server responded 
> with status 200 ok.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-282) Showing too few results on a page (Paging not correct)

2006-06-02 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-282?page=comments#action_12414435 ] 

Stefan Groschupf commented on NUTCH-282:


Is that related to host grouping we discussed? Can we in this case close this 
bug?

> Showing too few results on a page (Paging not correct)
> --
>
>  Key: NUTCH-282
>  URL: http://issues.apache.org/jira/browse/NUTCH-282
>  Project: Nutch
> Type: Bug

>   Components: web gui
> Versions: 0.8-dev
> Reporter: Stefan Neufeind

>
> I did a search and got back the  value "itemsPerPage" from opensearch. But 
> the output shows "results 1-8" and I have a total of 46 searchresults.
> Same happens for the webinterface.
> Why aren't "enough" results fetched?
> The problem might be somewhere in the area of where Nutch should only display 
> a certaian number of websites per site.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-289) CrawlDatum should store IP address

2006-06-01 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-289?page=comments#action_12414273 ] 

Stefan Groschupf commented on NUTCH-289:


Andrzej, I'm afraid I was not able to clearly communicate my ideas and we may 
be misunderstand each other. 
Resolve the ip in Parseoutputformat would be only necessary for the new links 
discovered in the content. 
Since by default we parse during fetching we would have the chance to use the 
jvm dns cache, since I guess many new urls point to the same host where we 
fetched a particular page from. Means if we do not parse separately we would 
have the best jvm cache usage. 
We do not lookup IPs of urls we fetch at this time, since these urls already 
have a ip that was resoved when these urls was first time discovered in a parse 
process. 
The only problem we need to handle is what happens in case a ip of a host 
change. We can simple lookup the ip of a url that throws a protocol error and 
compare cached and lookup ip.
An alternative aproche would be to lookup ip's during crawldb update just for 
the new urls.
Sorry I hope that describe my ideas more clearly. 

My personal point of view is to store the ip into the crawldatum not into the 
meta data.






> CrawlDatum should store IP address
> --
>
>  Key: NUTCH-289
>  URL: http://issues.apache.org/jira/browse/NUTCH-289
>  Project: Nutch
> Type: Bug

>   Components: fetcher
> Versions: 0.8-dev
> Reporter: Doug Cutting

>
> If the CrawlDatum stored the IP address of the host of it's URL, then one 
> could:
> - partition fetch lists on the basis of IP address, for better politeness;
> - truncate pages to fetch per IP address, rather than just hostname.  This 
> would be a good way to limit the impact of domain spammers.
> The IP addresses could be resolved when a CrawlDatum is first created for a 
> new outlink, or perhaps during CrawlDB update.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-293) support for Crawl-delay in Robots.txt

2006-06-01 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-293?page=all ]

Stefan Groschupf updated NUTCH-293:
---

Attachment: crawlDelayv1.patch

A frist darft of a crawl delay support for nutch. The problem I see is that in 
case ip based delay is configured it can happen that we use the crawl delay of 
one host for a other host running on the same ip.
Feedback is welcome.

> support for Crawl-delay in Robots.txt
> -
>
>  Key: NUTCH-293
>  URL: http://issues.apache.org/jira/browse/NUTCH-293
>  Project: Nutch
> Type: Improvement

>   Components: fetcher
> Versions: 0.8-dev
> Reporter: Stefan Groschupf
> Priority: Critical
>  Attachments: crawlDelayv1.patch
>
> Nutch need support for Crawl-delay defined in robots.txt, it is not a 
> standard but a de-facto standard.
> See:
> http://help.yahoo.com/help/us/ysearch/slurp/slurp-03.html
> Webmasters start blocking nutch since we do not support it.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-293) support for Crawl-delay in Robots.txt

2006-06-01 Thread Stefan Groschupf (JIRA)
support for Crawl-delay in Robots.txt
-

 Key: NUTCH-293
 URL: http://issues.apache.org/jira/browse/NUTCH-293
 Project: Nutch
Type: Improvement

  Components: fetcher  
Versions: 0.8-dev
Reporter: Stefan Groschupf
Priority: Critical


Nutch need support for Crawl-delay defined in robots.txt, it is not a standard 
but a de-facto standard.
See:
http://help.yahoo.com/help/us/ysearch/slurp/slurp-03.html
Webmasters start blocking nutch since we do not support it.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-289) CrawlDatum should store IP address

2006-05-30 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-289?page=comments#action_12413940 ] 

Stefan Groschupf commented on NUTCH-289:


+1
Andrzej, I agree that lookup the ip in ParseOutputFormat would be the best as 
Doug suggested.
The biggest problem nutch has at the moment is spam. The most often seen spam 
method is to setup a dns return the same  ip for all subdomains and than 
deliver dynamically generated content. 
Than spammers just randomly generate subdomains within the content. Also it 
happens often that they have many url but all of them pointing to the same 
server == ip. 
Buying more ip addresses is possible but in the moment more expansive than 
buying more domains. 

Limit the urls by Ip is  a great approach to prevent the crawler staying in 
honey pots with ten thousends of urls pointing to the same ip. 
However to do so  we need to have the ip already until generation and not 
lookup it when fetching. 
We would be able to reuse the ip in the fetcher, also we can try catch the 
parts in the fetcher and in case the ip is not available we can re lookup the 
ip. 
I don't think round robbing dns are huge problem, since only large sites have 
them and in such a case each ip is able to handle requests.
In any case storing the ip in crawl-datum and use it for urls by ip limitations 
will be a gib step forward to in the fight against web spam.

> CrawlDatum should store IP address
> --
>
>  Key: NUTCH-289
>  URL: http://issues.apache.org/jira/browse/NUTCH-289
>  Project: Nutch
> Type: Bug

>   Components: fetcher
> Versions: 0.8-dev
> Reporter: Doug Cutting

>
> If the CrawlDatum stored the IP address of the host of it's URL, then one 
> could:
> - partition fetch lists on the basis of IP address, for better politeness;
> - truncate pages to fetch per IP address, rather than just hostname.  This 
> would be a good way to limit the impact of domain spammers.
> The IP addresses could be resolved when a CrawlDatum is first created for a 
> new outlink, or perhaps during CrawlDB update.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-249) black- white list url filtering

2006-04-26 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-249?page=comments#action_12376477 ] 

Stefan Groschupf commented on NUTCH-249:


I mean the Class and method naming isn't very well.
Blacklist or blocklist? Whitelist or positivivelist?
Does this answer the question?

> black- white list url filtering
> ---
>
>  Key: NUTCH-249
>  URL: http://issues.apache.org/jira/browse/NUTCH-249
>  Project: Nutch
> Type: Improvement

>   Components: fetcher
> Versions: 0.8-dev
> Reporter: Stefan Groschupf
> Priority: Trivial
>  Fix For: 0.8-dev
>  Attachments: blackWhiteListV2.patch, blackWhiteListV3.patch
>
> Existing url filter mechanisms need to process each url against each filter 
> pattern. For very large filter sets this may be does not scale very well.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-249) black- white list url filtering

2006-04-25 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-249?page=all ]

Stefan Groschupf updated NUTCH-249:
---

Attachment: blackWhiteListV3.patch

A new patch that fix an bug where to less urls passed the filter.

> black- white list url filtering
> ---
>
>  Key: NUTCH-249
>  URL: http://issues.apache.org/jira/browse/NUTCH-249
>  Project: Nutch
> Type: Improvement

>   Components: fetcher
> Versions: 0.8-dev
> Reporter: Stefan Groschupf
> Priority: Trivial
>  Fix For: 0.8-dev
>  Attachments: blackWhiteListV2.patch, blackWhiteListV3.patch
>
> Existing url filter mechanisms need to process each url against each filter 
> pattern. For very large filter sets this may be does not scale very well.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-251) Administration GUI

2006-04-21 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-251?page=all ]

Stefan Groschupf updated NUTCH-251:
---

Attachment: hadoop_nutch_gui_v1.patch
nutch_gui_v1.patch
nutch_gui_plugins_v1.zip

This is a early preview patch of the nutch gui. 
There are known issues, however it is a starting point from where we can 
continue building a solid administration user interface.


This patch introduce following functionalities:

+ web based administration gui via embed web container
+ gui is fully based  on the plugin system, so it is customizable and  
extendable using plugins
+ all plugins can be internationalized  
+ introduce the concept of nutch instances, a mechanism to have separated 
configurable nutch deployments using the same code base. (e.g intranet search, 
webpage search)
+ plug able authentication, currently it comes with a default user  - password 
tuple based on the configuration but for example LDAP integration can be easily 
realized.  

The patch it comes with following plugins:
+ admin-listing
++ required by the web ui to show all deployed plugins as tabs on a webpage

+ admin-instance
++ lists all instances and allows to create a new instance

+ admin-configuration
++ configure a nutch instance (configuration will be written as nutch-site.xml 
to hdd)

+ admin-inject
++ inject urls in a crawlDb

+admin-system
++ shows status of system

+admin-job
++ shows  status of jobs

+ admin-crawldb-status
++ shows crawldb entries filtered by status or  shows the status of a given url 
 (usefully to check if a page was already fetched)

+admin-management
++ generate segment
++ fetch segment
++ parse segment (if required)
++ update crawldb
++ invert links
++ index segment
++ delete segment, parse, index etc.

+admin-scheduling
++ quartz based cron job management to run a time driven "generate - fetch - 
updatedb - invertlins - index" job


Known issues
+ require hadoop changes
+ local running jobs can not be stopped but distributed running jobs can be 
stopped
+ index searcher does not use index folders inside of segment folders as in 
nutch 0.7 but the gui place the index folder in the segment folder
++ searcher is unable to find indices
+ put to search does not work since searcher does not support dynamically 
adding of index folders
+ linkdb inverter does not update but overwrite a linkdb - this is a general 
nutch bug but affect the gui as well.
+ the nutch gui introduce locking by storing lock files in folders, this 
mechanism is ignored by the nutch command line tools.



It would be great if users can test the gui and reports bugs and help to 
improve the patch.
This is a very complex patch and it is difficult to stay in sync with the 
latest changes so in case we miss something 
until generation this patch and the patch does not work as expected please 
don't blame us but give us some time and hints to fix the problems.


 help is welcome by following tasks:
+ fixing languages issues in java doc, api and bundle files
+ translate bundles in more languages (currently it comes with english and 
german bundles)
+ heavily test and find bugs and provide fixes :)
+ write help texts and documentation

How to:

+ checkout latest nutch sources

+ checkout hadoop sources
+ patch hadoop with the hadoop patch
+ build hadoop jar
+ remove old hadoop jar from nutch/lib
+ place new hadoop jar in nutch/lib


+ uncompress plugin zip file
+ place plugins in nutch/src/plugins (patch not possible since svn does not 
support binary patches)
+ patch nutch with nutch patch
+ start gui with bin/nutch gui http://localhost:50060/general/
+ username and password are "admin". ( can be changed in nutch-default.xml)
+ select the "default" instance or create a new instance.



Thanks to everybody that helped to get this implement and do the first beta 
tests, but specially to Marko hacking all jsp's!
I suggest to add this patch to a nutch 0.9 branch and add a gui component in 
the jira to go from there.
I really hope I didn't miss anything or upload the wrong files now. :-O

> Administration GUI
> --
>
>  Key: NUTCH-251
>  URL: http://issues.apache.org/jira/browse/NUTCH-251
>  Project: Nutch
> Type: Improvement

> Versions: 0.8-dev
> Reporter: Stefan Groschupf
> Priority: Minor
>  Fix For: 0.8-dev
>  Attachments: hadoop_nutch_gui_v1.patch, nutch_gui_plugins_v1.zip, 
> nutch_gui_v1.patch
>
> Having a web based administration interface would help to make nutch 
> administration and management much more user friendly.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-251) Administration GUI

2006-04-21 Thread Stefan Groschupf (JIRA)
Administration GUI
--

 Key: NUTCH-251
 URL: http://issues.apache.org/jira/browse/NUTCH-251
 Project: Nutch
Type: Improvement

Versions: 0.8-dev
Reporter: Stefan Groschupf
Priority: Minor
 Fix For: 0.8-dev


Having a web based administration interface would help to make nutch 
administration and management much more user friendly.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-249) black- white list url filtering

2006-04-17 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-249?page=all ]

Stefan Groschupf updated NUTCH-249:
---

Attachment: blackWhiteListV2.patch

A concept tryout of black- white list filtering. I'm looking for beta tester 
and improvement suggestions. (Especially I'm looking for terminus suggestions)
Such a filter mechanism can be very useful for vertical search deployments of 
nutch with very large filter sets.

A black-White Url pattern database can be created and used to filter urls until 
updating a crawldb. So the crawlDb contains only urls that passes the black 
white list. In case a url match a black url prefix it will not written to the 
crawlDb. In case a url match a white prefix it is written to the crawlDb. 
In case a url does not match a white or black prefix it is also not written to 
the crawlDb.

Url filtering happens on a host level so a url only need to be filtered by all 
patterns for the same host. 

Usage: 
// inject prefix url patterns (a text file in a folder) that a url should not 
match
bin/nutch org.apache.nutch.crawl.bw.BWInjector bwdb ~/projects/negativeUrls/ 
-black 
// injkect prefix url patterns that a url is allowed to match
bin/nutch org.apache.nutch.crawl.bw.BWInjector bwdb ~/projects/positiveUrls/ 
-white 
// update a fetched segment into a database (only urls will be added to the db 
that pass the black white filter)
bin/nutch org.apache.nutch.crawl.bw.BWUpdateDb testCrawlDb bwdb 
segments/20060416181635/ 

Known Issues:
Hadoop does not allow to have different formats for one job, so some overhead 
format converting is required that currently slow down the processing. 

Any comments are welcome!

> black- white list url filtering
> ---
>
>  Key: NUTCH-249
>  URL: http://issues.apache.org/jira/browse/NUTCH-249
>  Project: Nutch
> Type: Improvement

>   Components: fetcher
> Versions: 0.8-dev
> Reporter: Stefan Groschupf
> Priority: Trivial
>  Fix For: 0.8-dev
>  Attachments: blackWhiteListV2.patch
>
> Existing url filter mechanisms need to process each url against each filter 
> pattern. For very large filter sets this may be does not scale very well.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-249) black- white list url filtering

2006-04-17 Thread Stefan Groschupf (JIRA)
black- white list url filtering
---

 Key: NUTCH-249
 URL: http://issues.apache.org/jira/browse/NUTCH-249
 Project: Nutch
Type: Improvement

  Components: fetcher  
Versions: 0.8-dev
Reporter: Stefan Groschupf
Priority: Trivial
 Fix For: 0.8-dev


Existing url filter mechanisms need to process each url against each filter 
pattern. For very large filter sets this may be does not scale very well.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-246) segment size is never as big as topN or crawlDB size in a distributed deployement

2006-04-13 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-246?page=all ]

Stefan Groschupf updated NUTCH-246:
---

Attachment: injectWithCurTimeMapper.patch

setFetchTime moved to Mapper.

> segment size is never as big as topN or crawlDB size in a distributed 
> deployement
> -
>
>  Key: NUTCH-246
>  URL: http://issues.apache.org/jira/browse/NUTCH-246
>  Project: Nutch
> Type: Bug

> Versions: 0.8-dev
> Reporter: Stefan Groschupf
> Priority: Minor
>  Fix For: 0.8-dev
>  Attachments: injectWithCurTime.patch, injectWithCurTimeMapper.patch
>
> I didn't reopen NUTCH-136 since it is may related to the hadoop split.
> I tested this on two different deployement (with 10 ttrackers + 1 jobtracker 
> and 9 ttracks and 1 jobtracker).
> Defining map and reduce task number in a mapred-default.xml does not solve 
> the problem. (is in nutch/conf on all boxes)
> We verified that it is not  a problem of maximum urls per hosts and also not 
> a problem of the url filter.
> Looks like the first job of the Generator (Selector) already got to less 
> entries to process. 
> May be this is somehow releasted to split generation or configuration inside 
> the distributed jobtracker since it runs in a different jvm as the jobclient.
> However we was not able to find the source for this problem.
> I think that should be fixed before  publishing a nutch 0.8. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-246) segment size is never as big as topN or crawlDB size in a distributed deployement

2006-04-12 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-246?page=all ]

Stefan Groschupf updated NUTCH-246:
---

Attachment: injectWithCurTime.patch

May be something like this?

> segment size is never as big as topN or crawlDB size in a distributed 
> deployement
> -
>
>  Key: NUTCH-246
>  URL: http://issues.apache.org/jira/browse/NUTCH-246
>  Project: Nutch
> Type: Bug

> Versions: 0.8-dev
> Reporter: Stefan Groschupf
> Priority: Minor
>  Fix For: 0.8-dev
>  Attachments: injectWithCurTime.patch
>
> I didn't reopen NUTCH-136 since it is may related to the hadoop split.
> I tested this on two different deployement (with 10 ttrackers + 1 jobtracker 
> and 9 ttracks and 1 jobtracker).
> Defining map and reduce task number in a mapred-default.xml does not solve 
> the problem. (is in nutch/conf on all boxes)
> We verified that it is not  a problem of maximum urls per hosts and also not 
> a problem of the url filter.
> Looks like the first job of the Generator (Selector) already got to less 
> entries to process. 
> May be this is somehow releasted to split generation or configuration inside 
> the distributed jobtracker since it runs in a different jvm as the jobclient.
> However we was not able to find the source for this problem.
> I think that should be fixed before  publishing a nutch 0.8. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-247) robot parser to restrict.

2006-04-11 Thread Stefan Groschupf (JIRA)
robot parser to restrict.
-

 Key: NUTCH-247
 URL: http://issues.apache.org/jira/browse/NUTCH-247
 Project: Nutch
Type: Bug

  Components: fetcher  
Versions: 0.8-dev
Reporter: Stefan Groschupf
Priority: Minor
 Fix For: 0.8-dev


If the agent name and the robots agents are not proper configure the Robot rule 
parser uses LOG.severe to log the problem but solve it also. 
Later on the fetcher thread checks for severe errors and stop if there is one.


RobotRulesParser:

if (agents.size() == 0) {
  agents.add(agentName);
  LOG.severe("No agents listed in 'http.robots.agents' property!");
} else if (!((String)agents.get(0)).equalsIgnoreCase(agentName)) {
  agents.add(0, agentName);
  LOG.severe("Agent we advertise (" + agentName
 + ") not listed first in 'http.robots.agents' property!");
}

Fetcher.FetcherThread:
 if (LogFormatter.hasLoggedSevere()) // something bad happened
break;  

I suggest to use warn or something similar instead of severe to log this 
problem.


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-246) segment size is never as big as topN or crawlDB size in a distributed deployement

2006-04-11 Thread Stefan Groschupf (JIRA)
segment size is never as big as topN or crawlDB size in a distributed 
deployement
-

 Key: NUTCH-246
 URL: http://issues.apache.org/jira/browse/NUTCH-246
 Project: Nutch
Type: Bug

Versions: 0.8-dev
Reporter: Stefan Groschupf
Priority: Blocker
 Fix For: 0.8-dev


I didn't reopen NUTCH-136 since it is may related to the hadoop split.
I tested this on two different deployement (with 10 ttrackers + 1 jobtracker 
and 9 ttracks and 1 jobtracker).
Defining map and reduce task number in a mapred-default.xml does not solve the 
problem. (is in nutch/conf on all boxes)
We verified that it is not  a problem of maximum urls per hosts and also not a 
problem of the url filter.

Looks like the first job of the Generator (Selector) already got to less 
entries to process. 
May be this is somehow releasted to split generation or configuration inside 
the distributed jobtracker since it runs in a different jvm as the jobclient.
However we was not able to find the source for this problem.

I think that should be fixed before  publishing a nutch 0.8. 




-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-171) Bring back multiple segment support for Generate / Update

2006-03-30 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-171?page=comments#action_12372600 ] 

Stefan Groschupf commented on NUTCH-171:


Doug I agree that 200M segment should work and would be the best way to go, but 
just for your information we note that larger segment more likely crash until 
reducing than smaller segments. May this is already solved with one of the many 
patches of hadoop until last 2 weeks. 
So in any case I see some needs (as already discussed) to get the automatic 
error managment extended.


> Bring back multiple segment support for Generate / Update
> -
>
>  Key: NUTCH-171
>  URL: http://issues.apache.org/jira/browse/NUTCH-171
>  Project: Nutch
> Type: Improvement
> Versions: 0.8-dev
> Reporter: Rod Taylor
> Priority: Minor
>  Attachments: multi_segment.patch
>
> We find it convenient to be able to run generate once for -topN 300M and have 
> multiple independent segments to work with (lower overhead) -- then run 
> update on all segments which succeeded simultaneously.
> This reactivates -numFetchers and fixes updatedb to handle multiple provided 
> segments again.
> Radu Mateescu wrote the attached patch for us with the below description 
> (lightly edited):
> The implementation of -numFetchers in 0.8 improperly plays with the number of 
> reduce tasks in order to generate a given number of fetch lists. Basically, 
> what it does is this: before the second reduce (map-reduce is applied twice 
> for generate), it sets the number of reduce tasks to numFetchers and ideally, 
> because each reduce will create a file like part-0, part-1, etc in 
> the ndfs, we'll end up with the number of desired fetched lists. But this 
> behaviour is incorrect for the following reasons:
> 1. the number of reduce tasks is orthogonal to the number of segments 
> somebody wants to create. The number of reduce tasks should be chosen based 
> on the physical topology rather then the number of segments someone might 
> want in ndfs
> 2. if in nutch-site.xml you specify a value for mapred.reduce.tasks property, 
> the numFetchers seems to be ignored
>  
> Therefore , I changed this behaviour to work like this: 
>  - generate will create numFetchers segments
>  - each reduce task will write in all segments (assuming there are enough 
> values to be written) in a round-robin fashion
> The end results for 3 reduce tasks and 2 segments will look like this :
>  
> /opt/nutch/bin>./nutch ndfs -ls segments
> 060111 17 parsing file:/opt/nutch/conf/nutch-default.xml
> 060111 18 parsing file:/opt/nutch/conf/nutch-site.xml
> 060111 18 Client connection to 192.168.0.1:5466: starting
> 060111 18 No FS indicated, using default:master:5466
> Found 2 items
> /user/root/segments/2006022144-0
> /user/root/segments/2006022144-1
>  
> /opt/nutch/bin>./nutch ndfs -ls segments/2006022144-0/crawl_generate
> 060111 122317 parsing file:/opt/nutch/conf/nutch-default.xml
> 060111 122317 parsing file:/opt/nutch/conf/nutch-site.xml
> 060111 122318 No FS indicated, using default:master:5466
> 060111 122318 Client connection to 192.168.0.1:5466: starting
> Found 3 items
> /user/root/segments/2006022144-0/crawl_generate/part-0  1276
> /user/root/segments/2006022144-0/crawl_generate/part-1  1289
> /user/root/segments/2006022144-0/crawl_generate/part-2  1858
>  
> /opt/nutch/bin>./nutch ndfs -ls segments/2006022144-1/crawl_generate
> 060111 122333 parsing file:/opt/nutch/conf/nutch-default.xml
> 060111 122334 parsing file:/opt/nutch/conf/nutch-site.xml
> 060111 122334 Client connection to 192.168.0.1:5466: starting
> 060111 122334 No FS indicated, using default:master:5466
> Found 3 items
> /user/root/segments/2006022144-1/crawl_generate/part-0  1207
> /user/root/segments/2006022144-1/crawl_generate/part-1  1236
> /user/root/segments/2006022144-1/crawl_generate/part-2  1841

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-233) wrong regular expression hang reduce process for ever

2006-03-16 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-233?page=comments#action_12370686 ] 

Stefan Groschupf commented on NUTCH-233:


Sorry, I haven't such url since it happens until reducing a fetch. Reducing 
provides no logging and map data will be deleted if the job fails because a 
timeout. :(


> wrong regular expression hang reduce process for ever
> -
>
>  Key: NUTCH-233
>  URL: http://issues.apache.org/jira/browse/NUTCH-233
>  Project: Nutch
> Type: Bug
> Versions: 0.8-dev
> Reporter: Stefan Groschupf
> Priority: Blocker
>  Fix For: 0.8-dev

>
> Looks like that the expression ".*(/.+?)/.*?\1/.*?\1/" in regex-urlfilter.txt 
> wasn't compatible with java.util.regex that is actually used in the regex url 
> filter. 
> May be it was missed to change it when the regular expression packages was 
> changed.
> The problem was that until reducing a fetch map output the reducer hangs 
> forever since the outputformat was applying the urlfilter a url that causes 
> the hang.
> 060315 230823 task_r_3n4zga at 
> java.lang.Character.codePointAt(Character.java:2335)
> 060315 230823 task_r_3n4zga at 
> java.util.regex.Pattern$Dot.match(Pattern.java:4092)
> 060315 230823 task_r_3n4zga at 
> java.util.regex.Pattern$Curly.match1(Pattern.java:
> I changed the regular expression to ".*(/[^/]+)/[^/]+\1/[^/]+\1/" and now the 
> fetch job works. (thanks to Grant and Chris B. helping to find the new regex)
> However may people can review it and can suggest improvements, since the old 
> regex would match :
> "abcd/foo/bar/foo/bar/foo/" and so will the new one match it also. But the 
> old regex would also match :
> "abcd/foo/bar/xyz/foo/bar/foo/" which the new regex will not match.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-233) wrong regular expression hang reduce process for ever

2006-03-15 Thread Stefan Groschupf (JIRA)
wrong regular expression hang reduce process for ever 
--

 Key: NUTCH-233
 URL: http://issues.apache.org/jira/browse/NUTCH-233
 Project: Nutch
Type: Bug
Versions: 0.8-dev
Reporter: Stefan Groschupf
Priority: Blocker
 Fix For: 0.8-dev


Looks like that the expression ".*(/.+?)/.*?\1/.*?\1/" in regex-urlfilter.txt 
wasn't compatible with java.util.regex that is actually used in the regex url 
filter. 
May be it was missed to change it when the regular expression packages was 
changed.
The problem was that until reducing a fetch map output the reducer hangs 
forever since the outputformat was applying the urlfilter a url that causes the 
hang.
060315 230823 task_r_3n4zga at 
java.lang.Character.codePointAt(Character.java:2335)
060315 230823 task_r_3n4zga at 
java.util.regex.Pattern$Dot.match(Pattern.java:4092)
060315 230823 task_r_3n4zga at 
java.util.regex.Pattern$Curly.match1(Pattern.java:

I changed the regular expression to ".*(/[^/]+)/[^/]+\1/[^/]+\1/" and now the 
fetch job works. (thanks to Grant and Chris B. helping to find the new regex)
However may people can review it and can suggest improvements, since the old 
regex would match :
"abcd/foo/bar/foo/bar/foo/" and so will the new one match it also. But the old 
regex would also match :
"abcd/foo/bar/xyz/foo/bar/foo/" which the new regex will not match.


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-229) improved handling of plugin folder configuration

2006-03-12 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-229?page=all ]

Stefan Groschupf updated NUTCH-229:
---

Attachment: pluginFolder.patch

A patch to be able using relative path that are not in the classpath.

> improved handling of plugin folder configuration
> 
>
>  Key: NUTCH-229
>  URL: http://issues.apache.org/jira/browse/NUTCH-229
>  Project: Nutch
> Type: Improvement
> Reporter: Stefan Groschupf
> Priority: Critical
>  Fix For: 0.8-dev
>  Attachments: pluginFolder.patch
>
> Currently nutch only supports absoluth path or realative path that are part 
> of the classpath. 
> There are cases where it would be useful to be able using relative paaths 
> that  are not in the classpath for example have a centralized plugin 
> repository on a shared hdd in cluster or running nutch inside a ide etc.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-229) improved handling of plugin folder configuration

2006-03-12 Thread Stefan Groschupf (JIRA)
improved handling of plugin folder configuration


 Key: NUTCH-229
 URL: http://issues.apache.org/jira/browse/NUTCH-229
 Project: Nutch
Type: Improvement
Reporter: Stefan Groschupf
Priority: Critical
 Fix For: 0.8-dev


Currently nutch only supports absoluth path or realative path that are part of 
the classpath. 
There are cases where it would be useful to be able using relative paaths that  
are not in the classpath for example have a centralized plugin repository on a 
shared hdd in cluster or running nutch inside a ide etc.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-226) CrawlDb Filter tool

2006-03-08 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-226?page=all ]

Stefan Groschupf updated NUTCH-226:
---

Attachment: crawlDbFilter.patch

Patch with tool to filter a existing crawlDb. In any case backup your crawlDb 
first.

> CrawlDb Filter tool
> ---
>
>  Key: NUTCH-226
>  URL: http://issues.apache.org/jira/browse/NUTCH-226
>  Project: Nutch
> Type: Improvement
> Reporter: Stefan Groschupf
> Priority: Minor
>  Attachments: crawlDbFilter.patch
>
> A tool to filter a existing crawlDb

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-226) CrawlDb Filter tool

2006-03-08 Thread Stefan Groschupf (JIRA)
CrawlDb Filter tool
---

 Key: NUTCH-226
 URL: http://issues.apache.org/jira/browse/NUTCH-226
 Project: Nutch
Type: Improvement
Reporter: Stefan Groschupf
Priority: Minor


A tool to filter a existing crawlDb

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Closed: (NUTCH-222) Exception in thread "main" java.lang.NoClassDefFoundError: invertlink

2006-03-04 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-222?page=all ]
 
Stefan Groschupf closed NUTCH-222:
--

Resolution: Fixed

Hi, 
I guess it is a typo, try "invertlinks" in case the nutch script does not know 
the command as in your case "invertlink" it tries to execute such a class.

> Exception in thread "main" java.lang.NoClassDefFoundError: invertlink
> -
>
>  Key: NUTCH-222
>  URL: http://issues.apache.org/jira/browse/NUTCH-222
>  Project: Nutch
> Type: Bug
>   Components: fetcher
> Versions: 0.7.1
>  Environment: Windows, Cygwin, etc.
> Reporter: Richard Braman

>
> When trying to invertlinks before indexing, following the tutorial, I get the 
> following error.
> [EMAIL PROTECTED] /cygdrive/t/nutch-0.7.1
> $ bin/nutch invertlink taxcrawl/db/ -dir taxcrawl/segments/*
> run java in C:\Program Files\Java\jdk1.5.0_04
> Exception in thread "main" java.lang.NoClassDefFoundError: invertlink

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-204) multiple field values in HitDetails

2006-02-27 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-204?page=comments#action_12368038 ] 

Stefan Groschupf commented on NUTCH-204:


Yes that is a good idea. Thanks for getting this into the sources.
Cheers, 
Stefan

> multiple field values in HitDetails
> ---
>
>  Key: NUTCH-204
>  URL: http://issues.apache.org/jira/browse/NUTCH-204
>  Project: Nutch
> Type: Improvement
>   Components: searcher
> Versions: 0.8-dev
> Reporter: Stefan Groschupf
>  Fix For: 0.8-dev
>  Attachments: DetailGetValues070206.patch, NUTCH-204.jc.060227.patch
>
> Improvement as Howie Wang suggested.
> http://mail-archives.apache.org/mod_mbox/lucene-nutch-dev/200601.mbox/[EMAIL 
> PROTECTED]

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-204) multiple field values in HitDetails

2006-02-27 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-204?page=comments#action_12367991 ] 

Stefan Groschupf commented on NUTCH-204:


Jérôme,
After taking a look to the HitDetails object again - after a some time - I 
notice I completely had overseen that there are already all values in key:value 
tuples in the HitDetais object. 
The problem is more that public String getValue(String field)  just returns the 
first field matching the field name. Accessing all values is already possible 
using  getLength, getField and getValue.
Isn't it?

>From my point of view should keep things as lightweight as possible and may 
>just  add one method getValues to the HitDetails object that could looks like 
>this:
public String[] getValues(String field) {
  ArrayList arrayList = new ArrayList();
  for (int i = 0; i < length; i++) {
if (fields[i].equals(field))
  arrayList.addvalues[i]);
}
  if(arrayList.size()>0){
return (String[]) arrayList.toArray(new String[arrayList.size()]);
  }
  return null;
}
So I think introduce a new Property object, that needs to be instantiated  and 
serialized any time is just more overhead we should not introduce. 
HitDetails has influence of the search performance and with having one object 
instantiated more for each HitDetails we will slow down this by calling gc 
doubled often than before.
Would you agree just adding a method getValues to the HitDetails object?



> multiple field values in HitDetails
> ---
>
>  Key: NUTCH-204
>  URL: http://issues.apache.org/jira/browse/NUTCH-204
>  Project: Nutch
> Type: Improvement
>   Components: searcher
> Versions: 0.8-dev
> Reporter: Stefan Groschupf
>  Fix For: 0.8-dev
>  Attachments: DetailGetValues070206.patch, NUTCH-204.jc.060227.patch
>
> Improvement as Howie Wang suggested.
> http://mail-archives.apache.org/mod_mbox/lucene-nutch-dev/200601.mbox/[EMAIL 
> PROTECTED]

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-204) multiple field values in HitDetails

2006-02-23 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-204?page=comments#action_12367552 ] 

Stefan Groschupf commented on NUTCH-204:


Make sense, I see, thanks for the clarification.

> multiple field values in HitDetails
> ---
>
>  Key: NUTCH-204
>  URL: http://issues.apache.org/jira/browse/NUTCH-204
>  Project: Nutch
> Type: Improvement
>   Components: searcher
> Versions: 0.8-dev
> Reporter: Stefan Groschupf
>  Fix For: 0.8-dev
>  Attachments: DetailGetValues070206.patch
>
> Improvement as Howie Wang suggested.
> http://mail-archives.apache.org/mod_mbox/lucene-nutch-dev/200601.mbox/[EMAIL 
> PROTECTED]

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-204) multiple field values in HitDetails

2006-02-23 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-204?page=comments#action_12367539 ] 

Stefan Groschupf commented on NUTCH-204:


Woudn't you end with something very similar as it is now, having one key and 
multiple values per key?
The Lucene Document provides a getValues so I do not see any changes to the 
lucene API concepts as you mentioned in your first post.
http://lucene.apache.org/java/docs/api/org/apache/lucene/document/Document.html#getValues(java.lang.String)
Sorry, I still do not understand your improvement suggestion can you give some 
more details?

> multiple field values in HitDetails
> ---
>
>  Key: NUTCH-204
>  URL: http://issues.apache.org/jira/browse/NUTCH-204
>  Project: Nutch
> Type: Improvement
>   Components: searcher
> Versions: 0.8-dev
> Reporter: Stefan Groschupf
>  Fix For: 0.8-dev
>  Attachments: DetailGetValues070206.patch
>
> Improvement as Howie Wang suggested.
> http://mail-archives.apache.org/mod_mbox/lucene-nutch-dev/200601.mbox/[EMAIL 
> PROTECTED]

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-204) multiple field values in HitDetails

2006-02-23 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-204?page=comments#action_12367520 ] 

Stefan Groschupf commented on NUTCH-204:


>There is something I don't understand with this patch. The way Lucene manage 
>multi-valued fields is to have many mono-valued Field objects with the same 
>name. My interrogation, is why not keeping this logic? 

Sure that would be possible. My idea was that we don't need these many 
identically keys, they just eat some bytes we do not really need to transfer 
over the neztwork. 
HitDetails is a writable and in case of multiple searchservers distributed in a 
network it makes sense to minimize the network io since getting details should 
be as fast as possible. 
Would you agree? however I agree there are other ways to realize that, if you 
see space for improvements feel free in any case I really would love to see the 
feature in the sources. 

> multiple field values in HitDetails
> ---
>
>  Key: NUTCH-204
>  URL: http://issues.apache.org/jira/browse/NUTCH-204
>  Project: Nutch
> Type: Improvement
>   Components: searcher
> Versions: 0.8-dev
> Reporter: Stefan Groschupf
>  Fix For: 0.8-dev
>  Attachments: DetailGetValues070206.patch
>
> Improvement as Howie Wang suggested.
> http://mail-archives.apache.org/mod_mbox/lucene-nutch-dev/200601.mbox/[EMAIL 
> PROTECTED]

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-213) checkstyle

2006-02-18 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-213?page=all ]

Stefan Groschupf updated NUTCH-213:
---

Attachment: checkstyle.patch
checkstyle-all-4.1.jar

As part of  my learning lesson 'whitespace' I added a checkstyle target to the 
build scrip. The check-style setup by now only checks whitespace but other 
checks can be added later. This target  can be helpful for contributors to 
verify that new code has a correct formating.
It is a own target that can called by 'ant checkstyle'. The result is rendered 
to build/checkstyle/checkstyle_report.html
The patch file contains the text changes and text documents, the jar need to be 
copied to the lib folder.

> checkstyle
> --
>
>  Key: NUTCH-213
>  URL: http://issues.apache.org/jira/browse/NUTCH-213
>  Project: Nutch
> Type: Improvement
> Versions: 0.8-dev
> Reporter: Stefan Groschupf
> Priority: Minor
>  Attachments: checkstyle-all-4.1.jar, checkstyle.patch
>
> Adding checkstyle target to ant build file to support contributers verifying 
> whitespace problems.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Created: (NUTCH-213) checkstyle

2006-02-18 Thread Stefan Groschupf (JIRA)
checkstyle
--

 Key: NUTCH-213
 URL: http://issues.apache.org/jira/browse/NUTCH-213
 Project: Nutch
Type: Improvement
Versions: 0.8-dev
Reporter: Stefan Groschupf
Priority: Minor


Adding checkstyle target to ant build file to support contributers verifying 
whitespace problems.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Updated: (NUTCH-211) FetchedSegments leave readers open

2006-02-16 Thread Stefan Groschupf (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-211?page=all ]

Stefan Groschupf updated NUTCH-211:
---

Attachment: closeable160206.patch

Now also closing linkdb reader and file system, thanks to Raghavendra.

> FetchedSegments leave readers open
> --
>
>  Key: NUTCH-211
>  URL: http://issues.apache.org/jira/browse/NUTCH-211
>  Project: Nutch
> Type: Bug
> Versions: 0.8-dev
> Reporter: Stefan Groschupf
> Assignee: Stefan Groschupf
> Priority: Critical
>  Fix For: 0.8-dev
>  Attachments: closeFetchSegments.patch, closeable160206.patch
>
> I have a case here where the NutchBean is instantiated more than once, 
> however I do cache the nutch bean, but in some situations the bean needs to 
> re created. The problem is the  FetchedSegments leaves open all reads it 
> uses. So a nio Exception is thrown as soon I try to create the NutchBean 
> again. 
> I would suggest to add a close method to  FetchedSegments and all involved 
> objects to be able cleanly shutting down the NutchBean.
> Any comments? Would a patch be welcome?
> Caused by: java.nio.channels.ClosedChannelException
> at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:89)
> at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:272)
> at 
> org.apache.nutch.fs.LocalFileSystem$LocalNFSFileInputStream.seek(LocalFileSystem.java:83)
> at 
> org.apache.nutch.fs.NFSDataInputStream$Checker.seek(NFSDataInputStream.java:66)
> at 
> org.apache.nutch.fs.NFSDataInputStream$PositionCache.seek(NFSDataInputStream.java:162)
> at 
> org.apache.nutch.fs.NFSDataInputStream$Buffer.seek(NFSDataInputStream.java:191)
> at org.apache.nutch.fs.NFSDataInputStream.seek(NFSDataInputStream.java:241)
> at org.apache.nutch.io.SequenceFile$Reader.seek(SequenceFile.java:403)
> at org.apache.nutch.io.MapFile$Reader.seek(MapFile.java:329)
> at org.apache.nutch.io.MapFile$Reader.get(MapFile.java:374)
> at 
> org.apache.nutch.mapred.MapFileOutputFormat.getEntry(MapFileOutputFormat.java:76)
> at 
> org.apache.nutch.searcher.FetchedSegments$Segment.getEntry(FetchedSegments.java:93)
> at 
> org.apache.nutch.searcher.FetchedSegments$Segment.getParseText(FetchedSegments.java:84)
> at 
> org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:147)
> at org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.java:321)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



[jira] Commented: (NUTCH-211) FetchedSegments leave readers open

2006-02-16 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-211?page=comments#action_12366645 ] 

Stefan Groschupf commented on NUTCH-211:


Raghavendra, I'm not sure if I also close the linkDB reader, may be I missed 
that. I will check this later today and may come with a improved version is I 
missed it. Thanks for catching this.

> FetchedSegments leave readers open
> --
>
>  Key: NUTCH-211
>  URL: http://issues.apache.org/jira/browse/NUTCH-211
>  Project: Nutch
> Type: Bug
> Versions: 0.8-dev
> Reporter: Stefan Groschupf
> Assignee: Stefan Groschupf
> Priority: Critical
>  Fix For: 0.8-dev
>  Attachments: closeFetchSegments.patch
>
> I have a case here where the NutchBean is instantiated more than once, 
> however I do cache the nutch bean, but in some situations the bean needs to 
> re created. The problem is the  FetchedSegments leaves open all reads it 
> uses. So a nio Exception is thrown as soon I try to create the NutchBean 
> again. 
> I would suggest to add a close method to  FetchedSegments and all involved 
> objects to be able cleanly shutting down the NutchBean.
> Any comments? Would a patch be welcome?
> Caused by: java.nio.channels.ClosedChannelException
> at sun.nio.ch.FileChannelImpl.ensureOpen(FileChannelImpl.java:89)
> at sun.nio.ch.FileChannelImpl.position(FileChannelImpl.java:272)
> at 
> org.apache.nutch.fs.LocalFileSystem$LocalNFSFileInputStream.seek(LocalFileSystem.java:83)
> at 
> org.apache.nutch.fs.NFSDataInputStream$Checker.seek(NFSDataInputStream.java:66)
> at 
> org.apache.nutch.fs.NFSDataInputStream$PositionCache.seek(NFSDataInputStream.java:162)
> at 
> org.apache.nutch.fs.NFSDataInputStream$Buffer.seek(NFSDataInputStream.java:191)
> at org.apache.nutch.fs.NFSDataInputStream.seek(NFSDataInputStream.java:241)
> at org.apache.nutch.io.SequenceFile$Reader.seek(SequenceFile.java:403)
> at org.apache.nutch.io.MapFile$Reader.seek(MapFile.java:329)
> at org.apache.nutch.io.MapFile$Reader.get(MapFile.java:374)
> at 
> org.apache.nutch.mapred.MapFileOutputFormat.getEntry(MapFileOutputFormat.java:76)
> at 
> org.apache.nutch.searcher.FetchedSegments$Segment.getEntry(FetchedSegments.java:93)
> at 
> org.apache.nutch.searcher.FetchedSegments$Segment.getParseText(FetchedSegments.java:84)
> at 
> org.apache.nutch.searcher.FetchedSegments.getSummary(FetchedSegments.java:147)
> at org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.java:321)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



  1   2   >