Re: [Fwd: Crawler submits forms?]

2005-12-14 Thread Andrzej Bialecki

Zaheed Haque wrote:


what about the following:

http://issues.apache.org/jira/browse/NUTCH-125
 



On its way ... ;-) I'll add it during this week.

--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: IndexOptimizer (Re: Lucene performance bottlenecks)

2005-12-14 Thread Andrzej Bialecki

Doug Cutting wrote:


Andrzej Bialecki wrote:

Ok, I just tested IndexSorter for now. It appears to work correctly, 
at least I get exactly the same results, with the same scores and the 
same explanations, if I run the smae queries on the original and on 
the sorted index.



Here's a more complete version, still mostly untested.  This should 
make searches faster.  We'll see how much good the results are...


This includes a patch to Lucene to make it easier to write hit 
collectors that collect TopDocs.


I'll test this on a 38M document index tomorrow.



I'll test it soon - one comment, though. Currently you use a subclass of 
RuntimeException to stop the collecting. I think we should come up with 
a better mechanism - throwing exceptions is too costly. Perhaps the 
HitCollector.collect() method should return a boolean to signal whether 
the searcher should continue working.


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




[jira] Commented: (NUTCH-140) Add alias capability in parse-plugins.xml file that allows mimeType-extensionId mapping

2005-12-14 Thread Stefan Groschupf (JIRA)
[ 
http://issues.apache.org/jira/browse/NUTCH-140?page=comments#action_12360409 ] 

Stefan Groschupf commented on NUTCH-140:


From my point of view this makes things more complicated, why not just use the 
extension id, where would be the advantage of aliases?
May the aliases would more human readable but in the end you have to define the 
aliases anyway and need to lookup the extension ids. So I think it is just one 
step more, but may I miss the advantage. 


 Add alias capability in parse-plugins.xml file that allows 
 mimeType-extensionId mapping
 

  Key: NUTCH-140
  URL: http://issues.apache.org/jira/browse/NUTCH-140
  Project: Nutch
 Type: Improvement
   Components: fetcher
  Environment:  Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB RAM, 
 although bug is independent of environment
 Reporter: Chris A. Mattmann
 Assignee: Chris A. Mattmann
 Priority: Minor


  Jerome and I have been talking about an idea to address the current issue 
 raised by Stefan G. about having a mapping of mimeType-list of pluginIds 
 rather than mimeType-list of extensionIds in the parse-plugins.xml file. 
 We've come up with the following proposed update that would seemingly fix 
 this problem.
   We propose to have the concept of aliases in the parse-plugins.xml file, 
 defined at the end of the file, something lie:
  parse-plugins
 
mimeType name=text/html
   plugin id=parse-html/
/mimeType
 .
   
aliases
alias name=parse-html
 extension-point=org.apache.nutch.parse.html.HtmlParser/

alias name=parse-html2 extension-point=my.other.html.Parser/


/aliases
 /parse-plugins
 What do you guys think? This approach would be flexible enough to allow the 
 mapping of extensionIds to mimeTypes, but without impacting the current 
 pluginId concept.
 Comments welcome. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira



Re: [Fwd: Crawler submits forms?]

2005-12-14 Thread Jérôme Charron
 What people think if we collect a list of issues and make a voting
 iteration?

+1


vote for issues to fix in 0.7.2

2005-12-14 Thread Stefan Groschupf

Full list of open issues
complete description can be found here :
http://issues.apache.org/jira/secure/IssueNavigator.jspa? 
view=fulltempMax=30


Please add a +1 in case you vote for the issue under this issue.
Please keep in mind that this will be more a maintenance release.

NUTCH-141   jobdetails.jsp doesnt work on webbrowser safari
NUTCH-140	Add alias capability in parse-plugins.xml file that allows  
mimeType-extensionId mapping

NUTCH-139   Standard metadata property names in the ParseData metadata
NUTCH-138   non-Latin-1 characters cannot be submitted for search
NUTCH-137   footer is not displayed in search result page   
NUTCH-136	mapreduce segment generator generates 50 % less than  
excepted urls

NUTCH-34Parsing different content formats   
NUTCH-3 multi values of header discarded
NUTCH-134   Summarizer doesn't select the best snippets 
NUTCH-132   Add ability to sort on more than one column 
NUTCH-131   Non-documented variable: mapred.child.heap.size
NUTCH-98RobotRulesParser interprets robots.txt incorrectly
NUTCH-129	rtf-parser does not work when opened with wordpad files and  
saved

NUTCH-120   one bad link on a page kills parsing
NUTCH-128   second configuration nodes overwrites first node
NUTCH-127   uncorrect values using -du, or ls does not return items
NUTCH-126   Fetching via https does not work with a proxy (patch)
NUTCH-125   OpenOffice Parser plugin
NUTCH-110   OpenSearchServlet outputs illegal xml characters
NUTCH-36Chinese in Nutch
NUTCH-123   Cache.jsp some times generate NullPointerException
NUTCH-39pagination in search result 
NUTCH-49	Flag for generate to fetch only new pages to complement the - 
refetchonly flag

NUTCH-94MapFile.Writer throwing 'File exists error'.
NUTCH-117	Crawl crashes with java.io.IOException: already exists: C: 
\nutch\crawl.intranet\oct18\db\webdb.new\pagesByURL

NUTCH-122   block numbers need a better random number generator
NUTCH-82Nutch Commands should run on Windows without external tools
NUTCH-121   SegmentReader for mapred
NUTCH-119   Regexp to extract outlinks incorrect
NUTCH-118   FAQ link points to invalid URL  
NUTCH-115   jobtracker.jsp shows too much information   
NUTCH-103   Vivisimo like treeview and url redirect 
NUTCH-108   tasktracker crashs when reconnecting to a new jobtracker.
NUTCH-113   Disable permanent DNS-to-IP caching for JVM 1.4
NUTCH-111	ndfs.replication is not documented within the nutch- 
default.xml configuration file.

NUTCH-100   New plugin urlfilter-db 
NUTCH-101   RobotRulesParser
NUTCH-96	MapFile.Writer throws directory exists exception if run  
multiple times in the same JVM or server JVM.

NUTCH-106   Datanode corruption 
NUTCH-105	Network error during robots.txt fetch causes file to be  
ignored

NUTCH-104   Nutch query parser does not support CJK bi-gram segmentation.
NUTCH-102   jobtracker does not start when webapps is in src
NUTCH-95DeleteDuplicates depends on the order of input segments
NUTCH-92DistributedSearch incorrectly scores results
NUTCH-87Efficient site-specific crawling for a large number of sites
NUTCH-91empty encoding causes exception 
NUTCH-90reduce logging output of IndexSegment   
NUTCH-52Parser plugin for MS Excel files
NUTCH-86LanguageIdentifier API enhancements 
NUTCH-84Fetcher for constrained crawls  
NUTCH-74French Analyzer Plugin  
NUTCH-83Release deliverable as zip  
NUTCH-81Webapp only works when deployed in root 
NUTCH-79Fault tolerant searching.   
NUTCH-64	no results after a restart of a search--server (without  
tomcat restart)

NUTCH-76NDFS DataNode advertises localhost as it's address
NUTCH-75	Patch for WebDBReader to get more detailed information about  
WebDBs

NUTCH-73A page for CSV results  
NUTCH-72Query basic filter with correction feature  
NUTCH-70duplicate pages - virtual hosts in db.  
NUTCH-68A tool to generate arbitrary fetchlists 
NUTCH-62	Add html META tag information into metaData in index-more  
plugin

NUTCH-61Adaptive re-fetch interval. Detecting umodified content
NUTCH-55	Create dmoz.org search plugin - incorporate the dmoz.org  
title/category/description if available 

NUTCH-59meta data support in webdb  
NUTCH-25needs 'character encoding' detector 
NUTCH-44too many search results 
NUTCH-42enhance search.jsp such that it can also returns XML
NUTCH-50Benchmarks  Performance goals  
NUTCH-13If dns points to 127.0.0.1, the url is also crawled
NUTCH-48Did you mean query enhancement/refignment feature request
NUTCH-47Configure host filter to do wildcard prefixes - *.redhat.com
NUTCH-45

Re: vote for issues to fix in 0.7.2

2005-12-14 Thread Matthias Jaekle
NUTCH-134Summarizer doesn't select the best snippets   

+1


NUTCH-98RobotRulesParser interprets robots.txt incorrectly

+1

NUTCH-120one bad link on a page kills parsing   

+1


NUTCH-95DeleteDuplicates depends on the order of input segments

+1


NUTCH-13If dns points to 127.0.0.1, the url is also crawled

+1

NUTCH-45Log corrupt segments in SegmentMergeTool   

+1


Matthias


Re: vote for issues to fix in 0.7.2

2005-12-14 Thread Stefan Groschupf

My personal fav. list
In a day or so I will count all votes and post them.


NUTCH-141   jobdetails.jsp doesnt work on webbrowser safari

+1
NUTCH-140	Add alias capability in parse-plugins.xml file that  
allows mimeType-extensionId mapping

NUTCH-139   Standard metadata property names in the ParseData metadata

+1

NUTCH-138   non-Latin-1 characters cannot be submitted for search

+1

NUTCH-137   footer is not displayed in search result page   
NUTCH-136	mapreduce segment generator generates 50 % less than  
excepted urls

NUTCH-34Parsing different content formats   
NUTCH-3 multi values of header discarded

+1

NUTCH-134   Summarizer doesn't select the best snippets 
NUTCH-132   Add ability to sort on more than one column 
NUTCH-131   Non-documented variable: mapred.child.heap.size
NUTCH-98RobotRulesParser interprets robots.txt incorrectly
NUTCH-129	rtf-parser does not work when opened with wordpad files  
and saved

NUTCH-120   one bad link on a page kills parsing

+1

NUTCH-128   second configuration nodes overwrites first node
NUTCH-127   uncorrect values using -du, or ls does not return items
NUTCH-126   Fetching via https does not work with a proxy (patch)

+1

NUTCH-125   OpenOffice Parser plugin

+1

NUTCH-110   OpenSearchServlet outputs illegal xml characters

+1

NUTCH-36Chinese in Nutch
NUTCH-123   Cache.jsp some times generate NullPointerException

+1 (may already fixed)

NUTCH-39pagination in search result 
NUTCH-49	Flag for generate to fetch only new pages to complement  
the -refetchonly flag

NUTCH-94MapFile.Writer throwing 'File exists error'.
NUTCH-117	Crawl crashes with java.io.IOException: already exists: C: 
\nutch\crawl.intranet\oct18\db\webdb.new\pagesByURL

NUTCH-122   block numbers need a better random number generator
NUTCH-82Nutch Commands should run on Windows without external tools
NUTCH-121   SegmentReader for mapred
NUTCH-119   Regexp to extract outlinks incorrect

+1

NUTCH-118   FAQ link points to invalid URL  
NUTCH-115   jobtracker.jsp shows too much information   
NUTCH-103   Vivisimo like treeview and url redirect 
NUTCH-108   tasktracker crashs when reconnecting to a new jobtracker.
NUTCH-113   Disable permanent DNS-to-IP caching for JVM 1.4
NUTCH-111	ndfs.replication is not documented within the nutch- 
default.xml configuration file.

NUTCH-100   New plugin urlfilter-db

+1


NUTCH-101   RobotRulesParser
NUTCH-96	MapFile.Writer throws directory exists exception if run  
multiple times in the same JVM or server JVM.

NUTCH-106   Datanode corruption 
NUTCH-105	Network error during robots.txt fetch causes file to be  
ignored
NUTCH-104	Nutch query parser does not support CJK bi-gram  
segmentation.

NUTCH-102   jobtracker does not start when webapps is in src
NUTCH-95DeleteDuplicates depends on the order of input segments
NUTCH-92DistributedSearch incorrectly scores results
NUTCH-87Efficient site-specific crawling for a large number of sites
NUTCH-91empty encoding causes exception

+1


NUTCH-90reduce logging output of IndexSegment   
NUTCH-52Parser plugin for MS Excel files
NUTCH-86LanguageIdentifier API enhancements 
NUTCH-84Fetcher for constrained crawls  
NUTCH-74French Analyzer Plugin

+1


NUTCH-83Release deliverable as zip  
NUTCH-81Webapp only works when deployed in root 
NUTCH-79Fault tolerant searching.   
NUTCH-64	no results after a restart of a search--server (without  
tomcat restart)

NUTCH-76NDFS DataNode advertises localhost as it's address
NUTCH-75	Patch for WebDBReader to get more detailed information  
about WebDBs

NUTCH-73A page for CSV results  
NUTCH-72Query basic filter with correction feature  
NUTCH-70duplicate pages - virtual hosts in db.  
NUTCH-68A tool to generate arbitrary fetchlists 

+1
NUTCH-62	Add html META tag information into metaData in index-more  
plugin

++1!

NUTCH-61Adaptive re-fetch interval. Detecting umodified content

++1! but is it ready to us?
NUTCH-55	Create dmoz.org search plugin - incorporate the dmoz.org  
title/category/description if available 

NUTCH-59meta data support in webdb  
NUTCH-25needs 'character encoding' detector 
NUTCH-44too many search results 
NUTCH-42enhance search.jsp such that it can also returns XML
NUTCH-50Benchmarks  Performance goals  
NUTCH-13If dns points to 127.0.0.1, the url is also crawled
NUTCH-48Did you mean query enhancement/refignment feature request

+1

NUTCH-47Configure host filter to do wildcard prefixes - *.redhat.com
NUTCH-45Log corrupt segments in SegmentMergeTool

Re: vote for issues to fix in 0.7.2

2005-12-14 Thread Marko Bauhardt

NUTCH-141   jobdetails.jsp doesnt work on webbrowser safari

+1
:-)

Marko.


translation of Nutch search page

2005-12-14 Thread hind
Hi,
I would like to translate in arabic the Nutch index page. I translated the
five files concerned : header, about, search, help and
search_lang.properties. But I didn't find documents explaining how to make
the translation effective, I ask you if you have an idea about make it
possible to search in a arabic nutch environment.

Best Regards

Hind OUKERRADI




Re: vote for issues to fix in 0.7.2

2005-12-14 Thread Andrew McNabb
 NUTCH-127 uncorrect values using -du, or ls does not return items
NUTCH-127 +1

 NUTCH-121 SegmentReader for mapred
NUTCH-121 +1

 NUTCH-115 jobtracker.jsp shows too much information   
NUTCH-115 +1

 NUTCH-108 tasktracker crashs when reconnecting to a new jobtracker.
NUTCH-108 +1

 NUTCH-111 ndfs.replication is not documented within the nutch- 
 default.xml configuration file.
NUTCH-111 +1


-- 
Andrew McNabb
http://www.mcnabbs.org/andrew/
PGP Fingerprint: 8A17 B57C 6879 1863 DE55  8012 AB4D 6098 8826 6868


pgp0HSNOV9SYZ.pgp
Description: PGP signature


Re: IndexOptimizer (Re: Lucene performance bottlenecks)

2005-12-14 Thread Doug Cutting

Andrzej Bialecki wrote:
I'll test it soon - one comment, though. Currently you use a subclass of 
RuntimeException to stop the collecting. I think we should come up with 
a better mechanism - throwing exceptions is too costly.


I thought about this, but I could not see a simple way to achieve it. 
And one exception thrown per query is not very expensive.  But it is bad 
style.  Sigh.


Perhaps the 
HitCollector.collect() method should return a boolean to signal whether 
the searcher should continue working.


We don't really want a HitCollector in this case: we want a TopDocs.  So 
the patch I made is required: we need to extend the HitCollector that 
implements TopDocs-based searching.


Long-term, to avoid the 'throw', we'd need to also:

1. Change:
 TopDocs Searchable.search(Query, Filter, int numHits)
   to:
 TopDocs Searchable.search(Query, Filter, int numHits, maxTotalHits)

2. Add, for back-compatibility:
 TopDocs Searcher.search(Query, Filter, int numHits) {
   return search(query, filter, numHits, Integer.MAX_VALUE);
 }

3. Add a new method:
 /** Return false to stop hit processing. */
 boolean HitCollector.processHit(int doc, float score) {
   collect(doc, score);   // for back-compatibility
   return true;
 }
   Then change all calls to HitCollector.collect to instead call this,
   and deprecate HitCollector.collect.

I think that would do it.  But is it worth it?

In the past I've frequently wanted to be able to extend TopDocs-based 
searching, so I think the Lucene patch I've constructed so far is 
generally useful.


Doug


mapreduce fetcher doesn't fetch all urls

2005-12-14 Thread Florent Gluck
When doing a one-pass crawl, I noticed that when I inject more than
~16000 urls, the fetcher only fetches a subset of the set initially
injected.
I use 1 master and 3 slaves with the following properties:
mapred.map.tasks = 30
mapred.reduce.tasks = 6
generate.max.per.host = -1

I tried to inject different amount of urls to see around what threshold
I start to see some missing ones.  Here are the results of my tests so far:

#urls
15000 and below: 100% fetched
16000: 15998 fetched (~100%)
25000: 21379 fetched (86%)
5: 26565 fetched (53%)
10: 22088 fetched (22%)

After having seen bug NUTCH-136 mapreduce segment generator generates
50 % less than excepted urls, I thought it may fix my problem.  I  only
applied the 2nd change mentioned in the description (the change in
Generator.java, line 48) since I didn't know how to set the partition to
use a normal hashPartitioner.  The fix didn't make any difference.

Then I started debugging the generator to see if all the urls were
generated.  I confirmed they were all generated (did a check w/ 50k), so
the problem lays further in the pipeline.  I assume it's somewhere in
the fetcher, but I'm not sure where yet.  I'm gonna keep investigating.

Has anyone encountered a similar issue ?
I read messages of people crawling million of pages and I wonder why it
seems I'm the only one to have this issue.  I'm apparently unable to
fetch more than ~30k pages even though I inject 1 million urls.

Any help would be greatly appreciated.

Thanks,
--Flo


Re: IndexOptimizer (Re: Lucene performance bottlenecks)

2005-12-14 Thread Andrzej Bialecki

Doug Cutting wrote:


Andrzej Bialecki wrote:

Ok, I just tested IndexSorter for now. It appears to work correctly, 
at least I get exactly the same results, with the same scores and the 
same explanations, if I run the smae queries on the original and on 
the sorted index.



Here's a more complete version, still mostly untested.  This should 
make searches faster.  We'll see how much good the results are...


This includes a patch to Lucene to make it easier to write hit 
collectors that collect TopDocs.


I'll test this on a 38M document index tomorrow.



I tested it on a 5 mln index.

The original index is considered the baseline, i.e. it represents 
normative values for scoring and ranking. These results are compared to 
results from the optimized index, and scores and positions of hits are 
also recorded. Finally, these two lists are matched, and relative 
differences in scoring and ranking are calculated.


At the end, I calculate the top10 %, top50% and top100%, defined as a 
percent of the top-N hits from the optimized index, which match the 
top-N hits from the baseline index. Ideally, all these measures should 
be 100%, i.e. all top-N hits from the optimized index should match 
corresponding top-N hits from the baseline index.


One variable which affects greatly both the recall and the performance 
is the maximum number of hits considered by the TopDocCollector. In my 
tests I used values between 1,000 up to 500,000 (which represents 1/10th 
of the full index in my case).


Now, the results. I collected all test results in a spreadsheet 
(OpenDocument or PDF format), you can download it from:


   http://www.getopt.org/nutch/20051214/nutchPerf.ods
   http://www.getopt.org/nutch/20051214/nutchPerf.pdf

For MAX_HITS=1000 the performance increase was ca. 40-fold, i.e. 
queries, which executed in e.g. 500 ms now executed in 10-20ms 
(perfRate=40). Following the intuition, performance drops as we increase 
MAX_HITS, until it reaches a more or less original values (perfRate=1) 
for MAX_HITS=30 (for a 5 mln doc index). After that, increasing 
MAX_HITS actually worsens the performance (perfRate  1) - which can be 
explained by the fact that the standard HitCollector doesn't collect as 
many documents, if they score too low.


* Single-term Nutch queries (i.e. which do not produce Lucene 
PhraseQueries) yield relatively good values of topN, even for relatively 
small values of MAX_HITS - however, MAX_HITS=1000 yields all topN=0%. 
The minimum useful value for my index was MAX_HITS=1 (perfRate=30), 
and this yields quite acceptable top10=90%, but less acceptable top50 
and top100. Please see the spreadsheet for details.


* Two-term Nutch queries result in complex Lucene BooleanQueries over 
many index fields, includng also PhraseQueries. These fared much worse 
than single-term queries: actually, the topN values were very low until 
MAX_HITS was increased to large values, and then all of a sudden all 
topN-s flipped into the 80-90% ranges.


I also noticed that the values of topN depended strongly on the document 
frequency of terms in the query. For a two-term query, where both terms 
have average document frequency, the topN values start from ~50% for low 
MAX_HITS. For a two-term query where one of the terms has a very high 
document frequency, the topN values start from 0% for low MAX_HITS. See 
the spreadsheet for details.


Conclusions: more work is needed... ;-)

--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Re: mapreduce fetcher doesn't fetch all urls

2005-12-14 Thread Stefan Groschupf

- job.setPartitionerClass(PartitionUrlByHost.class); in the generate
method


yes, this line is the one you need to change. The other stuff can be  
as it is for now.



Do I only need to change the last line to using HashPartitioner.class,
or do I need to modify the other 2 references as well?


Than also assign the case insensitive content properties patch to the
0.8. You may need to change 3 other classes (e.g fetcher) since the
patch is for 0.7.


Just submit my patch and try to compile you will see what you need to  
change.
Just some changes of new Properties() to  ContentProperties() and may  
the import of this class.



It's much better than what I have right now.  However, it's still not
100% and fetching all the urls would mean implementing some sort of
iterative process until all the urls are finally fetched.
Do you have an idea why we are still missing 10 to 20% ?


Well since i strated with dmoz that are the urls that does not exists  
anymore but still listen in dmoz. You also have some general errors  
like, unable to parse, host down etc.
So 10 % error rate is not to bad, if you have later on some hundred  
million you will see that this error rate is around less than 5%.


Stefan



Re: mapreduce fetcher doesn't fetch all urls

2005-12-14 Thread Florent Gluck
AWESOME !!  =:)

Stefan Groschupf wrote:

 ´So, with your patch, did you see 100% of urls *attempting* a fetch ?

 100% ! :-)




Re: mapreduce fetcher doesn't fetch all urls

2005-12-14 Thread Doug Cutting

Stefan Groschupf wrote:

- job.setPartitionerClass(PartitionUrlByHost.class); in the generate
method



yes, this line is the one you need to change. The other stuff can be  as 
it is for now.


I don't recommend this change.  It makes your crawler impolite, since 
multiple tasks may reference each host.  Perhaps you simply need to 
increase http.max.delays?  What is this set to?


Doug