Re: [Fwd: Crawler submits forms?]
Zaheed Haque wrote: what about the following: http://issues.apache.org/jira/browse/NUTCH-125 On its way ... ;-) I'll add it during this week. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: IndexOptimizer (Re: Lucene performance bottlenecks)
Doug Cutting wrote: Andrzej Bialecki wrote: Ok, I just tested IndexSorter for now. It appears to work correctly, at least I get exactly the same results, with the same scores and the same explanations, if I run the smae queries on the original and on the sorted index. Here's a more complete version, still mostly untested. This should make searches faster. We'll see how much good the results are... This includes a patch to Lucene to make it easier to write hit collectors that collect TopDocs. I'll test this on a 38M document index tomorrow. I'll test it soon - one comment, though. Currently you use a subclass of RuntimeException to stop the collecting. I think we should come up with a better mechanism - throwing exceptions is too costly. Perhaps the HitCollector.collect() method should return a boolean to signal whether the searcher should continue working. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
[jira] Commented: (NUTCH-140) Add alias capability in parse-plugins.xml file that allows mimeType-extensionId mapping
[ http://issues.apache.org/jira/browse/NUTCH-140?page=comments#action_12360409 ] Stefan Groschupf commented on NUTCH-140: From my point of view this makes things more complicated, why not just use the extension id, where would be the advantage of aliases? May the aliases would more human readable but in the end you have to define the aliases anyway and need to lookup the extension ids. So I think it is just one step more, but may I miss the advantage. Add alias capability in parse-plugins.xml file that allows mimeType-extensionId mapping Key: NUTCH-140 URL: http://issues.apache.org/jira/browse/NUTCH-140 Project: Nutch Type: Improvement Components: fetcher Environment: Power Mac OS X 10.4, Dual Processor G5 2.0 Ghz, 1.5 GB RAM, although bug is independent of environment Reporter: Chris A. Mattmann Assignee: Chris A. Mattmann Priority: Minor Jerome and I have been talking about an idea to address the current issue raised by Stefan G. about having a mapping of mimeType-list of pluginIds rather than mimeType-list of extensionIds in the parse-plugins.xml file. We've come up with the following proposed update that would seemingly fix this problem. We propose to have the concept of aliases in the parse-plugins.xml file, defined at the end of the file, something lie: parse-plugins mimeType name=text/html plugin id=parse-html/ /mimeType . aliases alias name=parse-html extension-point=org.apache.nutch.parse.html.HtmlParser/ alias name=parse-html2 extension-point=my.other.html.Parser/ /aliases /parse-plugins What do you guys think? This approach would be flexible enough to allow the mapping of extensionIds to mimeTypes, but without impacting the current pluginId concept. Comments welcome. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: [Fwd: Crawler submits forms?]
What people think if we collect a list of issues and make a voting iteration? +1
vote for issues to fix in 0.7.2
Full list of open issues complete description can be found here : http://issues.apache.org/jira/secure/IssueNavigator.jspa? view=fulltempMax=30 Please add a +1 in case you vote for the issue under this issue. Please keep in mind that this will be more a maintenance release. NUTCH-141 jobdetails.jsp doesnt work on webbrowser safari NUTCH-140 Add alias capability in parse-plugins.xml file that allows mimeType-extensionId mapping NUTCH-139 Standard metadata property names in the ParseData metadata NUTCH-138 non-Latin-1 characters cannot be submitted for search NUTCH-137 footer is not displayed in search result page NUTCH-136 mapreduce segment generator generates 50 % less than excepted urls NUTCH-34Parsing different content formats NUTCH-3 multi values of header discarded NUTCH-134 Summarizer doesn't select the best snippets NUTCH-132 Add ability to sort on more than one column NUTCH-131 Non-documented variable: mapred.child.heap.size NUTCH-98RobotRulesParser interprets robots.txt incorrectly NUTCH-129 rtf-parser does not work when opened with wordpad files and saved NUTCH-120 one bad link on a page kills parsing NUTCH-128 second configuration nodes overwrites first node NUTCH-127 uncorrect values using -du, or ls does not return items NUTCH-126 Fetching via https does not work with a proxy (patch) NUTCH-125 OpenOffice Parser plugin NUTCH-110 OpenSearchServlet outputs illegal xml characters NUTCH-36Chinese in Nutch NUTCH-123 Cache.jsp some times generate NullPointerException NUTCH-39pagination in search result NUTCH-49 Flag for generate to fetch only new pages to complement the - refetchonly flag NUTCH-94MapFile.Writer throwing 'File exists error'. NUTCH-117 Crawl crashes with java.io.IOException: already exists: C: \nutch\crawl.intranet\oct18\db\webdb.new\pagesByURL NUTCH-122 block numbers need a better random number generator NUTCH-82Nutch Commands should run on Windows without external tools NUTCH-121 SegmentReader for mapred NUTCH-119 Regexp to extract outlinks incorrect NUTCH-118 FAQ link points to invalid URL NUTCH-115 jobtracker.jsp shows too much information NUTCH-103 Vivisimo like treeview and url redirect NUTCH-108 tasktracker crashs when reconnecting to a new jobtracker. NUTCH-113 Disable permanent DNS-to-IP caching for JVM 1.4 NUTCH-111 ndfs.replication is not documented within the nutch- default.xml configuration file. NUTCH-100 New plugin urlfilter-db NUTCH-101 RobotRulesParser NUTCH-96 MapFile.Writer throws directory exists exception if run multiple times in the same JVM or server JVM. NUTCH-106 Datanode corruption NUTCH-105 Network error during robots.txt fetch causes file to be ignored NUTCH-104 Nutch query parser does not support CJK bi-gram segmentation. NUTCH-102 jobtracker does not start when webapps is in src NUTCH-95DeleteDuplicates depends on the order of input segments NUTCH-92DistributedSearch incorrectly scores results NUTCH-87Efficient site-specific crawling for a large number of sites NUTCH-91empty encoding causes exception NUTCH-90reduce logging output of IndexSegment NUTCH-52Parser plugin for MS Excel files NUTCH-86LanguageIdentifier API enhancements NUTCH-84Fetcher for constrained crawls NUTCH-74French Analyzer Plugin NUTCH-83Release deliverable as zip NUTCH-81Webapp only works when deployed in root NUTCH-79Fault tolerant searching. NUTCH-64 no results after a restart of a search--server (without tomcat restart) NUTCH-76NDFS DataNode advertises localhost as it's address NUTCH-75 Patch for WebDBReader to get more detailed information about WebDBs NUTCH-73A page for CSV results NUTCH-72Query basic filter with correction feature NUTCH-70duplicate pages - virtual hosts in db. NUTCH-68A tool to generate arbitrary fetchlists NUTCH-62 Add html META tag information into metaData in index-more plugin NUTCH-61Adaptive re-fetch interval. Detecting umodified content NUTCH-55 Create dmoz.org search plugin - incorporate the dmoz.org title/category/description if available NUTCH-59meta data support in webdb NUTCH-25needs 'character encoding' detector NUTCH-44too many search results NUTCH-42enhance search.jsp such that it can also returns XML NUTCH-50Benchmarks Performance goals NUTCH-13If dns points to 127.0.0.1, the url is also crawled NUTCH-48Did you mean query enhancement/refignment feature request NUTCH-47Configure host filter to do wildcard prefixes - *.redhat.com NUTCH-45
Re: vote for issues to fix in 0.7.2
NUTCH-134Summarizer doesn't select the best snippets +1 NUTCH-98RobotRulesParser interprets robots.txt incorrectly +1 NUTCH-120one bad link on a page kills parsing +1 NUTCH-95DeleteDuplicates depends on the order of input segments +1 NUTCH-13If dns points to 127.0.0.1, the url is also crawled +1 NUTCH-45Log corrupt segments in SegmentMergeTool +1 Matthias
Re: vote for issues to fix in 0.7.2
My personal fav. list In a day or so I will count all votes and post them. NUTCH-141 jobdetails.jsp doesnt work on webbrowser safari +1 NUTCH-140 Add alias capability in parse-plugins.xml file that allows mimeType-extensionId mapping NUTCH-139 Standard metadata property names in the ParseData metadata +1 NUTCH-138 non-Latin-1 characters cannot be submitted for search +1 NUTCH-137 footer is not displayed in search result page NUTCH-136 mapreduce segment generator generates 50 % less than excepted urls NUTCH-34Parsing different content formats NUTCH-3 multi values of header discarded +1 NUTCH-134 Summarizer doesn't select the best snippets NUTCH-132 Add ability to sort on more than one column NUTCH-131 Non-documented variable: mapred.child.heap.size NUTCH-98RobotRulesParser interprets robots.txt incorrectly NUTCH-129 rtf-parser does not work when opened with wordpad files and saved NUTCH-120 one bad link on a page kills parsing +1 NUTCH-128 second configuration nodes overwrites first node NUTCH-127 uncorrect values using -du, or ls does not return items NUTCH-126 Fetching via https does not work with a proxy (patch) +1 NUTCH-125 OpenOffice Parser plugin +1 NUTCH-110 OpenSearchServlet outputs illegal xml characters +1 NUTCH-36Chinese in Nutch NUTCH-123 Cache.jsp some times generate NullPointerException +1 (may already fixed) NUTCH-39pagination in search result NUTCH-49 Flag for generate to fetch only new pages to complement the -refetchonly flag NUTCH-94MapFile.Writer throwing 'File exists error'. NUTCH-117 Crawl crashes with java.io.IOException: already exists: C: \nutch\crawl.intranet\oct18\db\webdb.new\pagesByURL NUTCH-122 block numbers need a better random number generator NUTCH-82Nutch Commands should run on Windows without external tools NUTCH-121 SegmentReader for mapred NUTCH-119 Regexp to extract outlinks incorrect +1 NUTCH-118 FAQ link points to invalid URL NUTCH-115 jobtracker.jsp shows too much information NUTCH-103 Vivisimo like treeview and url redirect NUTCH-108 tasktracker crashs when reconnecting to a new jobtracker. NUTCH-113 Disable permanent DNS-to-IP caching for JVM 1.4 NUTCH-111 ndfs.replication is not documented within the nutch- default.xml configuration file. NUTCH-100 New plugin urlfilter-db +1 NUTCH-101 RobotRulesParser NUTCH-96 MapFile.Writer throws directory exists exception if run multiple times in the same JVM or server JVM. NUTCH-106 Datanode corruption NUTCH-105 Network error during robots.txt fetch causes file to be ignored NUTCH-104 Nutch query parser does not support CJK bi-gram segmentation. NUTCH-102 jobtracker does not start when webapps is in src NUTCH-95DeleteDuplicates depends on the order of input segments NUTCH-92DistributedSearch incorrectly scores results NUTCH-87Efficient site-specific crawling for a large number of sites NUTCH-91empty encoding causes exception +1 NUTCH-90reduce logging output of IndexSegment NUTCH-52Parser plugin for MS Excel files NUTCH-86LanguageIdentifier API enhancements NUTCH-84Fetcher for constrained crawls NUTCH-74French Analyzer Plugin +1 NUTCH-83Release deliverable as zip NUTCH-81Webapp only works when deployed in root NUTCH-79Fault tolerant searching. NUTCH-64 no results after a restart of a search--server (without tomcat restart) NUTCH-76NDFS DataNode advertises localhost as it's address NUTCH-75 Patch for WebDBReader to get more detailed information about WebDBs NUTCH-73A page for CSV results NUTCH-72Query basic filter with correction feature NUTCH-70duplicate pages - virtual hosts in db. NUTCH-68A tool to generate arbitrary fetchlists +1 NUTCH-62 Add html META tag information into metaData in index-more plugin ++1! NUTCH-61Adaptive re-fetch interval. Detecting umodified content ++1! but is it ready to us? NUTCH-55 Create dmoz.org search plugin - incorporate the dmoz.org title/category/description if available NUTCH-59meta data support in webdb NUTCH-25needs 'character encoding' detector NUTCH-44too many search results NUTCH-42enhance search.jsp such that it can also returns XML NUTCH-50Benchmarks Performance goals NUTCH-13If dns points to 127.0.0.1, the url is also crawled NUTCH-48Did you mean query enhancement/refignment feature request +1 NUTCH-47Configure host filter to do wildcard prefixes - *.redhat.com NUTCH-45Log corrupt segments in SegmentMergeTool
Re: vote for issues to fix in 0.7.2
NUTCH-141 jobdetails.jsp doesnt work on webbrowser safari +1 :-) Marko.
translation of Nutch search page
Hi, I would like to translate in arabic the Nutch index page. I translated the five files concerned : header, about, search, help and search_lang.properties. But I didn't find documents explaining how to make the translation effective, I ask you if you have an idea about make it possible to search in a arabic nutch environment. Best Regards Hind OUKERRADI
Re: vote for issues to fix in 0.7.2
NUTCH-127 uncorrect values using -du, or ls does not return items NUTCH-127 +1 NUTCH-121 SegmentReader for mapred NUTCH-121 +1 NUTCH-115 jobtracker.jsp shows too much information NUTCH-115 +1 NUTCH-108 tasktracker crashs when reconnecting to a new jobtracker. NUTCH-108 +1 NUTCH-111 ndfs.replication is not documented within the nutch- default.xml configuration file. NUTCH-111 +1 -- Andrew McNabb http://www.mcnabbs.org/andrew/ PGP Fingerprint: 8A17 B57C 6879 1863 DE55 8012 AB4D 6098 8826 6868 pgp0HSNOV9SYZ.pgp Description: PGP signature
Re: IndexOptimizer (Re: Lucene performance bottlenecks)
Andrzej Bialecki wrote: I'll test it soon - one comment, though. Currently you use a subclass of RuntimeException to stop the collecting. I think we should come up with a better mechanism - throwing exceptions is too costly. I thought about this, but I could not see a simple way to achieve it. And one exception thrown per query is not very expensive. But it is bad style. Sigh. Perhaps the HitCollector.collect() method should return a boolean to signal whether the searcher should continue working. We don't really want a HitCollector in this case: we want a TopDocs. So the patch I made is required: we need to extend the HitCollector that implements TopDocs-based searching. Long-term, to avoid the 'throw', we'd need to also: 1. Change: TopDocs Searchable.search(Query, Filter, int numHits) to: TopDocs Searchable.search(Query, Filter, int numHits, maxTotalHits) 2. Add, for back-compatibility: TopDocs Searcher.search(Query, Filter, int numHits) { return search(query, filter, numHits, Integer.MAX_VALUE); } 3. Add a new method: /** Return false to stop hit processing. */ boolean HitCollector.processHit(int doc, float score) { collect(doc, score); // for back-compatibility return true; } Then change all calls to HitCollector.collect to instead call this, and deprecate HitCollector.collect. I think that would do it. But is it worth it? In the past I've frequently wanted to be able to extend TopDocs-based searching, so I think the Lucene patch I've constructed so far is generally useful. Doug
mapreduce fetcher doesn't fetch all urls
When doing a one-pass crawl, I noticed that when I inject more than ~16000 urls, the fetcher only fetches a subset of the set initially injected. I use 1 master and 3 slaves with the following properties: mapred.map.tasks = 30 mapred.reduce.tasks = 6 generate.max.per.host = -1 I tried to inject different amount of urls to see around what threshold I start to see some missing ones. Here are the results of my tests so far: #urls 15000 and below: 100% fetched 16000: 15998 fetched (~100%) 25000: 21379 fetched (86%) 5: 26565 fetched (53%) 10: 22088 fetched (22%) After having seen bug NUTCH-136 mapreduce segment generator generates 50 % less than excepted urls, I thought it may fix my problem. I only applied the 2nd change mentioned in the description (the change in Generator.java, line 48) since I didn't know how to set the partition to use a normal hashPartitioner. The fix didn't make any difference. Then I started debugging the generator to see if all the urls were generated. I confirmed they were all generated (did a check w/ 50k), so the problem lays further in the pipeline. I assume it's somewhere in the fetcher, but I'm not sure where yet. I'm gonna keep investigating. Has anyone encountered a similar issue ? I read messages of people crawling million of pages and I wonder why it seems I'm the only one to have this issue. I'm apparently unable to fetch more than ~30k pages even though I inject 1 million urls. Any help would be greatly appreciated. Thanks, --Flo
Re: IndexOptimizer (Re: Lucene performance bottlenecks)
Doug Cutting wrote: Andrzej Bialecki wrote: Ok, I just tested IndexSorter for now. It appears to work correctly, at least I get exactly the same results, with the same scores and the same explanations, if I run the smae queries on the original and on the sorted index. Here's a more complete version, still mostly untested. This should make searches faster. We'll see how much good the results are... This includes a patch to Lucene to make it easier to write hit collectors that collect TopDocs. I'll test this on a 38M document index tomorrow. I tested it on a 5 mln index. The original index is considered the baseline, i.e. it represents normative values for scoring and ranking. These results are compared to results from the optimized index, and scores and positions of hits are also recorded. Finally, these two lists are matched, and relative differences in scoring and ranking are calculated. At the end, I calculate the top10 %, top50% and top100%, defined as a percent of the top-N hits from the optimized index, which match the top-N hits from the baseline index. Ideally, all these measures should be 100%, i.e. all top-N hits from the optimized index should match corresponding top-N hits from the baseline index. One variable which affects greatly both the recall and the performance is the maximum number of hits considered by the TopDocCollector. In my tests I used values between 1,000 up to 500,000 (which represents 1/10th of the full index in my case). Now, the results. I collected all test results in a spreadsheet (OpenDocument or PDF format), you can download it from: http://www.getopt.org/nutch/20051214/nutchPerf.ods http://www.getopt.org/nutch/20051214/nutchPerf.pdf For MAX_HITS=1000 the performance increase was ca. 40-fold, i.e. queries, which executed in e.g. 500 ms now executed in 10-20ms (perfRate=40). Following the intuition, performance drops as we increase MAX_HITS, until it reaches a more or less original values (perfRate=1) for MAX_HITS=30 (for a 5 mln doc index). After that, increasing MAX_HITS actually worsens the performance (perfRate 1) - which can be explained by the fact that the standard HitCollector doesn't collect as many documents, if they score too low. * Single-term Nutch queries (i.e. which do not produce Lucene PhraseQueries) yield relatively good values of topN, even for relatively small values of MAX_HITS - however, MAX_HITS=1000 yields all topN=0%. The minimum useful value for my index was MAX_HITS=1 (perfRate=30), and this yields quite acceptable top10=90%, but less acceptable top50 and top100. Please see the spreadsheet for details. * Two-term Nutch queries result in complex Lucene BooleanQueries over many index fields, includng also PhraseQueries. These fared much worse than single-term queries: actually, the topN values were very low until MAX_HITS was increased to large values, and then all of a sudden all topN-s flipped into the 80-90% ranges. I also noticed that the values of topN depended strongly on the document frequency of terms in the query. For a two-term query, where both terms have average document frequency, the topN values start from ~50% for low MAX_HITS. For a two-term query where one of the terms has a very high document frequency, the topN values start from 0% for low MAX_HITS. See the spreadsheet for details. Conclusions: more work is needed... ;-) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: mapreduce fetcher doesn't fetch all urls
- job.setPartitionerClass(PartitionUrlByHost.class); in the generate method yes, this line is the one you need to change. The other stuff can be as it is for now. Do I only need to change the last line to using HashPartitioner.class, or do I need to modify the other 2 references as well? Than also assign the case insensitive content properties patch to the 0.8. You may need to change 3 other classes (e.g fetcher) since the patch is for 0.7. Just submit my patch and try to compile you will see what you need to change. Just some changes of new Properties() to ContentProperties() and may the import of this class. It's much better than what I have right now. However, it's still not 100% and fetching all the urls would mean implementing some sort of iterative process until all the urls are finally fetched. Do you have an idea why we are still missing 10 to 20% ? Well since i strated with dmoz that are the urls that does not exists anymore but still listen in dmoz. You also have some general errors like, unable to parse, host down etc. So 10 % error rate is not to bad, if you have later on some hundred million you will see that this error rate is around less than 5%. Stefan
Re: mapreduce fetcher doesn't fetch all urls
AWESOME !! =:) Stefan Groschupf wrote: ´So, with your patch, did you see 100% of urls *attempting* a fetch ? 100% ! :-)
Re: mapreduce fetcher doesn't fetch all urls
Stefan Groschupf wrote: - job.setPartitionerClass(PartitionUrlByHost.class); in the generate method yes, this line is the one you need to change. The other stuff can be as it is for now. I don't recommend this change. It makes your crawler impolite, since multiple tasks may reference each host. Perhaps you simply need to increase http.max.delays? What is this set to? Doug