refetching interval
Hi, I am using nutch 07 and found the following code in FetchListTool.java private static final long FETCH_GENERATION_DELAY_MS = 7 * 24 * 60 * 60 * 1000; that means next refetching time is always 7 days later, no matter what fetch interval setting in nutch-site.xml, I feel puzzled. Could any one give me a hint? thanks, Michael, __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
compile search.jsp
Hi, I made change in search.jsp under /nutch/src/web/jsp and hope the change could reflect to the skin of nutch search page. I tried to run ant war and replace ROOT.war in tomcat/webapp also I tried to shutdown and restart tomcat; But seems the nutch search page keeps the same, also the bean.LOG.info keeps the same as before even I am writing new information. I wonder if any compiling steps I missed. thanks your help, Michael, __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
entrance point of Nutch search page
hi, Which JSP file is the entrance for Nutch search page. I saw nutch using search(Query query, int numHits, String dedupField, String sortField, boolean reverse) to get the search result. But not sure which JSP triggers this function. Is it in tomcat container? thanks, Michael, __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Re: Halloween Joke at Google
hi Byron: Did you run LinkAnalysisTool to update score in the fetched segment? I guess that is the most accurate PageRank score, otherwise, in IndexSegment.java Nutch do score calculation based on the number of anchor links for source page. Michael Ji, --- Byron Miller [EMAIL PROTECTED] wrote: We run with fetchlist.score.by.link.count=true and indexer.boost.by.link.count=true We haven't run a stand alone analyze, so it's how the database is updated when we run updatedb. (per the recommendations a few months back when it was found to be pretty darn close results!) Even though my scale is still much smaller than Googles, it is amazing how closely the results can match! Makes you wonder just how much of the net is usefull ;) -byron --- Andrzej Bialecki [EMAIL PROTECTED] wrote: Byron Miller wrote: Actually, to add fuel to the fire, using nutch out of the box, searching for miserable failure yields the same thing. http://www.mozdex.com/search.jsp?query=miserablefailure I'm curious... could you check if the anchors come from the same site, or from different sites? Do you run with fetchlist.score.by.link.count=true and indexer.boost.by.link.count=true? Anyway, that's how the PageRank is _supposed_ to work - it should give a higher score to sites that are highly linked, and also it should strongly consider the anchor text as an indication of the page's true subject ... ;-) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com __ Start your day with Yahoo! - Make it your home page! http://www.yahoo.com/r/hs
searching return 0 hit
Somehow, I found my search engine didn't show the result, even I can see the index from LukeAll. ( It works fine before ) I replace ROOT.WAR file in tomcat by nutch's and launch tomcat in nutch's segment directory ( parallel to index subdir ) Should I reinstall Tomcat? Or will that be nutch's indexing issue? My system is running in Linux. thanks, Michael Ji, - 051019 215411 11 query: com 051019 215411 11 searching for 20 raw hits 051019 215411 11 total hits: 0 051019 215449 12 query request from 65.34.213.205 051019 215449 12 query: net 051019 215449 12 searching for 20 raw hits __ Yahoo! Music Unlimited Access over 1 million songs. Try it free. http://music.yahoo.com/unlimited/
Re: searching return 0 hit
hi Gal: I Do stop tomcat before I do indexing. And restart tomcat after. I am not sure what the first server you mentioned here. The Linux box in my case? thanks, Michael Ji, --- Gal Nitzan [EMAIL PROTECTED] wrote: Hi Michael, At least on my side every time I run index, I must stop server and than tomcat and than re start first server than tomcat. I have asked about this twice in this list but nobody answered. I'm not sure it is the same issue, but try it. Regards, Gal. Michael Ji wrote: Somehow, I found my search engine didn't show the result, even I can see the index from LukeAll. ( It works fine before ) I replace ROOT.WAR file in tomcat by nutch's and launch tomcat in nutch's segment directory ( parallel to index subdir ) Should I reinstall Tomcat? Or will that be nutch's indexing issue? My system is running in Linux. thanks, Michael Ji, - 051019 215411 11 query: com 051019 215411 11 searching for 20 raw hits 051019 215411 11 total hits: 0 051019 215449 12 query request from 65.34.213.205 051019 215449 12 query: net 051019 215449 12 searching for 20 raw hits __ Yahoo! Music Unlimited Access over 1 million songs. Try it free. http://music.yahoo.com/unlimited/ . __ Start your day with Yahoo! - Make it your home page! http://www.yahoo.com/r/hs
next score usage
hi, I saw several discussions about Distributed Link Analysis Tool before. And I still have question about the usage of the field next score in Page data structure. Seems Distributed Link Analysis Tool will update this field by OutlinkWithTarget ( as I understand, that means the link has target page ). But I didn't see how next score field being used in search result. Coz in IndexSegment.java, only the field score in Page is used to generate boost value of a document for indexing segment and will affect the search rank for a particular document in Lucene indexing structure. thanks, Michael Ji, __ Start your day with Yahoo! - Make it your home page! http://www.yahoo.com/r/hs - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Document Duplication for Multiple Segment Merge
hi, When Nutch's IndexMerger.java is called, the indexes from multiple segment directories are merged to one target directory. I wonder how lucene deals with the case when identical documents existing in two segments. Is the older document ( lower time stamp ) deleted? thanks, Michael Ji, __ Yahoo! Music Unlimited Access over 1 million songs. Try it free. http://music.yahoo.com/unlimited/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Document Duplication for Multiple Segment Merge
hi Yonik: Does that mean when two documents has same MD5 content in two different segments, IndexMerger.java will keep both of them? When I look at the code of IndexSegment.java, it handle MD5 dedupling by keeping the one with higher document ID. So, when refetching happens, the old segment should be discarded totally. And, a strategy must be made in such a way that each segment should relate to a fetchlist with same interval time. Is it the way Nutch handling refetching case? Michael Ji, --- Yonik Seeley [EMAIL PROTECTED] wrote: There is no concept in Lucene of document identity linked to any fields of a document. You need to handle removal of duplicates yourself. -Yonik Now hiring -- http://tinyurl.com/7m67g On 10/14/05, Michael Ji [EMAIL PROTECTED] wrote: hi, When Nutch's IndexMerger.java is called, the indexes from multiple segment directories are merged to one target directory. I wonder how lucene deals with the case when identical documents existing in two segments. Is the older document ( lower time stamp ) deleted? thanks, Michael Ji, __ Yahoo! Music Unlimited Access over 1 million songs. Try it free. http://music.yahoo.com/unlimited/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Yahoo! Music Unlimited Access over 1 million songs. Try it free. http://music.yahoo.com/unlimited/ - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: Document Duplication for Multiple Segment Merge
Sorry, I guess I point out a wrong java class name. I want to be confirmed that if SegmentMerger.java in Lucene do dedup or not. I tracing down couple of java class from SegmentMerger.java, such as, SegmentReader.java, IndexWriter.java, SegmentReader.java. I didn't see a dedup mechanism yet. thanks, Micheal Ji, --- Yonik Seeley [EMAIL PROTECTED] wrote: Sorry, I've only briefly looked at Nutch, so you should ask on that mailing list. Lucene doesn't do deduping. -Yonik Now hiring -- http://tinyurl.com/7m67g On 10/14/05, Michael Ji [EMAIL PROTECTED] wrote: hi Yonik: Does that mean when two documents has same MD5 content in two different segments, IndexMerger.java will keep both of them? When I look at the code of IndexSegment.java, it handle MD5 dedupling by keeping the one with higher document ID. __ Yahoo! Mail - PC Magazine Editors' Choice 2005 http://mail.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: crawl db stats
using DBAdminTool to dump the webdb and you can get whole list of Pages in text format, Michael Ji, --- Stefan Groschupf [EMAIL PROTECTED] wrote: Hi, is there any chance to read the statistics of the nutch 0.8 crawl db or a trick to get an idea of how many pages are already crawled? Thanks for the hints. Stefan __ Start your day with Yahoo! - Make it your home page! http://www.yahoo.com/r/hs
Re: crawl db stats
or, you can use segread in bin/nutch to dump a new fetch segment to see what page it fetched, Michael Ji, --- Stefan Groschupf [EMAIL PROTECTED] wrote: Which class do you mean? There is the old webdbadmin tool, but I guess this will not work for the new crawl db. The bin/nutch admin command isn't supported until more. Thanks Stefan Am 15.10.2005 um 00:21 schrieb Michael Ji: using DBAdminTool to dump the webdb and you can get whole list of Pages in text format, Michael Ji, --- Stefan Groschupf [EMAIL PROTECTED] wrote: Hi, is there any chance to read the statistics of the nutch 0.8 crawl db or a trick to get an idea of how many pages are already crawled? Thanks for the hints. Stefan __ Start your day with Yahoo! - Make it your home page! http://www.yahoo.com/r/hs __ Yahoo! Music Unlimited Access over 1 million songs. Try it free. http://music.yahoo.com/unlimited/
Re: How can I unsubscribe from the mailing list?
http://lucene.apache.org/nutch/mailing_lists.html --- [EMAIL PROTECTED] wrote: Does any body know how I can unsubscribe from this mailing list? Thanks, Nima __ Yahoo! Mail - PC Magazine Editors' Choice 2005 http://mail.yahoo.com
RE: what contibute to fetch slowing down
Kelvin's OC implementation is queuing fetching request according to the host and using http 1.1 protocol. It is a nutch patch currently. Michael Ji, --- Fuad Efendi [EMAIL PROTECTED] wrote: Some suggestion to improve performance: 1. Decrease randomization of FetchList. Here is comment from FetchListTool: /** * The TableSet class will allocate a given FetchListEntry * into one of several ArrayFiles. It chooses which * ArrayFile based on a hash of the URL's domain name. * * It uses a hash of the domain name so that pages are * allocated to a random ArrayFile, but same-host pages * go to the same file (for efficiency purposes during * fetch). * * Further, within a given file, the FetchListEntry items * appear in random order. This is so that we don't * hammer the same site over and over again during fetch. * * Each table should receive a roughly * even number of entries, but all URLs for a specific * domain name will be found in a single table. If * the dataset is weirdly skewed toward large domains, * there may be an uneven distribution. */ Same same-host pages go to the same file - they should go in a sequence, without mixing/randomizing with other host-pages... We are fetching single URL, then we forget about existense of this TCP/IP connection, we even forget that Web Server created Client Process to handle our HTTP requests, it is called Keep Alive. Creation of TCP connection, and additionally creation of a such Client Process on a Web Server costs a lot of CPU on both sides, Nutch WebServer. I suggest to use single Keep-Alive thread to fetch single Host, without randomization. 2. Use/Investigate more staff from Socket API such as public void setSoTimeout(int timeout) public void setReuseAddress(true) I found this in J2SE API for setReuseAddress(default: false): = When a TCP connection is closed the connection may remain in a timeout state for a period of time after the connection is closed (typically known as the TIME_WAIT state or 2MSL wait state). For applications using a well known socket address or port it may not be possible to bind a socket to the required SocketAddress if there is a connection in the timeout state involving the socket address or port. = It probably means that we are reaching huge amount (65000!) of waiting TCP ports after Socket.close(); and Fetcher Theads are blocking by OS waiting when OS release some of these ports... Am I right? P.S. Anyway, using Keep-Alive option is very important not only for us but also for Production Web Sites. Thanks, Fuad -Original Message- From: Fuad Efendi [mailto:[EMAIL PROTECTED] Sent: Friday, September 30, 2005 10:58 PM To: nutch-dev@lucene.apache.org; [EMAIL PROTECTED] Subject: RE: what contibute to fetch slowing down Dear Nutchers, I noticed same problem twise, with PentiumMobile2Mhz WindowsXP 2Gb, and with 2xOpteron252 x SuseLinux x 4Gb I have only one explanation which should be probably mirrored at JIRA: Network. 1. I never had such a problem with The Grinder, http://grinder.sourceforge.net, which is based on alternate HTTPClient http://www.innovation.ch/java/HTTPClient/index.html. Apache SF should really review their HttpClient RC3(!!!) accordingly, HTTPClient (upper--HTTP-case)is not alpha, it is production version... I used Grinder a lot, it allows to execute 32 processes with 64 threads each on 2048Mb RAM... 2. I found at SUN API this: java.net.Socket public void setReuseAddress(boolean on) - please check API!!! 3. I saw in your PROTOCOL-HTTP this code: ... HTTP/1.0 ... Why? Why version 1.0??? It should understand server's reply such as Connection: close Connection: keep-alive etc. (pls ignore typo). 4. By the way, how many files UNIX needs in order to maintain 65536 network sockets? Respectfully, Fuad P.S. Sorry guys, I don't have anough time to participate... Could you please test this suspicious behaviour, and very strange opinion? Should I create a new bug report at JIRA? SUN's Socket, Apache's HttpClient, UNIX's networking... -Original Message- From: Daniele Menozzi [mailto:[EMAIL PROTECTED] Sent: Wednesday, September 28, 2005 4:42 PM To: nutch-dev@lucene.apache.org Subject: Re: what contibute to fetch slowing down On 10:27:55 28/Sep , AJ Chen wrote: I started the crawler with about 2000 sites. The fetcher could achieve 7 pages/sec initially, but the performance gradually dropped to about 2 pages/sec, sometimes even 0.5 pages/sec. The fetch list had 300k pages and I used 500 threads. What are the main causes of this slowing down? I have the same problem; I've tried with different number of fetchers (10,20,50,100
possibility of adding customerized data field in nutch Page Class
hi there, I am trying to add a new data field to Page class, a simple String. So, I follow the URL field in Page Class as template. But when I do WebDBInject, it gives me following error messages. Seems the readFields() is not reading in the right position. I wonder if it is feasible to make a change in Page Class, as I understand nutch webdb has advanced structure and operations. From OO view, all the Page fields should be accessed by Page Class Interface, but I just met something weird. thanks, Michael Ji, - Exception in thread main java.io.EOFException at java.io.DataInputStream.readUnsignedShort(DataInputStream.java:310) at org.apache.nutch.io.UTF8.readFields(UTF8.java:101) at org.apache.nutch.db.Page.readFields(Page.java:146) at org.apache.nutch.io.SequenceFile$Reader.next(SequenceFile.java:278) at org.apache.nutch.io.MapFile$Reader.next(MapFile.java:349) at org.apache.nutch.db.WebDBWriter$PagesByURLProcessor.mergeEdits(WebDBWriter.java:618) at org.apache.nutch.db.WebDBWriter$CloseProcessor.closeDown(WebDBWriter.java:557) at org.apache.nutch.db.WebDBWriter.close(WebDBWriter.java:1544) at org.apache.nutch.db.WebDBInjector.close(WebDBInjector.java:336) at org.apache.nutch.db.WebDBInjector.main(WebDBInjector.java:581) __ Yahoo! Mail - PC Magazine Editors' Choice 2005 http://mail.yahoo.com
Re: Index Infos
there is a book called Lucene in Action published, michael ji --- Daniele Menozzi [EMAIL PROTECTED] wrote: Hi all, can you please point me to a document in which is described the Indexing step? I havn't found anything inside the wiki, and I do not understand very well what happens... Thank you again!!! Menoz -- Free Software Enthusiast Debian Powered Linux User #332564 http://menoz.homelinux.org __ Yahoo! Mail - PC Magazine Editors' Choice 2005 http://mail.yahoo.com
Re: Problems on Crawling
at look at this good nutch doc http://wiki.apache.org/nutch/DissectingTheNutchCrawler Michael Ji --- Daniele Menozzi [EMAIL PROTECTED] wrote: Hi all, I have questions regarding org.apache.nutch.tools.CrawlTool: I do not have really understood what is the ralationship between depth,segments,fetching.. Take for example the tutorial, I understand theese 2 steps: bin/nutch admin db -create bin/nutch inject db -dmozfile content.rdf.u8 -subset 3000 but, when I do this: bin/nutch generate db segments what happens? I think that a dir called 'segments' id created, and inside of it I can find the links I have previously injected.Ok.Next steps: bin/nutch fetch $s1 bin/nutch updatedb db $s1 Ok, no problems here. But now I cannot understood what happens with this command: bin/nutch generate db segments it is the same command of above, but now I've not injected anything in the DB, it only contais the pages I've previously fetched. So, does it mean that when I generate a segment, it will automagically be filled with links found in fetched pages? And where theese links are saved? And who saves theese link? Thank you so much, this work is really interesting! Menoz -- Free Software Enthusiast Debian Powered Linux User #332564 http://menoz.homelinux.org __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Re: Nutch 6.1 running issu
hi Andrzej: Thanks for your correction. The patch is compiled successfully and running well in Nutch 07. Just a curious question: As stated in nutch 61: ...if content is unmodified it doesn't have to be fetched and processed... And I did test for refetching a page without content modification and Nutch 6.1 DID parsing this page to content/, parse_data/, and parse_text/ I took look at code: In Fetcher.java, ProtocolOutput output = protocol.getProtocolOutput(fle); ProtocolStatus pstat = output.getStatus(); : switch ( pstat ) { : : case ProtocolStatus.NOTMODIFIED: handleFetch(fle, output); break; : : } Should we just do nothing in case of NOTMODIFIED, which is the flag set when content.MD5 = page.MD5 in protocol.http.java? The handleFetch() actually parsing and output data structure to segments/. Thanks, Michael Ji, --- Andrzej Bialecki [EMAIL PROTECTED] wrote: Michael Ji wrote: FetchListEntry value = new FetchListEntry(); Page page = (Page)value.getPage().clone(); Seems value is an empty FetchListEntry instance. Will that cause clone getPage failure coz it is NULL? Please try to replace this logic with the following: FetchListEntry value = new FetchListEntry(); while (topN 0 reader.next(key, value)) { Page page = value.getPage(); if (page != null) { Page p = new Page(); p.set(page); page = p; } if (forceRefetch) { Page p = value.getPage(); // reset fetchTime and MD5, so that the content will // always be new and unique. p.setNextFetchTime(0L); p.setMD5(MD5Hash.digest(p.getURL().toString())); } tables.append(value); topN--; This patchset still needs a lot of thought and work. Even the part that avoids re-fetching unmodified content needs additional thinking - it's easy to end up in a state, where Nutch cannot be forced to re-fetch the page because every time you try it remains unmodified - but you need refetching the actual data because e.g. you lost that segment data... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Nutch-87 Setup
hi Matt: You nutch-87 has a good idea and I believe it provides a solution for good size of controled domain, say hundreds of thousands sites. I am currently trying to implement it to Nutch 07. Got several questions want to be clearified: 1) Should I create two plug-in classes in nutch? etc one for WhitelistURLFilter one for WhitelistWriter 2) I found Whitelist.java refer to import epile.util.LogLevel; And WhitelistURLFilter.java refer to import epile.crawl.util.StringURL; import epile.util.LogLevel; Are these new package existing in Nutch lib? If not, should we import a new epile*.jar? 3) If we want to use Nutch-87, should we change the code in Nutch core code. I plan to replace all the places where RegexURLFilter appearing by WhitelistURLFilter. Is it a right approach? thanks, Michael Ji, __ Click here to donate to the Hurricane Katrina relief effort. http://store.yahoo.com/redcross-donate3/
nutch excerpt
hi, There is a class Summarizer to accept a text string and generate summary for the hits page. My question is: 1) Where the raw text string from? Open db file in segment/content/? 2) I guess some code must call Summarizer.main to trigger it, but I didn't find which code does that. thanks, Michael Ji __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
link analysis in OC
hi Kelvin: Did OC compute page score same as Nutch crawling? I found Nutch/index compute document boost value based on the score/anchor data in segment/fetchlist data structure. I guess OC won't generate this boost score by itself or use its' own data structure. So if we want to have this score saved in lucene index, we need to use nutch/generate.. to get the fetchlist and generate webdb. That means OC will live with Nutch's webdb and other data structures. Is my though right? thanks, Michael Ji __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Re: bot-traps and refetching
hi Kelvin: I believe my previous email about further concerning of controlled crawling confused you a bit via my unmatured thought. But I believe that controlled crawling is very important for an efficient vertical crawling application generally. After reviewed our previous discussion, I think the solution for bot-traps and refetching in OC might be able to be combined as one. 1) Refetching will look at the FetcherOutput of last run, and queue the URLs according to their domain name (for http 1.1 protocol) as your FetcherThread does. 2) We might just count the number of URLs within the same domain (in fly of queue?). If that number is over centain threshold, we might think stop adding new URLs for that domain---it is equvalent in sense of controlled crawling, but in a way of width. Will it work as proposed? thanks, Michael Ji, --- Kelvin Tan [EMAIL PROTECTED] wrote: Michael, On Sun, 28 Aug 2005 07:31:06 -0700 (PDT), Michael Ji wrote: Hi Kelvin: 2) refetching If OC's fetchlist is online (memory residence), the next time refetch we have to restart from seeds.txt once again. Is it right? Maybe with the current implementation. But if you Implement a CrawlSeedSource that reads in the FetcherOutput directory in the Nutch segment, then you can seed a crawl using what's already been fetched. Start your day with Yahoo! - make it your home page http://www.yahoo.com/r/hs
Re: Fetcher for constrained crawls
Hi Jeremy: 1) I guess the solution/patch provided by Kelvin tries to enhance site fetching performance in several ways. One of these is using HTTP 1.1 features . His crawler is a site-depth---a sequence of URLs with the same host. See his concept at http://www.supermind.org/index.php?cat=17 2) I think your approach is based on existing Nutch scenario with minimum data structure modification in webDB. I am running test for Kelvin's patch now. I wonder if it is possible that you could provide more detail about your patch so that I can run test as well. thanks, Michael Ji --- Jeremy Calvert [EMAIL PROTECTED] wrote: Like Kelvin, I too have been trying to get limited crawl capabilities out of nutch. I've come up with a simplistic approach. I'm afraid I haven't had time to try out Kelvin's approach . I extend page to store a depth and a radius byte. Loosely speaking, depth is the distance you can hop within a given site (based on domainID), and radius is the distnce you can hop once you've left the site. You set these when you inject seed URLs. When you create new pages from outgoing links, you call linkedPage.propagateDepthAndRadius(pageWithOutgoingLink) where: /** * @param incoming The pointing page. */ public void propagateDepthAndRadius(Page incoming){ boolean sameSite = false; try{ sameSite = this.computeDomainID() == incoming.computeDomainID();} catch( MalformedURLException e ) {}//oh well, I guess they're different domains. if(sameSite incoming.depth 0){ this.depth = (byte) (incoming.depth - 1); // same site, decrement depth, maintain radius this.radius = incoming.radius; }else{ this.depth = 0; // different sites or out of depth, decrement radius this.radius = (byte) (incoming.radius - 1); } } If the page already exists when you go to add it to the DB (with instruction ADD_PAGE_IFN_PRESENT), you take the max of existing depth and radius with the newly assigned depth and radius. The overall code modifications are about 30 lines...small additions to WebDBWriter and Page. From there, it's fun and handy to have depth and radius at your disposal when creating the fetchlist. I've written a new FetchListTool to make use of them to keep things that are at the end of constraints out and prioritize pages to fetch. I also perturb the priorities slightly by 0.001% so that, if I do have enough domains to prevent my fetches from piling up on a single host, I generally do. Impacts: WebDBWriter (12 lines) Page (~20 lines) Requires new or modified FetchList tool. It's a simple and elegant solution for constrained crawl, but it does touch the WebDB. I'm interested to hear people's thoughts, and would be more than happy to contribute a patch. J __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
bot-traps and refetching
Hi Kelvin: 1) bot-traps problem for OC If we have a crawling depth for each starting host, it seems that the crawling will be finalized in the end ( we can decrement depth value in each time the outlink falls in same host domain). Let me know if my thought is wrong. 2) refetching If OC's fetchlist is online (memory residence), the next time refetch we have to restart from seeds.txt once again. Is it right? 3) page content checking In OC API, I found an API WebDBContentSeenFilter, who uses Nutch webdb data structure to see if the fetched page content has been seen before. That means, we have to use Nutch to create a webdb (maybe nutch/updatedb) in order to support this function. Is it right? thanks, Michael, Start your day with Yahoo! - make it your home page http://www.yahoo.com/r/hs
crawling ability of NUTCH-84
hi Kelvin: Just a curious question. As I know, the goal of nutch global crawling ability will reach 10 billions page based on implementation of map reduced. OC, seeming to fall in the middle, is for control industry domain crawling. How many sites is its' goal?dealing with couple of thousand sites? I believe the importance for industry domain crawling is in-time updating. So identifying content of fetched page and saving post-parsing time is critical. thanks, Michael Ji, Start your day with Yahoo! - make it your home page http://www.yahoo.com/r/hs
Re: junit test failed
What is junit test standing for? A particular patch? Sorry, if my question is silly. Michael Ji, --- AJ Chen [EMAIL PROTECTED] wrote: I'm a new comer, trying to test Nutch for vertical search. I downloaded the code and compiled it in cygwin. But, the unit test failed with the following message: test-core: [delete] Deleting directory nutch\trunk\build\test\data [mkdir] Created dir: nutch\trunk\build\test\data BUILD FAILED nutch\trunk\build.xml:173: Could not create task or type of type: junit. Did I miss anything for junit? Appreciate your help. AJ Chen __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Re: newbie questions
I guess, you need to delete a doc from lucene search engine to avoid to be hit it; you can use LuakeAll (an admin tool for lucene indexing) to manupilate the indexed content; Michael Ji, --- haipeng du [EMAIL PROTECTED] wrote: could lucene have a way to delete a document from index writer? how could lucene to search documents that have value something and do not need to worry about what kind of field names with it. For example: if document has field name: test1 with value something and another one has field name: test2 with value something should both be hitted. Thanks a lot. -- Haipeng Du Software Engineer Comphealth, Salt Lake City - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
Re: newbie questions
I guess so, but I didn't try that before, Michael Ji, --- haipeng du [EMAIL PROTECTED] wrote: yes, that is right. Could I do that from Lucene API? Thanks a lot. On 8/25/05, Michael Ji [EMAIL PROTECTED] wrote: I guess, you need to delete a doc from lucene search engine to avoid to be hit it; you can use LuakeAll (an admin tool for lucene indexing) to manupilate the indexed content; Michael Ji, --- haipeng du [EMAIL PROTECTED] wrote: could lucene have a way to delete a document from index writer? how could lucene to search documents that have value something and do not need to worry about what kind of field names with it. For example: if document has field name: test1 with value something and another one has field name: test2 with value something should both be hitted. Thanks a lot. -- Haipeng Du Software Engineer Comphealth, Salt Lake City - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] -- Haipeng Du Software Engineer Comphealth, Salt Lake City - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
dumping lucene index to text file
hi, I wonder if I can output the content of the individual files in index dir to a text format, means, I can see the each text saved in index files. I saw Plucene has this ability, but finally, I found it only support index generated by old lucene version. Any suggestions you could kindly provide? thanks, Michael Ji __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
crawl-urlfilter.txt mechanics
Hi, When I use intranet crawling, such as, call bin/nutch crawl ..., crawl-urlfilter.txt works---it filters out the urls that is not matched the domain I included; actually, when I take a look at crawltool.java, the config files are read in Java Properties by 'NutchConf.get().addConfResource(crawl-tool.xml)' But: When I calling each steps explicitly by myself, such as, Loop generate segment fetch updateDB The crawl-urlfilter.txt doesn't work; My question is: 1) If I want to control the crawler's behavior in second case, should I call 'NutchConf.get()...' by myself? 2) Where url-filter exactly works? In fetcher? So, after loaded from .xml and .txt, all the configuration data is kept in Properties for life time of nutch running? thanks, Michael Ji __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
MD5 in fetchlist / fetcher
hi there, I dumped the contents in segment/fetchlist and segment/fetcher; My curious question is that: why MD5 signature of the page content doesn't save in fetchlist? In my mind, I think it will save CPU time if we see a page unchanged --- coz we can skip the parsing process; From my view, if we have MD5 in fetchlist, we can do it directly in memory. If we have MD5 in fetcher, we need to search it in local file in order to do compare with the new fetched page content MD5. Did I miss some important points or my dumping is wrong? thanks, Michael Ji fetchlist fetch: true page: Version: 4 URL: http://www.sina.com/ ID: d6a83e9c17e05d5602709a63c241bf68 Next fetch: Sun Aug 21 20:15:06 CDT 2005 Retries since fetch: 0 Retry interval: 30 days Num outlinks: 0 Score: 1.0 NextScore: 1.0 anchors: 0 fetcher fetch: true page: Version: 4 URL: http://www.sina.com/ ID: d6a83e9c17e05d5602709a63c241bf68 Next fetch: Sun Aug 21 20:15:06 CDT 2005 Retries since fetch: 0 Retry interval: 30 days Num outlinks: 0 Score: 1.0 NextScore: 1.0 anchors: 0 Fetch Result: MD5Hash: 56eae3c2556cb10a00e7346738dcb318 ProtocolStatus: success(1), lastModified=0 FetchDate: Sun Aug 14 20:15:13 CDT 2005 __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Re: Parse-html should be enhanced!
Will an extension from existing point be a solution? Our on-going project also needs to deal specific crawling cases in some sites. We think about extending the current java class to fit our usage. Michael Ji, --- Jack Tang [EMAIL PROTECTED] wrote: Hi Nutchers I think parse-html parse should be enhanced. In some of my projects(Intranet search engine), we only need the content in the specified detectors and filter the junk, say the content between div class=start-here and /div or some detectors like XPath. Any thoughts on this enhancement? Regards /Jack -- Keep Discovering ... ... http://www.jroller.com/page/jmars __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Merge Lucene to Nutch
As I understand, Nutch is a crawling/searching application based on Lucene; Just a curious question, when Lucene has a new version/release, how to merge Lucene to Nutch? I didn't see an explicity Lucene Java source in Nutch source tree. I don't think Nutch and Lucene do low level API independently. Thanks, Michael Ji __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Re: How to extend Nutch?
hi Fuad: I am probably doing the same thing. I think plug-in is the right place to put my own code. But not sure, why we need to touch other config files. Regards, Michael Ji --- Fuad Efendi [EMAIL PROTECTED] wrote: I need some pre-processing, to add additional fields to Document, and to show it on a web-page I probably need to work with plugins, and to modify config files... nutch-conf.xsl nutch-default.xml nutch-site.xml Am I right? Thanks -Original Message- From: Fuad Efendi [mailto:[EMAIL PROTECTED] Sent: Wednesday, August 10, 2005 2:15 PM To: nutch-user@lucene.apache.org Subject: RE: [Nutch-general] How to extend Nutch So, I need to modify some existing classes, isn't it? -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Wednesday, August 10, 2005 1:48 PM To: [EMAIL PROTECTED] Subject: Re: [Nutch-general] How to extend Nutch Probably IndexingFilter or HtmlParser for indexing and for indexing I think there is something in org.apache.nutch.search some class that starts with Raw I just saw this in the Javadoc earlier. Otis --- Fuad Efendi [EMAIL PROTECTED] wrote: I need specific pre-processing of a html-page, to add more fields to Document before storing it in Index, and to modify web-interface accordingly. Where is the base point of extension? Thanks! __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Re: Http Max Delays
I met that problem before, after I change the http.timeout and max.delay values to 100 times the default setting, the problem is gone, you might look at nutch-default.xml and make a overwritten in nutch-site.xml, Michael, --- Drew Farris [EMAIL PROTECTED] wrote: By any chance are you crawling many pages stored on a single server or small number of servers? If so, take a look at: http://www.mail-archive.com/nutch-developers%40lists.sourceforge.net/msg04414.html http://www.mail-archive.com/nutch-developers%40lists.sourceforge.net/msg04427.html On 7/27/05, Christophe Noel [EMAIL PROTECTED] wrote: Hello, When I'm fetching , I really have too many Http Timeout with default nutch parameters. Does anyone have tips to improve that point ? Thanks very much. Christophe Noël. www.cetic.be = org.apache.nutch.protocol.RetryLater: Exceeded http.max.delays: retry later. at org.apache.nutch.protocol.httpclient.Http.blockAddr(Http.java:133) at org.apache.nutch.protocol.httpclient.Http.getProtocolOutput(Http.java:201) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:135) org.apache.nutch.protocol.RetryLater: Exceeded http.max.delays: retry later. at org.apache.nutch.protocol.httpclient.Http.blockAddr(Http.java:133) at org.apache.nutch.protocol.httpclient.Http.getProtocolOutput(Http.java:201) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:135) Start your day with Yahoo! - make it your home page http://www.yahoo.com/r/hs
http.max.delays
Hi there: I checked log file, found some site link met error asExceeded http.max.delays: retry later I change the corresponding value in conf file, nutch-default.xml, I changed it to 300, seems still not enough. Will that affect performance of crawling? Any idea? thanks, Michael __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Nutch's intranet VS internet crawling
I wonder if there is any difference between these two? Or intranet crawling must indicate an intranet site explicitly in crawl-urlfilter.txt under /conf ? thanks, Michael, __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
RE: fetching behavior of Nutch
thanks Howie, that guides me, Michael, --- Howie Wang [EMAIL PROTECTED] wrote: There are probably two settings you'll need to tweak in nutch-default.xml http.content.limit -- by default it's 64K, if the page is larger than that, then it essentially truncates the file. You could be missing lots of links that appear later in the page. max.outlinks.per.page -- by default it's 100. You might want to increase this since for pages with something like a nested navigation sidebar with tons of links, it won't get any links from the main part of the page. The *.xml files are fairly descriptive. So just reading through them can be pretty helpful. I don't know if there is a full guide to the config files. Howie 1) I did several testing running to fetch page from two website. The fetching depth is 10. After checking log files, I found the actual fetched page linkage is very different for two sites. In one site with lots of news, only first two depth fetching running well and only fetching 5 linkages. The actual linkages in that site is far beyond that. The other site can fetch till 10 rounds and fetched 100's linkage. I wonder if any one has similar experience. Should I setup configure files in /conf/? 2) Also, in Nutch/conf/ directory, I found several configuration files. Actually, I only modify crawl-urlfilter.txt to let it accept all the url (*.*). Is it proper? I really doesn't touch other conf files. Is there a guideline how I use these files? thanks, Michael, __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com
Re: a silly question
hi there: Actually, this is first time I setup Nutch; 1) I crawl a website in nutch directory 2) delete ROOT in tomcat and copy Nutch*.war to tomcat/webapp 3) I can run tomcat successfully before and after copy *.war file; by means I can see default tomcat homepage and nutch search page after. 4) I change file permission to 777 by using chmod command for both nutch and tomcat folder. Any particular file and fold I need to take care permission issue? Then, when I type in a text in search text box, and hit search button, it gives me http status 500 . error message. Some more suggestons? I found search engine actually running search.jsp. Where is this JSP? In nutch somewhere I guess. thanks, Michale, --- Fredrik Andersson [EMAIL PROTECTED] wrote: Have you deleted your old nutch.war (and the directory named nutch) in the Tomcat webapps directory, prior to inserting a new war-file? Tomcat can act pretty strange if you don delete your old x.war and the webapps/x dir. Using port 8080? Permissions on the .jsp and war-file ok? Can you access the standard Tomcat site by browsing to localhost:8080 (Congratulations, you have successfully installed Tomcat blabla...)? On 7/16/05, Feng (Michael) Ji [EMAIL PROTECTED] wrote: Hi there, I know this questions might be related to nutch's user group, but if any one could help me a bit, I will really appreciate it. -- I try to run Nutch in my Linux server and followed each step in the tutorial of http://lucene.apache.org/nutch/tutorial.html I can run Nutch successful, such as, crawling and save db to my local server. I can launch tomcat web page with Nutch search home page successful, but after I hit search button, tomcat give me http status 500 error... Looks like JSP (java) doesn't compile properly in Tomcat container. I think it is not Tomcat problem, but somehow, Nutch doesn't compile well, or maybe some nutch library isn't being accesses properly. Anyone has the similar experience? Any suggestion will be very helpful, thanks ahead, Michael, __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com Start your day with Yahoo! - make it your home page http://www.yahoo.com/r/hs