Re: Have anybody thought of replacing CrawlDb with any kind of Rational DB?
Howie Wang wrote: Sorry about the previous crappily formatted message. In brief, my point wasthat relational DB might perform better for small niche users, and plusyou get the flexibility of SQL. No more writing custom code to tweak webdb.Howie Generally speaking, I agree that it would be a good option to have, especially for smaller setups - but it would require extensive modifications to many tools in Nutch. Unless you are willing to provide patches that implement it without breaking the large-scale case, I think we should let the matter rest ... -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
RE: Have anybody thought of replacing CrawlDb with any kind of Rational DB?
I definitely don't expect people to write it just because it happens to be useful to me :-) Call me crazy, but I'm thinking of implementing this when I get some free time (whenever that will be). It seems that I would just need to implement IWebDBWriter and IWebDBReader, and then add a command line option to the tools (something like -mysql) to specify the type of db to instantiate. It would affect about 15 files, but the tools changes would be simple -- a few if statements here and there. Does that sound right? Howie _ Live Search Maps – find all the local information you need, right when you need it. http://maps.live.com/?icid=wlmtag2FORM=MGAC01
Runing a nutch crawler on Eclipse
Hi Some forms of error i am getting while trying to compile nutch using eclipse 070413 115922 parsing file:/C:/Documents%20and% 20Settings/ecslogon/Desktop/nutch-0.7.2/bin/tmp_build/nutch-default.xml 070413 115922 parsing file:/C:/Documents%20and% 20Settings/ecslogon/Desktop/nutch-0.7.2/bin/tmp_build/crawl-tool.xml 070413 115922 parsing file:/C:/Documents%20and% 20Settings/ecslogon/Desktop/nutch-0.7.2/bin/tmp_build/nutch-site.xml 070413 115922 No FS indicated, using default:local Could anyone help me out Tanmoy
Re: Have anybody thought of replacing CrawlDb with any kind of Rational DB?
Actually nutch people are kind of autocrate., don't expect more from them They do what they have decided I am waiting really stable product with incremental indexing, which detect and add/remove pages as soon as they added/removed. But they don't want to this , i don't know why ? what is there mission ? If we join together to implement this, it would be better. I can work on this as weekend project. ping me, if u want On 4/13/07, Howie Wang [EMAIL PROTECTED] wrote: I definitely don't expect people to write it just because it happens to be useful to me :-) Call me crazy, but I'm thinking of implementing this when I get some free time (whenever that will be). It seems that I would just need to implement IWebDBWriter and IWebDBReader, and then add a command line option to the tools (something like -mysql) to specify the type of db to instantiate. It would affect about 15 files, but the tools changes would be simple -- a few if statements here and there. Does that sound right? Howie _ Live Search Maps – find all the local information you need, right when you need it. http://maps.live.com/?icid=wlmtag2FORM=MGAC01
Re: Have anybody thought of replacing CrawlDb with any kind of Rational DB?
Arun Kaundal wrote: Actually nutch people are kind of autocrate., don't expect more from them They do what they have decided Have you submitted patches that have been ignored or rejected? Each Nutch contributor indeed does what he or she decides. Nutch is not a service organization that implements every feature that someone requests. It is a collaborative project of volunteers. Each contributor adds things they need, and others share the benefits. I am waiting really stable product with incremental indexing, which detect and add/remove pages as soon as they added/removed. But they don't want to this, i don't know why ? Perhaps because this is difficult, especially while still supporting large crawls. But if others don't want to implement this, I encourage you to try to implement it, and, if you succeed, contribute it back to the project. That's the way Nutch grows. what is there mission ? If we join together to implement this, it would be better. I can work on this as weekend project. ping me, if u want You can of course fork Nutch, or start a new project from scratch. But you ought to also consider submitting patches to Nutch, working work with other contributors to solve your problems here before abandoning Nutch in favor of another project. Cheers, Doug
Re: Have anybody thought of replacing CrawlDb with any kind of Rational DB?
Howie Wang wrote: I definitely don't expect people to write it just because it happens to be useful to me :-) Call me crazy, but I'm thinking of implementing this when I get some free time (whenever that will be). It seems that I would just need to implement IWebDBWriter and IWebDBReader, and then add a command line option to the tools (something like -mysql) to specify the type of db to instantiate. It would affect about 15 files, but the tools changes would be simple -- a few if statements here and there. Does that sound right? Howie You are talking about the codebase from branch 0.7. This branch is not under active development. The current codebase is very different - it uses the MapReduce framework to process data in a distributed fashion. So, there is no single interface for writing the CrawlDb. There is one class for reading the CrawlDb, but usually the data in the DB is used not standalone, but as one of many inputs to a map-reduce job. To summarize - I think it would be very difficult to do this with the current codebase. -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
RE: Have anybody thought of replacing CrawlDb with any kind of Rational DB?
Thanks for the input, Andrzej. Yes, I'm still working off of 0.7. I might still try it since I'm not planning on upgrading for a while, but it sounds like it's not going to port to the current versions. Howie _ Your friends are close to you. Keep them that way. http://spaces.live.com/signup.aspx
WritingPluginExample-0.8 by RicardoJMendez
Hi, I'm interested in building a Nutch plugin. I am having trouble getting the example recommended plugin to work - I followed all of the steps in http://wiki.apache.org/nutch/WritingPluginExample-0%2e9, confirmed after I ran the top-level ant that build/plugins/recommended contained the plugin.xml and jar file for the 'recommended' plugin, and then tried crawling a single page from a local webserver that contains the test content (with the =recommended meta tag) from the example. Although the page got crawled/indexed and I can search for it, I see no evidence of any rank boosting on the explain search link, and when I look at NUTCHDIR/logs/hadoop.log I don't see any indication that the recommended filter got loaded by the crawl. If anyone has suggestions I'd appreciate hearing them. Also, a couple of things I notice that I didn't understand and/or looked odd from the example wiki page: 1. In the section on Getting Ant to Compile Your Plugin, it said to add the line into NUTCHDIR/src/plugin/build.xml: ant dir=reccomended target=deploy / There's an extra c in there (typo). (I fixed my local copy before I ran the crawl; telling you in case you want to update the wiki; I don't want to edit it myself until I have actually gotten it working...) 2. In the section on Getting Nutch to Use Your Plugin it said to add a regex to include the id of the plugin, using the example: valuerecommended|protocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)/value But the description just above this part says you need to at least include the nutch-extensionpoints plugin (which is not present in this line). I notice from the wiki edit history you used to have the nutch-extensionpoints plugin in there and removed it, so I'm not sure which way it's supposed to be -- what's correct? (I tried it both with and without the nutch-extensionpoints and neither way worked for me.) Thanks - Mike Schwartz