Re: Have anybody thought of replacing CrawlDb with any kind of Rational DB?

2007-04-13 Thread Andrzej Bialecki

Howie Wang wrote:

Sorry about the previous crappily formatted message. In brief, my
point wasthat relational DB might perform better for small niche
users, and plusyou get the flexibility of SQL. No more writing custom
code to tweak webdb.Howie


Generally speaking, I agree that it would be a good option to have, 
especially for smaller setups - but it would require extensive 
modifications to many tools in Nutch. Unless you are willing to provide 
patches that implement it without breaking the large-scale case, I think 
we should let the matter rest ...



--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



RE: Have anybody thought of replacing CrawlDb with any kind of Rational DB?

2007-04-13 Thread Howie Wang
 I definitely don't expect people to write it just because it happens to be 
useful to me :-)  Call me crazy, but I'm thinking of implementing  this when I 
get some free time (whenever that will be). It seems that I  would just need to 
implement IWebDBWriter and IWebDBReader, and  then add a command line option to 
the tools (something like -mysql) to  specify the type of db to instantiate. It 
would affect about 15 files, but  the tools changes would be simple -- a few if 
statements here and there. Does that sound right?  Howie
_
Live Search Maps – find all the local information you need, right when you need 
it.
http://maps.live.com/?icid=wlmtag2FORM=MGAC01

Runing a nutch crawler on Eclipse

2007-04-13 Thread Tanmoy Kumar Mukherjee
Hi
Some forms of error i am getting while trying to compile nutch using 
eclipse

070413 115922 parsing file:/C:/Documents%20and%
20Settings/ecslogon/Desktop/nutch-0.7.2/bin/tmp_build/nutch-default.xml
070413 115922 parsing file:/C:/Documents%20and%
20Settings/ecslogon/Desktop/nutch-0.7.2/bin/tmp_build/crawl-tool.xml
070413 115922 parsing file:/C:/Documents%20and%
20Settings/ecslogon/Desktop/nutch-0.7.2/bin/tmp_build/nutch-site.xml
070413 115922 No FS indicated, using default:local

Could anyone help me out 

Tanmoy  


Re: Have anybody thought of replacing CrawlDb with any kind of Rational DB?

2007-04-13 Thread Arun Kaundal

Actually nutch people are kind of autocrate., don't expect more from them
They do what they have decided I am waiting really stable product with
incremental indexing, which detect and add/remove pages as soon as they
added/removed. But they don't want to this , i don't know why ? what is
there mission ? If we join together to implement this, it would be better. I
can work on this as weekend project.
ping me, if u want


On 4/13/07, Howie Wang [EMAIL PROTECTED] wrote:


I definitely don't expect people to write it just because it happens to be
useful to me :-)  Call me crazy, but I'm thinking of implementing  this when
I get some free time (whenever that will be). It seems that I  would just
need to implement IWebDBWriter and IWebDBReader, and  then add a command
line option to the tools (something like -mysql) to  specify the type of db
to instantiate. It would affect about 15 files, but  the tools changes would
be simple -- a few if statements here and there. Does that sound
right?  Howie
_
Live Search Maps – find all the local information you need, right when you
need it.
http://maps.live.com/?icid=wlmtag2FORM=MGAC01


Re: Have anybody thought of replacing CrawlDb with any kind of Rational DB?

2007-04-13 Thread Doug Cutting

Arun Kaundal wrote:

Actually nutch people are kind of autocrate., don't expect more from them
They do what they have decided


Have you submitted patches that have been ignored or rejected?

Each Nutch contributor indeed does what he or she decides.  Nutch is not 
a service organization that implements every feature that someone 
requests.  It is a collaborative project of volunteers.  Each 
contributor adds things they need, and others share the benefits.



I am waiting really stable product with
incremental indexing, which detect and add/remove pages as soon as they
added/removed. But they don't want to this, i don't know why ?


Perhaps because this is difficult, especially while still supporting 
large crawls.  But if others don't want to implement this, I encourage 
you to try to implement it, and, if you succeed, contribute it back to 
the project.  That's the way Nutch grows.



what is
there mission ? If we join together to implement this, it would be 
better. I

can work on this as weekend project.
ping me, if u want


You can of course fork Nutch, or start a new project from scratch.  But 
you ought to also consider submitting patches to Nutch, working work 
with other contributors to solve your problems here before abandoning 
Nutch in favor of another project.


Cheers,

Doug


Re: Have anybody thought of replacing CrawlDb with any kind of Rational DB?

2007-04-13 Thread Andrzej Bialecki

Howie Wang wrote:

I definitely don't expect people to write it just because it happens
to be useful to me :-)  Call me crazy, but I'm thinking of
implementing  this when I get some free time (whenever that will be).
It seems that I  would just need to implement IWebDBWriter and
IWebDBReader, and  then add a command line option to the tools
(something like -mysql) to  specify the type of db to instantiate. It
would affect about 15 files, but  the tools changes would be simple
-- a few if statements here and there. Does that sound right?  Howie


You are talking about the codebase from branch 0.7. This branch is not 
under active development. The current codebase is very different - it 
uses the MapReduce framework to process data in a distributed fashion.


So, there is no single interface for writing the CrawlDb. There is one 
class for reading the CrawlDb, but usually the data in the DB is used 
not standalone, but as one of many inputs to a map-reduce job.


To summarize - I think it would be very difficult to do this with the 
current codebase.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



RE: Have anybody thought of replacing CrawlDb with any kind of Rational DB?

2007-04-13 Thread Howie Wang
 Thanks for the input, Andrzej. Yes, I'm still working off of 0.7.  I might 
still try it since I'm not planning on upgrading for a while, but it sounds 
like it's not going to port to the current versions.  Howie
_
Your friends are close to you. Keep them that way.
http://spaces.live.com/signup.aspx

WritingPluginExample-0.8 by RicardoJMendez

2007-04-13 Thread Mike Schwartz

Hi,

I'm interested in building a Nutch plugin.  I am having trouble 
getting the example recommended plugin to work - I followed all of 
the steps in http://wiki.apache.org/nutch/WritingPluginExample-0%2e9, 
confirmed after I ran the top-level ant that 
build/plugins/recommended contained the plugin.xml and jar file for 
the 'recommended' plugin, and then tried crawling a single page from 
a local webserver that contains the test content (with the 
=recommended meta tag) from the example.  Although the page got 
crawled/indexed and I can search for it, I see no evidence of any 
rank boosting on the explain search link, and when I look at 
NUTCHDIR/logs/hadoop.log I don't see any indication that the 
recommended filter got loaded by the crawl.


If anyone has suggestions I'd appreciate hearing them.

Also, a couple of things I notice that I didn't understand and/or 
looked odd from the example wiki page:


1. In the section on Getting Ant to Compile Your Plugin, it said to 
add the line into NUTCHDIR/src/plugin/build.xml:

ant dir=reccomended target=deploy /

There's an extra c in there (typo).  (I fixed my local copy before 
I ran the crawl; telling you in case you want to update the wiki; I 
don't want to edit it myself until I have actually gotten it working...)


2. In the section on Getting Nutch to Use Your Plugin it said to 
add a regex to include the id of the plugin, using the example: 
valuerecommended|protocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)/value


But the description just above this part says you need to at least 
include the nutch-extensionpoints plugin (which is not present in 
this line).  I notice from the wiki edit history you used to have the 
nutch-extensionpoints plugin in there and removed it, so I'm not sure 
which way it's supposed to be -- what's correct?


(I tried it both with and without the nutch-extensionpoints and 
neither way worked for me.)


Thanks
 - Mike Schwartz