Re: [Nutch-general] Getting the real data not only the segment files/index

Arun Kaundal Tue, 07 Nov 2006 20:24:31 -0800

Hi Nils

  According to my knowledge , Nutch do not support this feature Till Date.
If yes, Do let me know. I also Need nutch to support this feature ,
otherwise I am planning to move to the same tech as u did  like using wget
and Lucene ....


Keep in touch...
./Arun


On 11/7/06, Nils Höller <[EMAIL PROTECTED]> wrote:


Hi,

I ve worked with Nutch till last year and
I am now trying to do something (about continious queries) new with it.

I have only used nutch for getting the index an searching something in a
generated site-map (with the WebDB).

Now I want to use it for to get a archive of a certain number of sites.
So I ll want to nutch to crawl the sites every day (like I used it
before) but also download and save the REAL content of the sites (all
html and pictures), so I can work with this real content.

Is there a possibility to make nutch save also the content like it is
crawled, and not only creating the WebDB and Index?

Actually I have a solution with a perl script, wget, and lucene, but
it would be perfect if I can use nutch from now on.

Thanks for your help.

Nils

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642

_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Getting the real data not only the segment files/index

Reply via email to