Re: [Nutch-general] extracting urls into text files

Enis Soztutar Fri, 16 Mar 2007 05:17:37 -0800

cha wrote:
> hi sagar,
>
> Thanks for the reply.
>
> Actually am trying to digg out the code in the same class..but not able to
> figure it out from where Urls has been read.
>
> When you dump the database, the file contains :
>
> http://blog.cha.com/  Version: 4
> Status: 2 (DB_fetched)
> Fetch time: Fri Apr 13 15:58:28 IST 2007
> Modified time: Thu Jan 01 05:30:00 IST 1970
> Retries since fetch: 0
> Retry interval: 30.0 days
> Score: 0.062367838
> Signature: 2b4e94ff83b8a4aa6ed061f607683d2e
> Metadata: null
>
> I figured it out rest of the things but not sure how the Url name has been
> read..
>
> I just want plain urls only  in the text file..It is possible that i can use
> to write url in some xml formats..If yes then how?
>
> Awaiting,
>
> Chandresh
>
>   
Hi, crawldb is a actually a map file, which has urls as keys(Text class) 
and CrawlDatum objects as values. You can write a generic map file 
reader and then which extracts the keys and dumps to a file.




-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] extracting urls into text files

Reply via email to