Re: [Nutch-general] extracting urls into text files

Enis Soztutar Mon, 19 Mar 2007 01:21:52 -0800

the crawldb is a serialization of a hadoop'sorg.apache.hadoop.io.MapFile object. This structure contains twoSequenceFiles, one for data and one for index. This is an excerpt fromthe javadoc of the MapFile class:


A file-based map from keys to values.
*
* <p>A map is a directory containing two files, the <code>data</code> file,

* containing all keys and values in the map, and a smaller<code>index</code>

* file, containing a fraction of the keys.  The fraction is determined by
* [EMAIL PROTECTED] Writer#getIndexInterval()}.

MapFile.Reader class is for reading the contents of the map file. Byusing this class, you can enumerate all the entries of the map file. Andsince the keys of the crawldb are Text objects containing urls, you canjust dump the keys one by one to another file. Try the following :



MapFile.Reader reader = new MapFile.Reader (fs, seqFile, conf);

       Class keyC = reader.getKeyClass();
       Class valueC = reader.getValueClass();

       while (true) {
           WritableComparable key = null;
           Writable value = null;
           try {
               key = (WritableComparable)keyC.newInstance();
               value = (Writable)valueC.newInstance();
           } catch (Exception ex) {
               ex.printStackTrace();
               System.exit(-1);
           }

try {if (!reader.next(key, value)) {

                   break;
               }

               out.println(key);
               out.println(value);
           } catch (Exception e) {
               e.printStackTrace();
               out.println("Exception occured. " + e);
               break;
           }
       }

This code is just for demonstration, of course you can customize it foryou needs, for example printing in xml format. you can check thejavadocs of CrawlDatum, Crawldb, Text, MapFile, SequenceFile classesfor further insight.



cha wrote:

Hi Enis,

I cant still able to figured it out how it can be done..Can you explain
elaborately.
please..

Regards,
Chandresh

Enis Soztutar wrote:

cha wrote:

hi sagar,

Thanks for the reply.

Actually am trying to digg out the code in the same class..but not able
to
figure it out from where Urls has been read.

When you dump the database, the file contains :

http://blog.cha.com/    Version: 4
Status: 2 (DB_fetched)
Fetch time: Fri Apr 13 15:58:28 IST 2007
Modified time: Thu Jan 01 05:30:00 IST 1970
Retries since fetch: 0
Retry interval: 30.0 days
Score: 0.062367838
Signature: 2b4e94ff83b8a4aa6ed061f607683d2e
Metadata: null

I figured it out rest of the things but not sure how the Url name has
been
read..

I just want plain urls only  in the text file..It is possible that i can
use
to write url in some xml formats..If yes then how?

Awaiting,

Chandresh

Hi, crawldb is a actually a map file, which has urls as keys(Text class)and CrawlDatum objects as values. You can write a generic map filereader and then which extracts the keys and dumps to a file.

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV

_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] extracting urls into text files

Reply via email to