the crawldb is a serialization of a hadoop's
org.apache.hadoop.io.MapFile object. This structure contains two
SequenceFiles, one for data and one for index. This is an excerpt from
the javadoc of the MapFile class:
A file-based map from keys to values.
*
* <p>A map is a directory containing two files, the <code>data</code> file,
* containing all keys and values in the map, and a smaller
<code>index</code>
* file, containing a fraction of the keys. The fraction is determined by
* [EMAIL PROTECTED] Writer#getIndexInterval()}.
MapFile.Reader class is for reading the contents of the map file. By
using this class, you can enumerate all the entries of the map file. And
since the keys of the crawldb are Text objects containing urls, you can
just dump the keys one by one to another file. Try the following :
MapFile.Reader reader = new MapFile.Reader (fs, seqFile, conf);
Class keyC = reader.getKeyClass();
Class valueC = reader.getValueClass();
while (true) {
WritableComparable key = null;
Writable value = null;
try {
key = (WritableComparable)keyC.newInstance();
value = (Writable)valueC.newInstance();
} catch (Exception ex) {
ex.printStackTrace();
System.exit(-1);
}
try {
if (!reader.next(key, value)) {
break;
}
out.println(key);
out.println(value);
} catch (Exception e) {
e.printStackTrace();
out.println("Exception occured. " + e);
break;
}
}
This code is just for demonstration, of course you can customize it for
you needs, for example printing in xml format. you can check the
javadocs of CrawlDatum, Crawldb, Text, MapFile, SequenceFile classes
for further insight.
cha wrote:
Hi Enis,
I cant still able to figured it out how it can be done..Can you explain
elaborately.
please..
Regards,
Chandresh
Enis Soztutar wrote:
cha wrote:
hi sagar,
Thanks for the reply.
Actually am trying to digg out the code in the same class..but not able
to
figure it out from where Urls has been read.
When you dump the database, the file contains :
http://blog.cha.com/ Version: 4
Status: 2 (DB_fetched)
Fetch time: Fri Apr 13 15:58:28 IST 2007
Modified time: Thu Jan 01 05:30:00 IST 1970
Retries since fetch: 0
Retry interval: 30.0 days
Score: 0.062367838
Signature: 2b4e94ff83b8a4aa6ed061f607683d2e
Metadata: null
I figured it out rest of the things but not sure how the Url name has
been
read..
I just want plain urls only in the text file..It is possible that i can
use
to write url in some xml formats..If yes then how?
Awaiting,
Chandresh
Hi, crawldb is a actually a map file, which has urls as keys(Text class)
and CrawlDatum objects as values. You can write a generic map file
reader and then which extracts the keys and dumps to a file.
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general