Hello,
I am hoping crawl about 3000 domains using the nutch crawler +
PrefixURLFilter, however, I have no need to actually index the html.
Ideally, I would just like each domain's raw html pages saved into separate
directories. We already have a parser that converts the HTML into indexes
for our particular application.
Is there a clean way to accomplish this?
My current idea is to create a python script (similar to the one already on
the wiki) that essentially loops through the fetch, update cycles until
depth is reached, and then simply never actually does the real lucene
indexing and merging. Now, here's the "there must be a better way" part ...
I would then simply execute the "bin/nutch readseg -dump" tool via python to
extract all the html and headers (for each segment) and then, via a regex,
save each html output back into an html file, and store it in a directory
according to the domain it came from.
How stupid/slow is this? Any better ideas? I saw someone previously
mentioned something like what I want to do, and someone responded that it
was better to just roll your own crawler or something? I doubt that for
some reason. Also, in the future we'd like to take advantage of the
word/pdf downloading/parsing as well.
Thanks for what appears to be a great crawler!
Sincerely,
John
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general