Hello nutch-users, I have some content files in HTML/XML well formatted. Each file has a corresponding URL which associated with.
These files are crawled from BBS whose need to login with cookie, that's the reason I don't use nutch's built-in crawler to grab them at all. Now problems are: 1) How to tell nutch taking these files correctly? Because for XML files, it should decide which parts are real contents. 2) How to tell nutch taking consideration of corresponding URL as associate properties to those files? For example, here I have two files on local disk: con01.html => http://www.somewhere.com/someurl.html con02.xml => http://www.somewhere.com/url02.xml I want to add these two files into nutch, and let nutch remember their url as well, for future search. Thank you for your help in advance, because I have read the help documentation but they didn't explain that well. Regards, David ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
