I need to create a new file in segment during parsing. Could anyone help me, how to do it?
I find two issues: 1. How to get location of the segment being currently processed? 2. I suppose I need to use Hadoop. I don't know how to use it. Maybe a bit on a background of the problem – maybe there is a better solution. I need to filter pages based on its content, so I cannot use URLFilter. Furthermore, I need to fallow links from the pages to filter out. I seems that solution might be to write a file with URLs of pages to drop (which didn't match my criteria). Then I would apply an URL Filter during merge segments (mergeseg). The URL Filter would read the file created during parsing and would drop all URLs given in this file. Thanks for help. Marcin ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
