Hi Mathijs,

On 5-jan-2007, at 11:42, Mathijs Homminga wrote:
I have written a parse-jpg plugin which rescales JPEG images before storing them:


public class JPEGParseFilter implements Parser {
...
 public Parse getParse(Content content) {
   ...
   content.setContent(scaledImage);
 }
}

This works fine when parsing is done fetch-time. So I assume that the Fetcher stores the content after it has been parsed (if parsing is not disabled). However, when I perform a reparse (to scale down the images even further) the content does not seem to be modified.

The parsed content will be saved in the directories parse_data and parse_text in de segment dir. The input directory used is the content directory, which contains the fetched data.

Question 1: Is it true that the code above changes the fetched content before storing it (throwing away the original content)?

No. The original content is never stored. setContent() just modifies the loaded object. All the parse jobs does is: dir:content - > job:parse -> dir:parse_data and dir:parse_text.

Question 2: Can I run this parse plugin to reparse the images and change the content again (e.g. to make the images smaller, without the need to refetch all content)? Or is the content write-once during fetch-parse time?

Not directly. The parse job is a simple straight-forward write-once operation. To reparse already parsed data you would have to implement your own job, which, for example, takes the parse_data as input directory, reparses the data to a temporary directory and then replace the original parse_data with the new one.

Take a look at the org.apache.nutch.parse.ParseSegment class to see how the parse job works. Also, take a look at the org.apache.nutch.crawl.CrawlDb and org.apache.nutch.crawl.CrawlDbMerger classes for ways to implement the replacing of an existing directory.

Good luck!

--
Regards,

Eelco Lempsink

Attachment: PGP.sig
Description: This is a digitally signed message part

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to