Steve Severance wrote: > So now that I have spent a few hours looking into how this works a lot more > deeply I am even more of a conundrum. The fetcher passes the contents of the > page to the parsers. It assumes that text will be output from the parsers. > For instance even the SWF parser returns text. For all binary data, images, > videos, music, etc... this is problematic. Potentially confounding the > problem even further in the case of music is that text and binary data can > come from the same file. Even if that is a problem I am not going to tackle > it.
Well, Nutch was originally intended as a text search engine. Lucene is a text search library, too - so all it knows is the plain text. If you want to use Nutch/Lucene for searching you will need to bring your data to a plain text format - at least the parts that you want to search against. Now, when it comes to metadata, or other associated binary data, I'm sure we can figure out a way to store it outside the Lucene index, in a similar way the original content and parseData is already stored outside Lucene indexes. ------- I've been thinking about an extension to the current "segment" format, which would allow arbitrary parts to be created (and retrieved) - this is actually needed to support a real-life application. It's a simple extension of the current model. Currently segments consist of a fixed number of pre-defined parts (content, crawl_generate, crawl_fetch, parse_data, parse_text). But it shouldn't be too difficult to extend segment tools and NutchBean to handle segments consisting of these basic parts plus other arbitrary parts. In your case: you could have an additional segment part that stores post-processed images in binary format (you already have the original ones in content/). Another example: we could convert PDF/DOC/PPT files to HTML, and store this output in the "HTML preview" part. > > So there are 3 choices for moving forward with an image search, > > 1. All image data can be encoded as strings. I really don't like that choice > since the indexer will index huge amounts of junk. > 2. The fetcher can be modified to allow another output for binary data. This > I think is the better choice although it will be a lot more work. I am not > sure that this is possible with MapReduce since MapRunnable has only 1 > output. No, not really - the number of output files is defined in the implementation of OutputFormat - but it's true that you can only set a single output location (and then you have to figure out how you want to put various stuff relative to that single location). There are existing implementations of OutputFormat-s that create more than 1 file at the same time - see ParseOutputFormat. > 3. Images can be written into another directory for processing. This would > need more work to automate but is probably non-issue. > > I want to do the right thing so that the image search can eventually be in > the trunk. I don't want to have to change the way a lot of things work in > the process. Let me know what you all think. I think we should work together on a proposed API changes to this "extensible part" interface, plus probably some changes to the Parse API. I can create a JIRA issue and provide some initial patches. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers