Steve Severance wrote:
> So now that I have spent a few hours looking into how this works a lot more
> deeply I am even more of a conundrum. The fetcher passes the contents of the
> page to the parsers. It assumes that text will be output from the parsers.
> For instance even the SWF parser returns text. For all binary data, images,
> videos, music, etc... this is problematic. Potentially confounding the
> problem even further in the case of music is that text and binary data can
> come from the same file. Even if that is a problem I am not going to tackle
> it. 


Well, Nutch was originally intended as a text search engine. Lucene is a 
text search library, too - so all it knows is the plain text. If you 
want to use Nutch/Lucene for searching you will need to bring your data 
to a plain text format - at least the parts that you want to search against.

Now, when it comes to metadata, or other associated binary data, I'm 
sure we can figure out a way to store it outside the Lucene index, in a 
similar way the original content and parseData is already stored outside 
Lucene indexes.

-------

I've been thinking about an extension to the current "segment" format, 
which would allow arbitrary parts to be created (and retrieved) - this 
is actually needed to support a real-life application. It's a simple 
extension of the current model. Currently segments consist of a fixed 
number of pre-defined parts (content, crawl_generate, crawl_fetch, 
parse_data, parse_text). But it shouldn't be too difficult to extend 
segment tools and NutchBean to handle segments consisting of these basic 
parts plus other arbitrary parts.

In your case: you could have an additional segment part that stores 
post-processed images in binary format (you already have the original 
ones in content/). Another example: we could convert PDF/DOC/PPT files 
to HTML, and store this output in the "HTML preview" part.


> 
> So there are 3 choices for moving forward with an image search,
> 
> 1. All image data can be encoded as strings. I really don't like that choice
> since the indexer will index huge amounts of junk.
> 2. The fetcher can be modified to allow another output for binary data. This
> I think is the better choice although it will be a lot more work. I am not
> sure that this is possible with MapReduce since MapRunnable has only 1
> output.

No, not really - the number of output files is defined in the 
implementation of OutputFormat - but it's true that you can only set a 
single output location (and then you have to figure out how you want to 
put various stuff relative to that single location). There are existing 
implementations of OutputFormat-s that create more than 1 file at the 
same time - see ParseOutputFormat.


> 3. Images can be written into another directory for processing. This would
> need more work to automate but is probably non-issue.
> 
> I want to do the right thing so that the image search can eventually be in
> the trunk. I don't want to have to change the way a lot of things work in
> the process. Let me know what you all think.

I think we should work together on a proposed API changes to this 
"extensible part" interface, plus probably some changes to the Parse 
API. I can create a JIRA issue and provide some initial patches.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to