Hi Harald, have a look at NUTCH-1785 <https://issues.apache.org/jira/browse/NUTCH-1785>: it's about the same problem.
> a) where does the binary blob appear in NutchDocument and Just add a NutchField. The value can be any type, but the indexer must be able to handle it. > b) how does it get there? In Nutch 1.x adding raw/binary content can only done within IndexerMapReduce. Indexing filters do not have the binary content at hand. In 2.x this is different: an indexing filter can request any field/column to be added. I didn't try but it should be possible to request the raw content (column has the same name). Sebastian 2014-07-23 16:29 GMT+02:00 Harald Kirsch <[email protected]>: > Hi, > > coming back to this question. Now I have basically the following > parse-plugins.xml: > > <mimeType name="text/html"> > <plugin id="parse-html" /> > </mimeType> > > All other mime-types shall not be parsed for links. The documents shall be > send as-is, i.e. as binary blobs to the index stage. (To preempt cryouts: > this is a custom index stage that knows how to deal with binary blobs.) > > Now where and how will the binary blob be amde available within the > NutchDocument send to my indexer. > > For parsed content I see text coming along in the content field, but > > a) where does the binary blob appear in NutchDocument and > b) how does it get there? > > Regards, > Harald. > > > On 03.07.2014 22:30, Sebastian Nagel wrote: > >> Hi Harald, >> >> it is sufficient to only activate the parse-html plugin >>> >> Yes. If parse-tika is active also other document types >> (PDFs, etc.) searched for links. >> >> or is even this not necessary >>> >> You need to parse HTMLs. It's impossible to extract links without >> parsing HTML. Think of relative links (base URL), <!-- comments -->, >> <![CDATA[...]]>, and other subtleties which will harm other >> approaches for link extraction (eg, regular expressions). >> >> b) provide HTML and all other documents found as such to some external >>> tool as is, i.e. unparsed. >>> >> Make sure that the raw content is stored (in segments or WebTable), cf. >> property fetcher.store.content. >> >> (Is there a more detailed description of what the individual stages of >>> nutch do beyond the tutorial?) >>> >> Still a good introduction: Andrzej BiaĆecki's chapter in "Hadoop: The >> definitive guide" >> by Tom White. >> >> Sebastian >> >> On 07/01/2014 03:12 PM, Harald Kirsch wrote: >> >>> Suppose I want nutch to fetch URLs and >>> >>> a) follow links in HTML documents *only* >>> b) provide HTML and all other documents found as such to some external >>> tool as is, i.e. unparsed. >>> >>> Is it correct that it is sufficient to only activate the parse-html >>> plugin from all the parse-* >>> plugins or is even this not necessary? >>> >>> (Is there a more detailed description of what the individual stages of >>> nutch do beyond the tutorial?) >>> >>> Thanks, >>> Harald. >>> >>> >> >>

