Re: Fetch iframe from HTML (if exists)

Amit Sela Wed, 26 Jun 2013 07:12:02 -0700

How will it affect ? I Crawl with no depth (depth 1) so outlinks don't
matter and it seems that the urls fetched don't get parsed, or am I
misunderstanding something ?



On Wed, Jun 26, 2013 at 5:06 PM, Markus Jelsma
<markus.jel...@openindex.io>wrote:

> No order does not matter. Try adding iframe to the ignore_tags
> configuration directive in your nutch-site.
> parser.html.outlinks.ignore_tags
>
>
>
> -----Original message-----
> > From:Amit Sela <am...@infolinks.com>
> > Sent: Wednesday 26th June 2013 16:03
> > To: user@nutch.apache.org
> > Subject: Re: Fetch iframe from HTML (if exists)
> >
> > In nutch-site.xml plugin.includes my custom filter is last and I have
> > no htmlparsefilter.order  so my filter should be applied last, right ?
> >
> >
> >
> > On Wed, Jun 26, 2013 at 5:00 PM, Amit Sela <am...@infolinks.com> wrote:
> >
> > > So I managed to create and deploy my plugin, which initially used
> > > content.getContent() and it worked.
> > > Then, I wanted to parse the fetched content as DocumentFragment (by
> > > iterating over the child nodes).
> > > This doesn't work. I logged DocumentFragment.toString() in my
> > > MyCustomHtmlParseFilter in filter method, and in the Parse MapReduce
> logs I
> > > see: [#document-fragment: null] for all URLS.
> > >
> > > How do I get nutch to pass the parsed html as DocumentFragment ?
> Should I
> > > state htmlparsefilter.order in nutch-site.xml ? if so, in what order ?
> > >
> > > Thanks.
> > >
> > >
> > >
> > >
> > > On Tue, Jun 25, 2013 at 5:38 PM, Amit Sela <am...@infolinks.com>
> wrote:
> > >
> > >> Thanks for the prompt answer!
> > >>
> > >>
> > >> On Tue, Jun 25, 2013 at 5:35 PM, Markus Jelsma <
> > >> markus.jel...@openindex.io> wrote:
> > >>
> > >>> Hi,
> > >>>
> > >>> Do i understand you correctly if you want all iframe src attributes
> on a
> > >>> given page stored in the iframe field?
> > >>>
> > >>> The src attributes are not extracted and there is no facility to do
> so
> > >>> right now. You should create your own HTMLParseFilter, loop through
> the
> > >>> document looking for iframe tags and collect the src attribute. Then
> add
> > >>> those as parse metadata. You can then index them with the
> index-metadata
> > >>> plugin. I'm not sure it supports multi valued metafields in Nutch
> 1.6, it
> > >>> sure will in 1.7.
> > >>>
> > >>> Use the bin/nutch parsechecker and indexchecker tools to check if
> your
> > >>> plugin works.
> > >>>
> > >>> Cheers
> > >>>
> > >>>
> > >>>
> > >>> -----Original message-----
> > >>> > From:Amit Sela <am...@infolinks.com>
> > >>> > Sent: Tuesday 25th June 2013 16:26
> > >>> > To: user@nutch.apache.org
> > >>> > Subject: Fetch iframe from HTML (if exists)
> > >>> >
> > >>> > Hi all,
> > >>> >
> > >>> > I'm using nutch 1.6 with Solr 3.6.2 and I would like to index the
> > >>> iframe
> > >>> > src field into Solr.
> > >>> > i.e.,
> > >>> > <iframe src="something" scrolling="" frameborder="".......>
> > >>> > So i want to fetch the iframe and index it as iframe so that I
> could
> > >>> find
> > >>> > URLS by iframe src.
> > >>> >
> > >>> > I'm crawling with no depth over a seed list, and I don't want to
> crawl
> > >>> to
> > >>> > the iframe src, just to index and store it.
> > >>> >
> > >>> > I tried adding
> > >>> > <name>urlmeta.tags</name> <value>iframe</value> to nutch-site.xml
> > >>> >
> > >>> > and
> > >>> > <field name="iframe" type="text_general" stored="true"
> indexed="true"
> > >>> > multiValued="true"/> to schema.xml
> > >>> >
> > >>> > and
> > >>> > <field dest="iframe" source="iframe"/> to solrindex-mapping.xml.
> > >>> >
> > >>> > What am I missing ?
> > >>> >
> > >>> > Thanks,
> > >>> >
> > >>> > Amit.
> > >>> >
> > >>>
> > >>
> > >>
> > >
> >
>

Re: Fetch iframe from HTML (if exists)

Reply via email to