Well, just for sports, I tried removing the parse-tika but still nothing...
On Wed, Jun 26, 2013 at 11:25 PM, Julien Nioche < lists.digitalpeb...@gmail.com> wrote: > I noticed recently that my XPath extraction rules did not work on HTML > documents with parse-tika but worked at treat with parse-html. Forgot to > open an issue, my bad. Could be the same problem here > > > On 26 June 2013 15:26, Amit Sela <am...@infolinks.com> wrote: > > > I did succeed in parsing using content and iterating over every line but > > I'd prefer do it with DocumentFragment. > > my plugin.includes has: > > > > > protocol-httpclient|urlfilter-regex|parse-(text|html|tika|metatags|js)|index-(basic|anchor|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass)|iframemeta > > So I us parse-html but also tika, text metatags and js. maybe it's to > much > > ? I copied this configuration from an example I saw. I do know that I use > > metatags (I index keywords and description) but I'm not sure about the > > rest... > > > > > > On Wed, Jun 26, 2013 at 5:21 PM, Markus Jelsma > > <markus.jel...@openindex.io>wrote: > > > > > Of course, forget it. What parser do you use? Maybe the old parse-html > > > doesn't report it back.You can also try to print every element you loop > > > over and check if it's there or not. > > > > > > > > > > > > -----Original message----- > > > > From:Amit Sela <am...@infolinks.com> > > > > Sent: Wednesday 26th June 2013 16:11 > > > > To: user@nutch.apache.org > > > > Subject: Re: Fetch iframe from HTML (if exists) > > > > > > > > How will it affect ? I Crawl with no depth (depth 1) so outlinks > don't > > > > matter and it seems that the urls fetched don't get parsed, or am I > > > > misunderstanding something ? > > > > > > > > > > > > On Wed, Jun 26, 2013 at 5:06 PM, Markus Jelsma > > > > <markus.jel...@openindex.io>wrote: > > > > > > > > > No order does not matter. Try adding iframe to the ignore_tags > > > > > configuration directive in your nutch-site. > > > > > parser.html.outlinks.ignore_tags > > > > > > > > > > > > > > > > > > > > -----Original message----- > > > > > > From:Amit Sela <am...@infolinks.com> > > > > > > Sent: Wednesday 26th June 2013 16:03 > > > > > > To: user@nutch.apache.org > > > > > > Subject: Re: Fetch iframe from HTML (if exists) > > > > > > > > > > > > In nutch-site.xml plugin.includes my custom filter is last and I > > have > > > > > > no htmlparsefilter.order so my filter should be applied last, > > right > > > ? > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Jun 26, 2013 at 5:00 PM, Amit Sela <am...@infolinks.com> > > > wrote: > > > > > > > > > > > > > So I managed to create and deploy my plugin, which initially > used > > > > > > > content.getContent() and it worked. > > > > > > > Then, I wanted to parse the fetched content as DocumentFragment > > (by > > > > > > > iterating over the child nodes). > > > > > > > This doesn't work. I logged DocumentFragment.toString() in my > > > > > > > MyCustomHtmlParseFilter in filter method, and in the Parse > > > MapReduce > > > > > logs I > > > > > > > see: [#document-fragment: null] for all URLS. > > > > > > > > > > > > > > How do I get nutch to pass the parsed html as DocumentFragment > ? > > > > > Should I > > > > > > > state htmlparsefilter.order in nutch-site.xml ? if so, in what > > > order ? > > > > > > > > > > > > > > Thanks. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Jun 25, 2013 at 5:38 PM, Amit Sela < > am...@infolinks.com> > > > > > wrote: > > > > > > > > > > > > > >> Thanks for the prompt answer! > > > > > > >> > > > > > > >> > > > > > > >> On Tue, Jun 25, 2013 at 5:35 PM, Markus Jelsma < > > > > > > >> markus.jel...@openindex.io> wrote: > > > > > > >> > > > > > > >>> Hi, > > > > > > >>> > > > > > > >>> Do i understand you correctly if you want all iframe src > > > attributes > > > > > on a > > > > > > >>> given page stored in the iframe field? > > > > > > >>> > > > > > > >>> The src attributes are not extracted and there is no facility > > to > > > do > > > > > so > > > > > > >>> right now. You should create your own HTMLParseFilter, loop > > > through > > > > > the > > > > > > >>> document looking for iframe tags and collect the src > attribute. > > > Then > > > > > add > > > > > > >>> those as parse metadata. You can then index them with the > > > > > index-metadata > > > > > > >>> plugin. I'm not sure it supports multi valued metafields in > > Nutch > > > > > 1.6, it > > > > > > >>> sure will in 1.7. > > > > > > >>> > > > > > > >>> Use the bin/nutch parsechecker and indexchecker tools to > check > > if > > > > > your > > > > > > >>> plugin works. > > > > > > >>> > > > > > > >>> Cheers > > > > > > >>> > > > > > > >>> > > > > > > >>> > > > > > > >>> -----Original message----- > > > > > > >>> > From:Amit Sela <am...@infolinks.com> > > > > > > >>> > Sent: Tuesday 25th June 2013 16:26 > > > > > > >>> > To: user@nutch.apache.org > > > > > > >>> > Subject: Fetch iframe from HTML (if exists) > > > > > > >>> > > > > > > > >>> > Hi all, > > > > > > >>> > > > > > > > >>> > I'm using nutch 1.6 with Solr 3.6.2 and I would like to > index > > > the > > > > > > >>> iframe > > > > > > >>> > src field into Solr. > > > > > > >>> > i.e., > > > > > > >>> > <iframe src="something" scrolling="" frameborder="".......> > > > > > > >>> > So i want to fetch the iframe and index it as iframe so > that > > I > > > > > could > > > > > > >>> find > > > > > > >>> > URLS by iframe src. > > > > > > >>> > > > > > > > >>> > I'm crawling with no depth over a seed list, and I don't > want > > > to > > > > > crawl > > > > > > >>> to > > > > > > >>> > the iframe src, just to index and store it. > > > > > > >>> > > > > > > > >>> > I tried adding > > > > > > >>> > <name>urlmeta.tags</name> <value>iframe</value> to > > > nutch-site.xml > > > > > > >>> > > > > > > > >>> > and > > > > > > >>> > <field name="iframe" type="text_general" stored="true" > > > > > indexed="true" > > > > > > >>> > multiValued="true"/> to schema.xml > > > > > > >>> > > > > > > > >>> > and > > > > > > >>> > <field dest="iframe" source="iframe"/> to > > > solrindex-mapping.xml. > > > > > > >>> > > > > > > > >>> > What am I missing ? > > > > > > >>> > > > > > > > >>> > Thanks, > > > > > > >>> > > > > > > > >>> > Amit. > > > > > > >>> > > > > > > > >>> > > > > > > >> > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > * > *Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > http://twitter.com/digitalpebble >