Building such a plugin is not complicated, in our team we built one similar that let us store a list of tags specified in the nutch-site.xml. Right now the plugin is not published but It could be done, basically in our case we store each extracted tag in a Solr field and we prefix each field name with the "custom-" text.
Regards, ----- Original Message ----- From: "Vishal Sharma" <[email protected]> To: "user" <[email protected]> Sent: Friday, November 28, 2014 12:11:33 AM Subject: Re: How to parse specific html tag in nutch+solr while crawling Thanks for replying Markus. I'll check that. *Vishal Sharma**TL, SFDC*T: +1 650 288 6711 E: [email protected] <[email protected]> www.grazitti.com [image: Description: LinkedIn] <http://www.linkedin.com/company/grazitti-interactive>[image: Description: Twitter] <https://twitter.com/grazitti>[image: fbook] <https://www.facebook.com/grazitti.interactive>*Zak*Calendar Salesforce1TM Calendar App for Teams <https://appexchange.salesforce.com/listingDetail?listingId=a0N3000000B5UPKEA3> On Thu, Nov 27, 2014 at 11:00 PM, Markus Jelsma <[email protected]> wrote: > The plugin works on headings only, but if you check the sources, you can > quickly adapt it to any element/attribute section. > > -----Original message----- > > From:Vishal Sharma <[email protected]> > > Sent: Thursday 27th November 2014 18:25 > > To: user <[email protected]> > > Subject: Re: How to parse specific html tag in nutch+solr while crawling > > > > Hi Markus, > > > > Thank you so much for your reply. > > > > Quick question: Will this parse only hN tags only or can we confiure it > for > > other html tags also like <div class=''test"> ? > > > > *Vishal Sharma**TL, SFDC*T: +1 650 288 6711 > > E: [email protected] <[email protected]> > > www.grazitti.com [image: Description: LinkedIn] > > <http://www.linkedin.com/company/grazitti-interactive>[image: > Description: > > Twitter] <https://twitter.com/grazitti>[image: fbook] > > <https://www.facebook.com/grazitti.interactive>*Zak*Calendar > > Salesforce1TM Calendar > > App for Teams > > < > https://appexchange.salesforce.com/listingDetail?listingId=a0N3000000B5UPKEA3 > > > > > > > > > > > > On Thu, Nov 27, 2014 at 10:33 PM, Markus Jelsma < > [email protected]> > > wrote: > > > > > You may want to check the headings plugin, it reads content from those > > > elements and writes them to some field. Very basic. > > > > > > > > > > > > -----Original message----- > > > > From:Vishal Sharma <[email protected]> > > > > Sent: Thursday 27th November 2014 17:59 > > > > To: user <[email protected]> > > > > Subject: How to parse specific html tag in nutch+solr while crawling > > > > > > > > I tried this on Google also. But, nothing useful. Appreciate any > help. > > > > > > > > Is there a way to parse specific html tag while doing the crawling > with > > > > nutch and then indexing it to solr. > > > > > > > > For-example I don't want all html page to go to content node. I would > > > want > > > > to parse h1 h2 tags into separate nodes. > > > > > > > > > > > > > > > > *Vishal Sharma**TL, SFDC*T: +1 650 288 6711 > > > > E: [email protected] <[email protected]> > > > > www.grazitti.com [image: Description: LinkedIn] > > > > <http://www.linkedin.com/company/grazitti-interactive>[image: > > > Description: > > > > Twitter] <https://twitter.com/grazitti>[image: fbook] > > > > <https://www.facebook.com/grazitti.interactive>*Zak*Calendar > > > > Salesforce1TM Calendar > > > > App for Teams > > > > < > > > > https://appexchange.salesforce.com/listingDetail?listingId=a0N3000000B5UPKEA3 > > > > > > > > > > > > > > Proceso de Acreditación de la Maestría en Gestión de Proyectos Informáticos. En busca de la Excelencia. Del 24 al 28 de noviembre de 2014.

