Any GOOD documentation on how to pursue the Xpath route for making the
Nutch crawl more atomic/specific?

Thanks again,
Mark
---------------------------------------------------------------------------
------------

P. 866.475.0317 x 3244
Bridgepoint Education
INNOVATIVE SOLUTIONS THAT ADVANCE LEARNING SM




On 11/11/13, 11:16 AM, "Markus Jelsma" <[email protected]> wrote:

>Ah yes, this is probably about extracting it from pages, not returning
>it. Headings can be extracted using the headings plugin which is
>available in 1.7. You can also use Xpath for extraction but there's not a
>plugin available yet plus it won't work with parse-tika.
> 
>-----Original message-----
>> From:Olle Romo <[email protected]>
>> Sent: Monday 11th November 2013 19:50
>> To: [email protected]
>> Subject: Re: Nutch 1.7 + AJAX Solr returning ALL contents vs. SPECIFIC
>> 
>> Hi Mark,
>> 
>> Not sure if this is exactly what you're looking for but maybe try the
>>whitelist_blacklist_plugin from NUTCH-585
>>https://issues.apache.org/jira/browse/NUTCH-585
>> 
>> Best,
>> Olle
>> 
>> On Nov 11, 2013, at 7:01 PM, "Reyes, Mark" <[email protected]>
>>wrote:
>> 
>> > Hi:
>> > 
>> > I¹m using Nutch 1.7 to crawl/index the pages of my domain to Solr and
>>JavaScript library AJAX Solr to capture that index as JSON, which would
>>then print that to the front-end.
>> > 
>> > My question is, if it¹s possible to have specific content return
>>(i.e. An H2 tag and a p tag) on the search results page versus all
>>contents of that page?
>> > 
>> > Thank you,
>> > Mark
>> > 
>> > 
>> > IMPORTANT NOTICE: This e-mail message is intended to be received only
>>by persons entitled to receive the confidential information it may
>>contain. E-mail messages sent from Bridgepoint Education may contain
>>information that is confidential and may be legally privileged. Please
>>do not read, copy, forward or store this message unless you are an
>>intended recipient of it. If you received this transmission in error,
>>please notify the sender by reply e-mail and delete the message and any
>>attachments.
>> 
>> 


IMPORTANT NOTICE: This e-mail message is intended to be received only by 
persons entitled to receive the confidential information it may contain. E-mail 
messages sent from Bridgepoint Education may contain information that is 
confidential and may be legally privileged. Please do not read, copy, forward 
or store this message unless you are an intended recipient of it. If you 
received this transmission in error, please notify the sender by reply e-mail 
and delete the message and any attachments.

Reply via email to