Re: Generic xsl parser plugin

Julien Nioche Fri, 26 Sep 2014 01:27:58 -0700

Hi Nima

Thanks for reminding me about this JIRA issue, it hasn't been commented on
for some time and I'd forgotten about it. Judging by the discussion on
NUTCH-978 <https://issues.apache.org/jira/browse/NUTCH-978> things got
stuck when Emmanuel tried to get in touch with Emir (who in the meantime
seems to have stopped using Nutch - see
http://www.atlantbh.com/book-review-web-crawling-and-data-mining-with-apache-nutch/
).


It would be a good thing to get in touch with him indeed, alternatively
Albin's plugin could be a good starting point. There clearly is a need for
such a functionality and quite a few people keen to make it happen.

Thanks

Julien


On 25 September 2014 18:19, Nima Falaki <[email protected]> wrote:

> And the reason why I think this is because of this ticket (Look at the
> conversation at the bottom between Emmanuel and Lewis John)
>
> https://issues.apache.org/jira/browse/NUTCH-978
>
> On Thu, Sep 25, 2014 at 8:44 AM, Nima Falaki <[email protected]> wrote:
>
>> Hi Julien:
>>
>> I was under the impression that the nutch community was going to use a
>> generic xls parser? This one.
>> http://www.atlantbh.com/precise-data-extraction-with-apache-nutch/ Is
>> the nutch community going to use this?
>>
>>
>>
>> On Thu, Sep 25, 2014 at 5:49 AM, Julien Nioche <
>> [email protected]> wrote:
>>
>>> Hi Albin,
>>>
>>> You don't have to have a separate plugin for each html structure you
>>> want to parse. You can have a single plugin with multiple HTMLParseFilters.
>>>
>>> Having a generic extractor with the extraction logic configured in an
>>> external file is definitely a good idea and would make a great contribution
>>> to the project. In a nutshell, you haven't missed anything and that wheel
>>> definitely needs inventing ;-)
>>>
>>> Best
>>>
>>> Julien
>>>
>>>
>>> On 25 September 2014 09:24, Albin Vigier <[email protected]> wrote:
>>>
>>>> Hello everybody,
>>>>
>>>> I'm just wondering if it is possible to fetch specific metadata with
>>>> an existing nutch plugin.
>>>>
>>>> Let's take an example.
>>>> I want to extract some metadata from "div" or "td" tags from html
>>>> pages that have specific ids and name them the way I like (this is
>>>> done at parser time).
>>>> Then, at indexer time, I would use index-metadata (a very good plugin)
>>>> to add my custom metadata.
>>>>
>>>> Currently from what I've seen on the wiki and by quickly analyzing
>>>> plugins I suppose I have to code my own plugin each time I've got a
>>>> new site (with a new html structure). I've already done that by using
>>>> a node walker in a custom htmlParseFilter but the extraction can be a
>>>> little bit boring :)
>>>>
>>>> So on my side i've coded a little plugin that enables me to specify
>>>> xpaths in an xml file. But before diving into more functionalities I'm
>>>> just wondering if I did not missed something.
>>>> This work allowed me to explore some nutch aspects but I don't want to
>>>> reinvent the wheel or miss something.
>>>>
>>>> Albin
>>>>
>>>
>>>
>>>
>>> --
>>>
>>> Open Source Solutions for Text Engineering
>>>
>>> http://digitalpebble.blogspot.com/
>>> http://www.digitalpebble.com
>>> http://twitter.com/digitalpebble
>>>
>>
>>
>>
>> --
>>
>>
>>
>> Nima Falaki
>> Software Engineer
>> [email protected]
>>
>>
>
>
> --
>
>
>
> Nima Falaki
> Software Engineer
> [email protected]
>
>


-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Generic xsl parser plugin

Reply via email to