Re: Generic xsl parser plugin

Albin Vigier Fri, 26 Sep 2014 07:05:16 -0700

Perfect, all is going fast in here ;)
I've looked at Emir's code but there is a small limitation: you can only
put xpath, it is not full xsl. So it doesn't fit my needs. I need to
perform real transformations (with xsl:for-each and custom xsl functions,
not only xpath).


Another thing, when implementing parser, I did get a problem when trying to
apply xpath on already provided DocumentFragment (generated by htmlParser
or tikaParser). It seems that Emir got a problem too because he is
recreating the whole DOM from raw content instead of reusing it. And then
he cleans up DOM nodes to XMLize it with another Html node cleaner (html
cleaner) instead of already used NekoHtml or TagSoup. I think I'll post a
new subject on this mailing list and ask Emir. Because it can be a
performance issue on our two plugins ;)

I've written some HOWTO to describe the main mecanism and comparison with
NodeWalker implementation. I'm performing some cleanups and I'll upload the
code:
http://albinscoding.wordpress.com/2014/09/25/xsl-parser-for-apache-nutch/




On Fri, Sep 26, 2014 at 6:01 AM, Nima Falaki <nfal...@popsugar.com> wrote:

> Yes please share. It would be useful.
> On Sep 25, 2014 8:54 PM, "Talat Uyarer" <ta...@uyarer.com> wrote:
>
>> Last thing I wrote a how to use it document. :)
>> On Sep 26, 2014 6:52 AM, "Talat Uyarer" <ta...@uyarer.com> wrote:
>>
>>> Hi all,
>>>
>>> I made some changes Emir's plugin for completable with 2.x That is
>>> useful If you need I can share my fork.
>>>
>>> Talat
>>> On Sep 26, 2014 6:47 AM, "Nima Falaki" <nfal...@popsugar.com> wrote:
>>>
>>>> Hi:
>>>>
>>>> Yes, it would be very interesting. Let me know what Emir says
>>>>
>>>> Nima
>>>>
>>>> On Thu, Sep 25, 2014 at 12:43 PM, Albinscode <albinsc...@gmail.com>
>>>> wrote:
>>>>
>>>>> Oh thanks Nima, I did found this topic last year but I thought the
>>>>> project was dead. I think there is a little reference in the nutch wiki 
>>>>> too
>>>>> I cannot find it now.
>>>>>
>>>>> It looks like we have the same xsl approach so it can be interesting
>>>>> to share. I'll try to contact Emir while continuing documenting my small
>>>>> plugin.
>>>>>
>>>>> Thanks again for the valuable information!
>>>>>
>>>>> 2014-09-25 19:19 GMT+02:00 Nima Falaki <nfal...@popsugar.com>:
>>>>>
>>>>>> And the reason why I think this is because of this ticket (Look at
>>>>>> the conversation at the bottom between Emmanuel and Lewis John)
>>>>>>
>>>>>> https://issues.apache.org/jira/browse/NUTCH-978
>>>>>>
>>>>>> On Thu, Sep 25, 2014 at 8:44 AM, Nima Falaki <nfal...@popsugar.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Julien:
>>>>>>>
>>>>>>> I was under the impression that the nutch community was going to use
>>>>>>> a generic xls parser? This one.
>>>>>>> http://www.atlantbh.com/precise-data-extraction-with-apache-nutch/
>>>>>>> Is the nutch community going to use this?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Sep 25, 2014 at 5:49 AM, Julien Nioche <
>>>>>>> lists.digitalpeb...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Albin,
>>>>>>>>
>>>>>>>> You don't have to have a separate plugin for each html structure
>>>>>>>> you want to parse. You can have a single plugin with multiple
>>>>>>>> HTMLParseFilters.
>>>>>>>>
>>>>>>>> Having a generic extractor with the extraction logic configured in
>>>>>>>> an external file is definitely a good idea and would make a great
>>>>>>>> contribution to the project. In a nutshell, you haven't missed 
>>>>>>>> anything and
>>>>>>>> that wheel definitely needs inventing ;-)
>>>>>>>>
>>>>>>>> Best
>>>>>>>>
>>>>>>>> Julien
>>>>>>>>
>>>>>>>>
>>>>>>>> On 25 September 2014 09:24, Albin Vigier <albinsc...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hello everybody,
>>>>>>>>>
>>>>>>>>> I'm just wondering if it is possible to fetch specific metadata
>>>>>>>>> with
>>>>>>>>> an existing nutch plugin.
>>>>>>>>>
>>>>>>>>> Let's take an example.
>>>>>>>>> I want to extract some metadata from "div" or "td" tags from html
>>>>>>>>> pages that have specific ids and name them the way I like (this is
>>>>>>>>> done at parser time).
>>>>>>>>> Then, at indexer time, I would use index-metadata (a very good
>>>>>>>>> plugin)
>>>>>>>>> to add my custom metadata.
>>>>>>>>>
>>>>>>>>> Currently from what I've seen on the wiki and by quickly analyzing
>>>>>>>>> plugins I suppose I have to code my own plugin each time I've got a
>>>>>>>>> new site (with a new html structure). I've already done that by
>>>>>>>>> using
>>>>>>>>> a node walker in a custom htmlParseFilter but the extraction can
>>>>>>>>> be a
>>>>>>>>> little bit boring :)
>>>>>>>>>
>>>>>>>>> So on my side i've coded a little plugin that enables me to specify
>>>>>>>>> xpaths in an xml file. But before diving into more functionalities
>>>>>>>>> I'm
>>>>>>>>> just wondering if I did not missed something.
>>>>>>>>> This work allowed me to explore some nutch aspects but I don't
>>>>>>>>> want to
>>>>>>>>> reinvent the wheel or miss something.
>>>>>>>>>
>>>>>>>>> Albin
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> Open Source Solutions for Text Engineering
>>>>>>>>
>>>>>>>> http://digitalpebble.blogspot.com/
>>>>>>>> http://www.digitalpebble.com
>>>>>>>> http://twitter.com/digitalpebble
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Nima Falaki
>>>>>>> Software Engineer
>>>>>>> nfal...@popsugar.com
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>>
>>>>>>
>>>>>> Nima Falaki
>>>>>> Software Engineer
>>>>>> nfal...@popsugar.com
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>>
>>>>
>>>> Nima Falaki
>>>> Software Engineer
>>>> nfal...@popsugar.com
>>>>
>>>>

Re: Generic xsl parser plugin

Reply via email to