Re: DIH using values from solrconfig.xml inside data-config.xml

Noble Paul നോബിള്‍ नोब्ळ् Wed, 04 Feb 2009 10:27:38 -0800

The implementation assumed that most of the users have xml with a
fixed schema. . In that case giving absolute path is not hard. This
helps us deal with a large subset of usecases rather easily.


We have not added all the features which are possible with a
streaming parser. It is wiser to piggyback on some real XPath engine
for because the demand for full xpath support will always be there.
--Noble

On Wed, Feb 4, 2009 at 5:15 PM, Fergus McMenemie <fer...@twig.me.uk> wrote:
>>: > The solr data field is populated properly. So I guess that bit works.
>>: > I really wish I could use xpath="//para"
>>
>>: The limitation comes from streaming the XML instead of creating a DOM.
>>: XPathRecordReader is a custom streaming XPath parser implementation and
>>: streaming is easy only because we limit the syntax. You can use
>>: PlainTextEntityProcessor which gives the XML as a string to a  custom
>>: Transformer. This Transformer can create a DOM, run your XPath query and
>>: populate the fields. It's more expensive but it is an option.
>>
>>Maybe it's just me, but it seems like i'm noticing that as DIH gets used
>>more, many people are noting that the XPath processing in DIH doesn't work
>>the way they expect because it's a custom XPath parser/engine designed for
>>streaming.
>>
>>It seems like it would be helpful to have an alternate processor for
>>people who don't need the streaming support (ie: are dealing with small
>>enough docs that they can load the full DOM tree into memory) that would
>>use the default Java XPath engine (and have less caveats/suprises) ... i
>>wou think it would probably even make sense for this new XPath processor
>>to be the one we suggest for new users, and only suggest the existing
>>(stream based) processor if they have really big xml docs to deal with.
>>
>>(In hindsight XPathEntityProcessor and XPathRecordReader should probably
>>have been named StreamingXPathEntityProcessor and
>>StreamingXPathRecordReader)
>>
> Four thoughts!
>
> 1) My use case involves a few million XML documents ranging in size
>   from a few K to 500K. 95% of the documents are under 25KBytes,
>   5 of the documents are around 0.5Mbytes. So.. sod it, I think I
>   need a streaming parser.
>
> 2) "streaming XPath parser"? I only half understand all this stuff,
>   but, and this is based on the little bit of SAX stuff I have written,
>   I would have thought that //para was trivial for any kind of
>   streaming XML parser.
>
> 3) Much of the confusion may be arising because the DIH wiki page is
>   not to clear on what is and is not allowed. We need better,
>   more explicit examples. What seems to be allowed is:-
>    <field column="streetname" xpath="/record/address/@c" />
>    <field column="title"      xpath="/record/title" />
>    <field column="date"       xpath="/record/da...@qualifier='pubDate']" />
>   I will add these to the wiki. Just to be sure, I tested
>   xpath="//para". It does not work!
>
> 4) XML documents are ether well structured with good separation of
>   data and presentation in which case absolute xpaths work fine.
>   Or older, in my case text documents, which have been forced into
>   XML format with poor structure where the data and presentation
>   is all mixed up. I suspect that the addition of //para would
>   cover many of the use cases, and what was left could be covered
>   by a preceding XSLT transform.
> --
>
> ===============================================================
> Fergus McMenemie               Email:fer...@twig.me.uk
> Techmore Ltd                   Phone:(UK) 07721 376021
>
> Unix/Mac/Intranets             Analyst Programmer
> ===============================================================
>



-- 
--Noble Paul

Re: DIH using values from solrconfig.xml inside data-config.xml

Reply via email to