Re: What is the best way to index xml data preserving the mark up?

Norberto Meijome Wed, 07 Nov 2007 22:20:54 -0800

On Wed, 7 Nov 2007 20:18:25 -0800 (PST)
David Neubert <[EMAIL PROTECTED]> wrote:


> I am sure this is 101 question, but I am bit confused about indexing xml data 
> using SOLR.
> 
> I have rich xml content (books) that need to searched at granular levels 
> (specifically paragraph and sentence levels very accurately, no 
> approximations).  My source text has exact <p></p> and <s></s> tags for this 
> purpose.  I have built this app in previous versions (using other search 
> engines) indexing the text twice, (1) where every paragraph was a virtual 
> document and (2) where every sentence was a virtual document  -- both 
> extracted from the source file (which was a singe xml file for the entire 
> book).  I have of course thought about using an XML engine eXists or 
> Xindices, but I am prefer to the stability and user base and performance that 
> Lucene/SOLR seems to have, and also there is a large body of text that is 
> regular documents and not well formed XML as well.
> 
> I am brand new to SOLR (one day) and at a basic level understand SOLR's nice 
> simple xml scheme to add documents:
> 
> <add>
>   <doc>
>     <field name="foo1">foo value 1</field>
>     <field name="foo2">foo value 2</field>
>   </doc>
>   <doc>...</doc>
> </add>
> 
> But my problem is that I believe I need to perserve the xml markup at the 
> paragraph and sentence levels, so I was hoping to create a content field that 
> could just contain the source xml for the paragraph or sentence respectively. 
>  There are reasons for this that I won't go into -- alot of granular work in 
> this app, accessing pars and sens.
> 
> Obviously an XML mechanism that could leverage the xml structure (via XPath 
> or XPointers) would work great.  Still I think Lucene can do this in a field 
> level way-- and I also can't imagine that users who are indexing XML 
> documents have to go through the trouble of striping all the markup before 
> indexing?  Hopefully I missing something basic?
> 
> It would be great to pointed in the right direction on this matter?
> 
> I think I need something along this line:
> 
> <add>
>   <doc>
>     <field name="foo1">value 1</field>
>     <field name="foo2">value 2</field>
>     ....
>     <field name="content"><an xml stream with embedded source markup></field>
>   </doc>
> </add>
> 
> Maybe the overall question -- is what is the best way to index XML content 
> using SOLR -- is all this tag stripping really necessary?

crazy/silly idea maybe... could you use dynamic fields, each containing a 
sentence, and a reference to the paragraph it belongs to ? 
eg, (not sure if the syntax is correct..)

<dynamicField name="s_*" type="string" />

Then when you create your document you can define
<doc>
  <field name="s_1_p1">{Sentence #1, Para#1}</field>
  <field name="s_2_p1">{Sentence #2, Para#1}</field>
  <field name="s_3_p1">{Sentence #3, Para#1}</field>
  <field name="s_1_p2">{Sentence #1, Para#2}</field>
[...]
</doc>

I have no idea how scalable that would be. 
cheers,
B
_________________________
{Beto|Norberto|Numard} Meijome

Immediate success shouldn't be necessary as a motivation to do the right thing.

I speak for myself, not my employer. Contents may be hot. Slippery when wet. 
Reading disclaimers makes you go blind. Writing them is worse. You have been 
Warned.

Re: What is the best way to index xml data preserving the mark up?

Reply via email to