I found an answer to my question, but it comes with a cost. With an XML file 
like this (this is simplified to remove extraneous elements and attributes):

<data>
  <user id="[id-num]">
    <message date="[date]">[message text]</message>
    ...
  </user>
  ...
</data>

I can index the user id as a field in documents that represent each of the 
user's messages with this data-config expression:

<dataConfig>
  <dataSource type="FileDataSource" encoding="UTF-8" />
  <document>
    <entity name="message"
            processor="XPathEntityProcessor"
            stream="true"
            forEach="/data/user/message | /data/user"
            url="message-data.xml">
      <field column="id" xpath="/data/user/@id" commonField="true"/>
      <field column="date" xpath="/data/user/message/@date" 
dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss"/>
      <field column="text" xpath="/data/user/message" />
   </entity>
  </document>
</dataConfig>

I didn't realize that commonField would work for cases in which the previously 
encountered field is in an element that encompasses the other elements, but it 
does. The forEach value has to be "/data/user/message | /data/user" in order 
for the user id to be located, since it is not under /data/user/message.

By specifying forEach="/data/user/message | /data/user" I am saying that each 
/data/user or /data/user/message element is a document in the index, but I 
don't really want /data/user elements to be treated this way. As luck would 
have it, those documents are filtered out, only because date and text are 
required fields, and they have not been assigned values yet when a document is 
created for a /data/user element, so an exception is thrown. I could live with 
this, but it's kind of ugly.

I don't see any other way of doing what I need to do with embedded XML elements 
though. I tried creating nested entities in the data-config file, but each one 
of them is required to have a url attribute, and I think that caused the input 
file to be read twice.

The only other possibility I could see from reading the DataImportHandler 
documentation was to specify an XSL file and change the XML file's structure so 
that the user id attribute is moved down to be an attribute of the message 
element. I'm not sure it's worth it to do something like that for what seems 
like a small problem, and I wonder how much it would slow down the importing of 
a large XML file.

Are there any other ways of handling cases like this, where an attribute of an 
outer element is to be included in an index document that corresponds to an 
element nested inside it?
Thanks,
Mike

-----Original Message-----
From: Mike O'Leary [mailto:tmole...@uw.edu] 
Sent: Friday, March 02, 2012 3:30 PM
To: Solr-User (solr-user@lucene.apache.org)
Subject: Including an attribute value from a higher level entity when using DIH 
to index an XML file

I have an XML file that I would like to index, that has a structure similar to 
this:

<data>
  <user id="[id-num]">
    <message date="[date]">[message text]</message>
    ...
  </user>
  ...
</data>

I would like to have the documents in the index correspond to the messages in 
the xml file, and have the user's [id-num] value stored as a field in each of 
the user's documents. I think this means that I have to define an entity for 
message that looks like this:

<dataConfig>
  <dataSource type="FileDataSource" encoding="UTF-8" />
  <document>
    <entity name="message"
            processor="XPathEntityProcessor"
            stream="true"
            forEach="/data/user/message/"
            url="message-data.xml">
      <field column="date" xpath="/data/user/message/@date" 
dateTimeFormat="yyyy-MM-dd'T'hh:mm:ss"/>
      <field column="text" xpath="/data/user/message" />
   </entity>
  </document>
</dataConfig>

but I don't know where to put the field definition for the user id. It would 
look like

<field column="id" xpath="/data/user/@id" />

I can't put it within the message entity, because it is defined with 
forEach="/data/user/message/" and the id field's xpath value is outside of the 
entity's scope. Putting the id field definition there causes a null pointer 
exception. I don't think I want to create a "user" entity that the "message" 
entity is nested inside of, or is there a way to do that and still have the 
index documents correspond to messages from the file? Are there one or more 
attributes or values of attribute that I haven't run across in my searching 
that provide a way to do what I need to do?
Thanks,
Mike


Reply via email to