DataImporter using pure solr add XML

2010-10-25 Thread Dario Rigolin
Looking at DataImporter I'm not sure if it's possible to import using a 
standard adddoc... xml document representing a document add operation.
Generating adddoc is quite expensive in my application and I have cached 
all those documents into a text column into MySQL database.
It will be easier for me to push all updated documents directly from 
Database instead passing via multiple xml files posted in stream mode to 
Solr.

Thank you.

Dario.


Re: DataImporter using pure solr add XML

2010-10-25 Thread Ken Stanley
On Mon, Oct 25, 2010 at 10:12 AM, Dario Rigolin
dario.rigo...@comperio.itwrote:

 Looking at DataImporter I'm not sure if it's possible to import using a
 standard adddoc... xml document representing a document add operation.
 Generating adddoc is quite expensive in my application and I have
 cached
 all those documents into a text column into MySQL database.
 It will be easier for me to push all updated documents directly from
 Database instead passing via multiple xml files posted in stream mode to
 Solr.

 Thank you.

 Dario.



Dario,

Technically nothing is stopping you from using the DIH to import your XML
document(s). However, note that the docadd/add/doc structure is not
required. In fact, you can make up your own structure for the documents, so
long as you configure the DIH to recognize them. At minimum, you should be
able to use something to the effect of:

dataSource type=FileDataSource encoding=UTF-8 /

document
entity
name=some_unique_name_for_the_entity
rootEntity=false
dataSource=null
processor=FileListEntityProcessor
fileName=some_regex_matching_your_files.*\.xml$
baseDir=/path/to/xml/files

newerThan=${dataimporter.some_unique_name_for_the_entity.last_index_time}

entity
name=another_unique_entity_name
dataSource=some_unique_name_for_the_entity
processor=XPathEntityProcessor
url=${some_unique_name_for_the_entity.fileAbsolutePath}
forEach=/XMLROOT/CHILD_NODE
stream=true

   !-- An optional list of field / definitions if your XML
schema does not match that of SOLR --
/entity
/entity
/document

The break down is as follows:

The dataSource / defines the document encoding that SOLR should use for
your XML files.

The top-level entity / creates the list of files to parse (hence why the
fileName attribute supports regex expressions). The dataSource attribute
needs to be set null here (I'm using 1.4.1, and AFAIK this is the same as
1.3 as well). The rootEntity=false  is important to tell SOLR that it
should not try to define fields from this entity.

The second-level entity / is where the documents found in the file list
are processed and parsed. The dataSource attribute needs to be the name of
the top-level entity /. The url attribute is defined as the absolute path
to the file generated by the top-level entity. The forEach is the key
component here; this is the minimum xPath needed to iterate over your
document structure. So, if by example you had:

XMLROOT
CHILD_NODE
 field1data/field1
 field2more data/field2
 ...
/CHILD_NODE
/XMLROOT

Also note that, in my experience, case sensitivity matters when parsing your
xpath instructions.

I hope this helps!

- Ken Stanley