On Mon, Oct 25, 2010 at 10:12 AM, Dario Rigolin <dario.rigo...@comperio.it>wrote:
> Looking at DataImporter I'm not sure if it's possible to import using a > standard <add><doc>... xml document representing a document add operation. > Generating <add><doc> is quite expensive in my application and I have > cached > all those documents into a text column into MySQL database. > It will be easier for me to "push" all updated documents directly from > Database instead passing via multiple xml files posted in "stream" mode to > Solr. > > Thank you. > > Dario. > Dario, Technically nothing is stopping you from using the DIH to import your XML document(s). However, note that the <doc><add></add></doc> structure is not required. In fact, you can make up your own structure for the documents, so long as you configure the DIH to recognize them. At minimum, you should be able to use something to the effect of: <dataSource type="FileDataSource" encoding="UTF-8" /> <document> <entity name="some_unique_name_for_the_entity" rootEntity="false" dataSource="null" processor="FileListEntityProcessor" fileName="some_regex_matching_your_files.*\.xml$" baseDir="/path/to/xml/files" newerThan="${dataimporter.some_unique_name_for_the_entity.last_index_time}" > <entity name="another_unique_entity_name" dataSource="some_unique_name_for_the_entity" processor="XPathEntityProcessor" url="${some_unique_name_for_the_entity.fileAbsolutePath}" forEach="/XMLROOT/CHILD_NODE" stream="true" > <!-- An optional list of <field /> definitions if your XML schema does not match that of SOLR --> </entity> </entity> </document> The break down is as follows: The <dataSource /> defines the document encoding that SOLR should use for your XML files. The top-level <entity /> creates the list of files to parse (hence why the fileName attribute supports regex expressions). The dataSource attribute needs to be set null here (I'm using 1.4.1, and AFAIK this is the same as 1.3 as well). The rootEntity="false" is important to tell SOLR that it should not try to define fields from this entity. The second-level <entity /> is where the documents found in the file list are processed and parsed. The dataSource attribute needs to be the name of the top-level <entity />. The url attribute is defined as the absolute path to the file generated by the top-level entity. The forEach is the key component here; this is the minimum xPath needed to iterate over your document structure. So, if by example you had: <XMLROOT> <CHILD_NODE> <field1>data</field1> <field2>more data</field2> ... </CHILD_NODE> </XMLROOT> Also note that, in my experience, case sensitivity matters when parsing your xpath instructions. I hope this helps! - Ken Stanley