Efficiently mining or parsing data out of XML source files

Van Tassell, Kristian Thu, 31 May 2012 06:15:35 -0700

I'm just wondering what the general consensus is on indexing XML data to Solr 
in terms of parsing and mining the relevant data out of the file and putting 
them into Solr fields. Assume that this is the XML file and resulting Solr 
fields:


XML data:
<mydoc id="1234">
<title>foo</title>
<bar attr1="val1"/>
<baz>garbage data</baz>
</ mydoc >

Solr Fields:
Id=1234
Title=foo
Bar=val1

I'd previously set this process up using XSLT and have since tested using 
XMLBeans, JAXB, etc. to get the relevant data. The speed at which this occurs, 
however, is not acceptable. 2800 objects take 11 minutes to parse and index 
into Solr.

The big slowdown appears to be that I'm parsing the data with an XML parser.

So, now I'm testing mining the data by opening the file as just a text file 
(using Groovy) and picking out relevant data using regular expression matching. 
I'm now able to parse (mine) the data and index the 2800 files in 72 seconds.

So I'm wondering if the typical solution people use is to go with a non-XML 
solution. It seems to make sense considering the search index would only want 
to store (as much data) as possible and not rely on the incoming documents 
being xml compliant.

Thanks in advance for any thoughts on this!
-Kristian

Efficiently mining or parsing data out of XML source files

Reply via email to