I have some delimited data that I would like to import but am having issues
getting the regex patterns to work properly with Solr. The following is just
one example of the issues I have experienced.

The regex required for this example should be very simple (delimited data). 
I have some regex patterns that work fine with online regex test sites like
http://www.regular-expressions.info/javascriptexample.html.  However solr
doesnt seem to handle them.

I am using the feb3/2010 trunk under windows

questions:
1. what sort of regex parser/flavor is used for splitBy and regex?  I
assumed it was the Java regex but doesnt seem to be
2. is there some documentation somewhere of what is/isnt acceptable regex
for Solr?
3. any ideas on how to get this [what I thought was] simple import routine
to work?

Any and all help/suggestions are appreciated.  Thx in advance.

------
Details:


1. my data looks something like this for each record: 
dataA1|^dataA2|?dataB1|^dataB2|?dataC1|^dataC2

2. I want to split out and save dataA1, dataB1, and dataC1 to a multivalue
field  ... and ignore the rest.

3. splitBy seems to work fine but my regex split doesnt; I end up with
either no data in the field or data that is not split.  for example I get:

<arr name="myfield">
 <str>dataA1|^dataA2</str>
 <str>dataB1|^dataB2</str>
 <str>dataC1|^dataC2</str>
</arr>


4. The relevant parts of my schema include:

<types>
 <fieldtype name="text" class="solr.TextField" positionIncrementGap="100">
 <analyzer type="index">
  <tokenizer class="solr.KeywordTokenizerFactory"/>
  <filter class="solr.SynonymFilterFactory" synonyms="text_synonyms.txt"
ignoreCase="true" expand="true"/>
  <filter class="solr.LowerCaseFilterFactory"/>
 </analyzer>
 <analyzer type="query">
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.LowerCaseFilterFactory"/>
 </analyzer>
</fieldtype>
...
</types>

<fields>
...
 <field name="mydata" type="text" indexed="true" stored="true"
multiValued="true"/>
...
</fields>

5. my data-conf.xml contains this sort of line:

<document>
...
<entity name="name"
 dataSource="ds-1"
 transformer="TemplateTransformer,RegexTransformer"
 query="select myfield from mytable'" >
 <field sourceColName="myfield" column="mydata" splitBy="(\|\?)"
regex="(.*)(\|.*)"/>
 ...
</entity>
</document>

6. patterns I have tried include "(.*)(\|.*)", "(.*)\|(.*)", "(.*)\|.*",
"(.*)(\|^.*)", ...

-- 
View this message in context: 
http://n3.nabble.com/problem-with-RegexTransformer-and-delimited-data-tp713846p713846.html
Sent from the Solr - User mailing list archive at Nabble.com.

Reply via email to