i've managed to get it working if i use the regexTransformer and string is on 
the same line in my tika entity. but when the string is multilined it isn't 
working even though i tried ?s to set the flag dotall.

<entity name="tika" processor="TikaEntityProcessor" url="${rec.url}" 
dataSource="dataUrl" onError="skip" htmlMapper="identity" format="html" 
transformer="RegexTransformer">
        <field column="text_html" regex="&lt;body&gt;(.+)&lt;/body&gt;" 
replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text"  />
</entity>
                        
then i tried it like this and i get a stackoverflow

<field column="text_html" regex="&lt;body&gt;((.|\n|\r)+)&lt;/body&gt;" 
replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text"  />

in javascript this works but maybe because i only used a small string.



On 6. Sep 2013, at 2:55 PM, Jack Krupansky wrote:

> Is there any chance that your changed your schema since you indexed the data? 
> If so, re-index the data.
> 
> If a "*" query finds nothing, that implies that the default field is empty. 
> Are you sure the "df" parameter is set to the field containing your data? 
> Show us your request handler definition and a sample of your actual Solr 
> input (Solr XML or JSON?) so that we can see what fields are being populated.
> 
> -- Jack Krupansky
> 
> -----Original Message----- From: Andreas Owen
> Sent: Friday, September 06, 2013 4:01 AM
> To: solr-user@lucene.apache.org
> Subject: Re: charfilter doesn't do anything
> 
> the input string is a normal html page with the word Zahlungsverkehr in it 
> and my query is ...solr/collection1/select?q=*
> 
> On 5. Sep 2013, at 9:57 PM, Jack Krupansky wrote:
> 
>> And show us an input string and a query that fail.
>> 
>> -- Jack Krupansky
>> 
>> -----Original Message----- From: Shawn Heisey
>> Sent: Thursday, September 05, 2013 2:41 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: charfilter doesn't do anything
>> 
>> On 9/5/2013 10:03 AM, Andreas Owen wrote:
>>> i would like to filter / replace a word during indexing but it doesn't do 
>>> anything and i dont get a error.
>>> 
>>> in schema.xml i have the following:
>>> 
>>> <field name="text_html" type="text_cutHtml" indexed="true" stored="true" 
>>> multiValued="true"/>
>>> 
>>> <fieldType name="text_cutHtml" class="solr.TextField">
>>> <analyzer>
>>> <!--  <tokenizer class="solr.StandardTokenizerFactory"/> -->
>>> <charFilter class="solr.PatternReplaceCharFilterFactory" 
>>> pattern="Zahlungsverkehr" replacement="ASDFGHJK" />
>>> <tokenizer class="solr.KeywordTokenizerFactory"/>
>>> </analyzer>
>>>  </fieldType>
>>> 
>>> my 2. question is where can i say that the expression is multilined like in 
>>> javascript i can use /m at the end of the pattern?
>> 
>> I don't know about your second question.  I don't know if that will be
>> possible, but I'll leave that to someone who's more expert than I.
>> 
>> As for the first question, here's what I have.  Did you reindex?  That
>> will be required.
>> 
>> http://wiki.apache.org/solr/HowToReindex
>> 
>> Assuming that you did reindex, are you trying to search for ASDFGHJK in
>> a field that contains more than just "Zahlungsverkehr"?  The keyword
>> tokenizer might not do what you expect - it tokenizes the entire input
>> string as a single token, which means that you won't be able to search
>> for single words in a multi-word field without wildcards, which are
>> pretty slow.
>> 
>> Note that both the pattern and replacement are case sensitive.  This is
>> how regex works.  You haven't used a lowercase filter, which means that
>> you won't be able to search for asdfghjk.
>> 
>> Use the analysis tab in the UI on your core to see what Solr does to
>> your field text.
>> 
>> Thanks,
>> Shawn 

Reply via email to