i've managed to get it working if i use the regexTransformer and string is on the same line in my tika entity. but when the string is multilined it isn't working even though i tried ?s to set the flag dotall.
<entity name="tika" processor="TikaEntityProcessor" url="${rec.url}" dataSource="dataUrl" onError="skip" htmlMapper="identity" format="html" transformer="RegexTransformer"> <field column="text_html" regex="<body>(.+)</body>" replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text" /> </entity> then i tried it like this and i get a stackoverflow <field column="text_html" regex="<body>((.|\n|\r)+)</body>" replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text" /> in javascript this works but maybe because i only used a small string. On 6. Sep 2013, at 2:55 PM, Jack Krupansky wrote: > Is there any chance that your changed your schema since you indexed the data? > If so, re-index the data. > > If a "*" query finds nothing, that implies that the default field is empty. > Are you sure the "df" parameter is set to the field containing your data? > Show us your request handler definition and a sample of your actual Solr > input (Solr XML or JSON?) so that we can see what fields are being populated. > > -- Jack Krupansky > > -----Original Message----- From: Andreas Owen > Sent: Friday, September 06, 2013 4:01 AM > To: solr-user@lucene.apache.org > Subject: Re: charfilter doesn't do anything > > the input string is a normal html page with the word Zahlungsverkehr in it > and my query is ...solr/collection1/select?q=* > > On 5. Sep 2013, at 9:57 PM, Jack Krupansky wrote: > >> And show us an input string and a query that fail. >> >> -- Jack Krupansky >> >> -----Original Message----- From: Shawn Heisey >> Sent: Thursday, September 05, 2013 2:41 PM >> To: solr-user@lucene.apache.org >> Subject: Re: charfilter doesn't do anything >> >> On 9/5/2013 10:03 AM, Andreas Owen wrote: >>> i would like to filter / replace a word during indexing but it doesn't do >>> anything and i dont get a error. >>> >>> in schema.xml i have the following: >>> >>> <field name="text_html" type="text_cutHtml" indexed="true" stored="true" >>> multiValued="true"/> >>> >>> <fieldType name="text_cutHtml" class="solr.TextField"> >>> <analyzer> >>> <!-- <tokenizer class="solr.StandardTokenizerFactory"/> --> >>> <charFilter class="solr.PatternReplaceCharFilterFactory" >>> pattern="Zahlungsverkehr" replacement="ASDFGHJK" /> >>> <tokenizer class="solr.KeywordTokenizerFactory"/> >>> </analyzer> >>> </fieldType> >>> >>> my 2. question is where can i say that the expression is multilined like in >>> javascript i can use /m at the end of the pattern? >> >> I don't know about your second question. I don't know if that will be >> possible, but I'll leave that to someone who's more expert than I. >> >> As for the first question, here's what I have. Did you reindex? That >> will be required. >> >> http://wiki.apache.org/solr/HowToReindex >> >> Assuming that you did reindex, are you trying to search for ASDFGHJK in >> a field that contains more than just "Zahlungsverkehr"? The keyword >> tokenizer might not do what you expect - it tokenizes the entire input >> string as a single token, which means that you won't be able to search >> for single words in a multi-word field without wildcards, which are >> pretty slow. >> >> Note that both the pattern and replacement are case sensitive. This is >> how regex works. You haven't used a lowercase filter, which means that >> you won't be able to search for asdfghjk. >> >> Use the analysis tab in the UI on your core to see what Solr does to >> your field text. >> >> Thanks, >> Shawn