Okay, I can repro the problem. Yes, in appears that the pattern replace char filter does not default to multiline mode for pattern matching, so <body> on one line and </body> on another line cannot be matched.

Now, whether that is by design or a bug or an option for enhancement is a matter for some committer to comment on.

But, the good news is that you can in fact set multiline mode in your pattern my starting it with "(?s)", which means that dot accepts line break characters as well.

So, here are my revised field types:

<fieldType name="text_html_body" class="solr.TextField" positionIncrementGap="100" >
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(?s)^.*&lt;body&gt;(.*)&lt;/body&gt;.*$" replacement="$1" />
   <tokenizer class="solr.StandardTokenizerFactory"/>
   <filter class="solr.LowerCaseFilterFactory"/>

<fieldType name="text_html_body_strip" class="solr.TextField" positionIncrementGap="100" >
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="(?s)^.*&lt;body&gt;(.*)&lt;/body&gt;.*$" replacement="$1" />
   <charFilter class="solr.HTMLStripCharFilterFactory" />
   <tokenizer class="solr.StandardTokenizerFactory"/>
   <filter class="solr.LowerCaseFilterFactory"/>

The first type accepts everything within <body>, including nested HTML formatting, while the latter strips nested HTML formatting as well.

The tokenizer will in fact strip out white space, but that happens after all character filters have completed.

ok i am getting there now but if there are newlines involved the regex stops as soon as it reaches a "\r\n" even if i try [\t\r\n.]* in the regex. I have to get rid of the newlines. why isn't whitespaceTokenizerFactory the right element for this?

Use XML then. Although you will need to escape the XML special characters as I did in the pattern.

The point is simply: Quickly and simply try to find the simple test scenario that illustrates the problem.

i tried but that isn't working either, it want a data-stream, i'll have to check how to post json instead of xml

Did you at least try the pattern I gave you?

The point of the curl was the data, not how you send the data. You can just use the standard Solr simple post tool.

i've downloaded curl and tried it in the comman prompt and power shell on my win 2008r2 server, thats why i used my dataimporter with a single line html file and copy/pastet the lines into schema.xml

Did you in fact try my suggested example? If not, please do so.

i index html pages with a lot of lines and not just a string with the body-tag. it doesn't work with proper html files, even though i took all the new lines out.

<html>nav-content<body> nur das will ich sehen</body>footer-content</html>

solr update debug output:
"text_html": ["<html>\r\n\r\n<meta name=\"Content-Encoding\" content=\"ISO-8859-1\">\r\n<meta name=\"Content-Type\" content=\"text/html; charset=ISO-8859-1\">\r\n<title></title>\r\n\r\n<body>nav-content nur das will ich sehenfooter-content</body></html>"]

I tried this and it seems to work when added to the standard Solr example in 4.4:

<field name="body" type="text_html_body" indexed="true" stored="true" />

<fieldType name="text_html_body" class="solr.TextField" positionIncrementGap="100" >
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="^.*&lt;body&gt;(.*)&lt;/body&gt;.*$" replacement="$1" />
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>

That char filter retains only text between <body> and </body>. Is that what you wanted?

Indexing this data:

curl 'localhost:8983/solr/update?commit=true' -H 'Content-type:application/json' -d '
[{"id":"doc-1","body":"abc <body>A test.</body> def"}]'

And querying with these commands:

curl "http://localhost:8983/solr/select/?q=*:*&indent=true&wt=json";
Shows all data

curl "http://localhost:8983/solr/select/?q=body:test&indent=true&wt=json";
shows the body text

curl "http://localhost:8983/solr/select/?q=body:abc&indent=true&wt=json";
shows nothing (outside of body)

curl "http://localhost:8983/solr/select/?q=body:def&indent=true&wt=json";
shows nothing (outside of body)

curl "http://localhost:8983/solr/select/?q=body:body&indent=true&wt=json";
Shows nothing, HTML tag stripped

In your original query, you didn't show us what your default field, df parameter, was.

yes but that filter html and not the specific tag i want.

Hmmm, have you looked at:

Not quite the <body>, perhaps, but might it help?

ok i have html pages with <html>.....<!--body-->content i
want....<!--/body-->.....</html>. i want to extract (index, store) only that between the body-comments. i thought regexTransformer would be the
best because xpath doesn't work in tika and i cant nest a
xpathEntetyProcessor to use xpath. what i have also found out is that the
htmlparser from tika cuts my body-comments out and tries to make well
formed html, which i would like to switch off.

i've managed to get it working if i use the regexTransformer and string
is on the same line in my tika entity. but when the string is multilined it
isn't working even though i tried ?s to set the flag dotall.

<entity name="tika" processor="TikaEntityProcessor" url="${rec.url}"
dataSource="dataUrl" onError="skip" htmlMapper="identity" format="html"
<field column="text_html" regex="&lt;body&gt;(.+)&lt;/body&gt;"
replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text"  />

then i tried it like this and i get a stackoverflow

<field column="text_html" regex="&lt;body&gt;((.|\n|\r)+)&lt;/body&gt;"
replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text"  />

in javascript this works but maybe because i only used a small string.

Sounds like we've got an XY problem here.


How about you tell us *exactly* what you'd actually like to have happen
and then we can find a solution for you?

It sounds a little bit like you're interested in stripping all the HTML
tags out.  Perhaps the HTMLStripCharFilter?


Something that I already said: By using the KeywordTokenizer, you won't be able to search for individual words on your HTML input. The entire
input string is treated as a single token, and therefore ONLY exact
entire-field matches (or certain wildcard matches) will be possible.


Note that no matter what you do to your data with the analysis chain, Solr will always return the text that was originally indexed in search results. If you need to affect what gets stored as well, perhaps you
need an Update Processor.


