perfect, i tried it before but always at the tail of the expression with no effect. thanks a lot. a last question, do you know how to keep the html comments from being filtered before the transformer has done its work?
On 10. Sep 2013, at 3:17 PM, Jack Krupansky wrote: > Okay, I can repro the problem. Yes, in appears that the pattern replace char > filter does not default to multiline mode for pattern matching, so <body> on > one line and </body> on another line cannot be matched. > > Now, whether that is by design or a bug or an option for enhancement is a > matter for some committer to comment on. > > But, the good news is that you can in fact set multiline mode in your pattern > my starting it with "(?s)", which means that dot accepts line break > characters as well. > > So, here are my revised field types: > > <fieldType name="text_html_body" class="solr.TextField" > positionIncrementGap="100" > > <analyzer> > <charFilter class="solr.PatternReplaceCharFilterFactory" > pattern="(?s)^.*<body>(.*)</body>.*$" replacement="$1" /> > <tokenizer class="solr.StandardTokenizerFactory"/> > <filter class="solr.LowerCaseFilterFactory"/> > </analyzer> > </fieldType> > > <fieldType name="text_html_body_strip" class="solr.TextField" > positionIncrementGap="100" > > <analyzer> > <charFilter class="solr.PatternReplaceCharFilterFactory" > pattern="(?s)^.*<body>(.*)</body>.*$" replacement="$1" /> > <charFilter class="solr.HTMLStripCharFilterFactory" /> > <tokenizer class="solr.StandardTokenizerFactory"/> > <filter class="solr.LowerCaseFilterFactory"/> > </analyzer> > </fieldType> > > The first type accepts everything within <body>, including nested HTML > formatting, while the latter strips nested HTML formatting as well. > > The tokenizer will in fact strip out white space, but that happens after all > character filters have completed. > > -- Jack Krupansky > > -----Original Message----- From: Andreas Owen > Sent: Tuesday, September 10, 2013 7:07 AM > To: solr-user@lucene.apache.org > Subject: Re: charfilter doesn't do anything > > ok i am getting there now but if there are newlines involved the regex stops > as soon as it reaches a "\r\n" even if i try [\t\r\n.]* in the regex. I have > to get rid of the newlines. why isn't whitespaceTokenizerFactory the right > element for this? > > > On 10. Sep 2013, at 1:21 AM, Jack Krupansky wrote: > >> Use XML then. Although you will need to escape the XML special characters as >> I did in the pattern. >> >> The point is simply: Quickly and simply try to find the simple test scenario >> that illustrates the problem. >> >> -- Jack Krupansky >> >> -----Original Message----- From: Andreas Owen >> Sent: Monday, September 09, 2013 7:05 PM >> To: solr-user@lucene.apache.org >> Subject: Re: charfilter doesn't do anything >> >> i tried but that isn't working either, it want a data-stream, i'll have to >> check how to post json instead of xml >> >> On 10. Sep 2013, at 12:52 AM, Jack Krupansky wrote: >> >>> Did you at least try the pattern I gave you? >>> >>> The point of the curl was the data, not how you send the data. You can just >>> use the standard Solr simple post tool. >>> >>> -- Jack Krupansky >>> >>> -----Original Message----- From: Andreas Owen >>> Sent: Monday, September 09, 2013 6:40 PM >>> To: solr-user@lucene.apache.org >>> Subject: Re: charfilter doesn't do anything >>> >>> i've downloaded curl and tried it in the comman prompt and power shell on >>> my win 2008r2 server, thats why i used my dataimporter with a single line >>> html file and copy/pastet the lines into schema.xml >>> >>> >>> On 9. Sep 2013, at 11:20 PM, Jack Krupansky wrote: >>> >>>> Did you in fact try my suggested example? If not, please do so. >>>> >>>> -- Jack Krupansky >>>> >>>> -----Original Message----- From: Andreas Owen >>>> Sent: Monday, September 09, 2013 4:42 PM >>>> To: solr-user@lucene.apache.org >>>> Subject: Re: charfilter doesn't do anything >>>> >>>> i index html pages with a lot of lines and not just a string with the >>>> body-tag. >>>> it doesn't work with proper html files, even though i took all the new >>>> lines out. >>>> >>>> html-file: >>>> <html>nav-content<body> nur das will ich sehen</body>footer-content</html> >>>> >>>> solr update debug output: >>>> "text_html": ["<html>\r\n\r\n<meta name=\"Content-Encoding\" >>>> content=\"ISO-8859-1\">\r\n<meta name=\"Content-Type\" >>>> content=\"text/html; >>>> charset=ISO-8859-1\">\r\n<title></title>\r\n\r\n<body>nav-content nur das >>>> will ich sehenfooter-content</body></html>"] >>>> >>>> >>>> >>>> On 8. Sep 2013, at 3:28 PM, Jack Krupansky wrote: >>>> >>>>> I tried this and it seems to work when added to the standard Solr example >>>>> in 4.4: >>>>> >>>>> <field name="body" type="text_html_body" indexed="true" stored="true" /> >>>>> >>>>> <fieldType name="text_html_body" class="solr.TextField" >>>>> positionIncrementGap="100" > >>>>> <analyzer> >>>>> <charFilter class="solr.PatternReplaceCharFilterFactory" >>>>> pattern="^.*<body>(.*)</body>.*$" replacement="$1" /> >>>>> <tokenizer class="solr.StandardTokenizerFactory"/> >>>>> <filter class="solr.LowerCaseFilterFactory"/> >>>>> </analyzer> >>>>> </fieldType> >>>>> >>>>> That char filter retains only text between <body> and </body>. Is that >>>>> what you wanted? >>>>> >>>>> Indexing this data: >>>>> >>>>> curl 'localhost:8983/solr/update?commit=true' -H >>>>> 'Content-type:application/json' -d ' >>>>> [{"id":"doc-1","body":"abc <body>A test.</body> def"}]' >>>>> >>>>> And querying with these commands: >>>>> >>>>> curl "http://localhost:8983/solr/select/?q=*:*&indent=true&wt=json" >>>>> Shows all data >>>>> >>>>> curl "http://localhost:8983/solr/select/?q=body:test&indent=true&wt=json" >>>>> shows the body text >>>>> >>>>> curl "http://localhost:8983/solr/select/?q=body:abc&indent=true&wt=json" >>>>> shows nothing (outside of body) >>>>> >>>>> curl "http://localhost:8983/solr/select/?q=body:def&indent=true&wt=json" >>>>> shows nothing (outside of body) >>>>> >>>>> curl "http://localhost:8983/solr/select/?q=body:body&indent=true&wt=json" >>>>> Shows nothing, HTML tag stripped >>>>> >>>>> In your original query, you didn't show us what your default field, df >>>>> parameter, was. >>>>> >>>>> -- Jack Krupansky >>>>> >>>>> -----Original Message----- From: Andreas Owen >>>>> Sent: Sunday, September 08, 2013 5:21 AM >>>>> To: solr-user@lucene.apache.org >>>>> Subject: Re: charfilter doesn't do anything >>>>> >>>>> yes but that filter html and not the specific tag i want. >>>>> >>>>> On 7. Sep 2013, at 7:51 PM, Erick Erickson wrote: >>>>> >>>>>> Hmmm, have you looked at: >>>>>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory >>>>>> >>>>>> Not quite the <body>, perhaps, but might it help? >>>>>> >>>>>> >>>>>> On Fri, Sep 6, 2013 at 11:33 AM, Andreas Owen <a...@conx.ch> wrote: >>>>>> >>>>>>> ok i have html pages with <html>.....<!--body-->content i >>>>>>> want....<!--/body-->.....</html>. i want to extract (index, store) only >>>>>>> that between the body-comments. i thought regexTransformer would be the >>>>>>> best because xpath doesn't work in tika and i cant nest a >>>>>>> xpathEntetyProcessor to use xpath. what i have also found out is that >>>>>>> the >>>>>>> htmlparser from tika cuts my body-comments out and tries to make well >>>>>>> formed html, which i would like to switch off. >>>>>>> >>>>>>> On 6. Sep 2013, at 5:04 PM, Shawn Heisey wrote: >>>>>>> >>>>>>>> On 9/6/2013 7:09 AM, Andreas Owen wrote: >>>>>>>>> i've managed to get it working if i use the regexTransformer and >>>>>>>>> string >>>>>>> is on the same line in my tika entity. but when the string is >>>>>>> multilined it >>>>>>> isn't working even though i tried ?s to set the flag dotall. >>>>>>>>> >>>>>>>>> <entity name="tika" processor="TikaEntityProcessor" url="${rec.url}" >>>>>>> dataSource="dataUrl" onError="skip" htmlMapper="identity" format="html" >>>>>>> transformer="RegexTransformer"> >>>>>>>>> <field column="text_html" regex="<body>(.+)</body>" >>>>>>> replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text" /> >>>>>>>>> </entity> >>>>>>>>> >>>>>>>>> then i tried it like this and i get a stackoverflow >>>>>>>>> >>>>>>>>> <field column="text_html" >>>>>>>>> regex="<body>((.|\n|\r)+)</body>" >>>>>>> replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text" /> >>>>>>>>> >>>>>>>>> in javascript this works but maybe because i only used a small string. >>>>>>>> >>>>>>>> Sounds like we've got an XY problem here. >>>>>>>> >>>>>>>> http://people.apache.org/~hossman/#xyproblem >>>>>>>> >>>>>>>> How about you tell us *exactly* what you'd actually like to have happen >>>>>>>> and then we can find a solution for you? >>>>>>>> >>>>>>>> It sounds a little bit like you're interested in stripping all the HTML >>>>>>>> tags out. Perhaps the HTMLStripCharFilter? >>>>>>>> >>>>>>>> >>>>>>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory >>>>>>>> >>>>>>>> Something that I already said: By using the KeywordTokenizer, you won't >>>>>>>> be able to search for individual words on your HTML input. The entire >>>>>>>> input string is treated as a single token, and therefore ONLY exact >>>>>>>> entire-field matches (or certain wildcard matches) will be possible. >>>>>>>> >>>>>>>> >>>>>>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.KeywordTokenizerFactory >>>>>>>> >>>>>>>> Note that no matter what you do to your data with the analysis chain, >>>>>>>> Solr will always return the text that was originally indexed in search >>>>>>>> results. If you need to affect what gets stored as well, perhaps you >>>>>>>> need an Update Processor. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Shawn