i've downloaded curl and tried it in the comman prompt and power shell on my win 2008r2 server, thats why i used my dataimporter with a single line html file and copy/pastet the lines into schema.xml
On 9. Sep 2013, at 11:20 PM, Jack Krupansky wrote: > Did you in fact try my suggested example? If not, please do so. > > -- Jack Krupansky > > -----Original Message----- From: Andreas Owen > Sent: Monday, September 09, 2013 4:42 PM > To: solr-user@lucene.apache.org > Subject: Re: charfilter doesn't do anything > > i index html pages with a lot of lines and not just a string with the > body-tag. > it doesn't work with proper html files, even though i took all the new lines > out. > > html-file: > <html>nav-content<body> nur das will ich sehen</body>footer-content</html> > > solr update debug output: > "text_html": ["<html>\r\n\r\n<meta name=\"Content-Encoding\" > content=\"ISO-8859-1\">\r\n<meta name=\"Content-Type\" content=\"text/html; > charset=ISO-8859-1\">\r\n<title></title>\r\n\r\n<body>nav-content nur das > will ich sehenfooter-content</body></html>"] > > > > On 8. Sep 2013, at 3:28 PM, Jack Krupansky wrote: > >> I tried this and it seems to work when added to the standard Solr example in >> 4.4: >> >> <field name="body" type="text_html_body" indexed="true" stored="true" /> >> >> <fieldType name="text_html_body" class="solr.TextField" >> positionIncrementGap="100" > >> <analyzer> >> <charFilter class="solr.PatternReplaceCharFilterFactory" >> pattern="^.*<body>(.*)</body>.*$" replacement="$1" /> >> <tokenizer class="solr.StandardTokenizerFactory"/> >> <filter class="solr.LowerCaseFilterFactory"/> >> </analyzer> >> </fieldType> >> >> That char filter retains only text between <body> and </body>. Is that what >> you wanted? >> >> Indexing this data: >> >> curl 'localhost:8983/solr/update?commit=true' -H >> 'Content-type:application/json' -d ' >> [{"id":"doc-1","body":"abc <body>A test.</body> def"}]' >> >> And querying with these commands: >> >> curl "http://localhost:8983/solr/select/?q=*:*&indent=true&wt=json" >> Shows all data >> >> curl "http://localhost:8983/solr/select/?q=body:test&indent=true&wt=json" >> shows the body text >> >> curl "http://localhost:8983/solr/select/?q=body:abc&indent=true&wt=json" >> shows nothing (outside of body) >> >> curl "http://localhost:8983/solr/select/?q=body:def&indent=true&wt=json" >> shows nothing (outside of body) >> >> curl "http://localhost:8983/solr/select/?q=body:body&indent=true&wt=json" >> Shows nothing, HTML tag stripped >> >> In your original query, you didn't show us what your default field, df >> parameter, was. >> >> -- Jack Krupansky >> >> -----Original Message----- From: Andreas Owen >> Sent: Sunday, September 08, 2013 5:21 AM >> To: solr-user@lucene.apache.org >> Subject: Re: charfilter doesn't do anything >> >> yes but that filter html and not the specific tag i want. >> >> On 7. Sep 2013, at 7:51 PM, Erick Erickson wrote: >> >>> Hmmm, have you looked at: >>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory >>> >>> Not quite the <body>, perhaps, but might it help? >>> >>> >>> On Fri, Sep 6, 2013 at 11:33 AM, Andreas Owen <a...@conx.ch> wrote: >>> >>>> ok i have html pages with <html>.....<!--body-->content i >>>> want....<!--/body-->.....</html>. i want to extract (index, store) only >>>> that between the body-comments. i thought regexTransformer would be the >>>> best because xpath doesn't work in tika and i cant nest a >>>> xpathEntetyProcessor to use xpath. what i have also found out is that the >>>> htmlparser from tika cuts my body-comments out and tries to make well >>>> formed html, which i would like to switch off. >>>> >>>> On 6. Sep 2013, at 5:04 PM, Shawn Heisey wrote: >>>> >>>>> On 9/6/2013 7:09 AM, Andreas Owen wrote: >>>>>> i've managed to get it working if i use the regexTransformer and string >>>> is on the same line in my tika entity. but when the string is multilined it >>>> isn't working even though i tried ?s to set the flag dotall. >>>>>> >>>>>> <entity name="tika" processor="TikaEntityProcessor" url="${rec.url}" >>>> dataSource="dataUrl" onError="skip" htmlMapper="identity" format="html" >>>> transformer="RegexTransformer"> >>>>>> <field column="text_html" regex="<body>(.+)</body>" >>>> replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text" /> >>>>>> </entity> >>>>>> >>>>>> then i tried it like this and i get a stackoverflow >>>>>> >>>>>> <field column="text_html" regex="<body>((.|\n|\r)+)</body>" >>>> replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text" /> >>>>>> >>>>>> in javascript this works but maybe because i only used a small string. >>>>> >>>>> Sounds like we've got an XY problem here. >>>>> >>>>> http://people.apache.org/~hossman/#xyproblem >>>>> >>>>> How about you tell us *exactly* what you'd actually like to have happen >>>>> and then we can find a solution for you? >>>>> >>>>> It sounds a little bit like you're interested in stripping all the HTML >>>>> tags out. Perhaps the HTMLStripCharFilter? >>>>> >>>>> >>>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory >>>>> >>>>> Something that I already said: By using the KeywordTokenizer, you won't >>>>> be able to search for individual words on your HTML input. The entire >>>>> input string is treated as a single token, and therefore ONLY exact >>>>> entire-field matches (or certain wildcard matches) will be possible. >>>>> >>>>> >>>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.KeywordTokenizerFactory >>>>> >>>>> Note that no matter what you do to your data with the analysis chain, >>>>> Solr will always return the text that was originally indexed in search >>>>> results. If you need to affect what gets stored as well, perhaps you >>>>> need an Update Processor. >>>>> >>>>> Thanks, >>>>> Shawn >>>>