i tried but that isn't working either, it want a data-stream, i'll have to check how to post json instead of xml
On 10. Sep 2013, at 12:52 AM, Jack Krupansky wrote: > Did you at least try the pattern I gave you? > > The point of the curl was the data, not how you send the data. You can just > use the standard Solr simple post tool. > > -- Jack Krupansky > > -----Original Message----- From: Andreas Owen > Sent: Monday, September 09, 2013 6:40 PM > To: solr-user@lucene.apache.org > Subject: Re: charfilter doesn't do anything > > i've downloaded curl and tried it in the comman prompt and power shell on my > win 2008r2 server, thats why i used my dataimporter with a single line html > file and copy/pastet the lines into schema.xml > > > On 9. Sep 2013, at 11:20 PM, Jack Krupansky wrote: > >> Did you in fact try my suggested example? If not, please do so. >> >> -- Jack Krupansky >> >> -----Original Message----- From: Andreas Owen >> Sent: Monday, September 09, 2013 4:42 PM >> To: solr-user@lucene.apache.org >> Subject: Re: charfilter doesn't do anything >> >> i index html pages with a lot of lines and not just a string with the >> body-tag. >> it doesn't work with proper html files, even though i took all the new lines >> out. >> >> html-file: >> <html>nav-content<body> nur das will ich sehen</body>footer-content</html> >> >> solr update debug output: >> "text_html": ["<html>\r\n\r\n<meta name=\"Content-Encoding\" >> content=\"ISO-8859-1\">\r\n<meta name=\"Content-Type\" content=\"text/html; >> charset=ISO-8859-1\">\r\n<title></title>\r\n\r\n<body>nav-content nur das >> will ich sehenfooter-content</body></html>"] >> >> >> >> On 8. Sep 2013, at 3:28 PM, Jack Krupansky wrote: >> >>> I tried this and it seems to work when added to the standard Solr example >>> in 4.4: >>> >>> <field name="body" type="text_html_body" indexed="true" stored="true" /> >>> >>> <fieldType name="text_html_body" class="solr.TextField" >>> positionIncrementGap="100" > >>> <analyzer> >>> <charFilter class="solr.PatternReplaceCharFilterFactory" >>> pattern="^.*<body>(.*)</body>.*$" replacement="$1" /> >>> <tokenizer class="solr.StandardTokenizerFactory"/> >>> <filter class="solr.LowerCaseFilterFactory"/> >>> </analyzer> >>> </fieldType> >>> >>> That char filter retains only text between <body> and </body>. Is that what >>> you wanted? >>> >>> Indexing this data: >>> >>> curl 'localhost:8983/solr/update?commit=true' -H >>> 'Content-type:application/json' -d ' >>> [{"id":"doc-1","body":"abc <body>A test.</body> def"}]' >>> >>> And querying with these commands: >>> >>> curl "http://localhost:8983/solr/select/?q=*:*&indent=true&wt=json" >>> Shows all data >>> >>> curl "http://localhost:8983/solr/select/?q=body:test&indent=true&wt=json" >>> shows the body text >>> >>> curl "http://localhost:8983/solr/select/?q=body:abc&indent=true&wt=json" >>> shows nothing (outside of body) >>> >>> curl "http://localhost:8983/solr/select/?q=body:def&indent=true&wt=json" >>> shows nothing (outside of body) >>> >>> curl "http://localhost:8983/solr/select/?q=body:body&indent=true&wt=json" >>> Shows nothing, HTML tag stripped >>> >>> In your original query, you didn't show us what your default field, df >>> parameter, was. >>> >>> -- Jack Krupansky >>> >>> -----Original Message----- From: Andreas Owen >>> Sent: Sunday, September 08, 2013 5:21 AM >>> To: solr-user@lucene.apache.org >>> Subject: Re: charfilter doesn't do anything >>> >>> yes but that filter html and not the specific tag i want. >>> >>> On 7. Sep 2013, at 7:51 PM, Erick Erickson wrote: >>> >>>> Hmmm, have you looked at: >>>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory >>>> >>>> Not quite the <body>, perhaps, but might it help? >>>> >>>> >>>> On Fri, Sep 6, 2013 at 11:33 AM, Andreas Owen <a...@conx.ch> wrote: >>>> >>>>> ok i have html pages with <html>.....<!--body-->content i >>>>> want....<!--/body-->.....</html>. i want to extract (index, store) only >>>>> that between the body-comments. i thought regexTransformer would be the >>>>> best because xpath doesn't work in tika and i cant nest a >>>>> xpathEntetyProcessor to use xpath. what i have also found out is that the >>>>> htmlparser from tika cuts my body-comments out and tries to make well >>>>> formed html, which i would like to switch off. >>>>> >>>>> On 6. Sep 2013, at 5:04 PM, Shawn Heisey wrote: >>>>> >>>>>> On 9/6/2013 7:09 AM, Andreas Owen wrote: >>>>>>> i've managed to get it working if i use the regexTransformer and string >>>>> is on the same line in my tika entity. but when the string is multilined >>>>> it >>>>> isn't working even though i tried ?s to set the flag dotall. >>>>>>> >>>>>>> <entity name="tika" processor="TikaEntityProcessor" url="${rec.url}" >>>>> dataSource="dataUrl" onError="skip" htmlMapper="identity" format="html" >>>>> transformer="RegexTransformer"> >>>>>>> <field column="text_html" regex="<body>(.+)</body>" >>>>> replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text" /> >>>>>>> </entity> >>>>>>> >>>>>>> then i tried it like this and i get a stackoverflow >>>>>>> >>>>>>> <field column="text_html" regex="<body>((.|\n|\r)+)</body>" >>>>> replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text" /> >>>>>>> >>>>>>> in javascript this works but maybe because i only used a small string. >>>>>> >>>>>> Sounds like we've got an XY problem here. >>>>>> >>>>>> http://people.apache.org/~hossman/#xyproblem >>>>>> >>>>>> How about you tell us *exactly* what you'd actually like to have happen >>>>>> and then we can find a solution for you? >>>>>> >>>>>> It sounds a little bit like you're interested in stripping all the HTML >>>>>> tags out. Perhaps the HTMLStripCharFilter? >>>>>> >>>>>> >>>>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory >>>>>> >>>>>> Something that I already said: By using the KeywordTokenizer, you won't >>>>>> be able to search for individual words on your HTML input. The entire >>>>>> input string is treated as a single token, and therefore ONLY exact >>>>>> entire-field matches (or certain wildcard matches) will be possible. >>>>>> >>>>>> >>>>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.KeywordTokenizerFactory >>>>>> >>>>>> Note that no matter what you do to your data with the analysis chain, >>>>>> Solr will always return the text that was originally indexed in search >>>>>> results. If you need to affect what gets stored as well, perhaps you >>>>>> need an Update Processor. >>>>>> >>>>>> Thanks, >>>>>> Shawn