Re: charfilter doesn't do anything

Jack Krupansky Mon, 09 Sep 2013 16:23:09 -0700

Use XML then. Although you will need to escape the XML special characters asI did in the pattern.

The point is simply: Quickly and simply try to find the simple test scenariothat illustrates the problem.


-- Jack Krupansky

-----Original Message-----From: Andreas Owen

Sent: Monday, September 09, 2013 7:05 PM
To: solr-user@lucene.apache.org
Subject: Re: charfilter doesn't do anything

i tried but that isn't working either, it want a data-stream, i'll have tocheck how to post json instead of xml


On 10. Sep 2013, at 12:52 AM, Jack Krupansky wrote:

Did you at least try the pattern I gave you?
The point of the curl was the data, not how you send the data. You canjust use the standard Solr simple post tool.
-- Jack Krupansky

-----Original Message----- From: Andreas Owen
Sent: Monday, September 09, 2013 6:40 PM
To: solr-user@lucene.apache.org
Subject: Re: charfilter doesn't do anything
i've downloaded curl and tried it in the comman prompt and power shell onmy win 2008r2 server, thats why i used my dataimporter with a single linehtml file and copy/pastet the lines into schema.xml
On 9. Sep 2013, at 11:20 PM, Jack Krupansky wrote:
Did you in fact try my suggested example? If not, please do so.

-- Jack Krupansky

-----Original Message----- From: Andreas Owen
Sent: Monday, September 09, 2013 4:42 PM
To: solr-user@lucene.apache.org
Subject: Re: charfilter doesn't do anything
i index html pages with a lot of lines and not just a string with thebody-tag.it doesn't work with proper html files, even though i took all the newlines out.
html-file:
<html>nav-content<body> nur das will ichsehen</body>footer-content</html>
solr update debug output:
"text_html": ["<html>\r\n\r\n<meta name=\"Content-Encoding\"content=\"ISO-8859-1\">\r\n<meta name=\"Content-Type\"content=\"text/html;charset=ISO-8859-1\">\r\n<title></title>\r\n\r\n<body>nav-content nur daswill ich sehenfooter-content</body></html>"]
On 8. Sep 2013, at 3:28 PM, Jack Krupansky wrote:
I tried this and it seems to work when added to the standard Solrexample in 4.4:
<field name="body" type="text_html_body" indexed="true" stored="true" />
<fieldType name="text_html_body" class="solr.TextField"positionIncrementGap="100" >
<analyzer>
<charFilter class="solr.PatternReplaceCharFilterFactory"pattern="^.*<body>(.*)</body>.*$" replacement="$1" />
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
That char filter retains only text between <body> and </body>. Is thatwhat you wanted?
Indexing this data:
curl 'localhost:8983/solr/update?commit=true' -H'Content-type:application/json' -d '
[{"id":"doc-1","body":"abc <body>A test.</body> def"}]'

And querying with these commands:

curl "http://localhost:8983/solr/select/?q=*:*&indent=true&wt=json";
Shows all data
curl"http://localhost:8983/solr/select/?q=body:test&indent=true&wt=json";
shows the body text

curl "http://localhost:8983/solr/select/?q=body:abc&indent=true&wt=json";
shows nothing (outside of body)

curl "http://localhost:8983/solr/select/?q=body:def&indent=true&wt=json";
shows nothing (outside of body)
curl"http://localhost:8983/solr/select/?q=body:body&indent=true&wt=json";
Shows nothing, HTML tag stripped
In your original query, you didn't show us what your default field, dfparameter, was.
-- Jack Krupansky

-----Original Message----- From: Andreas Owen
Sent: Sunday, September 08, 2013 5:21 AM
To: solr-user@lucene.apache.org
Subject: Re: charfilter doesn't do anything

yes but that filter html and not the specific tag i want.

On 7. Sep 2013, at 7:51 PM, Erick Erickson wrote:
Hmmm, have you looked at:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory

Not quite the <body>, perhaps, but might it help?


On Fri, Sep 6, 2013 at 11:33 AM, Andreas Owen <a...@conx.ch> wrote:
ok i have html pages with <html>.....content i
want.........</html>. i want to extract (index, store)onlythat between the body-comments. i thought regexTransformer would bethe
best because xpath doesn't work in tika and i cant nest a
xpathEntetyProcessor to use xpath. what i have also found out is thatthe
htmlparser from tika cuts my body-comments out and tries to make well
formed html, which i would like to switch off.

On 6. Sep 2013, at 5:04 PM, Shawn Heisey wrote:
On 9/6/2013 7:09 AM, Andreas Owen wrote:
i've managed to get it working if i use the regexTransformer andstring
is on the same line in my tika entity. but when the string ismultilined it
isn't working even though i tried ?s to set the flag dotall.
<entity name="tika" processor="TikaEntityProcessor" url="${rec.url}"
dataSource="dataUrl" onError="skip" htmlMapper="identity"format="html"
transformer="RegexTransformer">
 <field column="text_html" regex="&lt;body&gt;(.+)&lt;/body&gt;"
replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text"  />
</entity>

then i tried it like this and i get a stackoverflow
<field column="text_html"regex="<body>((.|\n|\r)+)</body>"
replaceWith="QQQQQQQQQQQQQQQ" sourceColName="text"  />
in javascript this works but maybe because i only used a smallstring.
Sounds like we've got an XY problem here.

http://people.apache.org/~hossman/#xyproblem
How about you tell us *exactly* what you'd actually like to havehappen
and then we can find a solution for you?
It sounds a little bit like you're interested in stripping all theHTML
tags out.  Perhaps the HTMLStripCharFilter?
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.HTMLStripCharFilterFactory
Something that I already said: By using the KeywordTokenizer, youwon'tbe able to search for individual words on your HTML input. Theentire
input string is treated as a single token, and therefore ONLY exact
entire-field matches (or certain wildcard matches) will be possible.
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.KeywordTokenizerFactory
Note that no matter what you do to your data with the analysis chain,
Solr will always return the text that was originally indexed insearch
results.  If you need to affect what gets stored as well, perhaps you
need an Update Processor.

Thanks,
Shawn

Re: charfilter doesn't do anything

Reply via email to