[ https://issues.apache.org/jira/browse/SOLR-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13733901#comment-13733901 ]
Uwe Schindler commented on SOLR-4679: ------------------------------------- bq. I never said that ... You somehow said: bq. I defer to your judgement on this So I assumed that you are still not 100% convinced. Sorry. In any case I will take the issue. In my opinion there is more work to be done with this crazy stack of StringBuilders to better handle the ignorableWhitepace when a new field begins/ends. Currently its insered after the block end tag, so it would go one up in the stack only. I have to think a little bit about it, but the fix in your patch is the easiest for now. And the maybe useless whitespace on some lower stacked StringBuilders is generally removed by text analysis. > HTML line breaks (<br>) are removed during indexing; causes wrong search > results > -------------------------------------------------------------------------------- > > Key: SOLR-4679 > URL: https://issues.apache.org/jira/browse/SOLR-4679 > Project: Solr > Issue Type: Bug > Components: update > Affects Versions: 4.2 > Environment: Windows Server 2008 R2, Java 6, Tomcat 7 > Reporter: Christoph Straßer > Assignee: Uwe Schindler > Attachments: external.htm, SOLR-4679__weird_TIKA-1134.patch, > Solr_HtmlLineBreak_Linz_NotFound.png, Solr_HtmlLineBreak_Vienna.png > > > HTML line breaks (<br>, <BR>, <br/>, ...) seem to be removed during > extraction of content from HTML-Files. They need to be replaced with a empty > space. > Test-File: > <html> > <head> > <title>Test mit HTML-Zeilenschaltungen</title> > </head> > <p> > word1<br>word2<br/> > Some other words, a special name like linz<br>and another special name - > vienna > </p> > </html> > The Solr-content-attribute contains the following text: > Test mit HTML-Zeilenschaltungen > word1word2 > Some other words, a special name like linzand another special name - vienna > So we are not able to find the word "linz". > We use the ExtractingRequestHandler to put content into Solr. > (wiki.apache.org/solr/ExtractingRequestHandler) -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org