[jira] [Commented] (SOLR-4679) HTML line breaks (
) are removed during indexing; causes wrong search results

Uwe Schindler (JIRA) Thu, 08 Aug 2013 12:52:52 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13733901#comment-13733901
 ]


Uwe Schindler commented on SOLR-4679:
-------------------------------------

bq. I never said that ...

You somehow said:

bq. I defer to your judgement on this

So I assumed that you are still not 100% convinced. Sorry.

In any case I will take the issue. In my opinion there is more work to be done 
with this crazy stack of StringBuilders to better handle the ignorableWhitepace 
when a new field begins/ends. Currently its insered after the block end tag, so 
it would go one up in the stack only. I have to think a little bit about it, 
but the fix in your patch is the easiest for now. And the maybe useless 
whitespace on some lower stacked StringBuilders is generally removed by text 
analysis.
                
> HTML line breaks (<br>) are removed during indexing; causes wrong search 
> results
> --------------------------------------------------------------------------------
>
>                 Key: SOLR-4679
>                 URL: https://issues.apache.org/jira/browse/SOLR-4679
>             Project: Solr
>          Issue Type: Bug
>          Components: update
>    Affects Versions: 4.2
>         Environment: Windows Server 2008 R2, Java 6, Tomcat 7
>            Reporter: Christoph Straßer
>            Assignee: Uwe Schindler
>         Attachments: external.htm, SOLR-4679__weird_TIKA-1134.patch, 
> Solr_HtmlLineBreak_Linz_NotFound.png, Solr_HtmlLineBreak_Vienna.png
>
>
> HTML line breaks (<br>, <BR>, <br/>, ...) seem to be removed during 
> extraction of content from HTML-Files. They need to be replaced with a empty 
> space.
> Test-File:
> <html>
> <head>
> <title>Test mit HTML-Zeilenschaltungen</title>
> </head>
> <p>
> word1<br>word2<br/>
> Some other words, a special name like linz<br>and another special name - 
> vienna
> </p>
> </html>
> The Solr-content-attribute contains the following text:
> Test mit HTML-Zeilenschaltungen    
> word1word2
> Some other words, a special name like linzand another special name - vienna
> So we are not able to find the word "linz".
> We use the ExtractingRequestHandler to put content into Solr. 
> (wiki.apache.org/solr/ExtractingRequestHandler)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-4679) HTML line breaks () are removed during indexing; causes wrong search results

Reply via email to

[jira] [Commented] (SOLR-4679) HTML line breaks (
) are removed during indexing; causes wrong search results