RE: Solr Encoding Issue?
Shawn - Stupid coding error in my java code. Used default charset. Changed to UTF-8 and problem fixed. Thanks again! -Original Message- From: Tarala, Magesh Sent: Wednesday, July 08, 2015 8:11 PM To: solr-user@lucene.apache.org Subject: RE: Solr Encoding Issue? Wow, that makes total sense. Thanks Shawn!! I'll go down this path. Thanks, Magesh -Original Message- From: Shawn Heisey [mailto:apa...@elyograg.org] Sent: Wednesday, July 08, 2015 7:24 PM To: solr-user@lucene.apache.org Subject: Re: Solr Encoding Issue? On 7/8/2015 6:09 PM, Tarala, Magesh wrote: > I believe the issue is in solr. The character “à” is getting stored in solr > as “à ”. Notice the space after Ã. > > I'm using solrj to ingest the documents into solr. So, one of those could be > the culprit? Solr accepts and outputs text in UTF-8. The UTF-8 hex encoding for the à character is C3A0. In the latin1 character set, hex C3 is the à character. Similarly, in latin1, hex A0 is a non-breaking space. So it sounds like your input is encoded as UTF-8, therefore that character in your input source is hex c3a0, but something in your indexing process is incorrectly interpreting the UTF-8 representation as latin1, so it sees it as "à ". SolrJ is faithfully converting that input to UTF-8 and sending it to Solr. Thanks, Shawn
RE: Solr Encoding Issue?
Wow, that makes total sense. Thanks Shawn!! I'll go down this path. Thanks, Magesh -Original Message- From: Shawn Heisey [mailto:apa...@elyograg.org] Sent: Wednesday, July 08, 2015 7:24 PM To: solr-user@lucene.apache.org Subject: Re: Solr Encoding Issue? On 7/8/2015 6:09 PM, Tarala, Magesh wrote: > I believe the issue is in solr. The character “à” is getting stored in solr > as “à ”. Notice the space after Ã. > > I'm using solrj to ingest the documents into solr. So, one of those could be > the culprit? Solr accepts and outputs text in UTF-8. The UTF-8 hex encoding for the à character is C3A0. In the latin1 character set, hex C3 is the à character. Similarly, in latin1, hex A0 is a non-breaking space. So it sounds like your input is encoded as UTF-8, therefore that character in your input source is hex c3a0, but something in your indexing process is incorrectly interpreting the UTF-8 representation as latin1, so it sees it as "à ". SolrJ is faithfully converting that input to UTF-8 and sending it to Solr. Thanks, Shawn
Re: Solr Encoding Issue?
On 7/8/2015 6:09 PM, Tarala, Magesh wrote: > I believe the issue is in solr. The character “à” is getting stored in solr > as “à ”. Notice the space after Ã. > > I'm using solrj to ingest the documents into solr. So, one of those could be > the culprit? Solr accepts and outputs text in UTF-8. The UTF-8 hex encoding for the à character is C3A0. In the latin1 character set, hex C3 is the à character. Similarly, in latin1, hex A0 is a non-breaking space. So it sounds like your input is encoded as UTF-8, therefore that character in your input source is hex c3a0, but something in your indexing process is incorrectly interpreting the UTF-8 representation as latin1, so it sees it as "à ". SolrJ is faithfully converting that input to UTF-8 and sending it to Solr. Thanks, Shawn
RE: Solr Encoding Issue?
Thanks Erick. I believe the issue is in solr. The character “à” is getting stored in solr as “Ã ”. Notice the space after Ã. I'm using solrj to ingest the documents into solr. So, one of those could be the culprit? -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Wednesday, July 08, 2015 1:36 PM To: solr-user@lucene.apache.org Subject: Re: Solr Encoding Issue? Attachments are pretty aggressively stripped by the e-mail server, so there's nothing to see, you'll have to paste it somewhere else and provide a link. Usually, though, this is a character set issue with the browser using a different charset than Solr, it's really the same character, just displayed differently. Shot in the dark though. Erick On Wed, Jul 8, 2015 at 10:49 AM, Tarala, Magesh wrote: > I’m ingesting a .TXT file with HTML content into Solr. The content > has the following character highlighted below: > > The file we get from CRM (also attached): > > [image: cid:image001.png@01D0B972.75BE23F0] > > > > > > After ingesting into solr, I see a different character. This is query > response from solr management console. > > > > [image: cid:image003.png@01D0B972.D1AED290] > > > > > > Anybody know how I can prevent this from happening? > > > > Thanks! >
RE: Solr Encoding Issue?
Looks like images did not come through. Here's the text... I'm ingesting a .TXT file with HTML content into Solr. The content has the following character highlighted below: The file we get from CRM (also attached): Enter Data in TK Onlyà After ingesting into solr, I see a different character. This is query response from solr management console. Enter Data in TK Onlyà I'm expecting to see à But I'm seeing à Anybody know how I can prevent this from happening? Thanks!
Re: Solr Encoding Issue?
Attachments are pretty aggressively stripped by the e-mail server, so there's nothing to see, you'll have to paste it somewhere else and provide a link. Usually, though, this is a character set issue with the browser using a different charset than Solr, it's really the same character, just displayed differently. Shot in the dark though. Erick On Wed, Jul 8, 2015 at 10:49 AM, Tarala, Magesh wrote: > I’m ingesting a .TXT file with HTML content into Solr. The content has > the following character highlighted below: > > The file we get from CRM (also attached): > > [image: cid:image001.png@01D0B972.75BE23F0] > > > > > > After ingesting into solr, I see a different character. This is query > response from solr management console. > > > > [image: cid:image003.png@01D0B972.D1AED290] > > > > > > Anybody know how I can prevent this from happening? > > > > Thanks! >