RE: Solr Encoding Issue?

2015-07-08 Thread Tarala, Magesh
Shawn - Stupid coding error in my java code. Used default charset. Changed to 
UTF-8 and problem fixed. 

Thanks again!

-Original Message-
From: Tarala, Magesh 
Sent: Wednesday, July 08, 2015 8:11 PM
To: solr-user@lucene.apache.org
Subject: RE: Solr Encoding Issue?

Wow, that makes total sense. Thanks Shawn!! 

I'll go down this path. 

Thanks,
Magesh

-Original Message-
From: Shawn Heisey [mailto:apa...@elyograg.org] 
Sent: Wednesday, July 08, 2015 7:24 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr Encoding Issue?

On 7/8/2015 6:09 PM, Tarala, Magesh wrote:
> I believe the issue is in solr. The character “à” is getting stored in solr 
> as “Ã ”. Notice the space after Ã.
>
> I'm using solrj to ingest the documents into solr. So, one of those could be 
> the culprit?

Solr accepts and outputs text in UTF-8.  The UTF-8 hex encoding for the à 
character is C3A0.

In the latin1 character set, hex C3 is the à character.  Similarly, in latin1, 
hex A0 is a non-breaking space.

So it sounds like your input is encoded as UTF-8, therefore that character in 
your input source is hex c3a0, but something in your indexing process is 
incorrectly interpreting the UTF-8 representation as latin1, so it sees it as 
"Ã ".

SolrJ is faithfully converting that input to UTF-8 and sending it to Solr.

Thanks,
Shawn



RE: Solr Encoding Issue?

2015-07-08 Thread Tarala, Magesh
Wow, that makes total sense. Thanks Shawn!! 

I'll go down this path. 

Thanks,
Magesh

-Original Message-
From: Shawn Heisey [mailto:apa...@elyograg.org] 
Sent: Wednesday, July 08, 2015 7:24 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr Encoding Issue?

On 7/8/2015 6:09 PM, Tarala, Magesh wrote:
> I believe the issue is in solr. The character “à” is getting stored in solr 
> as “Ã ”. Notice the space after Ã.
>
> I'm using solrj to ingest the documents into solr. So, one of those could be 
> the culprit?

Solr accepts and outputs text in UTF-8.  The UTF-8 hex encoding for the à 
character is C3A0.

In the latin1 character set, hex C3 is the à character.  Similarly, in latin1, 
hex A0 is a non-breaking space.

So it sounds like your input is encoded as UTF-8, therefore that character in 
your input source is hex c3a0, but something in your indexing process is 
incorrectly interpreting the UTF-8 representation as latin1, so it sees it as 
"Ã ".

SolrJ is faithfully converting that input to UTF-8 and sending it to Solr.

Thanks,
Shawn



Re: Solr Encoding Issue?

2015-07-08 Thread Shawn Heisey
On 7/8/2015 6:09 PM, Tarala, Magesh wrote:
> I believe the issue is in solr. The character “à” is getting stored in solr 
> as “Ã ”. Notice the space after Ã.
>
> I'm using solrj to ingest the documents into solr. So, one of those could be 
> the culprit?

Solr accepts and outputs text in UTF-8.  The UTF-8 hex encoding for the
à character is C3A0.

In the latin1 character set, hex C3 is the à character.  Similarly, in
latin1, hex A0 is a non-breaking space.

So it sounds like your input is encoded as UTF-8, therefore that
character in your input source is hex c3a0, but something in your
indexing process is incorrectly interpreting the UTF-8 representation as
latin1, so it sees it as "Ã ".

SolrJ is faithfully converting that input to UTF-8 and sending it to Solr.

Thanks,
Shawn



RE: Solr Encoding Issue?

2015-07-08 Thread Tarala, Magesh
Thanks Erick.

I believe the issue is in solr. The character “à” is getting stored in solr as 
“Ã ”. Notice the space after Ã.

I'm using solrj to ingest the documents into solr. So, one of those could be 
the culprit?


-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Wednesday, July 08, 2015 1:36 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr Encoding Issue?

Attachments are pretty aggressively stripped by the e-mail server, so there's 
nothing to see, you'll have to paste it somewhere else and provide a link.

Usually, though, this is a character set issue with the browser using a 
different charset than Solr, it's really the same character, just displayed 
differently.

Shot in the dark though.

Erick

On Wed, Jul 8, 2015 at 10:49 AM, Tarala, Magesh  wrote:

>  I’m ingesting a .TXT file with HTML content into Solr. The content 
> has the following character highlighted below:
>
> The file we get from CRM (also attached):
>
> [image: cid:image001.png@01D0B972.75BE23F0]
>
>
>
>
>
> After ingesting into solr, I see a different character. This is query 
> response from solr management console.
>
>
>
> [image: cid:image003.png@01D0B972.D1AED290]
>
>
>
>
>
> Anybody know how I can prevent this from happening?
>
>
>
> Thanks!
>


RE: Solr Encoding Issue?

2015-07-08 Thread Tarala, Magesh
Looks like images did not come through. Here's the text...


I'm ingesting a .TXT file with HTML content into Solr. The content has the 
following character highlighted below:
The file we get from CRM (also attached):
Enter Data in TK Onlyà



After ingesting into solr, I see a different character. This is query response 
from solr management console.
Enter Data in TK Onlyà 



I'm expecting to see à
But I'm seeing à 

Anybody know how I can prevent this from happening?

Thanks!


Re: Solr Encoding Issue?

2015-07-08 Thread Erick Erickson
Attachments are pretty aggressively stripped by the e-mail server, so
there's nothing to see,
you'll have to paste it somewhere else and provide a link.

Usually, though, this is a character set issue with the browser using a
different charset than
Solr, it's really the same character, just displayed differently.

Shot in the dark though.

Erick

On Wed, Jul 8, 2015 at 10:49 AM, Tarala, Magesh  wrote:

>  I’m ingesting a .TXT file with HTML content into Solr. The content has
> the following character highlighted below:
>
> The file we get from CRM (also attached):
>
> [image: cid:image001.png@01D0B972.75BE23F0]
>
>
>
>
>
> After ingesting into solr, I see a different character. This is query
> response from solr management console.
>
>
>
> [image: cid:image003.png@01D0B972.D1AED290]
>
>
>
>
>
> Anybody know how I can prevent this from happening?
>
>
>
> Thanks!
>