Don't feel bad: character encoding problems are often said to be among the hardest in software engineering.

There's no simple answer to problems like this since as Erick said, any tool in your chain could be the culprit. I doubt anyone on this list will be able to guess "the answer" since the question hasn't even really been properly arrived at yet.

My advice is to start as far upstream as you can (where you acquire the data), and make sure you understand how it is encoded. Keep in mind that *it may not be encoded consistently*. Just because it may be declared to be UTF-8, or Shift-JIS, or something, doesn't mean that the characters are actually going to come out sensibly when interpreted in that encoding. You may just be getting garbage. However, assuming that's not the case, you should be able to determine the character set somehow: look at the HTTP headers; look at the characters themselves. If it's HTML or XML, look at the encoding that may be declared in the beginning of the file itself (in the XML declaration). Keep in mind that when you look at these things, you are looking at them through the lens of a tool (wget, or Java's HTTP API, your shell, or a text editor) that will have applied its own processing to the characters. My advice is to use a low-level tool like wget, and maybe od or some other hex character-dumper as a sanity check. Maybe try a few different tools to make sure they agree. Understand all the character-set-related options in your tools so that you can try different settings. Learn about character encodings so you can recognize the byte patterns. In the end, you will only be successful if you master your tools.

Good luck!

-Mike Sokolov

On 11/9/13 2:20 PM, Chris wrote:
I tried a lot of things and almost am at my wit's end :(


Here is the code I used to get the strings -

String htmlContent = readPage(page.getWebURL().getURL());

I even tried -
Document doc = Jsoup.parse(new URL(url).openStream(), "UTF-8", url);
         String htmlContent = doc.html();

& Document doc = Jsoup.parse(htmlContent,"UTF-8");

No improvement so far, any advice for me please?



function that gets the html ----------------------------------------
  public static String readPage(String urlString)  {
              try{

            URL url = new URL(urlString);
              DefaultHttpClient client = new DefaultHttpClient();
              client.getParams().setParameter(ClientPNames.COOKIE_POLICY,
                      CookiePolicy.BROWSER_COMPATIBILITY);

              HttpGet request = new HttpGet(url.toURI());
              HttpResponse response = client.execute(request);

              if(response.getStatusLine().getStatusCode() == 200 &&
response.getEntity().getContentType().toString().contains("text/html"))
              {
                  Reader reader = null;
                  try {
                      reader = new
InputStreamReader(response.getEntity().getContent());

                      StringBuffer sb = new StringBuffer();
                      {
                          int read;
                          char[] cbuf = new char[1024];
                          while ((read = reader.read(cbuf)) != -1)
                              sb.append(cbuf, 0, read);
                      }

                      return sb.toString();

                  } finally {
                      if (reader != null) {
                          try {
                              reader.close();
                          } catch (IOException e) {
                              e.printStackTrace();
                          }
                     }
                  }
              }
              else
                  return "";

              }catch(Exception e){return "";}

          }

---------------------------------------------------------------------------



On Wed, Nov 6, 2013 at 2:53 AM, T. Kuro Kurosaka <k...@healthline.com>wrote:

It sounds like the characters were mishandled at index build time.
I would use Luke to see if a character that appear correctly
when you change the output to be SHIFT JIS is actually
stored as one Unicode. I bet it's stored as two characters,
each having the character of the value that happened
to be high and low bytes of the SHIFT JIS character.

There are many possible cause of this. If you are indexing
the HTML document from HTTP servers, HTTP server may
be configured to send wrong charset= info in Content-Type
header. If the document is directly from a file system,
and if the document doesn't  have META header declaring
the charset, then the system assumes a default charset,
which is typically ISO-8859-1 or UTF-8, and misinterprets
SHIF-JIS encoded characters.

You need to debug to find out where the characters
get corrupted.


On 11/04/2013 11:15 PM, Chris wrote:

Sorry, was away a bit & hence the delay.

I am inserting java strings into a java bean class, and then doing a
addBean() method to insert the POJO into Solr.

When i Query using either tomcat/jetty, I get these special characters.
But
I have noted, if I change output to - "Shift-JIS" encoding then those
characters appear as some japanese characters I think.

But then this solution doesn't work for all special characters as I can
still see some of them...isn't there an encoding that can cover all the
characters whatever they might be? Any ideas on what do i do?

Regards,
Chris


On Mon, Nov 4, 2013 at 6:27 PM, Erick Erickson <erickerick...@gmail.com>
wrote:

  The problem is there are about a dozen places where the character
encoding can be mis-configured. The problem you're seeing above
actually looks like a problem with the character set configured in
your browser, it may have nothing to do with what's actually in Solr.

You might write small SolrJ program and see if you can dump the contents
in binary and examine to see...

Best
Erick


On Sun, Nov 3, 2013 at 6:39 AM, Rajani Maski <rajinima...@gmail.com>
wrote:

  How are you extracting the text that is there in the website[1] you are
referring to? Apache Nutch or any other crawler? If yes, initially check
whether that crawler engine is giving you data in correct format before

you

invoke solr index method.

[1]http://blog.diigo.com/2009/09/28/scheduled-groups-maintenance/

URI encoding should resolve this problem.




On Fri, Nov 1, 2013 at 10:50 AM, Chris <christu...@gmail.com> wrote:

  Hi Rajani,
I followed the steps exactly as in


  http://zensarteam.wordpress.com/2011/11/25/6-steps-to-
configure-solr-on-apache-tomcat-7-0-20/

However, when i send a query to this new instance in tomcat, i again
get
the error -
    <str name="fulltxt">Scheduled Groups Maintenance
In preparation for the new release roll-out,���� Diigo groups won’t be
accessible on Sept 28 (Mon) around midnight 0:00 PST for several
hours.
Stay tuned to say hello to Diigo V4 soon!

location of the text  -
http://blog.diigo.com/2009/09/28/scheduled-groups-maintenance/

same problem at - http://cn.nytimes.com/business/20130926/c26alibaba/

All text in title comes like -

������������������������������������ - ���������������������
������������</str>
      <arr name="text">
        <str>������������������������������������ -
��������������������� ������������</str>
      </arr>


Can you please advice?

Chris




On Tue, Oct 29, 2013 at 11:33 PM, Rajani Maski <rajinima...@gmail.com

wrote:
Hi,

     If you are using Apache Tomcat Server, hope you are not missing

the
below mentioned configuration:
   <Connector port=”port Number″ protocol=”HTTP/1.1″
connectionTimeout=”20000″
redirectPort=”8443″ *URIEncoding=”UTF-8″*/>

I had faced similar issue with Chinese Characters and had resolved

with
the
above config.

Links for reference :


  http://zensarteam.wordpress.com/2011/11/25/6-steps-to-
configure-solr-on-apache-tomcat-7-0-20/

  http://blog.sidu.in/2007/05/tomcat-and-utf-8-encoded-uri-
parameters.html#.Um_3P3Cw2X8

Thanks



On Tue, Oct 29, 2013 at 9:20 PM, Chris <christu...@gmail.com> wrote:

  Hi All,
I get characters like -

������������������ - CTA������������ -

in the solr index. I am adding Java beans to solr by the addBean()
function.

This seems to be a character encoding issue. Any pointers on how to
resolve this one?

I have seen that this occurs  mostly for japanese chinese

characters.
--
-----------------------------------------
T. "Kuro" Kurosaka • Senior Software Engineer



Reply via email to