Re: Problem adding unicoded docs to Solr through SolrJ

2009-04-30 Thread Gunnar Wagenknecht
ahmed baseet schrieb:
 I first converted the whole string to
 byte array and then used that byte array to create a new utf-8 encoded sting
 like this,

I'm not sure that this is required at all. Java strings have the same
representation internally no matter what they were created from. Thus,
the code snipped you posted is wrong.

 // Encode in Unicode UTF-8
 byte [] utfEncodeByteArray = textOnly.getBytes();
 String utfString = new String(utfEncodeByteArray,
 Charset.forName(UTF-8));

Especially the expression textOnly.getBytes() is wrong. It converts
the String to a set of bytes using the JVM's default encoding. Then you
convert the bytes back to a string using UTF-8 encoding.

You carefully have to check *how* the string textOnly is created in
the first place. That's where you UTF-8 issues might come from.

-Gunnar

-- 
Gunnar Wagenknecht
gun...@wagenknecht.org
http://wagenknecht.org/



Re: Problem adding unicoded docs to Solr through SolrJ

2009-04-30 Thread Michael Ludwig

ahmed baseet schrieb:


I tried something stupid but working though. I first converted the
whole string to byte array and then used that byte array to create a
new utf-8 encoded sting like this,

// Encode in Unicode UTF-8
byte [] utfEncodeByteArray = textOnly.getBytes();


This yields a sequence of bytes using the platform's default charset,
which may not be UTF-8. Check:

* String#getBytes()
* String#getBytes(String charsetName)


String utfString = new String(utfEncodeByteArray,
Charset.forName(UTF-8));


Note that strings in Java are always internally encoded in UTF-16, so it
doesn't make much sense to call it utfString, especially if you think
that it is encoded in UTF-8, which it is not.

The above operation is only guaranteed to succeed without losing data
(resulting in ? in the output) when the sequence of bytes is valid as
UTF-8, i.e. in this case when your platform encoding, which you've
relied upon, is UTF-8.


then passed the utfString to the function for posting to Solr and it
works prefectly.
But is there any intelligent way of doing all this, like straight from
default encoded string to utf-8 encoded string, without going via byte
array.


It is a feature of the java.lang.String that you don't need to know the
encoding, as the string contains characters, not bytes. Only for input
and output you are concerned with encoding. So where you're dealing with
encodings, you're dealing with bytes.

And when dealing with bytes on the wire, you're likely concerned with
encodings, for example when the page you read via HTTP comes with a
Content-Type header specifying the encoding, or when you send documents
to the Solr indexer.

For more intelligent ways, you could take a look at the class
java.nio.charset.Charset and the methods encode, decode, newEncoder,
newDecoder.

Michael Ludwig


Problem adding unicoded docs to Solr through SolrJ

2009-04-29 Thread ahmed baseet
Hi All,
I'm trying to automate the process of posting xml s to Solr using Solrj.
Essentially I'm extracting the text from a given Url, then creating a
solrDoc and posting the same using the following function,

public void postToSolrUsingSolrj(String rawText, String pageId) {
String url = http://localhost:8983/solr;;
CommonsHttpSolrServer server;

try {
// Get connection to Solr server
  server = new CommonsHttpSolrServer(url);

// Set XMLResponseParser : Reqd for older version of Solr 1.3
server.setParser(new XMLResponseParser());

server.setSoTimeout(1000);  // socket read timeout
  server.setConnectionTimeout(100);
  server.setDefaultMaxConnectionsPerHost(100);
  server.setMaxTotalConnections(100);
  server.setFollowRedirects(false);  // defaults to false
  // allowCompression defaults to false.
  // Server side must support gzip or deflate for this to have
any effect.
  server.setAllowCompression(true);
  server.setMaxRetries(1); // defaults to 0.   1 not
recommended.

// WARNING : this will delete all pre-existing Solr index
//server.deleteByQuery( *:* );// delete everything!

SolrInputDocument doc = new SolrInputDocument();
doc.addField(id, pageId );
doc.addField(features, rawText );


// Add the docs to Solr Server
server.add(doc);

// Do commit the changes
server.commit();

}catch (Exception e) {}
}

In the above the param rawText is just the html stripped off of all its
tags, js, css etc and pageId is the Url for that page. When I'm using this
for English pages its working perfectly fine but the problem comes up when
I'm trying to index some non-english pages. For them, say pages in tamil,
the encoding Unicode/Utf-8 seems to create some problem, because after
indexing some non-english pages when I'm trying to search those from solr
admin search interface, it gives the result but the content is not showing
in that language i.e tamil rather it just displays just some characters, i
think in unicode. The same thing worked fine for pages in English.

Now what I did is just extracted the raw text from that html page and
manually created an xml page like this

?xml version=1.0 encoding=UTF-8?
add
  doc
field name=idUTF2TEST/field
field name=nameTest with some UTF-8 encoded characters/field
field name=features*some tamil unicode text here*/field
   /doc
/add

and posted this from command line using the post.jar file. Now searching
gives me the result but unlike last time browser shows the indexed text in
tamil itself and not the raw unicode. So this clearly shows that the string
that I'm using to create the solrDoc seems to have some encoding issues,
right? Or something else? I tried doing something like this also,

// Encode in Unicode UTF-8
 utfEncodedText = new String(rawText.getBytes(UTF-8));

but even this didn't help eighter.
Its seems some silly problem some where, which I'm not able to catch. :-)

I appreciate if some one can point me the bug...

Thanks,
Ahmed.


Re: Problem adding unicoded docs to Solr through SolrJ

2009-04-29 Thread Michael Ludwig

ahmed baseet schrieb:


public void postToSolrUsingSolrj(String rawText, String pageId) {



doc.addField(features, rawText );



In the above the param rawText is just the html stripped off of all
its tags, js, css etc and pageId is the Url for that page. When I'm
using this for English pages its working perfectly fine but the
problem comes up when I'm trying to index some non-english pages.


Maybe you're constructing a string without specifying the encoding, so
Java uses your default platform encoding?

String(byte[] bytes)
  Constructs a new String by decoding the specified array of
  bytes using the platform's default charset.

String(byte[] bytes, Charset charset)
  Constructs a new String by decoding the specified array of bytes using
  the specified charset.


Now what I did is just extracted the raw text from that html page and
manually created an xml page like this

?xml version=1.0 encoding=UTF-8?
add
  doc
field name=idUTF2TEST/field
field name=nameTest with some UTF-8 encoded characters/field
field name=features*some tamil unicode text here*/field
   /doc
/add

and posted this from command line using the post.jar file. Now searching
gives me the result but unlike last time browser shows the indexed text in
tamil itself and not the raw unicode.


Now that's perfect, isn't it?


I tried doing something like this also,



// Encode in Unicode UTF-8
 utfEncodedText = new String(rawText.getBytes(UTF-8));

but even this didn't help eighter.


No encoding specified, so the default platform encoding is used, which
is likely not what you want. Consider the following example:

package milu;
import java.nio.charset.Charset;
public class StringAndCharset {
  public static void main(String[] args) {
byte[] bytes = { 'K', (byte) 195, (byte) 164, 's', 'e' };
System.out.println(Charset.defaultCharset().displayName());
System.out.println(new String(bytes));
System.out.println(new String(bytes,  Charset.forName(UTF-8)));
  }
}

Output:

windows-1252
Käse (bad)
Käse (good)

Michael Ludwig