Re: Encoding problem with ExtractRequestHandler for HTML indexing

2010-03-24 Thread Teruhiko Kurosaka
I suppose you mean Extract_ing_RequestHandler.

Out of curiosity, I sent in a Japanese HTML file of EUC-JP encoding,
and it converted to Unicode properly and the index has correct
Japanese words.

Does your HTML files have META tag for Content-type with the value
having charset= ? For example, this is what I have:



On Mar 21, 2010, at 9:45 AM, Ukyo Virgden wrote:

> Hi,
> 
> I'm trying to index HTML documents with different encodings. My html are
> either in win-12XX, ISO-8859-X or UTF8 encoding. handler correctly parses
> all html in their respective encodings and indexes. However on the web
> interface I'm developing I enter query terms in UTF-8 which naturally does
> not match with content with different encodings. Also the results I see on
> my web app is not utf8 encoded as expected.
> 
> My question, is there any filter I can use to convert all content extracted
> by the handler to UTF-8 prior to indexing?
> 
> Does it make sense to write a filter which would convert tokens to UTF-8, or
> even is it possible with multiple encodings?
> 
> Thanks in advance.
> Ukyo


Teruhiko "Kuro" Kurosaka
RLP + Lucene & Solr = powerful search for global contents



RE: encoding problem

2009-09-01 Thread Bernadette Houghton
Finally resolved the problem! The solution was 3-pronged on my windows PC-

Added to my.ini under mysqld-
default-character-set=utf8
collation_server=utf8_unicode_ci
character_set_server=utf8
skip-character-set-client-handshake

Added to JAVA_OPTS environmental variable –
-Dfile.encoding=UTF-8

Added to beginning of tomcat startup.bat (positioning is important!)
set JAVA_OPTS="-Dfile.encoding=UTF-8"  

Thanks to everyone for their much appreciated help!

Bern

-Original Message-
From: Bernadette Houghton [mailto:bernadette.hough...@deakin.edu.au] 
Sent: Monday, 31 August 2009 9:18 AM
To: 'solr-user@lucene.apache.org'
Subject: RE: encoding problem

Still having a few issues with encoding, although I've been able to resolve the 
particular issue below by just re-editing the affected record. 

The other encoding issue is with Greek characters. With solr turned off in our 
user-facing application, greek characters e.g. α,ω (small alpha, small omega) 
display correctly. But with solr turned on, garbage displays instead. If we 
enter the characters as decimal (e.g. ω), all displays OK with or without 
solr. Does this suggest anything to anyone??

TIA
bern


RE: encoding problem

2009-08-30 Thread Bernadette Houghton
Still having a few issues with encoding, although I've been able to resolve the 
particular issue below by just re-editing the affected record. 

The other encoding issue is with Greek characters. With solr turned off in our 
user-facing application, greek characters e.g. α,ω (small alpha, small omega) 
display correctly. But with solr turned on, garbage displays instead. If we 
enter the characters as decimal (e.g. ω), all displays OK with or without 
solr. Does this suggest anything to anyone??

TIA
bern

-Original Message-
From: Bernadette Houghton [mailto:bernadette.hough...@deakin.edu.au] 
Sent: Friday, 28 August 2009 9:31 AM
To: 'solr-user@lucene.apache.org'; 'yo...@lucidimagination.com'
Subject: RE: encoding problem

Shalin, the XML from solr admin for the relevant field is displaying as -

Moncrieff, Joan, Macauley, Peter and Epps, Janine 2006, “My Universe is Here�: Implications 
For the Future of Academic Libraries From the Results of a Survey of 
Researchers, vol. 38, no. 2, pp. 71-83.


The weird thing is that the title displays OK in one place, but not in the 
"href" bit.

bern


RE: encoding problem

2009-08-27 Thread Bernadette Houghton
Shalin, the XML from solr admin for the relevant field is displaying as -

Moncrieff, Joan, Macauley, Peter and Epps, Janine 2006, “My Universe is Here�: Implications 
For the Future of Academic Libraries From the Results of a Survey of 
Researchers, vol. 38, no. 2, pp. 71-83.


The weird thing is that the title displays OK in one place, but not in the 
"href" bit.

bern


Re: encoding problem

2009-08-27 Thread Yonik Seeley
Have you determined if the problem is on the indexing side or the
query side?  I don't see any reason you should have to set/change any
encoding in the JVM.

-Yonik
http://www.lucidimagination.com



On Thu, Aug 27, 2009 at 7:03 PM, Bernadette
Houghton wrote:
> Hi Shalin, strangely, things still aren't working. I've set the JAVA_OPTS 
> through either the GUI or to startup.bat, but absolutely no impact. Have 
> tried reindexing also, but still no impact - results such as -
>
> “My Universe is Here�
>
> bern
>
> -Original Message-
> From: Shalin Shekhar Mangar [mailto:shalinman...@gmail.com]
> Sent: Wednesday, 26 August 2009 5:50 PM
> To: solr-user@lucene.apache.org
> Subject: Re: encoding problem
>
> On Wed, Aug 26, 2009 at 12:52 PM, Bernadette Houghton <
> bernadette.hough...@deakin.edu.au> wrote:
>
>> Thanks for your quick reply, Shalin.
>>
>> Tomcat is running on my Windows machine, but does not appear in Windows
>> Services (as I was expecting it should ... am I wrong?). I'm running it from
>> a startup.bat on my desktop - see below. Do I add the Dfile line to the
>> startup.bat?
>>
>> SOLR is part of the repository software that we are running.
>>
>
> Tomcat respects an environment variable called JAVA_OPTS through which you
> can pass any jvm argument (e.g. heap size, file encoding). Set
> JAVA_OPTS="-Dfile.encoding=UTF-8" either through the GUI or by adding the
> following to startup.bat:
>
> set JAVA_OPTS="-Dfile.encoding=UTF-8"
>
> --
> Regards,
> Shalin Shekhar Mangar.
>


RE: encoding problem

2009-08-27 Thread Bernadette Houghton
Hi Shalin, strangely, things still aren't working. I've set the JAVA_OPTS 
through either the GUI or to startup.bat, but absolutely no impact. Have tried 
reindexing also, but still no impact - results such as -

“My Universe is Here�

bern

-Original Message-
From: Shalin Shekhar Mangar [mailto:shalinman...@gmail.com] 
Sent: Wednesday, 26 August 2009 5:50 PM
To: solr-user@lucene.apache.org
Subject: Re: encoding problem

On Wed, Aug 26, 2009 at 12:52 PM, Bernadette Houghton <
bernadette.hough...@deakin.edu.au> wrote:

> Thanks for your quick reply, Shalin.
>
> Tomcat is running on my Windows machine, but does not appear in Windows
> Services (as I was expecting it should ... am I wrong?). I'm running it from
> a startup.bat on my desktop - see below. Do I add the Dfile line to the
> startup.bat?
>
> SOLR is part of the repository software that we are running.
>

Tomcat respects an environment variable called JAVA_OPTS through which you
can pass any jvm argument (e.g. heap size, file encoding). Set
JAVA_OPTS="-Dfile.encoding=UTF-8" either through the GUI or by adding the
following to startup.bat:

set JAVA_OPTS="-Dfile.encoding=UTF-8"

-- 
Regards,
Shalin Shekhar Mangar.


RE: encoding problem

2009-08-26 Thread Fuad Efendi
If you are complaining about Web Application (other than SOLR) (probably
behind-the Apache HTTPD) having encoding problem - try to troubleshoot it
with Mozilla Firefox + Live Http Headers plugin.


Look at "Content-Encoding" HTTP response headers, and don't forget about
 tag inside HTML... 


-Fuad
http://www.tokenizer.org



-Original Message-
From: Bernadette Houghton [mailto:bernadette.hough...@deakin.edu.au] 
Sent: August-26-09 12:55 AM
To: 'solr-user@lucene.apache.org'
Subject: encoding problem 

We have an encoding problem with our solr application. That is, non-ASCII
chars displaying fine in SOLR, but in googledegook in our application .

Our tomcat server.xml file already contains URIencoding="UTF-8" under the
relevant .

A google search reveals that I should set the encoding for the JVM, but have
no idea how to do this. I'm running Windows, and there is no tomcat process
in my Windows Services.

TIA

Bernadette Houghton, Library Business Applications Developer
Deakin University Geelong Victoria 3217 Australia.
Phone: 03 5227 8230 International: +61 3 5227 8230
Fax: 03 5227 8000 International: +61 3 5227 8000
MSN: bern_hough...@hotmail.com
Email:
bernadette.hough...@deakin.edu.au
Website: http://www.deakin.edu.au
Deakin University CRICOS Provider Code 00113B
(Vic)

Important Notice: The contents of this email are intended solely for the
named addressee and are confidential; any unauthorised use, reproduction or
storage of the contents is expressly prohibited. If you have received this
email in error, please delete it and any attachments immediately and advise
the sender by return email or telephone.
Deakin University does not warrant that this email and any attachments are
error or virus free





Re: encoding problem

2009-08-26 Thread Shalin Shekhar Mangar
On Wed, Aug 26, 2009 at 12:52 PM, Bernadette Houghton <
bernadette.hough...@deakin.edu.au> wrote:

> Thanks for your quick reply, Shalin.
>
> Tomcat is running on my Windows machine, but does not appear in Windows
> Services (as I was expecting it should ... am I wrong?). I'm running it from
> a startup.bat on my desktop - see below. Do I add the Dfile line to the
> startup.bat?
>
> SOLR is part of the repository software that we are running.
>

Tomcat respects an environment variable called JAVA_OPTS through which you
can pass any jvm argument (e.g. heap size, file encoding). Set
JAVA_OPTS="-Dfile.encoding=UTF-8" either through the GUI or by adding the
following to startup.bat:

set JAVA_OPTS="-Dfile.encoding=UTF-8"

-- 
Regards,
Shalin Shekhar Mangar.


RE: encoding problem

2009-08-26 Thread Bernadette Houghton
Thanks for your quick reply, Shalin.

Tomcat is running on my Windows machine, but does not appear in Windows 
Services (as I was expecting it should ... am I wrong?). I'm running it from a 
startup.bat on my desktop - see below. Do I add the Dfile line to the 
startup.bat?

SOLR is part of the repository software that we are running.

Thanks!

BERN

Startup.bat -
@echo off
if "%OS%" == "Windows_NT" setlocal
rem ---
rem Start script for the CATALINA Server
rem
rem $Id: startup.bat 302918 2004-05-27 18:25:11Z yoavs $
rem ---

rem Guess CATALINA_HOME if not defined
set CURRENT_DIR=%cd%
if not "%CATALINA_HOME%" == "" goto gotHome
set CATALINA_HOME=%CURRENT_DIR%
if exist "%CATALINA_HOME%\bin\catalina.bat" goto okHome
cd ..
set CATALINA_HOME=%cd%
cd %CURRENT_DIR%
:gotHome
if exist "%CATALINA_HOME%\bin\catalina.bat" goto okHome
echo The CATALINA_HOME environment variable is not defined correctly
echo This environment variable is needed to run this program
goto end
:okHome

set EXECUTABLE=%CATALINA_HOME%\bin\catalina.bat

rem Check that target executable exists
if exist "%EXECUTABLE%" goto okExec
echo Cannot find %EXECUTABLE%
echo This file is needed to run this program
goto end
:okExec

rem Get remaining unshifted command line arguments and save them in the
set CMD_LINE_ARGS=
:setArgs
if ""%1""== goto doneSetArgs
set CMD_LINE_ARGS=%CMD_LINE_ARGS% %1
shift
goto setArgs
:doneSetArgs

call "%EXECUTABLE%" start %CMD_LINE_ARGS%

:end





Re: encoding problem

2009-08-26 Thread Shalin Shekhar Mangar
On Wed, Aug 26, 2009 at 12:42 PM, Bernadette Houghton <
bernadette.hough...@deakin.edu.au> wrote:

> Hi Shalin, stupid question - I'm an apache/solr newbie - but how do I
> access the JVM???
>

When you execute the java executable, just add -Dfile.encoding=UTF-8 as a
command line argument to the executable.

How are you consuming Solr? You mentioned there is no tomcat, is your solr
client a desktop java application?

-- 
Regards,
Shalin Shekhar Mangar.


RE: encoding problem

2009-08-26 Thread Bernadette Houghton
Hi Shalin, stupid question - I'm an apache/solr newbie - but how do I access 
the JVM???

Regards
Bern


-Original Message-
From: Shalin Shekhar Mangar [mailto:shalinman...@gmail.com] 
Sent: Wednesday, 26 August 2009 5:10 PM
To: solr-user@lucene.apache.org
Subject: Re: encoding problem

On Wed, Aug 26, 2009 at 10:24 AM, Bernadette Houghton <
bernadette.hough...@deakin.edu.au> wrote:

> We have an encoding problem with our solr application. That is, non-ASCII
> chars displaying fine in SOLR, but in googledegook in our application .
>
> Our tomcat server.xml file already contains URIencoding="UTF-8" under the
> relevant .
>
> A google search reveals that I should set the encoding for the JVM, but
> have no idea how to do this. I'm running Windows, and there is no tomcat
> process in my Windows Services.
>

Add the following parameter to the JVM:

-Dfile.encoding=UTF-8

-- 
Regards,
Shalin Shekhar Mangar.


Re: encoding problem

2009-08-26 Thread Shalin Shekhar Mangar
On Wed, Aug 26, 2009 at 10:24 AM, Bernadette Houghton <
bernadette.hough...@deakin.edu.au> wrote:

> We have an encoding problem with our solr application. That is, non-ASCII
> chars displaying fine in SOLR, but in googledegook in our application .
>
> Our tomcat server.xml file already contains URIencoding="UTF-8" under the
> relevant .
>
> A google search reveals that I should set the encoding for the JVM, but
> have no idea how to do this. I'm running Windows, and there is no tomcat
> process in my Windows Services.
>

Add the following parameter to the JVM:

-Dfile.encoding=UTF-8

-- 
Regards,
Shalin Shekhar Mangar.


Re: Encoding problem

2009-04-01 Thread Rui Pereira
Thanks,I detected that same problem.
I have CP 1252 system file encoding and was recording data-config.xml file
in UTF-8. DIH was reading using the default encoding.
One possible workarround was using InputStream and OutputStream like DIH,
but the files won't be in UTF-8 if the system has different encoding (not
really good for XML files).
I will get the latest 1.4 build and maintain the files in UTF-8.

On Fri, Mar 27, 2009 at 9:37 PM, Shalin Shekhar Mangar <
shalinman...@gmail.com> wrote:

> On Sat, Mar 28, 2009 at 12:51 AM, Shalin Shekhar Mangar <
> shalinman...@gmail.com> wrote:
>
> >
> > I see that you are specifying the topologyname's value in the query
> itself.
> > It might be a bug in DataImportHandler because it reads the data-config
> as a
> > string from an InputStream. If your default platform encoding is not
> UTF-8,
> > this may be the cause.
> >
>
> I've opened SOLR-1090 to fix this issue.
>
> https://issues.apache.org/jira/browse/SOLR-1090
>
> --
> Regards,
> Shalin Shekhar Mangar.
>


Re: Encoding problem

2009-03-27 Thread Shalin Shekhar Mangar
On Sat, Mar 28, 2009 at 12:51 AM, Shalin Shekhar Mangar <
shalinman...@gmail.com> wrote:

>
> I see that you are specifying the topologyname's value in the query itself.
> It might be a bug in DataImportHandler because it reads the data-config as a
> string from an InputStream. If your default platform encoding is not UTF-8,
> this may be the cause.
>

I've opened SOLR-1090 to fix this issue.

https://issues.apache.org/jira/browse/SOLR-1090

-- 
Regards,
Shalin Shekhar Mangar.


Re: Encoding problem

2009-03-27 Thread Shalin Shekhar Mangar
On Fri, Mar 27, 2009 at 8:41 PM, Rui Pereira wrote:

> I'm having problems with encoding in responses from search queries. The
> encoding problem only occurs in the topologyname field, if a instancename
> has accents it is returned correctly. In all my configurations I have
> UTF-8.
>
> 
> 
>
> 
>  
>  
>  
>  ...
>
>
> As an example, I can have in the response the following result:
>
> 
> 285
> Informática
> 3141
> Inventário
> 
>

I see that you are specifying the topologyname's value in the query itself.
It might be a bug in DataImportHandler because it reads the data-config as a
string from an InputStream. If your default platform encoding is not UTF-8,
this may be the cause.

Can you try running the Solr's (or your servlet-container's) java process
with -Dfile.encoding=UTF-8 and see if that fixes the problem?

-- 
Regards,
Shalin Shekhar Mangar.


Re: Encoding problem

2009-03-27 Thread aerox7

Hi,
I had the same problem with DATAIMPORTHandler : i have a utf-8 mysql
DATABASE but it's seems that DIH import data in LATIN... So i just use
Transformer to (re)encode my strings in UTF-8.


Rui Pereira-2 wrote:
> 
> I'm having problems with encoding in responses from search queries. The
> encoding problem only occurs in the topologyname field, if a instancename
> has accents it is returned correctly. In all my configurations I have
> UTF-8.
> 
> 
> 
> 
> 
>   
>   
>   
>   ...
> 
> 
> As an example, I can have in the response the following result:
> 
> 
> 285
> Informática
> 3141
> Inventário
> 
> 
> 
> Thanks in advance,
>Rui Pereira
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Encoding-problem-tp22743698p22745133.html
Sent from the Solr - User mailing list archive at Nabble.com.