Re: Encoding problem with ExtractRequestHandler for HTML indexing

2010-03-24 Thread Teruhiko Kurosaka
I suppose you mean Extract_ing_RequestHandler.

Out of curiosity, I sent in a Japanese HTML file of EUC-JP encoding,
and it converted to Unicode properly and the index has correct
Japanese words.

Does your HTML files have META tag for Content-type with the value
having charset= ? For example, this is what I have:
meta http-equiv=Content-Type content=text/html; charset=EUC-JP /


On Mar 21, 2010, at 9:45 AM, Ukyo Virgden wrote:

 Hi,
 
 I'm trying to index HTML documents with different encodings. My html are
 either in win-12XX, ISO-8859-X or UTF8 encoding. handler correctly parses
 all html in their respective encodings and indexes. However on the web
 interface I'm developing I enter query terms in UTF-8 which naturally does
 not match with content with different encodings. Also the results I see on
 my web app is not utf8 encoded as expected.
 
 My question, is there any filter I can use to convert all content extracted
 by the handler to UTF-8 prior to indexing?
 
 Does it make sense to write a filter which would convert tokens to UTF-8, or
 even is it possible with multiple encodings?
 
 Thanks in advance.
 Ukyo


Teruhiko Kuro Kurosaka
RLP + Lucene  Solr = powerful search for global contents



RE: encoding problem

2009-09-01 Thread Bernadette Houghton
Finally resolved the problem! The solution was 3-pronged on my windows PC-

Added to my.ini under mysqld-
default-character-set=utf8
collation_server=utf8_unicode_ci
character_set_server=utf8
skip-character-set-client-handshake

Added to JAVA_OPTS environmental variable –
-Dfile.encoding=UTF-8

Added to beginning of tomcat startup.bat (positioning is important!)
set JAVA_OPTS=-Dfile.encoding=UTF-8  

Thanks to everyone for their much appreciated help!

Bern

-Original Message-
From: Bernadette Houghton [mailto:bernadette.hough...@deakin.edu.au] 
Sent: Monday, 31 August 2009 9:18 AM
To: 'solr-user@lucene.apache.org'
Subject: RE: encoding problem

Still having a few issues with encoding, although I've been able to resolve the 
particular issue below by just re-editing the affected record. 

The other encoding issue is with Greek characters. With solr turned off in our 
user-facing application, greek characters e.g. α,ω (small alpha, small omega) 
display correctly. But with solr turned on, garbage displays instead. If we 
enter the characters as decimal (e.g. #969;), all displays OK with or without 
solr. Does this suggest anything to anyone??

TIA
bern


RE: encoding problem

2009-08-30 Thread Bernadette Houghton
Still having a few issues with encoding, although I've been able to resolve the 
particular issue below by just re-editing the affected record. 

The other encoding issue is with Greek characters. With solr turned off in our 
user-facing application, greek characters e.g. α,ω (small alpha, small omega) 
display correctly. But with solr turned on, garbage displays instead. If we 
enter the characters as decimal (e.g. #969;), all displays OK with or without 
solr. Does this suggest anything to anyone??

TIA
bern

-Original Message-
From: Bernadette Houghton [mailto:bernadette.hough...@deakin.edu.au] 
Sent: Friday, 28 August 2009 9:31 AM
To: 'solr-user@lucene.apache.org'; 'yo...@lucidimagination.com'
Subject: RE: encoding problem

Shalin, the XML from solr admin for the relevant field is displaying as -

str name=citation_ta title=Browse by Author Name for Moncrieff, Joan 
href=/fez/list/author/Moncrieff%2C+Joan/Moncrieff, Joan/a, a 
title=Browse by Author Name for Macauley, Peter 
href=/fez/list/author/Macauley%2C+Peter/Macauley, Peter/a and a 
title=Browse by Author Name for Epps, Janine 
href=/fez/list/author/Epps%2C+Janine/Epps, Janine/a a title=Browse by 
Year 2006 href=/fez/list/year/2006/2006/a, a title=Click to view 
Journal, Media Article: ldquo;My Universe is Hererdquo;: Implications For the 
Future of Academic Libraries From the Results of a Survey of Researchers 
href=/fez/view/changeme:156“My Universe is Here�: Implications 
For the Future of Academic Libraries From the Results of a Survey of 
Researchers/ai/i, vol. 38, no. 2, pp. 71-83./str


The weird thing is that the title displays OK in one place, but not in the 
href bit.

bern


RE: encoding problem

2009-08-27 Thread Bernadette Houghton
Hi Shalin, strangely, things still aren't working. I've set the JAVA_OPTS 
through either the GUI or to startup.bat, but absolutely no impact. Have tried 
reindexing also, but still no impact - results such as -

“My Universe is Here�

bern

-Original Message-
From: Shalin Shekhar Mangar [mailto:shalinman...@gmail.com] 
Sent: Wednesday, 26 August 2009 5:50 PM
To: solr-user@lucene.apache.org
Subject: Re: encoding problem

On Wed, Aug 26, 2009 at 12:52 PM, Bernadette Houghton 
bernadette.hough...@deakin.edu.au wrote:

 Thanks for your quick reply, Shalin.

 Tomcat is running on my Windows machine, but does not appear in Windows
 Services (as I was expecting it should ... am I wrong?). I'm running it from
 a startup.bat on my desktop - see below. Do I add the Dfile line to the
 startup.bat?

 SOLR is part of the repository software that we are running.


Tomcat respects an environment variable called JAVA_OPTS through which you
can pass any jvm argument (e.g. heap size, file encoding). Set
JAVA_OPTS=-Dfile.encoding=UTF-8 either through the GUI or by adding the
following to startup.bat:

set JAVA_OPTS=-Dfile.encoding=UTF-8

-- 
Regards,
Shalin Shekhar Mangar.


Re: encoding problem

2009-08-27 Thread Yonik Seeley
Have you determined if the problem is on the indexing side or the
query side?  I don't see any reason you should have to set/change any
encoding in the JVM.

-Yonik
http://www.lucidimagination.com



On Thu, Aug 27, 2009 at 7:03 PM, Bernadette
Houghtonbernadette.hough...@deakin.edu.au wrote:
 Hi Shalin, strangely, things still aren't working. I've set the JAVA_OPTS 
 through either the GUI or to startup.bat, but absolutely no impact. Have 
 tried reindexing also, but still no impact - results such as -

 “My Universe is Here�

 bern

 -Original Message-
 From: Shalin Shekhar Mangar [mailto:shalinman...@gmail.com]
 Sent: Wednesday, 26 August 2009 5:50 PM
 To: solr-user@lucene.apache.org
 Subject: Re: encoding problem

 On Wed, Aug 26, 2009 at 12:52 PM, Bernadette Houghton 
 bernadette.hough...@deakin.edu.au wrote:

 Thanks for your quick reply, Shalin.

 Tomcat is running on my Windows machine, but does not appear in Windows
 Services (as I was expecting it should ... am I wrong?). I'm running it from
 a startup.bat on my desktop - see below. Do I add the Dfile line to the
 startup.bat?

 SOLR is part of the repository software that we are running.


 Tomcat respects an environment variable called JAVA_OPTS through which you
 can pass any jvm argument (e.g. heap size, file encoding). Set
 JAVA_OPTS=-Dfile.encoding=UTF-8 either through the GUI or by adding the
 following to startup.bat:

 set JAVA_OPTS=-Dfile.encoding=UTF-8

 --
 Regards,
 Shalin Shekhar Mangar.



RE: encoding problem

2009-08-27 Thread Bernadette Houghton
Shalin, the XML from solr admin for the relevant field is displaying as -

str name=citation_ta title=Browse by Author Name for Moncrieff, Joan 
href=/fez/list/author/Moncrieff%2C+Joan/Moncrieff, Joan/a, a 
title=Browse by Author Name for Macauley, Peter 
href=/fez/list/author/Macauley%2C+Peter/Macauley, Peter/a and a 
title=Browse by Author Name for Epps, Janine 
href=/fez/list/author/Epps%2C+Janine/Epps, Janine/a a title=Browse by 
Year 2006 href=/fez/list/year/2006/2006/a, a title=Click to view 
Journal, Media Article: ldquo;My Universe is Hererdquo;: Implications For the 
Future of Academic Libraries From the Results of a Survey of Researchers 
href=/fez/view/changeme:156“My Universe is Here�: Implications 
For the Future of Academic Libraries From the Results of a Survey of 
Researchers/ai/i, vol. 38, no. 2, pp. 71-83./str


The weird thing is that the title displays OK in one place, but not in the 
href bit.

bern


RE: encoding problem

2009-08-26 Thread Bernadette Houghton
Hi Shalin, stupid question - I'm an apache/solr newbie - but how do I access 
the JVM???

Regards
Bern


-Original Message-
From: Shalin Shekhar Mangar [mailto:shalinman...@gmail.com] 
Sent: Wednesday, 26 August 2009 5:10 PM
To: solr-user@lucene.apache.org
Subject: Re: encoding problem

On Wed, Aug 26, 2009 at 10:24 AM, Bernadette Houghton 
bernadette.hough...@deakin.edu.au wrote:

 We have an encoding problem with our solr application. That is, non-ASCII
 chars displaying fine in SOLR, but in googledegook in our application .

 Our tomcat server.xml file already contains URIencoding=UTF-8 under the
 relevant connector.

 A google search reveals that I should set the encoding for the JVM, but
 have no idea how to do this. I'm running Windows, and there is no tomcat
 process in my Windows Services.


Add the following parameter to the JVM:

-Dfile.encoding=UTF-8

-- 
Regards,
Shalin Shekhar Mangar.


Re: encoding problem

2009-08-26 Thread Shalin Shekhar Mangar
On Wed, Aug 26, 2009 at 12:42 PM, Bernadette Houghton 
bernadette.hough...@deakin.edu.au wrote:

 Hi Shalin, stupid question - I'm an apache/solr newbie - but how do I
 access the JVM???


When you execute the java executable, just add -Dfile.encoding=UTF-8 as a
command line argument to the executable.

How are you consuming Solr? You mentioned there is no tomcat, is your solr
client a desktop java application?

-- 
Regards,
Shalin Shekhar Mangar.


RE: encoding problem

2009-08-26 Thread Bernadette Houghton
Thanks for your quick reply, Shalin.

Tomcat is running on my Windows machine, but does not appear in Windows 
Services (as I was expecting it should ... am I wrong?). I'm running it from a 
startup.bat on my desktop - see below. Do I add the Dfile line to the 
startup.bat?

SOLR is part of the repository software that we are running.

Thanks!

BERN

Startup.bat -
@echo off
if %OS% == Windows_NT setlocal
rem ---
rem Start script for the CATALINA Server
rem
rem $Id: startup.bat 302918 2004-05-27 18:25:11Z yoavs $
rem ---

rem Guess CATALINA_HOME if not defined
set CURRENT_DIR=%cd%
if not %CATALINA_HOME% ==  goto gotHome
set CATALINA_HOME=%CURRENT_DIR%
if exist %CATALINA_HOME%\bin\catalina.bat goto okHome
cd ..
set CATALINA_HOME=%cd%
cd %CURRENT_DIR%
:gotHome
if exist %CATALINA_HOME%\bin\catalina.bat goto okHome
echo The CATALINA_HOME environment variable is not defined correctly
echo This environment variable is needed to run this program
goto end
:okHome

set EXECUTABLE=%CATALINA_HOME%\bin\catalina.bat

rem Check that target executable exists
if exist %EXECUTABLE% goto okExec
echo Cannot find %EXECUTABLE%
echo This file is needed to run this program
goto end
:okExec

rem Get remaining unshifted command line arguments and save them in the
set CMD_LINE_ARGS=
:setArgs
if %1== goto doneSetArgs
set CMD_LINE_ARGS=%CMD_LINE_ARGS% %1
shift
goto setArgs
:doneSetArgs

call %EXECUTABLE% start %CMD_LINE_ARGS%

:end





Re: encoding problem

2009-08-26 Thread Shalin Shekhar Mangar
On Wed, Aug 26, 2009 at 12:52 PM, Bernadette Houghton 
bernadette.hough...@deakin.edu.au wrote:

 Thanks for your quick reply, Shalin.

 Tomcat is running on my Windows machine, but does not appear in Windows
 Services (as I was expecting it should ... am I wrong?). I'm running it from
 a startup.bat on my desktop - see below. Do I add the Dfile line to the
 startup.bat?

 SOLR is part of the repository software that we are running.


Tomcat respects an environment variable called JAVA_OPTS through which you
can pass any jvm argument (e.g. heap size, file encoding). Set
JAVA_OPTS=-Dfile.encoding=UTF-8 either through the GUI or by adding the
following to startup.bat:

set JAVA_OPTS=-Dfile.encoding=UTF-8

-- 
Regards,
Shalin Shekhar Mangar.


RE: encoding problem

2009-08-26 Thread Fuad Efendi
If you are complaining about Web Application (other than SOLR) (probably
behind-the Apache HTTPD) having encoding problem - try to troubleshoot it
with Mozilla Firefox + Live Http Headers plugin.


Look at Content-Encoding HTTP response headers, and don't forget about
meta http-equiv...  tag inside HTML... 


-Fuad
http://www.tokenizer.org



-Original Message-
From: Bernadette Houghton [mailto:bernadette.hough...@deakin.edu.au] 
Sent: August-26-09 12:55 AM
To: 'solr-user@lucene.apache.org'
Subject: encoding problem 

We have an encoding problem with our solr application. That is, non-ASCII
chars displaying fine in SOLR, but in googledegook in our application .

Our tomcat server.xml file already contains URIencoding=UTF-8 under the
relevant connector.

A google search reveals that I should set the encoding for the JVM, but have
no idea how to do this. I'm running Windows, and there is no tomcat process
in my Windows Services.

TIA

Bernadette Houghton, Library Business Applications Developer
Deakin University Geelong Victoria 3217 Australia.
Phone: 03 5227 8230 International: +61 3 5227 8230
Fax: 03 5227 8000 International: +61 3 5227 8000
MSN: bern_hough...@hotmail.com
Email:
bernadette.hough...@deakin.edu.aumailto:bernadette.hough...@deakin.edu.au
Website: http://www.deakin.edu.au
http://www.deakin.edu.au/Deakin University CRICOS Provider Code 00113B
(Vic)

Important Notice: The contents of this email are intended solely for the
named addressee and are confidential; any unauthorised use, reproduction or
storage of the contents is expressly prohibited. If you have received this
email in error, please delete it and any attachments immediately and advise
the sender by return email or telephone.
Deakin University does not warrant that this email and any attachments are
error or virus free





Re: Encoding problem

2009-04-01 Thread Rui Pereira
Thanks,I detected that same problem.
I have CP 1252 system file encoding and was recording data-config.xml file
in UTF-8. DIH was reading using the default encoding.
One possible workarround was using InputStream and OutputStream like DIH,
but the files won't be in UTF-8 if the system has different encoding (not
really good for XML files).
I will get the latest 1.4 build and maintain the files in UTF-8.

On Fri, Mar 27, 2009 at 9:37 PM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:

 On Sat, Mar 28, 2009 at 12:51 AM, Shalin Shekhar Mangar 
 shalinman...@gmail.com wrote:

 
  I see that you are specifying the topologyname's value in the query
 itself.
  It might be a bug in DataImportHandler because it reads the data-config
 as a
  string from an InputStream. If your default platform encoding is not
 UTF-8,
  this may be the cause.
 

 I've opened SOLR-1090 to fix this issue.

 https://issues.apache.org/jira/browse/SOLR-1090

 --
 Regards,
 Shalin Shekhar Mangar.



Re: Encoding problem

2009-03-27 Thread aerox7

Hi,
I had the same problem with DATAIMPORTHandler : i have a utf-8 mysql
DATABASE but it's seems that DIH import data in LATIN... So i just use
Transformer to (re)encode my strings in UTF-8.


Rui Pereira-2 wrote:
 
 I'm having problems with encoding in responses from search queries. The
 encoding problem only occurs in the topologyname field, if a instancename
 has accents it is returned correctly. In all my configurations I have
 UTF-8.
 
 ?xml version=1.0 encoding=UTF-8?
 dataConfig
 document name=topologies
 entity query=SELECT DISTINCT '3141-' || Sub0.SUBID as id, 'Inventário'
 as
 topologyname, 3141 as topologyid, Sub0.SUBID as instancekey, Sub0.NAME as
 instancename FROM ...
   field column=INSTANCEKEY name=instancekey/
   field column=ID name=id/
   field column=TOPOLOGYID name=topologyid/
   field column=INSTANCENAME name=instancename/
   field column=TOPOLOGYNAME name=topologyname/...
 
 
 As an example, I can have in the response the following result:
 
 doc
 long name=instancekey285/long
 str name=instancenameInformática/str
 long name=topologyid3141/long
 str name=topologynameInventário/str
 /doc
 
 
 Thanks in advance,
Rui Pereira
 
 

-- 
View this message in context: 
http://www.nabble.com/Encoding-problem-tp22743698p22745133.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Encoding problem

2009-03-27 Thread Shalin Shekhar Mangar
On Fri, Mar 27, 2009 at 8:41 PM, Rui Pereira ruipereira...@gmail.comwrote:

 I'm having problems with encoding in responses from search queries. The
 encoding problem only occurs in the topologyname field, if a instancename
 has accents it is returned correctly. In all my configurations I have
 UTF-8.

 ?xml version=1.0 encoding=UTF-8?
 dataConfig
document name=topologies
 entity query=SELECT DISTINCT '3141-' || Sub0.SUBID as id, 'Inventário' as
 topologyname, 3141 as topologyid, Sub0.SUBID as instancekey, Sub0.NAME as
 instancename FROM ...
  field column=INSTANCEKEY name=instancekey/
  field column=ID name=id/
  field column=TOPOLOGYID name=topologyid/
  field column=INSTANCENAME name=instancename/
  field column=TOPOLOGYNAME name=topologyname/...


 As an example, I can have in the response the following result:

 doc
 long name=instancekey285/long
 str name=instancenameInformática/str
 long name=topologyid3141/long
 str name=topologynameInventário/str
 /doc


I see that you are specifying the topologyname's value in the query itself.
It might be a bug in DataImportHandler because it reads the data-config as a
string from an InputStream. If your default platform encoding is not UTF-8,
this may be the cause.

Can you try running the Solr's (or your servlet-container's) java process
with -Dfile.encoding=UTF-8 and see if that fixes the problem?

-- 
Regards,
Shalin Shekhar Mangar.


Re: Encoding problem

2009-03-27 Thread Shalin Shekhar Mangar
On Sat, Mar 28, 2009 at 12:51 AM, Shalin Shekhar Mangar 
shalinman...@gmail.com wrote:


 I see that you are specifying the topologyname's value in the query itself.
 It might be a bug in DataImportHandler because it reads the data-config as a
 string from an InputStream. If your default platform encoding is not UTF-8,
 this may be the cause.


I've opened SOLR-1090 to fix this issue.

https://issues.apache.org/jira/browse/SOLR-1090

-- 
Regards,
Shalin Shekhar Mangar.