Hi Jens,

Thanks for the reply!

I used iconv (thanks for the pointer, I had no idea this tool existed)
and was able to convert all of the articles to and from utf8 without
any errors being generated, so I am pretty sure that the input sources
are valid utf8.

I should mention that I am using an old version of ferret.  v.0.9.6
which is the last version to have a pure-ruby implementation.  I'm
using this version because I have added in some changes which allow me
to specify the scoring algorithm used on a per-search basis.  I
haven't however made any changes to the indexing portion of the
application.

I current have an iconv script creating transliterated ASCII copies of
all my articles, so I am going to try to index over these.  Also, I am
thinking of trying to index using Lucene since there is a chance that
the older version of ferret is compatible with lucene indexes.

If you have any other suggestions I'd love to hear them, but I
understand that I can't expect much help with such an old version.  Do
you know of a way to specify custom scoring algorithms in the current
versions of ferret?

Best,
Eric

On Monday, May 19, at 23:15, Jens Kraemer wrote:
 > Hi!
 > 
 > Are you *sure* this is all valid UTF8? I dont know how the file  
 > command determines this, and if it always is right.
 > Maybe try to play around with iconv to ensure whatever you send to  
 > Ferret really is UTF8.
 > 
 > Cheers,
 > Jens
 > 
 > On 19.05.2008, at 18:00, Eric Schulte wrote:
 > 
 > > Hi,
 > >
 > > I am trying to index a number of Spanish language text files, but a
 > > large fraction of the files are generating errors like the
 > > following...
 > >
 > > Error: exception 2 not handled: Error decoding input string. Check  
 > > that you have the locale set correctly
 > >
 > > however it looks to me like my locale matches the file type.  Running
 > > the file command on the files returns
 > >
 > > $ file /media/.../raw/abc/20Jan2007_abc_001041_67.es
 > > /media/.../raw/abc/20Jan2007_abc_001041_67.es: UTF-8 Unicode text
 > 
 > 
 > >
 > >
 > > and my locale is
 > >
 > > $ locale
 > > LANG=en_US.UTF-8
 > > LC_CTYPE="en_US.UTF-8"
 > > LC_NUMERIC="en_US.UTF-8"
 > > LC_TIME="en_US.UTF-8"
 > > LC_COLLATE="en_US.UTF-8"
 > > LC_MONETARY="en_US.UTF-8"
 > > LC_MESSAGES="en_US.UTF-8"
 > > LC_PAPER="en_US.UTF-8"
 > > LC_NAME="en_US.UTF-8"
 > > LC_ADDRESS="en_US.UTF-8"
 > > LC_TELEPHONE="en_US.UTF-8"
 > > LC_MEASUREMENT="en_US.UTF-8"
 > > LC_IDENTIFICATION="en_US.UTF-8"
 > > LC_ALL=
 > >
 > >
 > > after enough of these errors are generated, I begin to get errors for
 > > having too many open files, and the indexing fails.
 > >
 > > Error: exception 2 not handled: Too many open files
 > >
 > > Any suggestions would be greatly appreciated.
 > >
 > > Thanks,
 > > Eric
 > > _______________________________________________
 > > Ferret-talk mailing list
 > > [email protected]
 > > http://rubyforge.org/mailman/listinfo/ferret-talk
 > >
 > 
 > --
 > Jens Krämer
 > Finkenlust 14, 06449 Aschersleben, Germany
 > VAT Id DE251962952
 > http://www.jkraemer.net/ - Blog
 > http://www.omdb.org/     - The new free film database
 > 

-- 
schulte
_______________________________________________
Ferret-talk mailing list
[email protected]
http://rubyforge.org/mailman/listinfo/ferret-talk

Reply via email to