Hi Jens, Thanks for the reply!
I used iconv (thanks for the pointer, I had no idea this tool existed) and was able to convert all of the articles to and from utf8 without any errors being generated, so I am pretty sure that the input sources are valid utf8. I should mention that I am using an old version of ferret. v.0.9.6 which is the last version to have a pure-ruby implementation. I'm using this version because I have added in some changes which allow me to specify the scoring algorithm used on a per-search basis. I haven't however made any changes to the indexing portion of the application. I current have an iconv script creating transliterated ASCII copies of all my articles, so I am going to try to index over these. Also, I am thinking of trying to index using Lucene since there is a chance that the older version of ferret is compatible with lucene indexes. If you have any other suggestions I'd love to hear them, but I understand that I can't expect much help with such an old version. Do you know of a way to specify custom scoring algorithms in the current versions of ferret? Best, Eric On Monday, May 19, at 23:15, Jens Kraemer wrote: > Hi! > > Are you *sure* this is all valid UTF8? I dont know how the file > command determines this, and if it always is right. > Maybe try to play around with iconv to ensure whatever you send to > Ferret really is UTF8. > > Cheers, > Jens > > On 19.05.2008, at 18:00, Eric Schulte wrote: > > > Hi, > > > > I am trying to index a number of Spanish language text files, but a > > large fraction of the files are generating errors like the > > following... > > > > Error: exception 2 not handled: Error decoding input string. Check > > that you have the locale set correctly > > > > however it looks to me like my locale matches the file type. Running > > the file command on the files returns > > > > $ file /media/.../raw/abc/20Jan2007_abc_001041_67.es > > /media/.../raw/abc/20Jan2007_abc_001041_67.es: UTF-8 Unicode text > > > > > > > > and my locale is > > > > $ locale > > LANG=en_US.UTF-8 > > LC_CTYPE="en_US.UTF-8" > > LC_NUMERIC="en_US.UTF-8" > > LC_TIME="en_US.UTF-8" > > LC_COLLATE="en_US.UTF-8" > > LC_MONETARY="en_US.UTF-8" > > LC_MESSAGES="en_US.UTF-8" > > LC_PAPER="en_US.UTF-8" > > LC_NAME="en_US.UTF-8" > > LC_ADDRESS="en_US.UTF-8" > > LC_TELEPHONE="en_US.UTF-8" > > LC_MEASUREMENT="en_US.UTF-8" > > LC_IDENTIFICATION="en_US.UTF-8" > > LC_ALL= > > > > > > after enough of these errors are generated, I begin to get errors for > > having too many open files, and the indexing fails. > > > > Error: exception 2 not handled: Too many open files > > > > Any suggestions would be greatly appreciated. > > > > Thanks, > > Eric > > _______________________________________________ > > Ferret-talk mailing list > > [email protected] > > http://rubyforge.org/mailman/listinfo/ferret-talk > > > > -- > Jens Krämer > Finkenlust 14, 06449 Aschersleben, Germany > VAT Id DE251962952 > http://www.jkraemer.net/ - Blog > http://www.omdb.org/ - The new free film database > -- schulte _______________________________________________ Ferret-talk mailing list [email protected] http://rubyforge.org/mailman/listinfo/ferret-talk

