I have solved a problem. It is hack but it works.
The problem was, then item browsing did not work correctly, when values in
bi_* tables was normalised (characters with diacritic split to two
characters - first character without diacritics and second diacritic itself)
and search results sorting did not works correctly when values in bi_* was
not normalised:)

I use my own class OrderLowerTrim.java for creating sort strings:

public class OrderLowerTrim extends AbstractTextFilterOFD {

    {
        filters = new TextFilter[] { new LowerCaseAndTrim() };
    }

}

So strings are not normalized in db. And I modified
org.dspace.search.DSIndexer.java to ignore configuration and make sort
strings normalized. Change is on line 1166:

//String value = OrderFormat.makeSortString(dcv[0].value, dcv[0].language,
so.getType());
String value = (new OrderFormatTitle()).makeSortString(dcv[0].value,
dcv[0].language);

Sort strings are always made by OrderFormatTitle, which produce normalized
values.

Graham, thanks again for your response, it was very helpful. Can you
explain, why simply writing UTF-8 to the database tables results in very
random sorting for diacritics? I do it now in DSpace and it works.

2011/5/23 Ladislav Kulhanek <ladislav.kulha...@vsb.cz>

> Thanks for responses.
>
> I created class OrderFormatLocale:
>
>  public class OrderFormatLocale extends AbstractTextFilterOFD {
>        {
>                filters = new TextFilter[] { new LowerCaseAndTrim(),
>                                                   new
> LocaleOrderingFilter() };
>        }
> }
>
> but sorting was then very queer. For example alphabet starts with B, A
> was after D and there was other queer things like this. So I modified
> class by removing LocaleOrderingFilter to this form:
>
> public class OrderFormatLocale extends AbstractTextFilterOFD {
>        {
>                filters = new TextFilter[] { new LowerCaseAndTrim()};
>        }
> }
>
> Then sorting was correct in browsing ( by title, author and subjects
> too) but started to be incorrect in search results. When search
> results are sorted by title or author, string with diacritics are
> sorted to the end after all letters without diacritics.
>
> 2011/5/19 Graham Triggs <grahamtri...@gmail.com>:
> > Please take a look at a previous post of mine on this subject:
> >
> http://dspace.2283337.n4.nabble.com/Browse-UTF-8-and-sorting-in-1-5-tp3281449p3281450.html
> > Regards,
> > G
> >
> > On 19 May 2011 15:18, Peter Dietz <pdiet...@gmail.com> wrote:
> >>
> >> Hi Ladislav,
> >> I've noticed that our librarians here are happier with sorting when we
> use
> >> the collate of C as opposed to utf8/en_US.
> >>
> >> postgres=# create database "dspace" with owner = dspace encoding='utf8'
> >> tablespace=pg_default lc_collate = 'C' lc_ctype='en_US.UTF-8' template
> >> template0;
> >>
> >> I've add these three authors to a test collection that had some sample
> >> data in it, and it has the results you were expecting:
> >> == Author Name ==
> >> Cabanová, Zuzana
> >> Cablová, Barbora
> >> creatorlast, creatorfirst
> >> Čabla, Michael
> >>
> >>
> >>
> >> Peter Dietz
> >>
> >>
> >>
> >> On Thu, May 19, 2011 at 4:41 AM, Ladislav Kulhanek
> >> <ladislav.kulha...@vsb.cz> wrote:
> >>>
> >>> Hello everybody.
> >>>
> >>> We have data in our DSpace in czech language (code "cs" in accordance
> >>> with ISO 639-1) and we have a problem with order in browsing by
> >>> author, titles and subjects (order in search results is correct).
> >>> There are letters with diacritic in czech alphabet, for example "Č"
> >>> (0x010C code in unicode). This letter should be ordered between "C"
> >>> and "D", but in DSpace it is ordered to the same place as "C". For
> >>> example we have ordered list as
> >>>
> >>> Cabanová, Zuzana
> >>> Čabla, Michael
> >>> Cablová, Barbora
> >>>
> >>> and this list should be
> >>>
> >>> Cabanová, Zuzana
> >>> Cablová, Barbora
> >>> Čabla, Michael
> >>>
> >>> And czech alphabet contains letter "Ch" (it consists from two
> >>> characters). This letter should be ordered between "h" and "i". This
> >>> letter is ordered in DSpace correctly. So it looks like DSpace order
> >>> in accordance with czech alphabet, but ignore diacritics.
> >>> We have DSpace 1.7.1, Manakin, db PostgreSQL 8.4 (database has
> >>> Collation and Ctype set as cs_CZ.UTF-8), and in tomcat connector is
> >>> URIEncoding="UTF-8". Any idea how to solve it? Thanks.
> >>>
> >>> Ladislav Kulhanek
> >>>
> >>>
> >>>
> ------------------------------------------------------------------------------
> >>> What Every C/C++ and Fortran developer Should Know!
> >>> Read this article and learn how Intel has extended the reach of its
> >>> next-generation tools to help Windows* and Linux* C/C++ and Fortran
> >>> developers boost performance applications - including clusters.
> >>> http://p.sf.net/sfu/intel-dev2devmay
> >>> _______________________________________________
> >>> DSpace-tech mailing list
> >>> DSpace-tech@lists.sourceforge.net
> >>> https://lists.sourceforge.net/lists/listinfo/dspace-tech
> >>
> >>
> >>
> >>
> ------------------------------------------------------------------------------
> >> What Every C/C++ and Fortran developer Should Know!
> >> Read this article and learn how Intel has extended the reach of its
> >> next-generation tools to help Windows* and Linux* C/C++ and Fortran
> >> developers boost performance applications - including clusters.
> >> http://p.sf.net/sfu/intel-dev2devmay
> >> _______________________________________________
> >> DSpace-tech mailing list
> >> DSpace-tech@lists.sourceforge.net
> >> https://lists.sourceforge.net/lists/listinfo/dspace-tech
> >>
> >
> >
>
------------------------------------------------------------------------------
vRanger cuts backup time in half-while increasing security.
With the market-leading solution for virtual backup and recovery, 
you get blazing-fast, flexible, and affordable data protection.
Download your free trial now. 
http://p.sf.net/sfu/quest-d2dcopy1
_______________________________________________
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Reply via email to