Re: [Dspace-tech] Problem with ordering in browsing
I have solved a problem. It is hack but it works. The problem was, then item browsing did not work correctly, when values in bi_* tables was normalised (characters with diacritic split to two characters - first character without diacritics and second diacritic itself) and search results sorting did not works correctly when values in bi_* was not normalised:) I use my own class OrderLowerTrim.java for creating sort strings: public class OrderLowerTrim extends AbstractTextFilterOFD { { filters = new TextFilter[] { new LowerCaseAndTrim() }; } } So strings are not normalized in db. And I modified org.dspace.search.DSIndexer.java to ignore configuration and make sort strings normalized. Change is on line 1166: //String value = OrderFormat.makeSortString(dcv[0].value, dcv[0].language, so.getType()); String value = (new OrderFormatTitle()).makeSortString(dcv[0].value, dcv[0].language); Sort strings are always made by OrderFormatTitle, which produce normalized values. Graham, thanks again for your response, it was very helpful. Can you explain, why simply writing UTF-8 to the database tables results in very random sorting for diacritics? I do it now in DSpace and it works. 2011/5/23 Ladislav Kulhanek ladislav.kulha...@vsb.cz Thanks for responses. I created class OrderFormatLocale: public class OrderFormatLocale extends AbstractTextFilterOFD { { filters = new TextFilter[] { new LowerCaseAndTrim(), new LocaleOrderingFilter() }; } } but sorting was then very queer. For example alphabet starts with B, A was after D and there was other queer things like this. So I modified class by removing LocaleOrderingFilter to this form: public class OrderFormatLocale extends AbstractTextFilterOFD { { filters = new TextFilter[] { new LowerCaseAndTrim()}; } } Then sorting was correct in browsing ( by title, author and subjects too) but started to be incorrect in search results. When search results are sorted by title or author, string with diacritics are sorted to the end after all letters without diacritics. 2011/5/19 Graham Triggs grahamtri...@gmail.com: Please take a look at a previous post of mine on this subject: http://dspace.2283337.n4.nabble.com/Browse-UTF-8-and-sorting-in-1-5-tp3281449p3281450.html Regards, G On 19 May 2011 15:18, Peter Dietz pdiet...@gmail.com wrote: Hi Ladislav, I've noticed that our librarians here are happier with sorting when we use the collate of C as opposed to utf8/en_US. postgres=# create database dspace with owner = dspace encoding='utf8' tablespace=pg_default lc_collate = 'C' lc_ctype='en_US.UTF-8' template template0; I've add these three authors to a test collection that had some sample data in it, and it has the results you were expecting: == Author Name == Cabanová, Zuzana Cablová, Barbora creatorlast, creatorfirst Čabla, Michael Peter Dietz On Thu, May 19, 2011 at 4:41 AM, Ladislav Kulhanek ladislav.kulha...@vsb.cz wrote: Hello everybody. We have data in our DSpace in czech language (code cs in accordance with ISO 639-1) and we have a problem with order in browsing by author, titles and subjects (order in search results is correct). There are letters with diacritic in czech alphabet, for example Č (0x010C code in unicode). This letter should be ordered between C and D, but in DSpace it is ordered to the same place as C. For example we have ordered list as Cabanová, Zuzana Čabla, Michael Cablová, Barbora and this list should be Cabanová, Zuzana Cablová, Barbora Čabla, Michael And czech alphabet contains letter Ch (it consists from two characters). This letter should be ordered between h and i. This letter is ordered in DSpace correctly. So it looks like DSpace order in accordance with czech alphabet, but ignore diacritics. We have DSpace 1.7.1, Manakin, db PostgreSQL 8.4 (database has Collation and Ctype set as cs_CZ.UTF-8), and in tomcat connector is URIEncoding=UTF-8. Any idea how to solve it? Thanks. Ladislav Kulhanek -- What Every C/C++ and Fortran developer Should Know! Read this article and learn how Intel has extended the reach of its next-generation tools to help Windows* and Linux* C/C++ and Fortran developers boost performance applications - including clusters. http://p.sf.net/sfu/intel-dev2devmay ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech -- What Every C/C++ and Fortran developer Should Know! Read this article and learn how Intel has extended the reach of its next-generation tools to help
Re: [Dspace-tech] Problem with ordering in browsing
Thanks for responses. I created class OrderFormatLocale: public class OrderFormatLocale extends AbstractTextFilterOFD { { filters = new TextFilter[] { new LowerCaseAndTrim(), new LocaleOrderingFilter() }; } } but sorting was then very queer. For example alphabet starts with B, A was after D and there was other queer things like this. So I modified class by removing LocaleOrderingFilter to this form: public class OrderFormatLocale extends AbstractTextFilterOFD { { filters = new TextFilter[] { new LowerCaseAndTrim()}; } } Then sorting was correct in browsing ( by title, author and subjects too) but started to be incorrect in search results. When search results are sorted by title or author, string with diacritics are sorted to the end after all letters without diacritics. 2011/5/19 Graham Triggs grahamtri...@gmail.com: Please take a look at a previous post of mine on this subject: http://dspace.2283337.n4.nabble.com/Browse-UTF-8-and-sorting-in-1-5-tp3281449p3281450.html Regards, G On 19 May 2011 15:18, Peter Dietz pdiet...@gmail.com wrote: Hi Ladislav, I've noticed that our librarians here are happier with sorting when we use the collate of C as opposed to utf8/en_US. postgres=# create database dspace with owner = dspace encoding='utf8' tablespace=pg_default lc_collate = 'C' lc_ctype='en_US.UTF-8' template template0; I've add these three authors to a test collection that had some sample data in it, and it has the results you were expecting: == Author Name == Cabanová, Zuzana Cablová, Barbora creatorlast, creatorfirst Čabla, Michael Peter Dietz On Thu, May 19, 2011 at 4:41 AM, Ladislav Kulhanek ladislav.kulha...@vsb.cz wrote: Hello everybody. We have data in our DSpace in czech language (code cs in accordance with ISO 639-1) and we have a problem with order in browsing by author, titles and subjects (order in search results is correct). There are letters with diacritic in czech alphabet, for example Č (0x010C code in unicode). This letter should be ordered between C and D, but in DSpace it is ordered to the same place as C. For example we have ordered list as Cabanová, Zuzana Čabla, Michael Cablová, Barbora and this list should be Cabanová, Zuzana Cablová, Barbora Čabla, Michael And czech alphabet contains letter Ch (it consists from two characters). This letter should be ordered between h and i. This letter is ordered in DSpace correctly. So it looks like DSpace order in accordance with czech alphabet, but ignore diacritics. We have DSpace 1.7.1, Manakin, db PostgreSQL 8.4 (database has Collation and Ctype set as cs_CZ.UTF-8), and in tomcat connector is URIEncoding=UTF-8. Any idea how to solve it? Thanks. Ladislav Kulhanek -- What Every C/C++ and Fortran developer Should Know! Read this article and learn how Intel has extended the reach of its next-generation tools to help Windows* and Linux* C/C++ and Fortran developers boost performance applications - including clusters. http://p.sf.net/sfu/intel-dev2devmay ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech -- What Every C/C++ and Fortran developer Should Know! Read this article and learn how Intel has extended the reach of its next-generation tools to help Windows* and Linux* C/C++ and Fortran developers boost performance applications - including clusters. http://p.sf.net/sfu/intel-dev2devmay ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech -- What Every C/C++ and Fortran developer Should Know! Read this article and learn how Intel has extended the reach of its next-generation tools to help Windows* and Linux* C/C++ and Fortran developers boost performance applications - including clusters. http://p.sf.net/sfu/intel-dev2devmay ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech
[Dspace-tech] Problem with ordering in browsing
Hello everybody. We have data in our DSpace in czech language (code cs in accordance with ISO 639-1) and we have a problem with order in browsing by author, titles and subjects (order in search results is correct). There are letters with diacritic in czech alphabet, for example Č (0x010C code in unicode). This letter should be ordered between C and D, but in DSpace it is ordered to the same place as C. For example we have ordered list as Cabanová, Zuzana Čabla, Michael Cablová, Barbora and this list should be Cabanová, Zuzana Cablová, Barbora Čabla, Michael And czech alphabet contains letter Ch (it consists from two characters). This letter should be ordered between h and i. This letter is ordered in DSpace correctly. So it looks like DSpace order in accordance with czech alphabet, but ignore diacritics. We have DSpace 1.7.1, Manakin, db PostgreSQL 8.4 (database has Collation and Ctype set as cs_CZ.UTF-8), and in tomcat connector is URIEncoding=UTF-8. Any idea how to solve it? Thanks. Ladislav Kulhanek -- What Every C/C++ and Fortran developer Should Know! Read this article and learn how Intel has extended the reach of its next-generation tools to help Windows* and Linux* C/C++ and Fortran developers boost performance applications - including clusters. http://p.sf.net/sfu/intel-dev2devmay ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech
Re: [Dspace-tech] Problem with ordering in browsing
Hi Ladislav, I've noticed that our librarians here are happier with sorting when we use the collate of C as opposed to utf8/en_US. postgres=# create database dspace with owner = dspace encoding='utf8' tablespace=pg_default lc_collate = 'C' lc_ctype='en_US.UTF-8' template template0; I've add these three authors to a test collection that had some sample data in it, and it has the results you were expecting: == Author Name == Cabanová, Zuzana Cablová, Barbora creatorlast, creatorfirst Čabla, Michael Peter Dietz On Thu, May 19, 2011 at 4:41 AM, Ladislav Kulhanek ladislav.kulha...@vsb.cz wrote: Hello everybody. We have data in our DSpace in czech language (code cs in accordance with ISO 639-1) and we have a problem with order in browsing by author, titles and subjects (order in search results is correct). There are letters with diacritic in czech alphabet, for example Č (0x010C code in unicode). This letter should be ordered between C and D, but in DSpace it is ordered to the same place as C. For example we have ordered list as Cabanová, Zuzana Čabla, Michael Cablová, Barbora and this list should be Cabanová, Zuzana Cablová, Barbora Čabla, Michael And czech alphabet contains letter Ch (it consists from two characters). This letter should be ordered between h and i. This letter is ordered in DSpace correctly. So it looks like DSpace order in accordance with czech alphabet, but ignore diacritics. We have DSpace 1.7.1, Manakin, db PostgreSQL 8.4 (database has Collation and Ctype set as cs_CZ.UTF-8), and in tomcat connector is URIEncoding=UTF-8. Any idea how to solve it? Thanks. Ladislav Kulhanek -- What Every C/C++ and Fortran developer Should Know! Read this article and learn how Intel has extended the reach of its next-generation tools to help Windows* and Linux* C/C++ and Fortran developers boost performance applications - including clusters. http://p.sf.net/sfu/intel-dev2devmay ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech -- What Every C/C++ and Fortran developer Should Know! Read this article and learn how Intel has extended the reach of its next-generation tools to help Windows* and Linux* C/C++ and Fortran developers boost performance applications - including clusters. http://p.sf.net/sfu/intel-dev2devmay___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech
Re: [Dspace-tech] Problem with ordering in browsing
Please take a look at a previous post of mine on this subject: http://dspace.2283337.n4.nabble.com/Browse-UTF-8-and-sorting-in-1-5-tp3281449p3281450.html Regards, G On 19 May 2011 15:18, Peter Dietz pdiet...@gmail.com wrote: Hi Ladislav, I've noticed that our librarians here are happier with sorting when we use the collate of C as opposed to utf8/en_US. postgres=# create database dspace with owner = dspace encoding='utf8' tablespace=pg_default lc_collate = 'C' lc_ctype='en_US.UTF-8' template template0; I've add these three authors to a test collection that had some sample data in it, and it has the results you were expecting: == Author Name == Cabanová, Zuzana Cablová, Barbora creatorlast, creatorfirst Čabla, Michael Peter Dietz On Thu, May 19, 2011 at 4:41 AM, Ladislav Kulhanek ladislav.kulha...@vsb.cz wrote: Hello everybody. We have data in our DSpace in czech language (code cs in accordance with ISO 639-1) and we have a problem with order in browsing by author, titles and subjects (order in search results is correct). There are letters with diacritic in czech alphabet, for example Č (0x010C code in unicode). This letter should be ordered between C and D, but in DSpace it is ordered to the same place as C. For example we have ordered list as Cabanová, Zuzana Čabla, Michael Cablová, Barbora and this list should be Cabanová, Zuzana Cablová, Barbora Čabla, Michael And czech alphabet contains letter Ch (it consists from two characters). This letter should be ordered between h and i. This letter is ordered in DSpace correctly. So it looks like DSpace order in accordance with czech alphabet, but ignore diacritics. We have DSpace 1.7.1, Manakin, db PostgreSQL 8.4 (database has Collation and Ctype set as cs_CZ.UTF-8), and in tomcat connector is URIEncoding=UTF-8. Any idea how to solve it? Thanks. Ladislav Kulhanek -- What Every C/C++ and Fortran developer Should Know! Read this article and learn how Intel has extended the reach of its next-generation tools to help Windows* and Linux* C/C++ and Fortran developers boost performance applications - including clusters. http://p.sf.net/sfu/intel-dev2devmay ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech -- What Every C/C++ and Fortran developer Should Know! Read this article and learn how Intel has extended the reach of its next-generation tools to help Windows* and Linux* C/C++ and Fortran developers boost performance applications - including clusters. http://p.sf.net/sfu/intel-dev2devmay ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech -- What Every C/C++ and Fortran developer Should Know! Read this article and learn how Intel has extended the reach of its next-generation tools to help Windows* and Linux* C/C++ and Fortran developers boost performance applications - including clusters. http://p.sf.net/sfu/intel-dev2devmay___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech