Hi, thanks a lot for the hints. Changing the locale did not work, but now I have a better understanding, and I could make some hack for "fixing" the StandardTokenizer.
Федера́ция here the *а́ *character is actually split to *а* and * ́* where the last one (0x0301 Combining Acute Accent) is not considered alphanumerical by the _istalnum(ch) function. #define ALNUM (_istalnum(ch) != 0) thanks for the help, have a nice day! On Wed, 24 Jul 2019 at 15:57, Kostka Bořivoj <kos...@tovek.cz> wrote: > Hi, > > > > The problem should be in StandardTokenizer. Unfortunately I’m not familiar > with it, as we are using our own tokenizer. > > So I’m just guessing. > > 1) It uses _istspace which is mapped to iswspace. Some time ago I > discovered these function uses standard “C” locale by default (and doesn’t > work well with non-english characters) > > We solved this problem by calling setlocale( LC_CTYPE, "" ) during program > startup. No idea if this helps, but it is easy to try. > > 2) I have really bad experience with non-ascii characters inside > source code, especially in multiplatform environment we use (windows + > linux). It should work OK if file is in UTF-8, but we still had BOM/without > BOM issues. We encode characters as \uNNNN if we need it in source (there > is free online converters, like > https://www.mobilefish.com/services/unicode_escape_sequence_converter/unicode_escape_sequence_converter.php > > > > Borek > > > > *From:* Tamás Dömők [mailto:domokta...@gmail.com] > *Sent:* Wednesday, July 24, 2019 3:18 PM > *To:* clucene-developers@lists.sourceforge.net > *Subject:* Re: [CLucene-dev] Wildcard query on a Russian text is not > working for me > > > > Hi, > > > > i checked my index with Luke. These are the tokens in my index: > > > > 1 content официально > 1 content росси > 1 content также > 1 content федера > 1 content ция > 1 content я > 1 content йская > > > > > > It's interesting the word *Федера́ция* is split to * федера* and *ция*. > > > > Yes TCHAR is defined as wchar_t on my platform(s) (it behaves exactly the > same on mac, linux and windows for me.) > > > > Thanks for this Luke tool, it's awesome. > > > > > > On Wed, 24 Jul 2019 at 14:49, Kostka Bořivoj <kos...@tovek.cz> wrote: > > Hi again > > > > It would be interesting to explore index content. Seems to me, the the > word “Федера́ция” is treated as two words Федер and ция (а́ is treated as > space in other words). > > You can use Luke (https://code.google.com/archive/p/luke/downloads) to > explore index content > > > > Regards > > > > Borek > > > > *From:* Tamás Dömők [mailto:domokta...@gmail.com] > *Sent:* Wednesday, July 24, 2019 11:41 AM > *To:* clucene-developers@lists.sourceforge.net > *Subject:* [CLucene-dev] Wildcard query on a Russian text is not working > for me > > > > Hi all, > > > > I'm trying to index some Russian content and search in this content using > the CLucene library (v2.3.3.4-10). It works most of the time, but on some > words the wildcard query is not working for me, and I have no idea why. > > > > Can anybody help me on this, please? > > > > Here is my source code: > > > > *main.cc:* > > > > #include <QCoreApplication> > > > > #include <QString> > > #include <QDebug> > > #include <QScopedPointer> > > > > #include <CLucene.h> > > > > const TCHAR FIELD_CONTENT[] = L"content"; > > const char INDEX_PATH[] = "/tmp/index"; > > > > void *create_index*(const QString &content) > > { > > lucene::analysis::standard::StandardAnalyzer analyzer; > > lucene::index::IndexWriter writer(INDEX_PATH, &analyzer, true); > > > > lucene::document::Document doc; > > std::wstring content_buffer = content.toStdWString(); > > doc.add(***_CLNEW lucene::document::Field*(FIELD_CONTENT,* > > *content_buffer.data(),* > > > lucene*::*document*::*Field*::*STORE_NO *|* > > > lucene*::*document*::*Field*::*INDEX_TOKENIZED *|* > > > lucene*::*document*::*Field*::*TERMVECTOR_NO*,* > > true*)*); > > writer.addDocument(&doc); > > > > writer.flush(); > > writer.close(true); > > } > > > > void *search*(const QString &query_string) > > { > > lucene::search::IndexSearcher searcher(INDEX_PATH); > > > > lucene::analysis::standard::StandardAnalyzer analyzer; > > lucene::queryParser::QueryParser parser(FIELD_CONTENT, &analyzer); > > parser.setAllowLeadingWildcard(true); > > > > std::wstring query = query_string.toStdWString(); > > QScopedPointer< lucene::search::Query > > lucene_query(parser.parse(query.c_str(), FIELD_CONTENT, &analyzer)); > > QScopedPointer< lucene::search::Hits > > hits(searcher.search(lucene_query.data())); > > > > TCHAR *query_debug_string(lucene_query->toString()); > > qDebug() << "found?" << query_string << > QString::fromWCharArray(query_debug_string) << (hits->length() > 0); > > free(query_debug_string); > > } > > > > int *main*(int argc, char *argv[]) > > { > > QCoreApplication a(*argc*, argv); > > > > create_index(QString("Росси́я официально также Росси́йская Федера́ция")); > > > > search(QString("noWordLkeThis")); // ok > > > > search(QString("Федера́ция")); // ok > > search(QString("Федер*ция")); // ERROR: it should work, but it doesn't > > search(QString("Фед*")); // ok > > search(QString("Федер")); // ok > > search(QString("\"федера ция\"")); // why is this working? > > > > search(QString("официально")); // ok > > search(QString("офиц*ьно")); // ok > > search(QString("оф*циально")); // ok > > search(QString("офици*но")); // ok > > > > return 0; > > } > > > > *cluceneutf8.pro <http://cluceneutf8.pro>:* > > > > QT -= gui > > > > CONFIG += c++11 console > > CONFIG -= app_bundle > > > > CONFIG += link_pkgconfig > > PKGCONFIG += libclucene-core > > > > SOURCES += \ > > main.cc > > > > > > qmake && make && ./cluceneutf8 > > > > *The output of the program:* > > > > found? "noWordLkeThis" "content:nowordlkethis" false > found? "Федера́ция" "content:\"федера ция\"" true > found? "Федер*ция" "content:федер*ция" false > found? "Фед*" "content:фед*" true > found? "Федер" "content:федер" false > found? "\"федера ция\"" "content:\"федера ция\"" true > found? "официально" "content:официально" true > found? "офиц*ьно" "content:офиц*ьно" true > found? "оф*циально" "content:оф*циально" true > found? "офици*но" "content:офици*но" true > > > > > > It's built with Qt and qmake, but I also made a non-Qt version if that > would be better to share, I can. > > > > So my problem is that I can search for *Федера́ция* but I can't search > for *Федер*ция* for example. Other words like *официально* can be > searched anyway. > > > > > > Thanks. > > > > -- > > Dömők Tamás > > _______________________________________________ > CLucene-developers mailing list > CLucene-developers@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/clucene-developers > > > > > -- > > Dömők Tamás > _______________________________________________ > CLucene-developers mailing list > CLucene-developers@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/clucene-developers > -- Dömők Tamás
_______________________________________________ CLucene-developers mailing list CLucene-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/clucene-developers