bruns added a comment.

  I think this design has several problems:
  
  1. adding or renaming documents becomes very write heavy
    - consider adding e.g. foobar.png - this will add 7 bigrams. Each bigram 
has an associated list of matching documents. The document list is sorted. This 
means we have 7 read-modify-write operations.
  
  2. Lots of additional data
    - to be fast, we have to keep this data in memory. We also have to fetch it 
from disk at startup.
  
  3. It ties the searching algorithm to the data structure
  
  4. The searching may become inefficient when the data set becomes large
    - The lookup of each bigram is fast
    - You have to lookup several bigrams when the search term becomes longer
    - You have to evaluate all result sets and combine them in some way

INLINE COMMENTS

> fuzzysearchtest.cpp:63
> +}
> +
> +void FuzzySearchTest::testFeatures()

The test cases above are good in general, as these test a small aspect.
Improvement:
These are not different test cases, but iterations of the same test. This calls 
for data driven testing, see:
http://doc.qt.io/qt-5/qttestlib-tutorial2-example.html

> fuzzysearchtest.cpp:69
> +  QMap<FuzzyFeature, FuzzyDataList> correct;
> +
> +  auto make = [](quint8 wid, quint8 len) -> FuzzyDataList {

I think this test should be written in a different way:

- check if the map has the correct size
- check if each entry has the correct docId
- check the terms:

  for (const auto& feat : { "no", "ot", "te", "es"} ) {
      QCOMPARE(exported[feat].size(), 1); // one document
      QCOMPARE(exported[feat][0].wid, 0); // first word
      QCOMPARE(exported[feat][0].len, 5); // strlen("notes")
  }
  ...
  QCOMPARE(exported["md"][0].wid, 3);
  QCOMPARE(exported["md"][0].len, 2);

> fuzzysearchtest.cpp:119
> +  db.unite(FuzzySearch::features(2, file2));
> +  db.unite(FuzzySearch::features(3, file3));
> +  db.unite(FuzzySearch::features(4, file4));

You should have a separate test case here, or even 2 - you should test if the 
merging works as expected:

- for same feats from different documents ("notes")
- for same feat from one documents ("not_es_", "wedn_es_day", i.e. "es")

> fuzzysearchtest.cpp:129
> +  db.unite(FuzzySearch::features(12, file12));
> +
> +  FuzzyDataList list;

If you create a list of the files, you can use a loop for the insertion here.

  QList<QStringList> files = {
      {"notes", "april8", "2018", "md"},
      {"notes", "wednesday", "04092018", "md"},
  ...
      {"LMC2200"}
  }
  ...
  for (size_t i = 0; i < files.size(); i++) {
    db.unite(FuzzySearch::features(i, files[i]));
  }

> fuzzysearchtest.cpp:131
> +  FuzzyDataList list;
> +  FuzzyDataGetter getter = [&](const FuzzyFeature& feat) -> const 
> FuzzyDataList& {
> +    list.m_datalist.clear();

This function itself calls for a unit test - for a given feature, return the 
matching documents
Also, either capture the list, or return it (then, by value, not by reference), 
but don't do both. I strongly prefer return by value here.

> fuzzysearchtest.cpp:143
> +  results = fuzzy.search(QString("wensday"), getter);
> +  QCOMPARE(results, QList<quint64>({ 2 }));
> +

If you create a `QSet<QStringlist> foundFiles = files[result[0]];`, you can 
compare it with `QSet<QStringList>({{"notes", "wednesday", "04092018", "md"}});`
This way, it becomes obvious what you expect as output here.
To make it more obvious what the QStringList refers to, you can use `using 
FileNameTerms = QStringList;`

> CMakeLists.txt:18
>      postingdb.cpp
> +    fuzzydb.cpp
>      postingiterator.cpp

Please remove the db from the patch, this should be a separate patch, after the 
fuzzy matcher itself is paved out.

> database.cpp:25
>  #include "postingdb.h"
> +#include "fuzzydb.h"
>  #include "documentdb.h"

later ...

> database.cpp:108
>       */
> -    mdb_env_set_maxdbs(m_env, 12);
> +    mdb_env_set_maxdbs(m_env, 13);
>  

later ...

> fuzzysearch.cpp:80
> +      // Keep track of the score of each document and its length
> +      auto score = scores.find(doc);
> +      if (score == scores.end()) {

You are matching for document *and* term here, I don't think thats what you 
want.

REPOSITORY
  R293 Baloo

REVISION DETAIL
  https://phabricator.kde.org/D13747

To: michaeleden, vhanda, #baloo
Cc: bruns, kde-frameworks-devel, #baloo, ashaposhnikov, michaelh, astippich, 
spoorun, ngraham, abrahams

Reply via email to