Re: Spell check on a subset of an index ( 'namespace' aware spell checker)

E. van Chastelet Tue, 06 Dec 2011 06:21:29 -0800

I'm still struggling with this.

I've tried to implement the solution mentioned in previous reply, butunfortunately there is a blocking issue with this:I cannot find a way to create another index from the source index in away that the new index has the field values in it. The only way to copydocument's field values from one to another index is to have storedfields. But stored fields hold "the original String in its entirety",and not the analyzed String, which I need. Is there another way to copydocuments with (at least the spellcheck field) from the one to anotherindex?


Recap:

I have a source index holding documents for different namespaces. Thesedocuments hold one field (analyzed) that should be used for spellchecking. I want to construct an spellchecker index for each namespaceseparately. To accomplish this, I first get the list of namespaces (eachdocument has a namespace field in the original index). Then, for eachnamespace, I get the list of documents that match this namespace. ThenI'd like to use this subset to construct a spellchecker index.


Regards,
Elmer


On 11/23/2011 03:28 PM, E. van Chastelet wrote:

I currently have an idea to get it done, but it's not a nice solution.
If we have an index Q with all documents for all namespaces, we firstextract the list of all terms that appear for the field namespace in Q(this field indicates the namespace of the document).
Then, for each namespace n in the terms list:
 - Get all docs from Q that match +namespace:n
 - Construct a temporary index from these docs
- Use this temporary index to construct the dictionary, which theSpellChecker can use as input.- Call indexDictionary on SpellChecker to create spellcheck index forcurrent namespace.
 - Delete temporary index

We now have separate spell check indexes for each namespace.

Any suggestions for a cleaner solution?

Regards,
Elmer van Chastelet



On 11/10/2011 01:16 PM, E. van Chastelet wrote:
Hi all,
In our project we like to have the ability to get search resultsscoped to one 'namespace' (as we call it). This can easily beachieved by using a filter or just an additional must-clause.For the spellchecker (and our autocompletion, which is a modifiedspellchecker), the story seems different. The spell checker index iscreated using a LuceneDictionary, which has a IndexReader as source.We would like to get (spellcheck/autocomplete) suggestions that arescoped to one namespace (i.e. field 'namespace' should have aparticular value).With a single source index containing docs for all namespaces, itseems not possible to create a spellcheck index for each namespacethe ordinary way.Q1: Is there a way to construct a LuceneDictionary from a subset of asingle source index (all terms where namespace = %value%) ?
Another, maybe better solution is to customize the spellchecker byadding an additional namespace field to the spellchecker index. Atquery-time, an additional must-clause is added, scoping thesuggestions to one (or more) namespace(s). The advantage of this isto have a singleton spellchecker (or at least the index reader) forall namespaces. This also means less open files by our application(imagine if there are over 1000 namespaces).Q2: Will there be a significant penalty (say more than 50% slower)for the additional must-clause at query time?
Q3: Or can you think of a better solution for this problem? :)
How we currently do it: we currently use Lucene 3.1 with HibernateSearch and we actually already have auto completion and spellchecking scoped to one namespace. This is currently achieved by usingindex sharding, so each namespace has its own index and reader, andanother for spell check and auto completion. Unfortunately there aresome downsides to this:- Our faceting engine has no good support for multiple indexes, sofaceting only works on a single namespace- Needs administration for mapping namespace identifier (String) toindex number (integer)- The number of shards (and thus name spaces) is currently hardcoded.At this moment it is set to 100, and this means Hibernate Searchopens up 100 index readers/writers, while only n<100 are in use. andtherfore:
- Much open file descriptors
- Hard limit on number of namespaces
Therefore it seems better to switch back to having a single index forall namespaces.
Thanks!

Regards,
Elmer van Chastelet



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Spell check on a subset of an index ( 'namespace' aware spell checker)

Reply via email to