Re: Spell check on a subset of an index ( 'namespace' aware spell checker)

E. van Chastelet Thu, 08 Dec 2011 03:38:59 -0800

Ian, thank you for your suggestions.

I have looked to the TermEnum and TermDocs, but they don't offer acombination with terms and frequencies (used by our autocompleter class)from a filtered set of docs.


Eventually I implemented the following solution:
- In the source index, get all terms for namespace field
- For each namespace ns:
 * copy the source index to a new location
 * remove all documents that match  (*:*) -(namespace:ns)
 * construct spellcheck/autocompletion index from index

I still need to look for other possibilities where I have 1 spellcheckand 1 autocompletion index for all namespaces with support for namespacefiltering. For the autocompleter this will become more difficult,because this should sort the completions on a frequency field thatrepresents the frequency scoped to one namespace. But this has lowerpriority atm.The main goal was to have spellchecking and autocompletion scoped tonamespaces, where there is one source index containing all namespaces.


Regards,
Elmer

On 12/06/2011 03:40 PM, Ian Lea wrote:

There are utilities floating around for getting output from analyzers
- would that help?  I think there are some in LIA, probably others
elsewhere.  The idea being that you grab the stored fields from the
index, pass them through your analyzer, grab the output and use that.

Or can you do something with TermEnum and/or TermDocs.  Not sure
exactly what or how though ...


--
Ian.

On Tue, Dec 6, 2011 at 2:20 PM, E. van Chastelet
<evanchaste...@gmail.com>  wrote:

I'm still struggling with this.

I've tried to implement the solution mentioned in previous reply, but
unfortunately there is a blocking issue with this:
I cannot find a way to create another index from the source index in a way
that the new index has the field values in it. The only way to copy
document's field values from one to another index is to have stored fields.
But stored fields hold "the original String in its entirety", and not the
analyzed String, which I need. Is there another way to copy documents with
(at least the spellcheck field) from the one to another index?

Recap:
I have a source index holding documents for different namespaces. These
documents hold one field (analyzed) that should be used for spell checking.
I want to construct an spellchecker index for each namespace separately. To
accomplish this, I first get the list of namespaces (each document has a
namespace field in the original index). Then, for each namespace, I get the
list of documents that match this namespace. Then I'd like to use this
subset to construct a spellchecker index.

Regards,
Elmer


On 11/23/2011 03:28 PM, E. van Chastelet wrote:

I currently have an idea to get it done, but it's not a nice solution.

If we have an index Q with all documents for all namespaces, we first
extract the list of all terms that appear for the field namespace in Q (this
field indicates the namespace of the document).

Then, for each namespace n in the terms list:
  - Get all docs from Q that match +namespace:n
  - Construct a temporary index from these docs
  - Use this temporary index to construct the dictionary, which the
SpellChecker can use as input.
  - Call indexDictionary on SpellChecker to create spellcheck index for
current namespace.
  - Delete temporary index

We now have separate spell check indexes for each namespace.

Any suggestions for a cleaner solution?

Regards,
Elmer van Chastelet



On 11/10/2011 01:16 PM, E. van Chastelet wrote:

Hi all,

In our project we like to have the ability to get search results scoped
to one 'namespace' (as we call it). This can easily be achieved by using a
filter or just an additional must-clause.
For the spellchecker (and our autocompletion, which is a modified
spellchecker), the story seems different. The spell checker index is created
using a LuceneDictionary, which has a IndexReader as source. We would like
to get (spellcheck/autocomplete) suggestions that are scoped to one
namespace (i.e. field 'namespace' should have a particular value).
With a single source index containing docs for all namespaces, it seems
not possible to create a spellcheck index for each namespace the ordinary
way.
Q1: Is there a way to construct a LuceneDictionary from a subset of a
single source index (all terms where namespace = %value%) ?

Another, maybe better solution is to customize the spellchecker by adding
an additional namespace field to the spellchecker index. At query-time, an
additional must-clause is added, scoping the suggestions to one (or more)
namespace(s). The advantage of this is to have a singleton spellchecker (or
at least the index reader) for all namespaces. This also means less open
files by our application (imagine if there are over 1000 namespaces).
Q2: Will there be a significant penalty (say more than 50% slower) for
the additional must-clause at query time?

Q3: Or can you think of a better solution for this problem? :)

How we currently do it: we currently use Lucene 3.1 with Hibernate Search
and we actually already have auto completion and spell checking scoped to
one namespace. This is currently achieved by using index sharding, so each
namespace has its own index and reader, and another for spell check and auto
completion. Unfortunately there are some downsides to this:
- Our faceting engine has no good support for multiple indexes, so
faceting only works on a single namespace
- Needs administration for mapping namespace identifier (String) to index
number (integer)
- The number of shards (and thus name spaces) is currently hardcoded. At
this moment it is set to 100, and this means Hibernate Search opens up 100
index readers/writers, while only n<100 are in use. and therfore:
- Much open file descriptors
- Hard limit on number of namespaces

Therefore it seems better to switch back to having a single index for all
namespaces.

Thanks!

Regards,
Elmer van Chastelet

---------------------------------------------------------------------
To unsubscribe, e-mail:java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail:java-user-h...@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail:java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail:java-user-h...@lucene.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Spell check on a subset of an index ( 'namespace' aware spell checker)

Reply via email to