Re: Index merging and optimizing

Erick Erickson Mon, 14 Jan 2008 08:19:27 -0800

OK, I think I'm getting a better handle here. I can't imagine how
it would work to combine indexes that use *different* analyzers
on the *same* field. Regardless of what Lucene did, you
simply could NOT explain this to a user. To take a simple example,
index part of your data for field1 with KeywordAnalyzer and
part with WhiteSpaceAnalyzer. Now you index the same
data in two different documents, i.e. field1 has
"some data" in two docs.

Now one doc has the tokens "some" and "data", and the other
has "some data". Depending on the analyzer you use at
query time, searching for "some" would return one doc. Searching
for "some data" would return the other doc. Or would return the
first doc depending upon the query-time analyzer. And no
matter what analyzer you used, searching for "some" would never
return doc 2.

Even assuming Lucene allows this
how in the world could you explain what the "correct" results
were to a user? Even assuming that Lucene does something
reasonable, it'll still be wrong enough of the time to make the
system unusable.

And we're not even into stopwords, cases, different languages,
accent folding.

Not to mention how in the world you'd be able to explain what
query analyzer to use <G>....

That said, you *can* use *different* analyzers on *different* fields
in the same index. See PerFieldAnalyzerWrapper. And there's
no restriction that all documents have the *same* fields. So as
long as all your field names were disjoint, it could work. *But* I
have no real idea whether this is supported, so you'll have to
experiment if you still think it's a good thing.

But it also seems that the parallel/not parallel decision is
something you control on the back end, so I'm not sure the user
is involved in the merge question at all. In other words, you could
easily split the indexing task up amongst several machines and/or
processes and combine all the results after all the sub-indexes
were build, thus making your question basically irrelevant.

But you still haven't explained what the user is getting from all
this flexibility. I have a hard time understanding the use-case
you're trying to support. If you're trying to build a generic front-end
to allow parameterized Lucene index building, have you looked at
SOLR, which uses XML configuration files to drive the indexing
and searching process? (which I haven't used, but I'm tracking
the user's group list.....).

Best
Erick

On Jan 14, 2008 10:35 AM, <[EMAIL PROTECTED]> wrote:

> > Then why would you want to combine them?
> >
> > I really think you need to explain what you're trying to accomplish
> > rather then obsess on the details.
>
> I have to create indexes in parallel because the amount of data is very
> high.
> Then I want to merge them into bigger indexes an move them to the search
> server.
>
> (See therad "Max size of index (FSDirectory )" too.)
>
> My question now is, what will happen if one merges indexes which were
> created with different analyzer (the customer can confige the analyzer
> depending on the data which is indexed)?
>
> I think this will produce unpredicted results when searching.
> If so I have to document that only indexes created with the same analyzer
> are allowed to merge.
>
> Thank you
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Re: Index merging and optimizing

Reply via email to