I hope I got all the details right, if I didn't then please clarify.
Also,
I haven't read the entire thread, so if someone already suggested this
...
well, it probably means it's the right solution :)
It sounds like you could use Lucene's ParallelCompositeReader, which
already handles multiple IndexReaders that are aligned by their internal
document IDs. The way it would work, as far as I understand your
scenario
is something like the following table (columns denote different
indexes).
Each index contains a subset of relevant fields, where common contains
the
common fields, and each language index contains the respective language
fields.
DocID LuceneID Common English German ....
"FirstDoc" 0 A,B,C EN_words, DE_words,
EN_sentences DE_sentences
"SecondDoc" 1 A,B,C
"ThirdDoc" 2 A,B,C
Each index can contain all relevant fields, or only a subset (e.g. maybe
not all documents have a value for the 'B' field in the 'common' index).
What's absolutely very important here though is that the indexes are
created very carefully, and if e.g. SecondDoc is not translated into
German, *you must still have an empty document* in the German index, or
otherwise, document IDs will not align.
That's exactly how I saw it and what I need to do. So, I'll have a very
good look at
ParallelCompositeReader
Lucene does not offer a way to build those indexes though (patches
welcome!!).
This answers my question 1. Thanks. :)
I somehow hoped that there was already support for that kind of situation
in lucene but well,
now at least I know that I won't find an already made solution to my
problem in the lucene classes and that I will have to code one myself,
by taking inspiration in the lucene classes that do similar processing.
We've started some effort very long time ago on LUCENE-1879
(there's a patch and a discussion for an alternative approach) as well
as
there is a very useful suggestion in ParallelCompositeReader's jdocs
(use
LogDocMergePolicy).
Wow, priceless. This gives me some headstart and inspiration. :)
One challenge is how to support multi-threaded indexing, but perhaps
this
isn't a problem in your application? It sounds like, by you writing
that a
user will "download the german index", that the indexes are built
offline?
Indeed. The index is built offline, in a single thread, and once it is
built, it is read only.
Cant find an easier situation. :)
Another challenge is how to control segment merging, so that the *exact
same segments* are merged over the parallel indexes. Again, if your
application builds the indexes offline, then this should be easier to
accomplish.
I assume though that when you index e.g. the German documents, then the
already indexes 'common' fields do not change for a document. If they
do,
you will need to rebuild the 'common' index too.
Once you achieve a correct parallel index, it is very easy to open a
ParallelCompositeReader on any subset of the indexes, e.g.
Common+English,
Common+German, or Common+English+German and search it, since the
internal
document IDs are perfectly aligned.
Shai
Many thanks for the awesome answer and the help (I love you).
As I really really really need this to happen, I'm going to start working
on this really soon.
I'm definately not an expert on threads/filesystems/and lucene inner
workings, so I can't promise to contribute a miracoulous patch though.
Especially since I won't work on the muli-thread aspect of the problem.
But I'll do the best I can and contribute back whatever code I can
produce.
Many thanks, again. :)
On Wed, Apr 30, 2014 at 7:07 AM, Jose Carlos Canova <
jose.carlos.can...@gmail.com> wrote:
My suggestion is you not worry about the docId, in practice it is an
"internal lucene" id, quite similar with a rowId on a database, each
index
may generate a different docId (it is their problem) from a translated
document, you may use your own ID that relates one document to another
on
different index mainly because like you mention are translated
documents
that on theory can be ranked differently from language to language (it
is
not an obligation that a set of documents on different languages spams
the
same rank order but i am not 100% sure about this),
Second reason is that 'they may change the internal structure of lucene
without warrant', and then you lose the forward compatibility.
I am not an expert on Lucene like Schindler, but reading their
documentation understood that they have a special attention on
"internal lucene" and "experimental lucene" which means internal is
"non
warrant compatible", and experimental "may be removed".
For example they (apache-lucene) discover a "new manner" to relate each
document that is more efficient and change some mechanism, then your
application uses an internal mechanism that is high coupled with lucene
version xxx (marked as "internal-lucene") you can stuck on a specific
version and on future have to rewrite some code because and this
might
cause some "management conflict" if your project follows a continuous
integration and you are subordinated on a management structure (bad to
you).
I saw this on several projects that uses Lucene around they do not
upgrade
their lucene components on their new releases one example if i am not
wrong
still uses Lucene 3 and other that i saw around (e.g. Luke) which means
that "The project was abandoned because the manner how they integrate
with
Lucene was not fully functional".
Another interesting thing is that developing around Lucene is more
effective, you guarantee that your product will work and they guarantee
that Lucene works too. This is related with design by contract.
Regards.
On Tue, Apr 29, 2014 at 7:11 PM, Olivier Binda <
olivier.bi...@wanadoo.fr
wrote:
Hello.
Sorry to bring this up again. I don't want to be rudeand I mean no
disrespect, but after thinking it through today,
I need to and would really love to have the answer to the following
question :
1) At lucene indexing time, is it possible to rewrite a read-only
index
so
that some fields are only found in some segments (and how ?)
Uwe Schindler suggested using different index and a MultiReader for my
needs and It probably answers my second question, better formulated as
"Is
it possible to restrict an index to some of it's segments ? " as a
CompositeReader with AtomicReaders (or a custom Directory) that read
the
aforementioned segments might do the trick
Yet, if I am not mistaken (please tell me if I am wrong), it doesn't
solve
my needs as I have around 300000 documents of the following kind :
READ ONLY Document :
// common fields shipped with the App that aren't language related
A:
B:
C:
// fields shipped with the English package (a zip)
EN:
EN_Words:
EN_Sentences:
some DocValues
// fields shipped with the German package (a zip)
DE:
DE_Words:
DE_Sentences:
some DocValues
...
There might be hundreds of language package that my users might use
If I use different indexes
indexA for the common stuff,
indexEN for the English package,
indexDE for the german package,
For sure, I will be able to make a big index out of those by using a
MultiReader
BUT it really makes an union out of the three index (right ?) which
means
I'll have 900000 documents
and the documents in the indexA won't have any relations to the
documents
in indexEN (right ?) except if I give each document an id in each
index
and
make a join at query time which is a big no no, because I use a
queryParser
and users may enter queries like "A:gah AND (DE:schlaffen OR
EN:sleep)"
Or I am mistaken and there is a way to create a document in three
different index that stay in relations with the same docId ?
My solution if question 1 is possible :
In contrast, if I am able to build my index so that my READ ONLY
Document
are stored in
SEGMENT 1
// common fields shipped with the App that aren't language related
A:
B:
C:
SEGMENT 2
// fields shipped with the English package (a zip)
EN:
EN_Words:
EN_Sentences:
some DocValues
SEGMENT 3
// fields shipped with the German package (a zip)
DE:
DE_Words:
DE_Sentences:
some DocValues
I only need to ship SEGMENT 1 in the App and let users download
SEGMENT
2
or SEGMENT 3 whether they want english or german
and use a composite reader with atomic readers (right ?) to use my
frankenstein index at query time with a queryparser
Also, In case question 1 is possible. I would really like to know too,
if
it is possible to remap at build time docIds in a read-only index.
An application of this would be :
At day 1, I shipp my app with 2 languages packages : English and
german
(documents are uniquely identified by a docId... or by an external id
(thanks to a docId<-> external id map)
At day 2, I ship an additional language package (French) because I'm
able
to build an index with English, German, French with the same exact
docIds
for each document that the index shipped at day 1
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
------------------------------------------------------------
---------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org