Re: several existential issues about Lucene's filesystem

Samuel LEMOINE Thu, 28 Jun 2007 02:30:16 -0700

Grant Ingersoll a écrit :

On Jun 27, 2007, at 8:51 AM, Samuel LEMOINE wrote:
Hi everyone !
I'm working on bibliographical researches on Lucene as an intern inLingway (which uses Lucene in its main product), and I'm currentlystudying Lucene's file system.There are several things I don't catch in Lucene's file system, and Ithought here was the right place to ask about those questions (I hopeit's the case actually).
The main resource I used is this document:
http://lucene.apache.org/java/2_1_0/fileformats.html
-in the .tvf file (Term Vector file) in Lucene 2.2.0, position &offsets can be possibly given in the term vector... I don'tunderstand how it works, since there's only one .tvf per segment(according to what I've understood), and in the architecturedescribed, there is no information given about the documents in whichappears each term stored in the TermVector (the informationsdocument-related are in the .tvd file I assume). The position/offsetinformations seems to be simply a list of addresses, but how can beknown the document it refers to? Or is there one .tvf file per document?
Yes, offsets and positions can be associated with a term vector. Whenyou ask the IndexReader for a term vector, you give it the documentnumber and, optionally, a field, which it uses to go look up in thetvd file the document location in the tvd file. The tvd file thenlooks up the specific information in the tvf file. Have a look at theTermVectorsReader for details on implementation.
-in the .prx file (prositions file), payloads are mentionned andallow to attach meta-data... what's the purpose of such data? isthere a precise use, or is it only data for the sole user's use?
Payloads have a variety of uses. Search the java-dev archive for theword Payload and you will find lots of discussion. I also have a fewslides on it in my ApacheCon Europe presentation athttp://cnlp.org/presentations/slides/AdvancedLuceneEU.pdf See alsohttp://wiki.apache.org/jakarta-lucene/Payload_Planning
Essentially, it can be used to store information on a term by termlevel, things like font weight, or XML enclosing tag, or Part ofSpeech. The sky really is the limit (that and your disk space) onwhat can be stored in a payload.
-many adresses in many files are given under Delta shapes... Doesn'tit slacken the search among the index ? I mean, when a keyword islooked for, in order to find its position in the right file, Lucenemust find the adress of the previous term and add the "delta"address... but the previous term adress is also given by a deltaaddress, and so on, so that as far as I understand it, the whole filemust be climbed back, recursively finding the address of each term...I assume I've misunderstood something, but don't know what.
Not quite sure what you are asking, but I will take a stab at it.Have a look at the section on the Term Dictionary, specifically therelationship between the tis file and the tii file. The storagemechanism makes it very easy to find where the keyword is in the fileso that the rest of the information can be easily looked up.
HTH,
Grant


--------------------------
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org/tech/lucene.asp

Read the Lucene Java FAQ at http://wiki.apache.org/lucene-java/LuceneFAQ



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Thanks for the resources about payloads, I'll have a look over it.

About the positions/offsets in .tvf, please tell me if I've wellunderstood:The .tvd provides the needed informations concerning the occurrences ofeach term in documents, and thanks to these informations, Lucene is ableto determinate how many documents contain the term "foo".Thus the position/offset data contained in .tvf can just consist in alist of positions in the different documents containing "foo"concatenated ? I mean, if foo appears in positions 1,30,65 in doc 0, andpositions 27 & 52 in doc 2, the .tvf will give "1 30 65 27 52" andLucene rests on .tvd to determine which positions belongs to whichdocument? (or rather "1 29 35 27 25" as it is delta-positions)


Hoping my interrogations will help other people ^^

Thanks,

Samuel

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: several existential issues about Lucene's filesystem

Reply via email to