Similarity class and searchPayloads

2011-06-08 Thread Alex vB
Hello everybody,

I am just curious about following case.
Currently, I create a boolean AND query which loads payloads.
In some cases it occurs that Lucene loads payloads but does not return hits.

Therefore, I assume that payloads are directly loaded whith each doc ID from
the posting list before the boolean filter.Is that right?
Is it possible to filter documents first and then load the payload?
For example, I have three terms and I check in every posting list if the
current doc ID is availabel.
Only then I load payload.

Or can anybody tell me where exactly Lucene loads payloads in code?

Regards
Alex

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Similarity-class-and-searchPayloads-tp3041463p3041463.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Lucene query processing

2011-04-26 Thread Alex vB
Hello everybody,

As far as I know Lucene processes documents DAAT. Depending on the query
either the intersection or union is calculated. For the intersection only
documents occurring in all posting lists are scored. In the union case every
document is scored which makes it a more expensive operation. 

Lucene stores its index into several files. Depending on the query different
files might be accessed for scoring. For example a payload query needs to
read paylods from .pos.

What is not clear for me how term frequencies or payloads are processed.
Assuming I store term frequencies I need to set
setOmitTermFreqAndPositions(false). 
1) Which queries include term frequencies? I assume all queries if term
frequencies are stored?
2) Why is fetching payloads so much more expensive than getting term
frequencies. Both are stored in seperated files and therefore demand a disk
seek.
3) What for a value contains tf if I set setOmitTermFreqAndPositions(true)?
Allways 1?
4) How are term freqs, payloads read from disk? In bulk for all remaining
docs at once or every time a document gets scored?

Regards
Alex



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Lucene-query-processing-tp2868144p2868144.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: New codecs keep Freq skip/omit Pos

2011-04-23 Thread Alex vB
> it depends upon the type of query.. what queries are you using for
> this benchmarking and how are you benchmarking?
> FYI: for benchmarking standard query types with wikipedia you might be
> interested in http://code.google.com/a/apache-extras.org/p/luceneutil/

I have 1 queries from a AOL data set where the followed link lead to
wikipedia.
I benchmark by warming up the indexSearcher with 5000 and perform the test
with the remaining 5000 queries. I just measure the time needed to execute
the queries. I use QueryParser.

> wait, you are indexing payloads for your tests with these other codecs
> when it says "W POS" ?

No only my last implementation uses payloads. All others not. Therefore I
use a payload aware query for Huffman.

> keep in mind that even adding a single payload to your index slows
> down the decompression of the positions tremendously, because payload
> lengths are intertwined with the positions. For block codecs payloads
> really need to be done differently so that blocks of positions are
> really just blocks of positions. This hasn't yet been fixed for the
> sep nor the fixed layouts, so if you add any payloads, and then
> benchmark positional queries then the results are not realistic.

Oh I know that payloads slow down query processing but I wasn't aware of the
block codec problem. I suggest you mean with not realistic they will be
slower? Some numbers for Huffman:
20 Bytes segements.gen
234.6 KB fdt
1.8 MB fdx
20 bytes fnm
626.1 MB pos
1.7 GB pyl
17.8 MB skp
39.8 MB tib
2028.5 KB tiv
268 Bytes Segments_2
214.6 MB doc

I used here for query processing PayloadQueryParser and adapt the similarity
according to my payloads.

> No they do not, only if you use a payload based query such as
> PayloadTermQuery. Normal non-positional queries like TermQuery and
> even normal positional queries like PhraseQuery don't fetch payloads
> at all...

Sorry my question was misleading. I already focused on a payload aware
query. When I use one how exactly are the payload informations fetched from
disk? For example if a query needs to read two posting lists. Are all
payloads fetched for them directly or is Lucene at first making a boolean
intersection and then retrieves the payloads for documents within that
intersection?

> From the description of what you are doing I don't understand how
> payloads fit in because they are per-position? But, I haven't had the
> time to digest the paper you sent yet.

I will try to summarize it and how I adapted it to Lucene. 

I already mentioned the idea of two levels for versioned document
collections. When I parse Wikipedia I unite for one article all terms of all
versions. From this word bag I extract each distinct term and index it with
Lucene into one document. Frequency information is now "lost" for the first
level but will be stored on the second. This is what I meant with " The
first level contains a posting for a document when a term occurs at least in
one version". For example if an article has two versions like version1: "a b
b" and version2: "a a a c c" only 'a','b' and 'c' are indexed.

For the second level I collected term frequency information during my
parsing step. Those frequencies are stored as a vector in version order. For
the above example the frequency vector for 'a' would be [1,3].  I store
these vectors as payloads which I see as the "second level". Every distinct
term on first level receives a single frequency vector on its first
position. So I somehow abuse payloads.

For query processing I now need to retrieve the docs and payloads. It would
be optimal to process the posting lists first ignoring payloads and then
fetch payloads (frequency information) for the remaining docs. The term
frequency is then used for ranking purposes. At the moment I pick for
ranking the highest value from the freq vector which corresponds to the most
matching version.

Regards
Alex

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



--
View this message in context: 
http://lucene.472066.n3.nabble.com/New-codecs-keep-Freq-skip-omit-Pos-tp2849776p2856054.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: New codecs keep Freq skip/omit Pos

2011-04-23 Thread Alex vB
Hi Robert,


the adapted codec is running but it seems to be incredible slow. Will take
some time ;)
Here are some performance results:


 
 
 

 
Indexing scheme 
Index Size 
Avg. Query performance 
Max. Query Performance 
 

 
PforDelta2 W Freq W Pos 
20.6 GB (3,3 GB w/o .pos) 
81.97 ms 
1295 ms 
 

 
PforDelta2 W/O Freq W/O Pos 
1.6 GB 
63.33 ms 
766 ms 
 

 
Standard 4.0 W Freq W Pos 
28.1 GB (8,1 GB w/o .prx) 
77.71 ms 
978 ms 
 

 
Standard 4.0 W/O Freq W/O Pos 
6.2 GB 
59.93 ms 
718 ms 
 

 
Standard 3.0 W Freq W Pos 
28.1 GB (8,1 GB w/o .prx) 
71.41 ms 
978 ms 
 

 
Standard 3.0 WO Freq WO Pos 
6.2 GB 
72.72 ms 
 845 ms 
 

 
PforDelta W Freq W Pos 
22 GB (5 GB w/o .pos) 
67.98 ms 
783 ms 
 

 
PforDelta W/O Freq W/O Pos 
3.1 GB 
56.08 ms 
596 ms 
 

 
Huffman BL10 W Freq W/O Pos 
2.6 GB 
216.29 ms (Mem 14 ms) 
1338 ms 
 
 
 
I am a little bit curious about the Lucene 3.0 performance results because
the larger index seems to
work faster?!? I already ran the test several times. Are my results
realistic at all? I thought PForDelta/2 would outperform the standard index
implementations in query processing. 


The last result is my own implementation. I am still looking to get it
smaller because I think I can improve compression further. For indexing I
use PForDelta2 in combination with payloads. Those are causing the higher
runtimes. In memory it looks nice. The gap between my solution and PForDelta
is already 700 MB. I would say it is an improvement. :D I will have a look
at it again after I got an index with your adapted implementation.


I still have another question. The basic idea in my implementation is to
create a "Two-Level" index structure. It is specialized for versioned
document collections. On the first level I create a posting list entry for a
document whenever a term occurs in one or more of its versions. The second
level holds corresponding term frequency informations. Is it possible to
build such a structure by creating a codec? For query processing it should
filter per boolean query on the first level and only fetch information from
the second level when the document is in the intersection of the first
level. At the moment I use payloads to "simulate" a two-level structure.
Normally all payloads corresponding to a query get fetched, right?


If this structure would be possible there are several more implementations
with promising results (Two-Level Diff/MSA in this paper
http://cis.poly.edu/suel/papers/version.pdf).

Regards Alex



--
View this message in context: 
http://lucene.472066.n3.nabble.com/New-codecs-keep-Freq-skip-omit-Pos-tp2849776p284.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

Re: New codecs keep Freq skip/omit Pos

2011-04-22 Thread Alex vB
Wow cool ,

I will give that a try!

Thank you!!

Alex

--
View this message in context: 
http://lucene.472066.n3.nabble.com/New-codecs-keep-Freq-skip-omit-Pos-tp2849776p2852370.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: New codecs keep Freq skip/omit Pos

2011-04-22 Thread Alex vB
I also indexed one time with Lucene 3.0. Are those sizes really completely
the same?

Standard 4.0 W Freq W Pos   28.1 GB
Standard 4.0 W/O Freq W/O Pos   6.2 GB
Standard 3.0 W Freq W Pos   28.1 GB
Standard 3.0 WO Freq WO Pos 6.2 GB

Regards
Alex


--
View this message in context: 
http://lucene.472066.n3.nabble.com/New-codecs-keep-Freq-skip-omit-Pos-tp2849776p2851898.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: New codecs keep Freq skip/omit Pos

2011-04-22 Thread Alex vB
Hello Robert,

thank you for the answers! :)
Currently I used PatchedFrameOfRef and PatchedFrameOfRef2. Therefore both
implementations are PForDelta! Sorry my mistake.

PatchedFrameOfRef2: PforDelta W/O Freq W/O Pos   1.6 GB 
PatchedFrameOfRef :  Pfor W/O Freq W/O Pos  3.1 GB 

Here are some numbers:
PatchedFrameOfRef2 w/o POS w/o FREQ
segements.gen  20 Bytes
_43.fdt  8,1 MB
_43.fdx  64,4 MB
_43.fnm  20 Bytes
_43_0.skp  182,6 MB
_43_0.tib  32,3 MB
_43_0.tiv  1,0 MB
segements_2  268 Bytes
_43_0.doc  1,3 GB

PatchedFrameOfRef w/o POS w/o FREQ
segements.gen  20 Bytes
_43.fdt  8,1 MB
_43.fdx  64,4 MB
_43.fnm  20 Bytes
_43_0.skp  182,6 MB
_43_0.tib  32,3 MB
_43_0.tiv  1,1 MB
segements_2  267 Bytes
_43_0.doc  2,8 GB

During indexing I use StandardAnalyzer (StandardFilter, LowerCaseFilter,
StopFilter). 
Can I get somewhere more information for Codec creation or is there just
"grubbing" through the code? 

My own implementation needs 2,8 GB of space including FREQ but not POS. This
is why I am asking because I want somehow compare the result. Compared to 20
GB it is very nice and compared to 1,6 GB it is very bad ;).

Regards
Alex


--
View this message in context: 
http://lucene.472066.n3.nabble.com/New-codecs-keep-Freq-skip-omit-Pos-tp2849776p2851809.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



New codecs keep Freq skip/omit Pos

2011-04-21 Thread Alex vB
Hello everybody,

I am currently testing several new Lucene 4.0 codec implementations to
compare with an own solution.
The difference is that I am only indexing frequencies and not positions. I
would like to have this for the other codecs. I know there was already a
post for this topic
http://lucene.472066.n3.nabble.com/Omit-positions-but-not-TF-td599710.html. 

I just wanted to ask if there has something changed especially for the new
codecs.
I had a look at the FixedPostingWriterImpl and PostingsConsumer. Are those
they right places for adapting Pos/Freq handling? What would happen if I
just skip writing postions/payloads? Would it mess up the index? 

The written files have different endings like pyl, skp, pos, doc etc. Gives
me "not counting" the pos file a correct index size estimation for W Freqs
W/O Pos? Or where exactly are term positions written?

Regards
Alex

PS: Some results with the current codecs if someone is interested. I indexed
10% of Wikipedia(english).
Each version is indexed as document.

Docs240179
Versions8467927
Distinct Terms  3501214
total Terms 1520008204
Avg. Versions   35.25
Avg. Terms per Version  179.50
Avg. Terms per Doc  6328.65

PforDelta W Freq W Pos 20.6 GB
PforDelta W/O Freq W/O Pos   1.6 GB
Standard 4.0 W Freq W Pos  28.1 GB
Standard 4.0 W/O Freq W/O Pos6.2 GB
Pfor W Freq W Pos 22 GB
Pfor W/O Freq W/O Pos3.1 GB

Performance follows ;)


--
View this message in context: 
http://lucene.472066.n3.nabble.com/New-codecs-keep-Freq-skip-omit-Pos-tp2849776p2849776.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Lucene 4.0 Payloads

2011-03-17 Thread Alex vB
Hello everybody,

I am currently experimenting with Lucene 4.0 and would like to add payloads.
Payload should only be added once per term on the first position. My current
code looks like this:

public final boolean incrementToken() throws java.io.IOException {
   String term = characterAttr.toString();

  if (!input.incrementToken()) {
 return false;
  }

  // hmh contains all terms for one document
  if(hmh.checkKey(term)){ //  check if hashmap contains term
  Payload payload = new Payload(hmh.getCompressedData(term)); 
//get
payload data
  payloadAttr.setPayload(payload); // add payload
  hmh.removeFromIndexingMap(term);  // remove term from 
hashmap
  }
  
  return true;
}

Is this a correct way for adding payloads in Lucene 4.0? When I try to
receive payloads I am not getting payload on the first position. For getting
payloads I use this:

DocsAndPositionsEnum tp = MultiFields.getTermPositionsEnum(ir,   
MultiFields.getDeletedDocs(ir), fieldName,
new BytesRef(searchString));

while (tp.nextDoc() != tp.NO_MORE_DOCS) {
if (tp.hasPayload() && counter < 10) {
Document doc = ir.document(tp.docID());
BytesRef br = tp.getPayload();
System.out.println("Found payload \"" + 
br.utf8ToString() + "\" for
document " +  
  tp.docID() + " and query " +
searchString +  " in country " +  
  doc.get("country"));
}
}

As far as I know there are two possibilities to use payloads
1) During similarity scoring
2) During search

Is there a better/faster way to receive payloads during search? Is it
possible to run a normal query and read the payloads from hits? Is 1 or 2
the faster way to use payloads? Can I find somewhere example code for Lucene
and loading payloads?

Regards
Alex

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Lucene-4-0-Payloads-tp2695817p2695817.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Early Termination

2011-03-15 Thread Alex vB
Hi,

is Lucene capable of any early termination techniques during query
processing?
On the forum I only found some information about TimeLimitedCollector. Are
there more implementations?

Regards
Alex

--
View this message in context: 
http://lucene.472066.n3.nabble.com/Early-Termination-tp2684557p2684557.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



How are stored Fields/Payloads loaded

2011-02-28 Thread Alex vB
Hello everybody,

I am currently unsure how stored data is written and loaded from index. 
I want to store for every term of a document some binary data but only once
and not for every position! 
Therefore I am not sure if Payloads or stored Fields are the better solution
(Or the not implemented feature Column Stride Field).

As far as I know all fields of a document are loaded by Lucene during
search. With large stored fields this can be time consuming and therefore
exists the possibility to load specific fields with FieldSelector. Maybe I
could create for each term a stored Field (up to several thousand Fields!)
and read those fields depending on the query term. Is this a common
approach?
The other possibility (like I have implemented it at the moment) is to store
per term a payload but only on the first term position. Payloads are loaded
only if I retrieve them from a hit right? So my current posting list looks
like this:
http://lucene.472066.n3.nabble.com/file/n2598739/Payload.png 
Picture adapted from M. McCandless "Fun with Flex"

How will the feature Column Stride Field (or per-document field) work? It's
not clear for me what "per Document" exactly means for the posting list
entries. I think (hope :P) it works like this:
http://lucene.472066.n3.nabble.com/file/n2598739/CSD.png 
Picture adapted from M. McCandless "Fun with Flex"


Do I understand the Column Stride Field correct? What would give me the best
performance (Stored Field, Payload, CSD)? Are there other ways to retrieve
payloads during search than Spanquery (I would like to use a normal query
here)?

Regards
Alex

-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/How-are-stored-Fields-Payloads-loaded-tp2598739p2598739.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Storing payloads without term-position and frequency

2011-02-02 Thread Alex vB

Hello everybody,

I am currently using Lucene 3.0.2 with payloads. I store extra information
in the payloads about the term like frequencies and therefore I don't need
frequencies and term positions stored normally by Lucene. I would like to
set f.setOmitTermFreqAndPositions(true) but then I am not able to retrieve
payloads. Would it be hard to "hack" Lucene for my requests? Anymore I only
store one payload per term if that information makes it easier.

Best regards
Alex
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Storing-payloads-without-term-position-and-frequency-tp2408094p2408094.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



RE: Could not find implementing class

2011-01-25 Thread Alex vB

Hello Uwe,

I recompiled some classes manually in Lucene sources. No it's running fine!
Something went wrong there.

Thank you very much!

Best regards
Alex
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Could-not-find-implementing-class-tp2330598p2332141.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Could not find implementing class

2011-01-25 Thread Alex vB

Hello Alexander,

isn't it enough to add the classpath through -cp? If I don't use -cp I can't
compile my project. I thought after compiling without errors all sources are
correctly added. In Eclipse I added Lucene sources the same  way(which
works) and I also tried using the jar file. Therefore I seem to find all
classes but I don't get a clue with the error message. This error message is
thrown by the Lucene class DefaultAttributeFactory in 
org.apache.lucene.util.AttributeSource. I work under Ubuntu and configured
java with 

- sudo update-alternatives --config java 
- sudo update-java-alternatives -java-6-sun

Greetings
Alex


-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Could-not-find-implementing-class-tp2330598p2331617.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Could not find implementing class

2011-01-25 Thread Alex vB

Hello everybody,

I used a small indexing example from "Lucene in Action" and can run and
compile the program under eclipse. If I want to compile and run it by
console I get this error:


java.lang.IllegalArgumentException: Could not find implementing class for
org.apache.lucene.analysis.tokenattributes.TermAttribute
at
org.apache.lucene.util.AttributeSource$AttributeFactory$DefaultAttributeFactory.getClassForInterface(AttributeSource.java:87)
at
org.apache.lucene.util.AttributeSource$AttributeFactory$DefaultAttributeFactory.createAttributeInstance(AttributeSource.java:66)
at
org.apache.lucene.util.AttributeSource.addAttribute(AttributeSource.java:245)
at
org.apache.lucene.index.DocInverterPerThread$SingleTokenAttributeSource.(DocInverterPerThread.java:41)
at
org.apache.lucene.index.DocInverterPerThread$SingleTokenAttributeSource.(DocInverterPerThread.java:36)
at
org.apache.lucene.index.DocInverterPerThread.(DocInverterPerThread.java:34)
at org.apache.lucene.index.DocInverter.addThread(DocInverter.java:95)
at
org.apache.lucene.index.DocFieldProcessorPerThread.(DocFieldProcessorPerThread.java:62)
at
org.apache.lucene.index.DocFieldProcessor.addThread(DocFieldProcessor.java:88)
at
org.apache.lucene.index.DocumentsWriterThreadState.(DocumentsWriterThreadState.java:43)
at
org.apache.lucene.index.DocumentsWriter.getThreadState(DocumentsWriter.java:739)
at
org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:814)
at
org.apache.lucene.index.DocumentsWriter.addDocument(DocumentsWriter.java:802)
at 
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1998)
at 
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1972)
at Demo.setUp(Demo.java:86)
at Demo.main(Demo.java:46)

I compile the command with javac -cp  Demo.java which
finishes without errors but running the program isn't possible. What am I
missing?? Basically I am just creating a directory, getting an indexwriter
with analyzer etc.. Line 86 in Demo.java is writer.addDocument(doc);.

Greetings Alex
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Could-not-find-implementing-class-tp2330598p2330598.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Indexing large XML dumps

2011-01-03 Thread Alex vB

Hello everybody,

I am currently indexing wikipedia dumps and create an index for versioned
document collections. As far everything is working fine but I have never
thought that single articles of wikipedia would reach a size of around 2 GB!
One article for example has 2 versions with an average length of 6
characters  for each (HUGE in memory!). This means I need a heap space
around 4 GB to perform indexing and I would like to decrease my memory
consumption ;).

At the moment I load every wikipedia article completely into memory
containing all versions. Then I collect some statistical data about the
article to store extra information about term occurences which are written
into the index as payloads. The statistic is created during an own
tokenization run which happens before the document is written into index.
This means I am analyzing my documents twice! :( I know there is a
CachingTokenFilter but I haven't found how and where to implement it exactly
(I tried it in my Analyzer but stream.reset() seems not to work). Does
somebody have a nice example?

1) Can I somehow avoid loading one complete article to get my statistics? 
2) Is it possible to index large files without completely loading it into a
field? 
3) How can I avoid to parse an article twice? 

Best regards 
Alex


-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Indexing-large-XML-dumps-tp2185926p2185926.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Implementing indexing of Versioned Document Collections

2010-11-16 Thread Alex vB

Hi again,

my Payloads are working fine as I figured out now (haven't seen the
nextPosition method). I really have problems with adding the bitvectors.
Currently I am creating them during tokenization. Therefore, as already
mentioned, they are only completely created when all fields are tokenized
because I add every new term occurence into HashMap and create/update the
linked bitvector during this analysis process. I read in another post that
changing or updating already set payloads isn't possible. Furthermore I need
to store payload only ONCE for a term and not on every term position. For
example on the wiki article for April I would have around 5000 term
occurrences for the term "April"! This would save a lot of memory.

1) Is it possible to pre analyze fields? Maybe analyzing twice. First time
for getting the bitvectors (without writing them!) and second time for
normal index writing with bitvector payloads.
2) Alternatively I could still add the bitvectors during tokenization if I
would be able to set the current term in my custom Filter (extends
TokenFilter). In my HashMap I have pairs of  and I could
iterate over all term keys. Is it possible to manually set the current term
and the corresponding payload? I tried something like this after all fields
and streams have been tokenized (Without success):

for (Map.Entry e : map.entrySet()) {
key = e.getKey();
value = e.getValue();

termAtt.setTermBuffer(key);
bitvectorPalyoad = new Payload(toByteArray(value)); 
payloadAttr.setPayload(bitvectorPalyoad);
}

3) Can I use payloads without term positions? 

If my questions are unclear please tell me! :)

Best regards
Alex



-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Implementing-indexing-of-Versioned-Document-Collections-tp1872701p1913140.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Implementing indexing of Versioned Document Collections

2010-11-16 Thread Alex vB

Hello Pulkit,

thank you for your answer and excuse me for my late reply. I am currently
working on the payload stuff and have implemented my own Analyzer and
Tokenfilter for adding custom payloads. As far as I understand I can add
Payload for every term occurence and write this into the posting list. My
posting list now looks like this:

car -> DocID1, [Payload 1], DocID2, [Payload2]., DocID N, [Payload N]

Where each payload is a BitSet depending on the versions of a document. I
must admit that the index is getting really big at the moment because I am
adding around 8 to 16 bytes with each payload. I have to find a good
compression for the bitvectors. 
Further I am always getting the error
org.apache.lucene.index.CorruptIndexException: checksum mismatch in segments
file if I use my own Analyzer. After I uncomment the checksum test
everything works fine. Even Luke isn't giving me an error. Any ideas?
Another problem is the BitVector creation during tokenization. I am running
through all versions during the tokenizing step for creating my bitvectors
(stored in a HashMap). So my bitvectors are completly created after the last
field is analyzed (I added every wikipedia verison as an own field).
Therefore I need to add the payload after the tokenizing step. Is this
possible? What happens if I add payload for a current term and I add another
payload for the same term later ? Is it overwritten or appended?

Greetings
Alex
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Implementing-indexing-of-Versioned-Document-Collections-tp1872701p1910449.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Implementing indexing of Versioned Document Collections

2010-11-09 Thread Alex vB

Hello everybody,

I would like to implement the paper "Compact Full-Text Indexing of Versioned
Document Collections" [1] from Torsten Suel for my diploma thesis in Lucene.
The basic idea is to create a two-level index structure. On the first level
a document is identified by document ID with a posting list entry if the
term exists at least in one version. For every posting on the first level
with term t we have a bitvector on the second one. These bitvectors contain
as many bits as there are versions for one document, and bit i is set to 1
if version i contains term t or otherwise it remains 0. 

http://lucene.472066.n3.nabble.com/file/n1872701/Unbenannt_1.jpg 

This little picture is just for demonstration purposes. It shows a posting
list for the term car and is composed of 4 document IDs. If a hit is found
in document 6 another look-up is needed on the second level to get the
corresponding versions (version 1, 5, 7, 8, 9, 10 from 10 versions at all). 

At the moment I am using wikipedia (simplewiki dump) as source with a
SAXParser and can resolve each document with all its versions from the XML
file (Fields are Title, ID, Content(seperated for each version)). My problem
is that I am unsure how to connect the second level with the first one and
how to store it. The key points that are needed:
- Information from posting list creation to create the bitvector (term ->
doc -> versions)
- Storing the bitvectors
- Implementing search on second level

For the first steps I disabled term frequencies and positions because the
paper isn't handling them. I would be happy to get any running version at
all. :)
At the moment I can create bitvectors for the documents. I realized this
with a HashMap in TermsHashPerField where I grab the current
term in add() (I hope this is the correct location for retrieving the
inverted lists terms). Anyway I can create the corret bitvectors and write
them into a text file.
Excerpt of bitVectors from article "April":
april : 
110110111011
never : 
0010
ayriway : 
010110111011
inclusive : 
1000
 

Next step would be storing all bitvecors in the index. At first glance I
like to use an extra field to store the created bitvectors permanent in the
index. It seems to be the easiest way for a first implementation without
accessing the low level functions of Lucene. Can I add a field after I
already started writing the document through IndexWriter? How would I do
this? Or are there any other suggestions for storing? Another idea is to
expand the index format of Lucene but this seems a little bit to difficult
for me. Maybe I could write these information into my own file. Could
anybody point me to the right direction? :)

Currently I am focusing on storing and try to extend Lucenes search after
the former step.

THX in advance & best regards 
Alex

[1] http://cis.poly.edu/suel/
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Implementing-indexing-of-Versioned-Document-Collections-tp1872701p1872701.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Detailed file handling on hard disk

2010-09-03 Thread Alex vB

Hello everybody,

I read the paper  http://www2008.org/papers/pdf/p387-zhangA.pdf Performance
of Compresses Inverted List Caching in Search Engines  and now I am unsure
how Lucene implements its structure on the hard disk. I am using Windos as
OS and therefore I implemented FSDirectory based on
Java.io.RandomAccessFile. 

How is the skipping in the .tis file realized? Do I use metadata at the
beginning of each block too like in the mentioned paper above on page 388
(in the paper the metadata stores informations about how many inverted lists
are in the block and where they start)? 

http://lucene.472066.n3.nabble.com/file/n1413062/Block_assignment.jpg 

Because I read in another article that I can seek to the correct position on
the hard drive with the byte address using java.io.RandomAccessFile (which I
can read from .tii-file in "IndexDelta"?).

How do I find the correct position/location for my PostingList/Document?
Do I need information/metadata about the blocks from the underlying file
system?
Or where can I find further informations about this stuff? :)

Best regards
Alex
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/Detailed-file-handling-on-hard-disk-tp1413062p1413062.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org