date:20070109

[jira] Commented: (LUCENE-140) docs out of order

2007-01-09 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12463243
 ] 

Michael McCandless commented on LUCENE-140:
---

Jed, thanks for testing the fix!

> Alas, this doesn't appear to be the problem. We are still getting
> it, but we do at least have a little more info. We added the doc and
> lastDoc to the IllegalArgEx and we are getting very strange numbers:
>
> java.lang.IllegalStateException: docs out of order (-1764 < 0)
> ...
>
> where doc = -1764 and lastDoc is zero

OK, so we've definitely got something else at play here (bummer!). That
(negative doc number) is very strange.  I will keep looking a this.  I
will prepare a patch on 1.9.1 that adds some more instrumentation so
we can get more details when you hit this ...

> We do only use the deleteDocuments(Term) method, so we are not sure
> whether this will truly fix our problem, but we note that that
> method calls deleteDocument(int) based on the TermDocs returned for
> the Term - and maybe they can be incorrect???

Right, but I had thought the docNum's coming in from this path would
be correct.  It sounds like we have another issue at play here that
can somehow get even these doc numbers messed up.

> Out of interest, apart from changing from 1.4.3 to 1.9.1, in the
> JIRA 3.7 release we changed our default merge factor to 4 from
> 10. We hadn't seen this problem before, and suddenly we have had a
> reasonable number of occurrences.

Interesting.  Maybe try changing back to 4 and see if it suppresses
the bug?  Might give us [desperately needed] more data to cling to
here!  On the 1.4.3 -> 1.9.1 change, some of the cases above were even
pre-1.4.x (though they could have been from yet another root cause or
maybe filesystem) so it's hard to draw hard conclusions on this
front.



> docs out of order
> -
>
> Key: LUCENE-140
> URL: https://issues.apache.org/jira/browse/LUCENE-140
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: unspecified
> Environment: Operating System: Linux
> Platform: PC
>Reporter: legez
> Assigned To: Michael McCandless
> Attachments: bug23650.txt, corrupted.part1.rar, corrupted.part2.rar
>
>
> Hello,
>   I can not find out, why (and what) it is happening all the time. I got an
> exception:
> java.lang.IllegalStateException: docs out of order
> at
> org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:219)
> at
> org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:191)
> at
> org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:172)
> at 
> org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:135)
> at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:88)
> at 
> org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:341)
> at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:250)
> at Optimize.main(Optimize.java:29)
> It happens either in 1.2 and 1.3rc1 (anyway what happened to it? I can not 
> find
> it neither in download nor in version list in this form). Everything seems 
> OK. I
> can search through index, but I can not optimize it. Even worse after this
> exception every time I add new documents and close IndexWriter new segments is
> created! I think it has all documents added before, because of its size.
> My index is quite big: 500.000 docs, about 5gb of index directory.
> It is _repeatable_. I drop index, reindex everything. Afterwards I add a few
> docs, try to optimize and receive above exception.
> My documents' structure is:
>   static Document indexIt(String id_strony, Reader reader, String 
> data_wydania,
> String id_wydania, String id_gazety, String data_wstawienia)
> {
> Document doc = new Document();
> doc.add(Field.Keyword("id", id_strony ));
> doc.add(Field.Keyword("data_wydania", data_wydania));
> doc.add(Field.Keyword("id_wydania", id_wydania));
> doc.add(Field.Text("id_gazety", id_gazety));
> doc.add(Field.Keyword("data_wstawienia", data_wstawienia));
> doc.add(Field.Text("tresc", reader));
> return doc;
> }
> Sincerely,
> legez

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-140) docs out of order

2007-01-09 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12463247
 ] 

Michael McCandless commented on LUCENE-140:
---


Doron,

> (1) the sequence of ops brought by Jason is wrong:
>   ...
>
> Problem here is that the docIDs found in (b) may be altered in step
> (d) and so step (f) would delete the wrong docs. In particular, it
> might attempt to delete ids that are out of the range. This might
> expose exactly the BitVector problem, and would explain the whole
> thing, but I too cannot see how it explains the delete-by-term case.

Right, the case I fixed only happens when the Lucene
deleteDocument(int docNum) is [slightly] mis-used.  Ie if you are
"playing by the rules" you would never have hit this bug.  And this
particular use case is indeed incorrect: doc numbers are only valid to
the one reader that you got them from.

> I think however that the test Mike added does not expose the docs
> out of order bug - I tried this test without the fix and it only
> fail on the "gotException assert" - if you comment this assert the
> test pass.

Huh, I see my test case (in IndexReader) indeed hitting the original
"docs out of order" exception.  If I take the current trunk and
comment out the (one line) bounds check in BitVector.set and run that
test, it hits the "docs out of order" exception.

Are you sure you updated the change (to tighten the check to a <= from
a <) to index/SegmentMerger.java?  Because, I did indeed find that the
test failed to fail when I first wrote it (but should have).  So in
digging why it didn't fail as expected, I found that the check for
"docs out of order" missed the boundary case of the same doc number
twice in a row.  Once I fixed that, the test failed as expected.

> (3) maxDoc() computation in SegmentReader is based (on some paths)
> in RandomAccessFile.length(). IIRC I saw cases (in previous project)
> where File.length() or RAF.length() (not sure which of the two) did
> not always reflect real length, if the system was very busy IO wise,
> unless FD.sync() was called (with performance hit).

Yes I saw this too.  From the follow-on discussion it sounds like we
haven't found a specific known JVM bug here.  Still, it does make me
nervous that we rely on file length to derive maxDoc.

In general I think we should rely on as little as possible from the
file system (there are so many cross platform issues/differences) and
instead explicitly store things like maxDoc into the index.  I will
open a separate Jira issue to track this.  Also I will record this
path in the instrumentation patch for 1.9.1 just to see if we are
actually hitting something here (I think unlikely but possible).


> docs out of order
> -
>
> Key: LUCENE-140
> URL: https://issues.apache.org/jira/browse/LUCENE-140
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: unspecified
> Environment: Operating System: Linux
> Platform: PC
>Reporter: legez
> Assigned To: Michael McCandless
> Attachments: bug23650.txt, corrupted.part1.rar, corrupted.part2.rar
>
>
> Hello,
>   I can not find out, why (and what) it is happening all the time. I got an
> exception:
> java.lang.IllegalStateException: docs out of order
> at
> org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:219)
> at
> org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:191)
> at
> org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:172)
> at 
> org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:135)
> at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:88)
> at 
> org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:341)
> at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:250)
> at Optimize.main(Optimize.java:29)
> It happens either in 1.2 and 1.3rc1 (anyway what happened to it? I can not 
> find
> it neither in download nor in version list in this form). Everything seems 
> OK. I
> can search through index, but I can not optimize it. Even worse after this
> exception every time I add new documents and close IndexWriter new segments is
> created! I think it has all documents added before, because of its size.
> My index is quite big: 500.000 docs, about 5gb of index directory.
> It is _repeatable_. I drop index, reindex everything. Afterwards I add a few
> docs, try to optimize and receive above exception.
> My documents' structure is:
>   static Document indexIt(String id_strony, Reader reader, String 
> data_wydania,
> String id_wydania, String id_gazety, String data_wstawienia)
> {
> Document doc = new Document();
> doc.add(Field.Keyword("id", id_strony ));
> doc.add(Field.Keyword("data_wydania", data_wydani

[jira] Created: (LUCENE-767) maxDoc should be explicitly stored in the index, not derived from file length

2007-01-09 Thread Michael McCandless (JIRA)

maxDoc should be explicitly stored in the index, not derived from file length
-

 Key: LUCENE-767
 URL: https://issues.apache.org/jira/browse/LUCENE-767
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.0.0, 1.9, 2.0.1, 2.1
Reporter: Michael McCandless
 Assigned To: Michael McCandless
Priority: Minor


This is a spinoff of LUCENE-140

In general we should rely on "as little as possible" from the file system.  
Right now, maxDoc is derived by checking the file length of the FieldsReader 
index file (.fdx) which makes me nervous.  I think we should explicitly store 
it instead.

Note that there are no known cases where this is actually causing a problem. 
There was some speculation in the discussion of LUCENE-140 that it could be one 
of the possible, but in digging / discussion there were no specifically 
relevant JVM bugs found (yet!).  So this would be a defensive fix at this point.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-140) docs out of order

2007-01-09 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12463249
 ] 

Michael McCandless commented on LUCENE-140:
---

OK, I created LUCENE-767 for the "maxDoc should be explicitly stored in the 
index" issue.

> docs out of order
> -
>
> Key: LUCENE-140
> URL: https://issues.apache.org/jira/browse/LUCENE-140
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: unspecified
> Environment: Operating System: Linux
> Platform: PC
>Reporter: legez
> Assigned To: Michael McCandless
> Attachments: bug23650.txt, corrupted.part1.rar, corrupted.part2.rar
>
>
> Hello,
>   I can not find out, why (and what) it is happening all the time. I got an
> exception:
> java.lang.IllegalStateException: docs out of order
> at
> org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:219)
> at
> org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:191)
> at
> org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:172)
> at 
> org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:135)
> at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:88)
> at 
> org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:341)
> at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:250)
> at Optimize.main(Optimize.java:29)
> It happens either in 1.2 and 1.3rc1 (anyway what happened to it? I can not 
> find
> it neither in download nor in version list in this form). Everything seems 
> OK. I
> can search through index, but I can not optimize it. Even worse after this
> exception every time I add new documents and close IndexWriter new segments is
> created! I think it has all documents added before, because of its size.
> My index is quite big: 500.000 docs, about 5gb of index directory.
> It is _repeatable_. I drop index, reindex everything. Afterwards I add a few
> docs, try to optimize and receive above exception.
> My documents' structure is:
>   static Document indexIt(String id_strony, Reader reader, String 
> data_wydania,
> String id_wydania, String id_gazety, String data_wstawienia)
> {
> Document doc = new Document();
> doc.add(Field.Keyword("id", id_strony ));
> doc.add(Field.Keyword("data_wydania", data_wydania));
> doc.add(Field.Keyword("id_wydania", id_wydania));
> doc.add(Field.Text("id_gazety", id_gazety));
> doc.add(Field.Keyword("data_wstawienia", data_wstawienia));
> doc.add(Field.Text("tresc", reader));
> return doc;
> }
> Sincerely,
> legez

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: The tvp extension

2007-01-09 Thread Bernhard Messer

Term Vectors with positions are written to the "tvf" file like other 
term vector information too. There is no extra file containing term 
vectors position information. The "tvp" extension seems to be an relict 
from earlier days where lucene file extensions where spreaded over 
several class files, e.g SegmentReader.java.


I will remove the "tvp" extension so that nobody gets confused. Thanks 
to Nicolas, reporting the bug.


Bernhard


I don't have the sources handy to check, but my guess would be this extension 
is/was for Term Vectors with Positions.
Like somebody else said recently, it would be good to make all these into static finals, 
so we don't have to chase "string" values in the code.

Otis

- Original Message 
From: Nicolas Lalevée <[EMAIL PROTECTED]>
To: java-dev@lucene.apache.org
Sent: Monday, January 8, 2007 12:10:32 AM
Subject: The tvp extension


Hello,

In o.a.l.index.IndexFileNames.java, there are these lines :

  static final String INDEX_EXTENSIONS[] = new String[] {
  "cfs", "fnm", "fdx", "fdt", "tii", "tis", "frq", "prx", "del",
  "tvx", "tvd", "tvf", "tvp", "gen", "nrm" 
  };


What is the "tvp" extension ? I didn't see any occurrence of it in the doc, 
neither in the code. A bug ?


cheers,
Nicolas

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


  



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Beyond Lucene 2.0 Index Design

2007-01-09 Thread Dalton, Jeffery

Hi,

I wanted to start some discussion about possible future Lucene file /
index formats.  This is an extension to the discussion on Flexible
Lucene Indexing discussed on the wiki:
http://wiki.apache.org/jakarta-lucene/FlexibleIndexing

Note: Related sources are listed at the end.

I would like to have the ability to create term frequency [Persin, et
al. 1996] or "impact" sorted [Anh, Moffat 2001,2006] posting lists (freq
data) . A posting list sorted by Term frequency rather than document id
is straightforward (posting design below).  An Impact sorted list is
relatively new (and perhaps unfamiliar).  An Impact is a single integer
value for a term in a document that is stored in the posting list and is
computed from the combination of the term frequency, document boost,
field boost, length norms, and other arbitrary scoring features (word
position, font, etc...) -- all local information. 

The driving motivation for this change is to avoid reading the entire
posting list from disk for very long posting lists (it also leads to
simplified query-time scoring because the tf, norms, and boosts are
built into the impact).  This would address scalability issues with
large collections that have been seen in the past; back in December 2005
there were two threads: "Lucene Performance Bottlenecks" (Lucene User)
and "IndexOptimizer Re: Lucene Performance Bottlenecks" (Nutch Dev)
where Doug and Andrzej addressed some speed concerns by sorting the
Nutch index based on Document Boost (IndexSorter and a TopDocsCollector)
[inpsired by Long, Suel]. The goal is that an impact sorted posting list
would address these and other concerns in a generic manner.

Allowing a frequency or impact sorted posting list format would lead to
a posting list with the following structure:  
(Frequency or impact could be used interchangeably in the structure.
Lettering continued from Wiki)

e.  
f. ],...[docN, freq
,]) 

The positions are delta encoded for compression.  Similarly, the
document numbers for a given frequency/impact are delta encoded.  If you
read Moffat and Persin, the papers show that this achieves compression
comparable to, or even better than, a standard delta encoded docId
sorted index.  The structure lends itself well to early termination,
pruning, etc... where the entire posting list is not read from disk.  

This type of Impact sorted structure (or similar concept) seems to be
becoming a "standard" solution described in a lot of new research / text
books on IR for large scale indexes.  It would be great if Lucene
supported something like this someday ;-).

Thanks,

Jeff Dalton

References:
Anh, Moffat. Vector-Space Ranking with Effective Early Termination.
2001.
Anh, Moffat. Impact Transformation: Effective and Efficient Web
Retrieval. 2006.  
Anh, Moffat. Pruned Query Evaluation Using Pre-Computed Impacts. 2006.
Long, Suel. Optimized Query Execution in Large Search Engine with Global
Page Ordering.
Manning, Raghavan, Schutze. Introduction to Information Retrieval,
Chapters 2,7.
http://www-csli.stanford.edu/%7Eschuetze/information-retrieval-book.html

Persin, et al. Filtered Document Retrieval with Frequency-Sorted
Indexes. 1996.






-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-140) docs out of order

2007-01-09 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-140:
--

Attachment: LUCENE-140-2007-01-09-instrumentation.patch

> docs out of order
> -
>
> Key: LUCENE-140
> URL: https://issues.apache.org/jira/browse/LUCENE-140
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: unspecified
> Environment: Operating System: Linux
> Platform: PC
>Reporter: legez
> Assigned To: Michael McCandless
> Attachments: bug23650.txt, corrupted.part1.rar, corrupted.part2.rar, 
> LUCENE-140-2007-01-09-instrumentation.patch
>
>
> Hello,
>   I can not find out, why (and what) it is happening all the time. I got an
> exception:
> java.lang.IllegalStateException: docs out of order
> at
> org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:219)
> at
> org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:191)
> at
> org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:172)
> at 
> org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:135)
> at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:88)
> at 
> org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:341)
> at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:250)
> at Optimize.main(Optimize.java:29)
> It happens either in 1.2 and 1.3rc1 (anyway what happened to it? I can not 
> find
> it neither in download nor in version list in this form). Everything seems 
> OK. I
> can search through index, but I can not optimize it. Even worse after this
> exception every time I add new documents and close IndexWriter new segments is
> created! I think it has all documents added before, because of its size.
> My index is quite big: 500.000 docs, about 5gb of index directory.
> It is _repeatable_. I drop index, reindex everything. Afterwards I add a few
> docs, try to optimize and receive above exception.
> My documents' structure is:
>   static Document indexIt(String id_strony, Reader reader, String 
> data_wydania,
> String id_wydania, String id_gazety, String data_wstawienia)
> {
> Document doc = new Document();
> doc.add(Field.Keyword("id", id_strony ));
> doc.add(Field.Keyword("data_wydania", data_wydania));
> doc.add(Field.Keyword("id_wydania", id_wydania));
> doc.add(Field.Text("id_gazety", id_gazety));
> doc.add(Field.Keyword("data_wstawienia", data_wstawienia));
> doc.add(Field.Text("tresc", reader));
> return doc;
> }
> Sincerely,
> legez

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-140) docs out of order

2007-01-09 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12463294
 ] 

Michael McCandless commented on LUCENE-140:
---


Jed, one question: when you tested the fix, you fully rebuilt your
index from scratch, right?  Just want to verify that.  You have to
re-index because once the index is corrupted it will eventually hit
the "docs out of order" exception even if you fix the original cause.

OK I've prepared a patch off 1.9.1 (just attached it).  The patch
passes all unit tests on 1.9.1.

It has the changes I committed to the trunk yesterday, plus
instrumentation (messages printed to a PrintStream) to catch places
where doc numbers are not correct.

All messages I added print to a newly added infoStream static member
of SegmentMerger.  You can do SegmentMerger.setInfoStream(...) to
change it (it defaults to System.err).

Jed if you could get the error to re-occur with this patch and then
post the resulting messages, that would be great.  Hopefully it gives
us enough information to find the source here or at least to have
another iteration with yet more instrumentation.  Thanks!



> docs out of order
> -
>
> Key: LUCENE-140
> URL: https://issues.apache.org/jira/browse/LUCENE-140
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: unspecified
> Environment: Operating System: Linux
> Platform: PC
>Reporter: legez
> Assigned To: Michael McCandless
> Attachments: bug23650.txt, corrupted.part1.rar, corrupted.part2.rar, 
> LUCENE-140-2007-01-09-instrumentation.patch
>
>
> Hello,
>   I can not find out, why (and what) it is happening all the time. I got an
> exception:
> java.lang.IllegalStateException: docs out of order
> at
> org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:219)
> at
> org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:191)
> at
> org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:172)
> at 
> org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:135)
> at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:88)
> at 
> org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:341)
> at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:250)
> at Optimize.main(Optimize.java:29)
> It happens either in 1.2 and 1.3rc1 (anyway what happened to it? I can not 
> find
> it neither in download nor in version list in this form). Everything seems 
> OK. I
> can search through index, but I can not optimize it. Even worse after this
> exception every time I add new documents and close IndexWriter new segments is
> created! I think it has all documents added before, because of its size.
> My index is quite big: 500.000 docs, about 5gb of index directory.
> It is _repeatable_. I drop index, reindex everything. Afterwards I add a few
> docs, try to optimize and receive above exception.
> My documents' structure is:
>   static Document indexIt(String id_strony, Reader reader, String 
> data_wydania,
> String id_wydania, String id_gazety, String data_wstawienia)
> {
> Document doc = new Document();
> doc.add(Field.Keyword("id", id_strony ));
> doc.add(Field.Keyword("data_wydania", data_wydania));
> doc.add(Field.Keyword("id_wydania", id_wydania));
> doc.add(Field.Text("id_gazety", id_gazety));
> doc.add(Field.Keyword("data_wstawienia", data_wstawienia));
> doc.add(Field.Text("tresc", reader));
> return doc;
> }
> Sincerely,
> legez

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-767) maxDoc should be explicitly stored in the index, not derived from file length

2007-01-09 Thread Chuck Williams (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12463322
 ] 

Chuck Williams commented on LUCENE-767:
---

Isn't maxDoc always the same as the docCount of the segment, which is stored?  
I.e., couldn't SegmentReader.maxDoc() be equivalently defined as:

  public int maxDoc() {
return si.docCount;
  }

Since maxDoc==numDocs==docCount for a newly merged segment, and deletion with a 
reader never changes numDocs or maxDoc, it seems to me these values should 
always be the same.

All Lucene tests pass with this definition.  I have code that relies on this 
equivalence and so would appreciate knowledge of any case where this 
equivalence might not hold.


> maxDoc should be explicitly stored in the index, not derived from file length
> -
>
> Key: LUCENE-767
> URL: https://issues.apache.org/jira/browse/LUCENE-767
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 1.9, 2.0.0, 2.0.1, 2.1
>Reporter: Michael McCandless
> Assigned To: Michael McCandless
>Priority: Minor
>
> This is a spinoff of LUCENE-140
> In general we should rely on "as little as possible" from the file system.  
> Right now, maxDoc is derived by checking the file length of the FieldsReader 
> index file (.fdx) which makes me nervous.  I think we should explicitly store 
> it instead.
> Note that there are no known cases where this is actually causing a problem. 
> There was some speculation in the discussion of LUCENE-140 that it could be 
> one of the possible, but in digging / discussion there were no specifically 
> relevant JVM bugs found (yet!).  So this would be a defensive fix at this 
> point.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-767) maxDoc should be explicitly stored in the index, not derived from file length

2007-01-09 Thread Michael McCandless (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12463335
 ] 

Michael McCandless commented on LUCENE-767:
---

Ooh that's great!  I think your logic is correct.

But I do see one unit test failing when I make that change locally 
(testIndexAndMerge in src/test/org/apache/lucene/index/TestDoc.java).  
Actually, this unit test only fails with my last commit (yesterday) for 
LUCENE-140 , because I made the checking for "docs out of order" more strict 
(catch a previously missing boundary case), and this test seems to hit that 
boundary case.

However, that test is buggy because it manually creates SegmentInfos with an 
incorrect docCount.  So I will fix the test, and commit your solution above.  
Thanks!

> maxDoc should be explicitly stored in the index, not derived from file length
> -
>
> Key: LUCENE-767
> URL: https://issues.apache.org/jira/browse/LUCENE-767
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 1.9, 2.0.0, 2.0.1, 2.1
>Reporter: Michael McCandless
> Assigned To: Michael McCandless
>Priority: Minor
>
> This is a spinoff of LUCENE-140
> In general we should rely on "as little as possible" from the file system.  
> Right now, maxDoc is derived by checking the file length of the FieldsReader 
> index file (.fdx) which makes me nervous.  I think we should explicitly store 
> it instead.
> Note that there are no known cases where this is actually causing a problem. 
> There was some speculation in the discussion of LUCENE-140 that it could be 
> one of the possible, but in digging / discussion there were no specifically 
> relevant JVM bugs found (yet!).  So this would be a defensive fix at this 
> point.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Created: (LUCENE-768) Exception in deleteDocument, undeleteAll or setNorm in IndexReader can fail to release write lock on close

2007-01-09 Thread Michael McCandless (JIRA)

Exception in deleteDocument, undeleteAll or setNorm in IndexReader can fail to 
release write lock on close
--

 Key: LUCENE-768
 URL: https://issues.apache.org/jira/browse/LUCENE-768
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.0.0, 1.9, 2.0.1, 2.1
Reporter: Michael McCandless
 Assigned To: Michael McCandless
Priority: Minor
 Fix For: 2.1


I hit this while working on LUCENE-140

We have 3 cases in the IndexReader methods above where we have this pattern:

  if (directoryOwner) acquireWriteLock();
  doSomething();
  hasChanges = true;

The problem is if you hit an exception in doSomething(), and hasChanges was not 
already true, then hasChanges will not have been set to true yet the write lock 
is held.  If you then try to close the reader without making any other changes, 
then the write lock is not released because in IndexReader.close() (well, in 
commit()) we only release write lock if hasChanges is true.

I think the simple fix is to swap the order of hasChanges = true and 
doSomething().  I already fixed one case of this under LUCENE-140 commit 
yesterday; I will fix the other two under this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Created: (LUCENE-767) maxDoc should be explicitly stored in the index, not derived from file length

2007-01-09 Thread Grant Ingersoll


Hi Michael,

Can you explain in more detail on this bug why this makes you nervous?

Thanks,
Grant

On Jan 9, 2007, at 6:41 AM, Michael McCandless (JIRA) wrote:

maxDoc should be explicitly stored in the index, not derived from  
file length
-- 
---


 Key: LUCENE-767
 URL: https://issues.apache.org/jira/browse/LUCENE-767
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.0.0, 1.9, 2.0.1, 2.1
Reporter: Michael McCandless
 Assigned To: Michael McCandless
Priority: Minor


This is a spinoff of LUCENE-140

In general we should rely on "as little as possible" from the file  
system.  Right now, maxDoc is derived by checking the file length  
of the FieldsReader index file (.fdx) which makes me nervous.  I  
think we should explicitly store it instead.


Note that there are no known cases where this is actually causing a  
problem. There was some speculation in the discussion of LUCENE-140  
that it could be one of the possible, but in digging / discussion  
there were no specifically relevant JVM bugs found (yet!).  So this  
would be a defensive fix at this point.


--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the  
administrators: https://issues.apache.org/jira/secure/ 
Administrators.jspa

-
For more information on JIRA, see: http://www.atlassian.com/ 
software/jira




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll
Center for Natural Language Processing
http://www.cnlp.org

Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ 
LuceneFAQ




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Resolved: (LUCENE-768) Exception in deleteDocument, undeleteAll or setNorm in IndexReader can fail to release write lock on close

2007-01-09 Thread Michael McCandless (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-768.
---

Resolution: Fixed

> Exception in deleteDocument, undeleteAll or setNorm in IndexReader can fail 
> to release write lock on close
> --
>
> Key: LUCENE-768
> URL: https://issues.apache.org/jira/browse/LUCENE-768
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: 1.9, 2.0.0, 2.0.1, 2.1
>Reporter: Michael McCandless
> Assigned To: Michael McCandless
>Priority: Minor
> Fix For: 2.1
>
>
> I hit this while working on LUCENE-140
> We have 3 cases in the IndexReader methods above where we have this pattern:
>   if (directoryOwner) acquireWriteLock();
>   doSomething();
>   hasChanges = true;
> The problem is if you hit an exception in doSomething(), and hasChanges was 
> not already true, then hasChanges will not have been set to true yet the 
> write lock is held.  If you then try to close the reader without making any 
> other changes, then the write lock is not released because in 
> IndexReader.close() (well, in commit()) we only release write lock if 
> hasChanges is true.
> I think the simple fix is to swap the order of hasChanges = true and 
> doSomething().  I already fixed one case of this under LUCENE-140 commit 
> yesterday; I will fix the other two under this issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-767) maxDoc should be explicitly stored in the index, not derived from file length

2007-01-09 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12463358
 ] 

Michael McCandless commented on LUCENE-767:
---

Carrying over from the java-dev list:

Grant Ingersoll wrote:

> Can you explain in more detail on this bug why this makes you nervous?

Well ... the only specific example I have is NFS (always my favorite
example!).

As I understand it, the NFS client typically uses a separate cache to
hold the "attributes" of the file, including file length.  This cache
often has weaker or maybe just "different" guarantees than the "data
cache" that holds the file contents.  So basically you can ask what
the file length is and get a wrong (stale) answer.  EG see
http://nfs.sourceforge.net, which describes Linux's NFS client
approach.  The NFS client on Apple's OS X seems to be even worse!

I think very likely Lucene may not trip up on this specifically since
a reader would only ask for this file's length for the first time once
the file is done being written (ie the commit of segments_N has
occurred) and so hopefully it's not in the attribute cache yet?

I think there may very well be cases of other filesystems where
"checking file length" is risky (that we all just don't know about
(yet!)), which is why I favor using explicit values instead of relying
on file system semantics, whenever possible.

Maybe I'm just too paranoid :)

But for all the places / devices Lucene has gone and will go, relying
on the bare minimum set of IO operations I think will maximize our
overall portability.  Every filesystem has its quirks.

> maxDoc should be explicitly stored in the index, not derived from file length
> -
>
> Key: LUCENE-767
> URL: https://issues.apache.org/jira/browse/LUCENE-767
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 1.9, 2.0.0, 2.0.1, 2.1
>Reporter: Michael McCandless
> Assigned To: Michael McCandless
>Priority: Minor
>
> This is a spinoff of LUCENE-140
> In general we should rely on "as little as possible" from the file system.  
> Right now, maxDoc is derived by checking the file length of the FieldsReader 
> index file (.fdx) which makes me nervous.  I think we should explicitly store 
> it instead.
> Note that there are no known cases where this is actually causing a problem. 
> There was some speculation in the discussion of LUCENE-140 that it could be 
> one of the possible, but in digging / discussion there were no specifically 
> relevant JVM bugs found (yet!).  So this would be a defensive fix at this 
> point.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-767) maxDoc should be explicitly stored in the index, not derived from file length

2007-01-09 Thread robert engels

It would appear that NFS Version 2 is not suitable for Lucene. NFS  
Version 3 looks like it should work. See http://nfs.sourceforge.net/ 
#section_a


I will take this opportunity to state again what I've always been  
told, and it seems to hold up, using NFS for shared interactively  
updated files is always going to be troublesome. They have patched it  
over the years to help, but it just wasn't designed for this for the  
beginning.


Unix systems never even had file system locks. It was assumed that  
shared access to shared data would be accomplished via a shared  
server - not by sharing access to the data directly. It is far more  
efficient and robust to do things this way.


Modifying a shared Lucene directory via NFS directly is always going  
to be error prone.


Why not just implement a server/parallel index solution ?

On Jan 9, 2007, at 12:25 PM, Michael McCandless (JIRA) wrote:



[ https://issues.apache.org/jira/browse/LUCENE-767? 
page=com.atlassian.jira.plugin.system.issuetabpanels:comment- 
tabpanel#action_12463358 ]


Michael McCandless commented on LUCENE-767:
---


Carrying over from the java-dev list:


Grant Ingersoll wrote:

Can you explain in more detail on this bug why this makes you  
nervous?


Well ... the only specific example I have is NFS (always my favorite
example!).

As I understand it, the NFS client typically uses a separate cache to
hold the "attributes" of the file, including file length.  This cache
often has weaker or maybe just "different" guarantees than the "data
cache" that holds the file contents.  So basically you can ask what
the file length is and get a wrong (stale) answer.  EG see
http://nfs.sourceforge.net, which describes Linux's NFS client
approach.  The NFS client on Apple's OS X seems to be even worse!

I think very likely Lucene may not trip up on this specifically since
a reader would only ask for this file's length for the first time once
the file is done being written (ie the commit of segments_N has
occurred) and so hopefully it's not in the attribute cache yet?

I think there may very well be cases of other filesystems where
"checking file length" is risky (that we all just don't know about
(yet!)), which is why I favor using explicit values instead of relying
on file system semantics, whenever possible.

Maybe I'm just too paranoid :)

But for all the places / devices Lucene has gone and will go, relying
on the bare minimum set of IO operations I think will maximize our
overall portability.  Every filesystem has its quirks.


maxDoc should be explicitly stored in the index, not derived from  
file length
- 



Key: LUCENE-767
URL: https://issues.apache.org/jira/browse/LUCENE-767
Project: Lucene - Java
 Issue Type: Improvement
   Affects Versions: 1.9, 2.0.0, 2.0.1, 2.1
   Reporter: Michael McCandless
Assigned To: Michael McCandless
   Priority: Minor

This is a spinoff of LUCENE-140
In general we should rely on "as little as possible" from the file  
system.  Right now, maxDoc is derived by checking the file length  
of the FieldsReader index file (.fdx) which makes me nervous.  I  
think we should explicitly store it instead.
Note that there are no known cases where this is actually causing  
a problem. There was some speculation in the discussion of  
LUCENE-140 that it could be one of the possible, but in digging /  
discussion there were no specifically relevant JVM bugs found  
(yet!).  So this would be a defensive fix at this point.


--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the  
administrators: https://issues.apache.org/jira/secure/ 
Administrators.jspa

-
For more information on JIRA, see: http://www.atlassian.com/ 
software/jira




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-767) maxDoc should be explicitly stored in the index, not derived from file length

2007-01-09 Thread robert engels


I think this is the relevant section:

  A8. What is close-to-open cache consistency?

A. Perfect cache coherency among disparate NFS clients is very  
expensive to achieve, so NFS settles for something weaker that  
satisfies the requirements of most everyday types of file sharing.  
Everyday file sharing is most often completely sequential: first  
client A opens a file, writes something to it, then closes it; then  
client B opens the same file, and reads the changes.


So, when an application opens a file stored in NFS, the NFS  
client checks that it still exists on the server, and is permitted to  
the opener, by sending a GETATTR or ACCESS operation. When the  
application closes the file, the NFS client writes back any pending  
changes to the file so that the next opener can view the changes.  
This also gives the NFS client an opportunity to report any server  
write errors to the application via the return code from close().  
This behavior is referred to as close-to-open cache consistency.


Linux implements close-to-open cache consistency by comparing  
the results of a GETATTR operation done just after the file is closed  
to the results of a GETATTR operation done when the file is next  
opened. If the results are the same, the client will assume its data  
cache is still valid; otherwise, the cache is purged.


Close-to-open cache consistency was introduced to the Linux NFS  
client in 2.4.20. If for some reason you have applications that  
depend on the old behavior, you can disable close-to-open support by  
using the "nocto" mount option.


There are still opportunities for a client's data cache to  
contain stale data. The NFS version 3 protocol introduced "weak cache  
consistency" (also known as WCC) which provides a way of checking a  
file's attributes before and after an operation to allow a client to  
identify changes that could have been made by other clients.  
Unfortunately when a client is using many concurrent operations that  
update the same file at the same time, it is impossible to tell  
whether it was that client's updates or some other client's updates  
that changed the file.


For this reason, some versions of the Linux 2.6 NFS client  
abandon WCC checking entirely, and simply trust their own data cache.  
On these versions, the client can maintain a cache full of stale file  
data if a file is opened for write. In this case, using file locking  
is the best way to ensure that all clients see the latest version of  
a file's data.


A system administrator can try using the "noac" mount option to  
achieve attribute cache coherency among multiple clients. Almost  
every client operation checks file attribute information. Usually the  
client keeps this information cached for a period of time to reduce  
network and server load. When "noac" is in effect, a client's file  
attribute cache is disabled, so each operation that needs to check a  
file's attributes is forced to go back to the server. This permits a  
client to see changes to a file very quickly, at the cost of many  
extra network operations.


Be careful not to confuse "noac" with "no data caching." The  
"noac" mount option will keep file attributes up-to-date with the  
server, but there are still races that may result in data incoherency  
between client and server. If you need absolute cache coherency among  
clients, applications can use file locking, where a client purges  
file data when a file is locked, and flushes changes back to the  
server before unlocking a file; or applications can open their files  
with the O_DIRECT flag to disable data caching entirely.


For a better understanding of the compromises faced in the  
design of NFS caching, see Callaghan's "NFS Illustrated."


On Jan 9, 2007, at 12:25 PM, Michael McCandless (JIRA) wrote:



[ https://issues.apache.org/jira/browse/LUCENE-767? 
page=com.atlassian.jira.plugin.system.issuetabpanels:comment- 
tabpanel#action_12463358 ]


Michael McCandless commented on LUCENE-767:
---


Carrying over from the java-dev list:


Grant Ingersoll wrote:

Can you explain in more detail on this bug why this makes you  
nervous?


Well ... the only specific example I have is NFS (always my favorite
example!).

As I understand it, the NFS client typically uses a separate cache to
hold the "attributes" of the file, including file length.  This cache
often has weaker or maybe just "different" guarantees than the "data
cache" that holds the file contents.  So basically you can ask what
the file length is and get a wrong (stale) answer.  EG see
http://nfs.sourceforge.net, which describes Linux's NFS client
approach.  The NFS client on Apple's OS X seems to be even worse!

I think very likely Lucene may not trip up on this specifically since
a reader would only ask for this file's length for the first time once
the file is done being written (ie the commit of segments_N has
occur

Re: [jira] Commented: (LUCENE-140) docs out of order

2007-01-09 Thread Doron Cohen

"Michael McCandless (JIRA)" <[EMAIL PROTECTED]> wrote on 09/01/2007 03:32:27:

> > I think however that the test Mike added does not expose the docs
> > out of order bug - I tried this test without the fix and it only
> > fail on the "gotException assert" - if you comment this assert the
> > test pass.
>
> Huh, I see my test case (in IndexReader) indeed hitting the original
> "docs out of order" exception.  If I take the current trunk and
> comment out the (one line) bounds check in BitVector.set and run that
> test, it hits the "docs out of order" exception.
>
> Are you sure you updated the change (to tighten the check to a <= from
> a <) to index/SegmentMerger.java?  Because, I did indeed find that the
> test failed to fail when I first wrote it (but should have).  So in
> digging why it didn't fail as expected, I found that the check for
> "docs out of order" missed the boundary case of the same doc number
> twice in a row.  Once I fixed that, the test failed as expected.
>

That's a moving target...:-)
You're right, I ran without the SegmentMerger-tightened-check, imitating
current 1.9.1 "field experience".


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-724) Oracle JVM implementation for Lucene DataStore also a preliminary implementation for an Oracle Domain index using Lucene

2007-01-09 Thread Marcelo F. Ochoa (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo F. Ochoa updated LUCENE-724:


Attachment: ojvm-01-09-07.tar.gz

> Oracle JVM implementation for Lucene DataStore also a preliminary 
> implementation for an Oracle Domain index using Lucene
> 
>
> Key: LUCENE-724
> URL: https://issues.apache.org/jira/browse/LUCENE-724
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Affects Versions: 2.0.0
> Environment: Oracle 10g R2 with latest patchset, there is a txt file 
> into the lib directory with the required libraries to compile this extension, 
> which for legal issues I can't redistribute. All these libraries are include 
> into the Oracle home directory,
>Reporter: Marcelo F. Ochoa
>Priority: Minor
> Attachments: ojvm-01-09-07.tar.gz, ojvm-11-28-06.tar.gz, 
> ojvm-12-20-06.tar.gz, ojvm.tar.gz
>
>
> Here a preliminary implementation of the Oracle JVM Directory data store 
> which replace a file system by BLOB data storage.
> The reason to do this is:
>   - Using traditional File System for storing the inverted index is not a 
> good option for some users.
>   - Using BLOB for storing the inverted index running Lucene outside the 
> Oracle database has a bad performance because there are a lot of network 
> round trips and data marshalling.
>   - Indexing relational data stores such as tables with VARCHAR2, CLOB or 
> XMLType with Lucene running outside the database has the same problem as the 
> previous point.
>   - The JVM included inside the Oracle database can scale up to 10.000+ 
> concurrent threads without memory leaks or deadlock and all the operation on 
> tables are in the same memory space!!
>   With these points in mind, I uploaded the complete Lucene framework inside 
> the Oracle JVM and I runned the complete JUnit test case successful, except 
> for some test such as the RMI test which requires special grants to open 
> ports inside the database.
>   The Lucene's test cases run faster inside the Oracle database (11g) than 
> the Sun JDK 1.5, because the classes are automatically JITed after some 
> executions.
>   I had implemented and OJVMDirectory Lucene Store which replaces the file 
> system storage with a BLOB based storage, compared with a RAMDirectory 
> implementation is a bit slower but we gets all the benefits of the BLOB 
> storage (backup, concurrence control, and so on).
>  The OJVMDirectory is cloned from the source at
> http://issues.apache.org/jira/browse/LUCENE-150 (DBDirectory) but with some 
> changes to run faster inside the Oracle JVM.
>  At this moment, I am working in a full integration with the SQL Engine using 
> the Data Cartridge API, it means using Lucene as a new Oracle Domain Index.
>  With this extension we can create a Lucene Inverted index in a table using:
> create index it1 on t1(f2) indextype is LuceneIndex parameters('test');
>  assuming that the table t1 has a column f2 of type VARCHAR2, CLOB or 
> XMLType, after this, the query against the Lucene inverted index can be made 
> using a new Oracle operator:
> select * from t1 where contains(f2, 'Marcelo') = 1;
>  the important point here is that this query is integrated with the execution 
> plan of the Oracle database, so in this simple example the Oracle optimizer 
> see that the column "f2" is indexed with the Lucene Domain index, then using 
> the Data Cartridge API a Java code running inside the Oracle JVM is executed 
> to open the search, a fetch all the ROWID that match with "Marcelo" and get 
> the rows using the pointer,
> here the output:
> SELECT STATEMENT  ALL_ROWS  3   1 
>   115
>TABLE ACCESS(BY INDEX ROWID) LUCENE.T1  3   1   115
> DOMAIN INDEX LUCENE.IT1
>  Another benefits of using the Data Cartridge API is that if the table T1 has 
> insert, update or delete rows operations a corresponding Java method will be 
> called to automatically update the Lucene Index.
>   There is a simple HTML file with some explanation of the code.
>The install.sql script is not fully tested and must be lunched into the 
> Oracle database, not remotely.
>   Best regards, Marcelo.
> - For Oracle users the big question is, Why do I use Lucene instead of Oracle 
> Text which is implemented in C?
>   I think that the answer is too simple, Lucene is open source and anybody 
> can extend it and add the functionality needed
> - For Lucene users which try to use Lucene as enterprise search engine, the 
> Oracle JVM provides an highly scalable container which can scale up to 
> 10.000+ concurrent session and with the facility of querying table in the 
> same memory spa

[jira] Commented: (LUCENE-724) Oracle JVM implementation for Lucene DataStore also a preliminary implementation for an Oracle Domain index using Lucene

2007-01-09 Thread Marcelo F. Ochoa (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12463377
 ] 

Marcelo F. Ochoa commented on LUCENE-724:
-

Latest code includes:
- The  Data Cartridge API is used without column data to reduce the data stored 
on the queue of changes and speedup the operation of the synchronize method.
- Query Hits are cached associated to the index search and the string returned 
by the QueryParser.toString() method.
- If no ancillary operator is used in the select, do not store the score list.
- The "Stemmer" argument is recognized as parameter given the argument for the 
SnowBall analyzer, for example: create index it1 on t1(f2) indextype is 
lucene.LuceneIndex parameters('Stemmer:English');.
- Before installing the ojvm extension is necessary to execute "ant jar-core" 
on the snowball directory.
- The IndexWriter.setUseCompoundFile(false) is called to use multi file storage 
(faster than the compound file) because there is no file descriptor limitation 
inside the OJVM, BLOBs are used instead of File.
- Files are marked for deletion and they are purged when calling to Sync or 
Optimize methods.
- Blob are created and populated in one call using Oracle SQL RETURNING 
information.
- A testing script for using OE sample schema, with query comparisons against 
Oracle Text ctxsys.context index. 

TODO:
- ODCI Stats interface implementation to provide to the optimizer the 
information about the cost of using the Domain Index. 
- A binding for using FIRST_ROWS(n) optimizer hint.
- A Digester class for loading DBLP database for testing very big indexes.
- Support for column with XDBUriType values.

> Oracle JVM implementation for Lucene DataStore also a preliminary 
> implementation for an Oracle Domain index using Lucene
> 
>
> Key: LUCENE-724
> URL: https://issues.apache.org/jira/browse/LUCENE-724
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Store
>Affects Versions: 2.0.0
> Environment: Oracle 10g R2 with latest patchset, there is a txt file 
> into the lib directory with the required libraries to compile this extension, 
> which for legal issues I can't redistribute. All these libraries are include 
> into the Oracle home directory,
>Reporter: Marcelo F. Ochoa
>Priority: Minor
> Attachments: ojvm-01-09-07.tar.gz, ojvm-11-28-06.tar.gz, 
> ojvm-12-20-06.tar.gz, ojvm.tar.gz
>
>
> Here a preliminary implementation of the Oracle JVM Directory data store 
> which replace a file system by BLOB data storage.
> The reason to do this is:
>   - Using traditional File System for storing the inverted index is not a 
> good option for some users.
>   - Using BLOB for storing the inverted index running Lucene outside the 
> Oracle database has a bad performance because there are a lot of network 
> round trips and data marshalling.
>   - Indexing relational data stores such as tables with VARCHAR2, CLOB or 
> XMLType with Lucene running outside the database has the same problem as the 
> previous point.
>   - The JVM included inside the Oracle database can scale up to 10.000+ 
> concurrent threads without memory leaks or deadlock and all the operation on 
> tables are in the same memory space!!
>   With these points in mind, I uploaded the complete Lucene framework inside 
> the Oracle JVM and I runned the complete JUnit test case successful, except 
> for some test such as the RMI test which requires special grants to open 
> ports inside the database.
>   The Lucene's test cases run faster inside the Oracle database (11g) than 
> the Sun JDK 1.5, because the classes are automatically JITed after some 
> executions.
>   I had implemented and OJVMDirectory Lucene Store which replaces the file 
> system storage with a BLOB based storage, compared with a RAMDirectory 
> implementation is a bit slower but we gets all the benefits of the BLOB 
> storage (backup, concurrence control, and so on).
>  The OJVMDirectory is cloned from the source at
> http://issues.apache.org/jira/browse/LUCENE-150 (DBDirectory) but with some 
> changes to run faster inside the Oracle JVM.
>  At this moment, I am working in a full integration with the SQL Engine using 
> the Data Cartridge API, it means using Lucene as a new Oracle Domain Index.
>  With this extension we can create a Lucene Inverted index in a table using:
> create index it1 on t1(f2) indextype is LuceneIndex parameters('test');
>  assuming that the table t1 has a column f2 of type VARCHAR2, CLOB or 
> XMLType, after this, the query against the Lucene inverted index can be made 
> using a new Oracle operator:
> select * from t1 where contains(f2, 'Marcelo') = 1;
>  the important point here is that this query is integrated with the ex

Re: Beyond Lucene 2.0 Index Design

2007-01-09 Thread Marvin Humphrey



On Jan 9, 2007, at 6:25 AM, Dalton, Jeffery wrote:


e. 
f. ],...[docN, freq
,])


Does the impact have any use after it's used to sort the postings?   
Can we leave it out of the index format and recalculate at merge-time?


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-767) maxDoc should be explicitly stored in the index, not derived from file length

2007-01-09 Thread Michael McCandless


robert engels wrote:
It would appear that NFS Version 2 is not suitable for Lucene. NFS 
Version 3 looks like it should work. See 
http://nfs.sourceforge.net/#section_a


I will take this opportunity to state again what I've always been told, 
and it seems to hold up, using NFS for shared interactively updated 
files is always going to be troublesome. They have patched it over the 
years to help, but it just wasn't designed for this for the beginning.


Unix systems never even had file system locks. It was assumed that 
shared access to shared data would be accomplished via a shared server - 
not by sharing access to the data directly. It is far more efficient and 
robust to do things this way.


Modifying a shared Lucene directory via NFS directly is always going to 
be error prone.


Why not just implement a server/parallel index solution ?


Actually I think now (with lockless commits) Lucene works fine over
NFS, except for the [yes, rather big] remaining issue: LUCENE-710.

But that issue, while clearly scary when you first see it, can be
easily worked around (just refresh your searchers once they hit "Stale
NFS handle").

Even once we resolve that and Lucene works over NFS, I do think the
performance will typically not be "stellar".  At least in my
experience the performance of NFS is surprisingly poor.  So I do think
for users that require high performance a replicated (like Solr)
and/or distributed index solution is probably the way to go.

Anyway, I didn't mean to turn this back into an NFS discussion.  I
just wanted to use NFS as an example of where relying on file length
for something important (maxDocs() in a segment) is possibly
dangerous.

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-755) Payloads

2007-01-09 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/LUCENE-755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicolas Lalevée updated LUCENE-755:
---

Attachment: payload.patch

> Payloads
> 
>
> Key: LUCENE-755
> URL: https://issues.apache.org/jira/browse/LUCENE-755
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Reporter: Michael Busch
> Assigned To: Michael Busch
> Attachments: payload.patch, payloads.patch
>
>
> This patch adds the possibility to store arbitrary metadata (payloads) 
> together with each position of a term in its posting lists. A while ago this 
> was discussed on the dev mailing list, where I proposed an initial design. 
> This patch has a much improved design with modifications, that make this new 
> feature easier to use and more efficient.
> A payload is an array of bytes that can be stored inline in the ProxFile 
> (.prx). Therefore this patch provides low-level APIs to simply store and 
> retrieve byte arrays in the posting lists in an efficient way. 
> API and Usage
> --   
> The new class index.Payload is basically just a wrapper around a byte[] array 
> together with int variables for offset and length. So a user does not have to 
> create a byte array for every payload, but can rather allocate one array for 
> all payloads of a document and provide offset and length information. This 
> reduces object allocations on the application side.
> In order to store payloads in the posting lists one has to provide a 
> TokenStream or TokenFilter that produces Tokens with payloads. I added the 
> following two methods to the Token class:
>   /** Sets this Token's payload. */
>   public void setPayload(Payload payload);
>   
>   /** Returns this Token's payload. */
>   public Payload getPayload();
> In order to retrieve the data from the index the interface TermPositions now 
> offers two new methods:
>   /** Returns the payload length of the current term position.
>*  This is invalid until [EMAIL PROTECTED] #nextPosition()} is called for
>*  the first time.
>* 
>* @return length of the current payload in number of bytes
>*/
>   int getPayloadLength();
>   
>   /** Returns the payload data of the current term position.
>* This is invalid until [EMAIL PROTECTED] #nextPosition()} is called for
>* the first time.
>* This method must not be called more than once after each call
>* of [EMAIL PROTECTED] #nextPosition()}. However, payloads are loaded 
> lazily,
>* so if the payload data for the current position is not needed,
>* this method may not be called at all for performance reasons.
>* 
>* @param data the array into which the data of this payload is to be
>* stored, if it is big enough; otherwise, a new byte[] array
>* is allocated for this purpose. 
>* @param offset the offset in the array into which the data of this payload
>*   is to be stored.
>* @return a byte[] array containing the data of this payload
>* @throws IOException
>*/
>   byte[] getPayload(byte[] data, int offset) throws IOException;
> Furthermore, this patch indroduces the new method 
> IndexOutput.writeBytes(byte[] b, int offset, int length). So far there was 
> only a writeBytes()-method without an offset argument. 
> Implementation details
> --
> - One field bit in FieldInfos is used to indicate if payloads are enabled for 
> a field. The user does not have to enable payloads for a field, this is done 
> automatically:
>* The DocumentWriter enables payloads for a field, if one ore more Tokens 
> carry payloads.
>* The SegmentMerger enables payloads for a field during a merge, if 
> payloads are enabled for that field in one or more segments.
> - Backwards compatible: If payloads are not used, then the formats of the 
> ProxFile and FreqFile don't change
> - Payloads are stored inline in the posting list of a term in the ProxFile. A 
> payload of a term occurrence is stored right after its PositionDelta.
> - Same-length compression: If payloads are enabled for a field, then the 
> PositionDelta is shifted one bit. The lowest bit is used to indicate whether 
> the length of the following payload is stored explicitly. If not, i. e. the 
> bit is false, then the payload has the same length as the payload of the 
> previous term occurrence.
> - In order to support skipping on the ProxFile the length of the payload at 
> every skip point has to be known. Therefore the payload length is also stored 
> in the skip list located in the FreqFile. Here the same-length compression is 
> also used: The lowest bit of DocSkip is used to indicate if the payload 
> length is stored for a SkipDatum or if the length is the same as in the last 
> SkipDatum.
> - Payloads are loaded lazily.

[jira] Commented: (LUCENE-755) Payloads

2007-01-09 Thread JIRA


[ 
https://issues.apache.org/jira/browse/LUCENE-755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12463414
 ] 

Nicolas Lalevée commented on LUCENE-755:


The patch I have just upload (payload.patch) is Michael's one (payloads.patch) 
with the customization of how payload are written and read, exactly like I did 
for Lucene-662. An IndexFormat is in fact a factory of PayloadWriter and 
PayloadReader, this index format being stored in the Directory instance.

Note that I haven't changed the javadoc neither the comments included in 
Michael's patch, it needs some cleanup if somebody is interested in commiting 
it.
And sorry for the name of the patch I have uploaded, it is a little bit 
confusing now, and I can't change it's name. I will be more carefull next time 
when naming my patch files.

> Payloads
> 
>
> Key: LUCENE-755
> URL: https://issues.apache.org/jira/browse/LUCENE-755
> Project: Lucene - Java
>  Issue Type: New Feature
>  Components: Index
>Reporter: Michael Busch
> Assigned To: Michael Busch
> Attachments: payload.patch, payloads.patch
>
>
> This patch adds the possibility to store arbitrary metadata (payloads) 
> together with each position of a term in its posting lists. A while ago this 
> was discussed on the dev mailing list, where I proposed an initial design. 
> This patch has a much improved design with modifications, that make this new 
> feature easier to use and more efficient.
> A payload is an array of bytes that can be stored inline in the ProxFile 
> (.prx). Therefore this patch provides low-level APIs to simply store and 
> retrieve byte arrays in the posting lists in an efficient way. 
> API and Usage
> --   
> The new class index.Payload is basically just a wrapper around a byte[] array 
> together with int variables for offset and length. So a user does not have to 
> create a byte array for every payload, but can rather allocate one array for 
> all payloads of a document and provide offset and length information. This 
> reduces object allocations on the application side.
> In order to store payloads in the posting lists one has to provide a 
> TokenStream or TokenFilter that produces Tokens with payloads. I added the 
> following two methods to the Token class:
>   /** Sets this Token's payload. */
>   public void setPayload(Payload payload);
>   
>   /** Returns this Token's payload. */
>   public Payload getPayload();
> In order to retrieve the data from the index the interface TermPositions now 
> offers two new methods:
>   /** Returns the payload length of the current term position.
>*  This is invalid until [EMAIL PROTECTED] #nextPosition()} is called for
>*  the first time.
>* 
>* @return length of the current payload in number of bytes
>*/
>   int getPayloadLength();
>   
>   /** Returns the payload data of the current term position.
>* This is invalid until [EMAIL PROTECTED] #nextPosition()} is called for
>* the first time.
>* This method must not be called more than once after each call
>* of [EMAIL PROTECTED] #nextPosition()}. However, payloads are loaded 
> lazily,
>* so if the payload data for the current position is not needed,
>* this method may not be called at all for performance reasons.
>* 
>* @param data the array into which the data of this payload is to be
>* stored, if it is big enough; otherwise, a new byte[] array
>* is allocated for this purpose. 
>* @param offset the offset in the array into which the data of this payload
>*   is to be stored.
>* @return a byte[] array containing the data of this payload
>* @throws IOException
>*/
>   byte[] getPayload(byte[] data, int offset) throws IOException;
> Furthermore, this patch indroduces the new method 
> IndexOutput.writeBytes(byte[] b, int offset, int length). So far there was 
> only a writeBytes()-method without an offset argument. 
> Implementation details
> --
> - One field bit in FieldInfos is used to indicate if payloads are enabled for 
> a field. The user does not have to enable payloads for a field, this is done 
> automatically:
>* The DocumentWriter enables payloads for a field, if one ore more Tokens 
> carry payloads.
>* The SegmentMerger enables payloads for a field during a merge, if 
> payloads are enabled for that field in one or more segments.
> - Backwards compatible: If payloads are not used, then the formats of the 
> ProxFile and FreqFile don't change
> - Payloads are stored inline in the posting list of a term in the ProxFile. A 
> payload of a term occurrence is stored right after its PositionDelta.
> - Same-length compression: If payloads are enabled for a field, then the 
> PositionDelta is shifted one bit. The lowest bit is use

[jira] Created: (LUCENE-769) [PATCH] Performance improvement for some cases of sorted search

2007-01-09 Thread Artem Vasiliev (JIRA)

[PATCH] Performance improvement for some cases of sorted search
---

 Key: LUCENE-769
 URL: https://issues.apache.org/jira/browse/LUCENE-769
 Project: Lucene - Java
  Issue Type: Improvement
Affects Versions: 2.0.0
Reporter: Artem Vasiliev


It's a small addition to Lucene that significantly lowers memory consumption 
and improves performance for sorted searches with frequent index updates and 
relatively big indexes (>1mln docs) scenario. This solution supports only 
single-field sorting currently (which seem to be quite popular use case). 
Multiple fields support can be added without much trouble.

The solution is this: documents from the sorting set (instead of given field's 
values from the whole index - current FieldCache approach) are cached in a 
WeakHashMap so the cached items are candidates for GC.  Their fields values are 
then fetched from the cache and compared while sorting.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Beyond Lucene 2.0 Index Design

2007-01-09 Thread Doron Cohen

Scoring today goes doc-at-a-time - all scorers and term-posting-readers
advance together; once a new doc is processed, scoring of previous docs is
known and final. This allows maintaining a finite size queue for collecting
best hits. Then, for huge collections, having to exhaustively scan all
posting lists becomes a scalability issue.

If I understand the proposal here correctly, it is suggested to order
posting lists by possible score contribution (frequency/impact) - in
particular, two different terms A and B both appearing in two docs X and Y
may have the docs in a different order - X Y for A and Y X for B. This
flexibility/granularity, while enabling early-terminataion and pruning,
would require a term-at-a-time scoring algorithm, which would be quite a
change in Lucene scoring.

The work sited seems less flexible - only a single value is maintained per
document - e.g. Page Rank - and then documents IDs are shuffled so that in
the resulted index the posting lists are still ordered - as today - by
docids, but sattify: rank(i) >= rank(i+1).

If the latter (single-value-per-doc) approach is taken, still needs to
decide when to sort - externally, when exporting the (large) index for
search, or internally, as part of merge. If done as part of merge, each
(result) segment is always ordered (by rank); merging would become more
tricky, since you don't want to exhaust memory while merging, but I think
it is doable.

"Dalton, Jeffery" <[EMAIL PROTECTED]> wrote on 09/01/2007 06:25:33:

> Hi,
>
> I wanted to start some discussion about possible future Lucene file /
> index formats.  This is an extension to the discussion on Flexible
> Lucene Indexing discussed on the wiki:
> http://wiki.apache.org/jakarta-lucene/FlexibleIndexing
>
> Note: Related sources are listed at the end.
>
> I would like to have the ability to create term frequency [Persin, et
> al. 1996] or "impact" sorted [Anh, Moffat 2001,2006] posting lists (freq
> data) . A posting list sorted by Term frequency rather than document id
> is straightforward (posting design below).  An Impact sorted list is
> relatively new (and perhaps unfamiliar).  An Impact is a single integer
> value for a term in a document that is stored in the posting list and is
> computed from the combination of the term frequency, document boost,
> field boost, length norms, and other arbitrary scoring features (word
> position, font, etc...) -- all local information.
>
> The driving motivation for this change is to avoid reading the entire
> posting list from disk for very long posting lists (it also leads to
> simplified query-time scoring because the tf, norms, and boosts are
> built into the impact).  This would address scalability issues with
> large collections that have been seen in the past; back in December 2005
> there were two threads: "Lucene Performance Bottlenecks" (Lucene User)
> and "IndexOptimizer Re: Lucene Performance Bottlenecks" (Nutch Dev)
> where Doug and Andrzej addressed some speed concerns by sorting the
> Nutch index based on Document Boost (IndexSorter and a TopDocsCollector)
> [inpsired by Long, Suel]. The goal is that an impact sorted posting list
> would address these and other concerns in a generic manner.
>
> Allowing a frequency or impact sorted posting list format would lead to
> a posting list with the following structure:
> (Frequency or impact could be used interchangeably in the structure.
> Lettering continued from Wiki)
>
> e. 
> f. ],...[docN, freq
> ,])
>
> The positions are delta encoded for compression.  Similarly, the
> document numbers for a given frequency/impact are delta encoded.  If you
> read Moffat and Persin, the papers show that this achieves compression
> comparable to, or even better than, a standard delta encoded docId
> sorted index.  The structure lends itself well to early termination,
> pruning, etc... where the entire posting list is not read from disk.
>
> This type of Impact sorted structure (or similar concept) seems to be
> becoming a "standard" solution described in a lot of new research / text
> books on IR for large scale indexes.  It would be great if Lucene
> supported something like this someday ;-).
>
> Thanks,
>
> Jeff Dalton
>
> References:
> Anh, Moffat. Vector-Space Ranking with Effective Early Termination.
> 2001.
> Anh, Moffat. Impact Transformation: Effective and Efficient Web
> Retrieval. 2006.
> Anh, Moffat. Pruned Query Evaluation Using Pre-Computed Impacts. 2006.
> Long, Suel. Optimized Query Execution in Large Search Engine with Global
> Page Ordering.
> Manning, Raghavan, Schutze. Introduction to Information Retrieval,
> Chapters 2,7.
> http://www-csli.stanford.edu/%7Eschuetze/information-retrieval-book.html
>
> Persin, et al. Filtered Document Retrieval with Frequency-Sorted
> Indexes. 1996.
>
>
>
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>

[jira] Updated: (LUCENE-769) [PATCH] Performance improvement for some cases of sorted search

2007-01-09 Thread Artem Vasiliev (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Vasiliev updated LUCENE-769:
--

Attachment: DocCachingSorting.patch

> [PATCH] Performance improvement for some cases of sorted search
> ---
>
> Key: LUCENE-769
> URL: https://issues.apache.org/jira/browse/LUCENE-769
> Project: Lucene - Java
>  Issue Type: Improvement
>Affects Versions: 2.0.0
>Reporter: Artem Vasiliev
> Attachments: DocCachingSorting.patch
>
>
> It's a small addition to Lucene that significantly lowers memory consumption 
> and improves performance for sorted searches with frequent index updates and 
> relatively big indexes (>1mln docs) scenario. This solution supports only 
> single-field sorting currently (which seem to be quite popular use case). 
> Multiple fields support can be added without much trouble.
> The solution is this: documents from the sorting set (instead of given 
> field's values from the whole index - current FieldCache approach) are cached 
> in a WeakHashMap so the cached items are candidates for GC.  Their fields 
> values are then fetched from the cache and compared while sorting.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Beyond Lucene 2.0 Index Design

2007-01-09 Thread Dalton, Jeffery

Doron -- you have the idea.  

And yes, it would be a substantial change to Lucene scoring.  Ideally,
Lucene / doc format would be changed in such a way to support both docId
sorted indexes (and doc-at-a-time processing) and frequency/impact
sorted indexes with term-at-a-time or even score-at-a-time processing.
I think we could do a lot to generalize Lucene to support both modes (at
least at index level).

You do a good job describing doc and term at a time processing.
Score-at-a-time processing is where the posting list with the highest
IDF x current Impact score is read -- the next chunk of index that can
contribute the most to overall document score.  

Doron, the less flexible work I assume you mention is Suel -- Global
Page Ordering.  Indeed, I agree, this is inflexible, requires all
documents to be renumbered, and only sorts by Page Rank.  It's too high
level (only applies to web docs) for Lucene to support.  Anh/Moffat's
Impact sorting is quite flexible and generally has wide application
because it provides one score per term in a document - any local
document information can be factored into the score at index creation
time (such as multiple field or positional boosts).

As you say Doron, I think the best approach is that the sort be
maintained at segment merge time, the ImpactSortedSegmentMerger (or a
class similarly named) would merge sort the postings by score as it
writes them to disk. It's definitely doable -- in fact I have created
classes to do it, but doing so required significant changes to other
parts of Lucene to calculate the score and get it down to the
DocumentWriter / SegmentMerger level. 

On the search side there needs to be ScoreSortedTermDocs and
ScoreSortedTermPositions objects to read the new format, along with
Scorers to perform scoring and intersection.  

The end product would be a very scalable and flexible solution.  

- Jeff

> -Original Message-
> From: Doron Cohen [mailto:[EMAIL PROTECTED] 
> Sent: Tuesday, January 09, 2007 5:27 PM
> To: java-dev@lucene.apache.org
> Subject: Re: Beyond Lucene 2.0 Index Design
> 
> Scoring today goes doc-at-a-time - all scorers and 
> term-posting-readers advance together; once a new doc is 
> processed, scoring of previous docs is known and final. This 
> allows maintaining a finite size queue for collecting best 
> hits. Then, for huge collections, having to exhaustively scan 
> all posting lists becomes a scalability issue.
> 
> If I understand the proposal here correctly, it is suggested 
> to order posting lists by possible score contribution 
> (frequency/impact) - in particular, two different terms A and 
> B both appearing in two docs X and Y may have the docs in a 
> different order - X Y for A and Y X for B. This 
> flexibility/granularity, while enabling early-terminataion 
> and pruning, would require a term-at-a-time scoring 
> algorithm, which would be quite a change in Lucene scoring.
> 
> The work sited seems less flexible - only a single value is 
> maintained per document - e.g. Page Rank - and then documents 
> IDs are shuffled so that in the resulted index the posting 
> lists are still ordered - as today - by docids, but sattify: 
> rank(i) >= rank(i+1).
> 
> If the latter (single-value-per-doc) approach is taken, still 
> needs to decide when to sort - externally, when exporting the 
> (large) index for search, or internally, as part of merge. If 
> done as part of merge, each
> (result) segment is always ordered (by rank); merging would 
> become more tricky, since you don't want to exhaust memory 
> while merging, but I think it is doable.
> 
> "Dalton, Jeffery" <[EMAIL PROTECTED]> wrote on 
> 09/01/2007 06:25:33:
> 
> > Hi,
> >
> > I wanted to start some discussion about possible future 
> Lucene file / 
> > index formats.  This is an extension to the discussion on Flexible 
> > Lucene Indexing discussed on the wiki:
> > http://wiki.apache.org/jakarta-lucene/FlexibleIndexing
> >
> > Note: Related sources are listed at the end.
> >
> > I would like to have the ability to create term frequency 
> [Persin, et 
> > al. 1996] or "impact" sorted [Anh, Moffat 2001,2006] posting lists 
> > (freq
> > data) . A posting list sorted by Term frequency rather than 
> document 
> > id is straightforward (posting design below).  An Impact 
> sorted list 
> > is relatively new (and perhaps unfamiliar).  An Impact is a single 
> > integer value for a term in a document that is stored in 
> the posting 
> > list and is computed from the combination of the term frequency, 
> > document boost, field boost, length norms, and other 
> arbitrary scoring 
> > features (word position, font, etc...) -- all local information.
> >
> > The driving motivation for this change is to avoid reading 
> the entire 
> > posting list from disk for very long posting lists (it also 
> leads to 
> > simplified query-time scoring because the tf, norms, and boosts are 
> > built into the impact).  This would address scalability issues with 
> >

RE: Beyond Lucene 2.0 Index Design

2007-01-09 Thread Dalton, Jeffery

I'm not sure we fully understand one another, but I'll try to explain
what I am thinking.

Yes, it has use after sorting.  It is used at query time for document
scoring in place of the TF and length norm components  (new scorers
would need to be created).  

Using an impact based index moves most of the scoring from query time to
index time (trades query time flexibility for greatly improved query
search performance).  Because the field boosts, length norm, position
boosts, etc... are incorporated into a single document-term-score, you
can use a single field at search time.  It allows one posting list per
query term instead of the current one posting list per field per query
term (MultiFieldQueryParser wouldn't be necessary in most cases).  In
addition to having fewer posting lists to examine, you often don't need
to read to the end of long posting lists when processing with a
score-at-a-time approach (see Anh/Moffat's Pruned Query Evaluation Using
Pre-Computed Impacts, SIGIR 2006) for details on one potential
algorithm.  

I'm not quite sure what you mean when mention leaving them out and
re-calculating them at merge time.  

- Jeff

> -Original Message-
> From: Marvin Humphrey [mailto:[EMAIL PROTECTED] 
> Sent: Tuesday, January 09, 2007 2:58 PM
> To: java-dev@lucene.apache.org
> Subject: Re: Beyond Lucene 2.0 Index Design
> 
> 
> On Jan 9, 2007, at 6:25 AM, Dalton, Jeffery wrote:
> 
> > e. 
> > f. ],...[docN, freq
> > ,])
> 
> Does the impact have any use after it's used to sort the postings?   
> Can we leave it out of the index format and recalculate at merge-time?
> 
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
> 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Updated: (LUCENE-140) docs out of order

2007-01-09 Thread Jed Wesley-Smith (JIRA)


 [ 
https://issues.apache.org/jira/browse/LUCENE-140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jed Wesley-Smith updated LUCENE-140:


Attachment: indexing-failure.log

> docs out of order
> -
>
> Key: LUCENE-140
> URL: https://issues.apache.org/jira/browse/LUCENE-140
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: unspecified
> Environment: Operating System: Linux
> Platform: PC
>Reporter: legez
> Assigned To: Michael McCandless
> Attachments: bug23650.txt, corrupted.part1.rar, corrupted.part2.rar, 
> indexing-failure.log, LUCENE-140-2007-01-09-instrumentation.patch
>
>
> Hello,
>   I can not find out, why (and what) it is happening all the time. I got an
> exception:
> java.lang.IllegalStateException: docs out of order
> at
> org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:219)
> at
> org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:191)
> at
> org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:172)
> at 
> org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:135)
> at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:88)
> at 
> org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:341)
> at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:250)
> at Optimize.main(Optimize.java:29)
> It happens either in 1.2 and 1.3rc1 (anyway what happened to it? I can not 
> find
> it neither in download nor in version list in this form). Everything seems 
> OK. I
> can search through index, but I can not optimize it. Even worse after this
> exception every time I add new documents and close IndexWriter new segments is
> created! I think it has all documents added before, because of its size.
> My index is quite big: 500.000 docs, about 5gb of index directory.
> It is _repeatable_. I drop index, reindex everything. Afterwards I add a few
> docs, try to optimize and receive above exception.
> My documents' structure is:
>   static Document indexIt(String id_strony, Reader reader, String 
> data_wydania,
> String id_wydania, String id_gazety, String data_wstawienia)
> {
> Document doc = new Document();
> doc.add(Field.Keyword("id", id_strony ));
> doc.add(Field.Keyword("data_wydania", data_wydania));
> doc.add(Field.Keyword("id_wydania", id_wydania));
> doc.add(Field.Text("id_gazety", id_gazety));
> doc.add(Field.Keyword("data_wstawienia", data_wstawienia));
> doc.add(Field.Text("tresc", reader));
> return doc;
> }
> Sincerely,
> legez

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-140) docs out of order

2007-01-09 Thread Jed Wesley-Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12463440
 ] 

Jed Wesley-Smith commented on LUCENE-140:
-

Hi Michael,

Thanks for the patch, applied and recreated. Attached is the log.

To be explicit, we are recreating the index via the IndexWriter ctor with the 
create flag set and then completely rebuilding the index. We are not completely 
deleting the entire directory. There ARE old index files (_*.cfs & _*.del) in 
the directory with updated timestamps that are months old. If I completely 
recreate the directory the problem does go away. This is a fairly trivial 
"fix", but we are still investigating as we want to know if this is indeed the 
problem, how we have come to make it prevalent, and what the root cause is.

Thanks for all the help everyone.

> docs out of order
> -
>
> Key: LUCENE-140
> URL: https://issues.apache.org/jira/browse/LUCENE-140
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: unspecified
> Environment: Operating System: Linux
> Platform: PC
>Reporter: legez
> Assigned To: Michael McCandless
> Attachments: bug23650.txt, corrupted.part1.rar, corrupted.part2.rar, 
> indexing-failure.log, LUCENE-140-2007-01-09-instrumentation.patch
>
>
> Hello,
>   I can not find out, why (and what) it is happening all the time. I got an
> exception:
> java.lang.IllegalStateException: docs out of order
> at
> org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:219)
> at
> org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:191)
> at
> org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:172)
> at 
> org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:135)
> at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:88)
> at 
> org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:341)
> at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:250)
> at Optimize.main(Optimize.java:29)
> It happens either in 1.2 and 1.3rc1 (anyway what happened to it? I can not 
> find
> it neither in download nor in version list in this form). Everything seems 
> OK. I
> can search through index, but I can not optimize it. Even worse after this
> exception every time I add new documents and close IndexWriter new segments is
> created! I think it has all documents added before, because of its size.
> My index is quite big: 500.000 docs, about 5gb of index directory.
> It is _repeatable_. I drop index, reindex everything. Afterwards I add a few
> docs, try to optimize and receive above exception.
> My documents' structure is:
>   static Document indexIt(String id_strony, Reader reader, String 
> data_wydania,
> String id_wydania, String id_gazety, String data_wstawienia)
> {
> Document doc = new Document();
> doc.add(Field.Keyword("id", id_strony ));
> doc.add(Field.Keyword("data_wydania", data_wydania));
> doc.add(Field.Keyword("id_wydania", id_wydania));
> doc.add(Field.Text("id_gazety", id_gazety));
> doc.add(Field.Keyword("data_wstawienia", data_wstawienia));
> doc.add(Field.Text("tresc", reader));
> return doc;
> }
> Sincerely,
> legez

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: .sN (separate norms files) and NO_NORMS

2007-01-09 Thread Grant Ingersoll

I synched the XML with the HTML and committed, changes should show up  
tonight.  The regeneration process is now (almost) the same as Solrs  
and is covered at http://wiki.apache.org/jakarta-lucene/ 
HowToUpdateTheWebsite

-Grant

On 1/9/07, Doron Cohen <[EMAIL PROTECTED]> wrote:
Otis Gospodnetic <[EMAIL PROTECTED]> wrote on 08/01/2007  
23:36:46:

> Also, looking at http://lucene.apache.org/java/docs/fileformats.html
> I don't even see any mention of .sN files.

I am almost sure I added that info to fileformats (issue 756).
I'll check what's been with that.

It may be in the xml, but I didn't regenerate or sync the site.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search  
server

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Job Opportunity (Sunnyvale, CA)

2007-01-09 Thread J. Delgado


(Sorry for the cross-posting)

This is a full-time position with an exciting New Venture (now in
stealth mode) and will be based out of Sunnyvale, CA.

We are looking for Java Developer with search, social networks and/or
payment processing related experience.

Required Skills:

2+ yrs of industrial experience on Search technologies/Engines like
Lucene/Nutch/Solr, Oracle,Fast, Endeca, etc. as well
as XML and relational database technologies and/or on development of
transactional payment systems (e.g. PayPal).

- Experience with classification, attribute matching and/or
collaborative filtering
- Some exposure to P2P technologies (transactions, communication and
social networks) is highly desirable.
- Understanding of ontologies/taxonomies, keyword libraries, and other
databases to assist search query interpretation and formulation.
- Prefer MS or Computer Science graduate with specialization in
Information Retrieval or Data Mining.
- Willing to train a junior candidate.
- Must be *hands-on*.
- Ability to work quickly and accurately in a high-volume work environment.
- Excellent analytical skills.
- Creativity, intelligence, and integrity.
- Strong work ethic and a high level of professionalism.
- Hands-on design and development skills in Java and J2EE technologies
- Experience in development of large scale Web Portals is a plus.

If interested, please send the resume with contact info and salary
expectations at the earliest.

Less experienced AJAX/Web 2.0 Java Developers are also welcomed to
submit their resume.

Joaquin Delgado, PhD.
[EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-140) docs out of order

2007-01-09 Thread Jed Wesley-Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12463470
 ] 

Jed Wesley-Smith commented on LUCENE-140:
-

BTW. We have looked at all the open files referenced by the VM when the 
indexing errors occur, and there does not seem to be any reference to the old 
index segment files, so I am not sure how those files are influencing this 
problem.

> docs out of order
> -
>
> Key: LUCENE-140
> URL: https://issues.apache.org/jira/browse/LUCENE-140
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: unspecified
> Environment: Operating System: Linux
> Platform: PC
>Reporter: legez
> Assigned To: Michael McCandless
> Attachments: bug23650.txt, corrupted.part1.rar, corrupted.part2.rar, 
> indexing-failure.log, LUCENE-140-2007-01-09-instrumentation.patch
>
>
> Hello,
>   I can not find out, why (and what) it is happening all the time. I got an
> exception:
> java.lang.IllegalStateException: docs out of order
> at
> org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:219)
> at
> org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:191)
> at
> org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:172)
> at 
> org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:135)
> at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:88)
> at 
> org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:341)
> at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:250)
> at Optimize.main(Optimize.java:29)
> It happens either in 1.2 and 1.3rc1 (anyway what happened to it? I can not 
> find
> it neither in download nor in version list in this form). Everything seems 
> OK. I
> can search through index, but I can not optimize it. Even worse after this
> exception every time I add new documents and close IndexWriter new segments is
> created! I think it has all documents added before, because of its size.
> My index is quite big: 500.000 docs, about 5gb of index directory.
> It is _repeatable_. I drop index, reindex everything. Afterwards I add a few
> docs, try to optimize and receive above exception.
> My documents' structure is:
>   static Document indexIt(String id_strony, Reader reader, String 
> data_wydania,
> String id_wydania, String id_gazety, String data_wstawienia)
> {
> Document doc = new Document();
> doc.add(Field.Keyword("id", id_strony ));
> doc.add(Field.Keyword("data_wydania", data_wydania));
> doc.add(Field.Keyword("id_wydania", id_wydania));
> doc.add(Field.Text("id_gazety", id_gazety));
> doc.add(Field.Keyword("data_wstawienia", data_wstawienia));
> doc.add(Field.Text("tresc", reader));
> return doc;
> }
> Sincerely,
> legez

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-542) QueryParser doesn't support keywords staring with *

2007-01-09 Thread jianwu chen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12463471
 ] 

jianwu chen commented on LUCENE-542:


Hi, Erik
I can't find more information on this issue. Could you provide more information 
or some links to that issue. I badly require this feature in my project. But 
now I can't find a solution unless to write another QueryPaser from scratch.
Thanks

> QueryParser doesn't support keywords staring with *
> ---
>
> Key: LUCENE-542
> URL: https://issues.apache.org/jira/browse/LUCENE-542
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: QueryParser
>Affects Versions: 1.9
> Environment: Windows Server 2003, Linux ES 3.0
>Reporter: Colin Yu
>
> It seems that the QueryParser can't handle the keyword starting with "*", 
> such as *test. It throws out ParserException. But this syntax is a valid one, 
> even "dir" or "ls" supports it. 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-140) docs out of order

2007-01-09 Thread Doron Cohen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12463483
 ] 

Doron Cohen commented on LUCENE-140:


Jed, is it possible that when re-creating the index, while IndexWriter is 
constructed with create=true, FSDirectory is opened with create=false?
I suspect so, because otherwise, old  .del files would have been deleted. 
If indeed so, newly created segments, which have same names as segments in 
previous (bad) runs, when opened, would read the (bad) old .del file. 
This would possibly expose the bug fixed by Michael. 
I may be over speculating here, but if this is the case, it can also explain 
why changing the merge factor from 4 to 10 exposed the problem. 

In fact, let me speculate even further - if indeed when creating the index from 
scratch, the FSDirectory is (mistakenly) opened with create=false, as long as 
you always repeated the same sequencing of adding and deleting docs, you were 
likely to almost not suffer from this mistake, because segments created with 
same names as (old) .del files simply see docs as deleted before the docs are 
actually deleted by the program. The search behaves wrongly, not finding these 
docs before they are actually deleted, but no exception is thrown when adding 
docs. However once the merge factor was changed from 4 to 10, the matching 
between old .del files and new segments (with same names) was broken, and the 
out-of-order exception appeared. 

...and if this is not the case, we would need to look for something else...

> docs out of order
> -
>
> Key: LUCENE-140
> URL: https://issues.apache.org/jira/browse/LUCENE-140
> Project: Lucene - Java
>  Issue Type: Bug
>  Components: Index
>Affects Versions: unspecified
> Environment: Operating System: Linux
> Platform: PC
>Reporter: legez
> Assigned To: Michael McCandless
> Attachments: bug23650.txt, corrupted.part1.rar, corrupted.part2.rar, 
> indexing-failure.log, LUCENE-140-2007-01-09-instrumentation.patch
>
>
> Hello,
>   I can not find out, why (and what) it is happening all the time. I got an
> exception:
> java.lang.IllegalStateException: docs out of order
> at
> org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:219)
> at
> org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:191)
> at
> org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:172)
> at 
> org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:135)
> at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:88)
> at 
> org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:341)
> at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:250)
> at Optimize.main(Optimize.java:29)
> It happens either in 1.2 and 1.3rc1 (anyway what happened to it? I can not 
> find
> it neither in download nor in version list in this form). Everything seems 
> OK. I
> can search through index, but I can not optimize it. Even worse after this
> exception every time I add new documents and close IndexWriter new segments is
> created! I think it has all documents added before, because of its size.
> My index is quite big: 500.000 docs, about 5gb of index directory.
> It is _repeatable_. I drop index, reindex everything. Afterwards I add a few
> docs, try to optimize and receive above exception.
> My documents' structure is:
>   static Document indexIt(String id_strony, Reader reader, String 
> data_wydania,
> String id_wydania, String id_gazety, String data_wstawienia)
> {
> Document doc = new Document();
> doc.add(Field.Keyword("id", id_strony ));
> doc.add(Field.Keyword("data_wydania", data_wydania));
> doc.add(Field.Keyword("id_wydania", id_wydania));
> doc.add(Field.Text("id_gazety", id_gazety));
> doc.add(Field.Keyword("data_wstawienia", data_wstawienia));
> doc.add(Field.Text("tresc", reader));
> return doc;
> }
> Sincerely,
> legez

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-140) docs out of order

[jira] Commented: (LUCENE-140) docs out of order

[jira] Created: (LUCENE-767) maxDoc should be explicitly stored in the index, not derived from file length

[jira] Commented: (LUCENE-140) docs out of order

Re: The tvp extension

Beyond Lucene 2.0 Index Design

[jira] Updated: (LUCENE-140) docs out of order

[jira] Commented: (LUCENE-140) docs out of order

[jira] Commented: (LUCENE-767) maxDoc should be explicitly stored in the index, not derived from file length

[jira] Commented: (LUCENE-767) maxDoc should be explicitly stored in the index, not derived from file length

[jira] Created: (LUCENE-768) Exception in deleteDocument, undeleteAll or setNorm in IndexReader can fail to release write lock on close

Re: [jira] Created: (LUCENE-767) maxDoc should be explicitly stored in the index, not derived from file length

[jira] Resolved: (LUCENE-768) Exception in deleteDocument, undeleteAll or setNorm in IndexReader can fail to release write lock on close

[jira] Commented: (LUCENE-767) maxDoc should be explicitly stored in the index, not derived from file length

Re: [jira] Commented: (LUCENE-767) maxDoc should be explicitly stored in the index, not derived from file length

Re: [jira] Commented: (LUCENE-767) maxDoc should be explicitly stored in the index, not derived from file length

Re: [jira] Commented: (LUCENE-140) docs out of order

[jira] Updated: (LUCENE-724) Oracle JVM implementation for Lucene DataStore also a preliminary implementation for an Oracle Domain index using Lucene

[jira] Commented: (LUCENE-724) Oracle JVM implementation for Lucene DataStore also a preliminary implementation for an Oracle Domain index using Lucene

Re: Beyond Lucene 2.0 Index Design

Re: [jira] Commented: (LUCENE-767) maxDoc should be explicitly stored in the index, not derived from file length

[jira] Updated: (LUCENE-755) Payloads

[jira] Commented: (LUCENE-755) Payloads

[jira] Created: (LUCENE-769) [PATCH] Performance improvement for some cases of sorted search

Re: Beyond Lucene 2.0 Index Design

[jira] Updated: (LUCENE-769) [PATCH] Performance improvement for some cases of sorted search

RE: Beyond Lucene 2.0 Index Design

RE: Beyond Lucene 2.0 Index Design

[jira] Updated: (LUCENE-140) docs out of order

[jira] Commented: (LUCENE-140) docs out of order

Re: .sN (separate norms files) and NO_NORMS

Job Opportunity (Sunnyvale, CA)

[jira] Commented: (LUCENE-140) docs out of order

[jira] Commented: (LUCENE-542) QueryParser doesn't support keywords staring with *

[jira] Commented: (LUCENE-140) docs out of order

35 matches

Site Navigation

Mail list logo

Footer information