Re: [jira] Created: (LUCENE-1257) Port to Java5

2008-04-09 Thread Toke Eskildsen
On Tue, 2008-04-08 at 18:48 -0500, robert engels wrote:
 That is opposite of my testing:...
 
 The 'foreach' is consistently faster. The time difference is  
 independent of the size of the array. What I know about JVM  
 implementations, the foreach version SHOULD always be faster -  
 because the no bounds checking needs to be done on the element access...

That's interesting. Even if it doesn't show in a performance-test right
now, it might do so in later Java versions.

As for your test-code, then it does not measure performance in a fair
way, as the foreach runs after the old-style loop. I'm sure you'll see
different results if you switch the order of the two tests.

I'm a big fan of foreach, but I'll have to admit that Steven's
observations seems to be correct. I hope I'll find the time to take the
advice of Yonik and make my own test sometime soon.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Optimise Indexing time using lucene..

2008-04-09 Thread Mathieu Lecarme

lucene4varma a écrit :

Hi all,

I am new to lucene and am using it for text search in my web application,
and for that i need to index records in database.
We are using jdbc directory to store the indexes. Now the problem is when is
start the process of indexing the records for the first time it is taking
huge amount of time. Following is the code for indexing. 


rs = st.executequery(); // returns 2 million records
while(rs.next()) {
create java object .;
index java record into JDBC directory...;
}

The above process takes me huge amount of time for 2 million records.
Approximately it is taking 3-4 business days to run the process. 
Can any one please suggest me and approach by which i could cut down this

time.
  
jdbc directory is not a good idea. It's only useful when you need 
central repository.

Use large maxBufferedDocs in your IndexWriter.
With large amount of data, you'll get bottleneck : database reading, 
index writing, RAM for buffered docs, maybe CPU.
If your database reading is huge, and you are hurry, you can shard the 
index between multiple computer, and when it's finished, merge all the 
index, with champain.


M.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1262) NullPointerException from FieldsReader after problem reading the index

2008-04-09 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12587117#action_12587117
 ] 

Michael McCandless commented on LUCENE-1262:


Those stack traces look like 2.1 not 2.3.1.  Is that right?

Can you post the index that you are using and the code that results in the 2nd 
exception?  I can't get the 2nd exception to happen in a test case...

 NullPointerException from FieldsReader after problem reading the index
 --

 Key: LUCENE-1262
 URL: https://issues.apache.org/jira/browse/LUCENE-1262
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.3.1
Reporter: Trejkaz

 There is a situation where there is an IOException reading from Hits, and 
 then the next time you get a NullPointerException instead of an IOException.
 Example stack traces:
 java.io.IOException: The specified network name is no longer available
   at java.io.RandomAccessFile.readBytes(Native Method)
   at java.io.RandomAccessFile.read(RandomAccessFile.java:322)
   at 
 org.apache.lucene.store.FSIndexInput.readInternal(FSDirectory.java:536)
   at 
 org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:74)
   at 
 org.apache.lucene.index.CompoundFileReader$CSIndexInput.readInternal(CompoundFileReader.java:220)
   at 
 org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:93)
   at 
 org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:34)
   at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:57)
   at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:88)
   at 
 org.apache.lucene.index.SegmentReader.document(SegmentReader.java:344)
   at org.apache.lucene.index.IndexReader.document(IndexReader.java:368)
   at org.apache.lucene.search.IndexSearcher.doc(IndexSearcher.java:84)
   at org.apache.lucene.search.Hits.doc(Hits.java:104)
 That error is fine.  The problem is the next call to doc generates:
 java.lang.NullPointerException
   at 
 org.apache.lucene.index.FieldsReader.getIndexType(FieldsReader.java:280)
   at org.apache.lucene.index.FieldsReader.addField(FieldsReader.java:216)
   at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:101)
   at 
 org.apache.lucene.index.SegmentReader.document(SegmentReader.java:344)
   at org.apache.lucene.index.IndexReader.document(IndexReader.java:368)
   at org.apache.lucene.search.IndexSearcher.doc(IndexSearcher.java:84)
   at org.apache.lucene.search.Hits.doc(Hits.java:104)
 Presumably FieldsReader is caching partially-initialised data somewhere.  I 
 would normally expect the exact same IOException to be thrown for subsequent 
 calls to the method.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1150) The token types of the standard tokenizer is not accessible

2008-04-09 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless updated LUCENE-1150:
---

Fix Version/s: 2.3.2

Backported fix to 2.3.2.

 The token types of the standard tokenizer is not accessible
 ---

 Key: LUCENE-1150
 URL: https://issues.apache.org/jira/browse/LUCENE-1150
 Project: Lucene - Java
  Issue Type: Bug
  Components: Analysis
Affects Versions: 2.3
Reporter: Nicolas Lalevée
Assignee: Michael McCandless
 Fix For: 2.3.2, 2.4

 Attachments: LUCENE-1150.patch, LUCENE-1150.take2.patch


 The StandardTokenizerImpl not being public, these token types are not 
 accessible :
 {code:java}
 public static final int ALPHANUM  = 0;
 public static final int APOSTROPHE= 1;
 public static final int ACRONYM   = 2;
 public static final int COMPANY   = 3;
 public static final int EMAIL = 4;
 public static final int HOST  = 5;
 public static final int NUM   = 6;
 public static final int CJ= 7;
 /**
  * @deprecated this solves a bug where HOSTs that end with '.' are identified
  * as ACRONYMs. It is deprecated and will be removed in the next
  * release.
  */
 public static final int ACRONYM_DEP   = 8;
 public static final String [] TOKEN_TYPES = new String [] {
 ALPHANUM,
 APOSTROPHE,
 ACRONYM,
 COMPANY,
 EMAIL,
 HOST,
 NUM,
 CJ,
 ACRONYM_DEP
 };
 {code}
 So no custom TokenFilter can be based of the token type. Actually even the 
 StandardFilter cannot be writen outside the 
 org.apache.lucene.analysis.standard package.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: StandardTokenizerConstants in 2.3

2008-04-09 Thread Antony Bowesman

Thanks Mike/Hoss for the clarification.
Antony


Michael McCandless wrote:


Chris Hostetter wrote:


:  But, StandardTokenizer is public?  It exports those constants 
for you?

:
: Really?  Sorry, but I can't find them - in 2.3.1 sources, there are no
: references to those statics.  Javadocs have no reference to them in
: StandardTokenizer

I think Michael is forgetting that he re-added those constants to the
trunk after 2.3.1 was released...

https://issues.apache.org/jira/browse/LUCENE-1150


Woops!  I'm sorry Antony -- Hoss is correct.

I didn't realize this missed 2.3.  I'll backport this fix to 2.3 branch 
so it'll be included when we release 2.3.2 (which I think we should do 
soon -- alot of little fixes have been backported).


Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: StandardTokenizerConstants in 2.3

2008-04-09 Thread Michael McCandless


Chris Hostetter wrote:


:  But, StandardTokenizer is public?  It exports those constants  
for you?

:
: Really?  Sorry, but I can't find them - in 2.3.1 sources, there  
are no

: references to those statics.  Javadocs have no reference to them in
: StandardTokenizer

I think Michael is forgetting that he re-added those constants to the
trunk after 2.3.1 was released...

https://issues.apache.org/jira/browse/LUCENE-1150


Woops!  I'm sorry Antony -- Hoss is correct.

I didn't realize this missed 2.3.  I'll backport this fix to 2.3  
branch so it'll be included when we release 2.3.2 (which I think we  
should do soon -- alot of little fixes have been backported).


Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: [jira] Created: (LUCENE-1257) Port to Java5

2008-04-09 Thread Steven A Rowe
Hi Toke,

On 04/09/2008 at 2:43 AM, Toke Eskildsen wrote:
 On Tue, 2008-04-08 at 18:48 -0500, robert engels wrote:
  That is opposite of my testing:...
  
  The 'foreach' is consistently faster. The time difference is
  independent of the size of the array. What I know about JVM
  implementations, the foreach version SHOULD always be faster -
  because the no bounds checking needs to be done on the
  element access...
 
 As for your test-code, then it does not measure performance in a fair
 way, as the foreach runs after the old-style loop. I'm sure you'll see
 different results if you switch the order of the two tests.

My first try at a test looked like Robert's, and exactly as you say, Toke, when 
operating on the same array, the first loop is slower and the second one is 
faster.

Steve

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Storing phrases in index

2008-04-09 Thread palexv

Hello all.
I have a question to advanced in lucene.
I have a set of phrases which I need to store in index. 
Is there is a way of storing phrases as terms in index?

How is the best way of writing such index? Should this field be tokenized?

What is the best way of searching phrases by mask in such index? Should I
use BooleanQuery, WildCartQuery or SpanQuery? 
How is the best way to escape from maxClauses exception when searching like
a*?
-- 
View this message in context: 
http://www.nabble.com/Storing-phrases-in-index-tp16585658p16585658.html
Sent from the Lucene - Java Developer mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Storing phrases in index

2008-04-09 Thread Mathieu Lecarme

palexv a écrit :

Hello all.
I have a question to advanced in lucene.
I have a set of phrases which I need to store in index. 
Is there is a way of storing phrases as terms in index?


How is the best way of writing such index? Should this field be tokenized?
  

not tokenized

What is the best way of searching phrases by mask in such index? Should I
use BooleanQuery, WildCartQuery or SpanQuery?
il you search complete phrase, just use Term, if you search part of 
phrase, use ShingleFilter.


 
How is the best way to escape from maxClauses exception when searching like

a*?
  

indexing indexed term.

M.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Flexible indexing design (was Re: Pooling of posting objects in DocumentsWriter)

2008-04-09 Thread Michael Busch

Thanks for your quick answers.

Michael McCandless wrote:

Hi Michael,

I've actually been working on factoring DocumentsWriter, as a first
step towards flexible indexing.



Cool, yeah separating the DocumentsWriter into multiple classes 
certainly helped understanding the complex code better.



I agree we would have an abstract base Posting class that just tracks
the term text.

Then, DocumentsWriter manages inverting each field, maintaining the
per-field hash of term Text - abstract Posting instances, exposing
the methods to write bytes into multiple streams for a Posting in the
RAM byte slices, and then read them back when flushing, etc.

And then the code that writes the current index format would plug into
this and should be fairly small and easy to understand.  For example,
frq/prx postings and term vectors writing would be two plugins to the
inverted terms API; it's just that term vectors flush after every
document and frq/prx flush when RAM is full.



I think this makes sense. We also need to come up with a good solution 
for the dictionary, because a term with frq/prx postings needs to store 
two (or three for skiplist) file pointers in the dictionary, whereas e. 
g. a binary posting list only needs one pointer.



Then there would also be plugins that just tap into the entire
document (don't need inversion), like FieldsWriter.

There are still alot of details to work out...


Definitely. For example, we should think about the Field APIs. Since we 
don't have global field semantics in Lucene I wonder how to handle 
conflict cases, e. g. when a document specifies a different posting list 
format than a previous one for the same field. The easiest way would be 
to not allow it and throw an exception. But this is kind of against 
Lucene's way of dealing with fields currently. But I'm scared of the 
complicated code to handle conflicts of all the possible combinations of 
posting list formats. KinoSearch doesn't have to worry about this, 
because it has a static schema (I think?), but isn't as flexible as Lucene.




The DocumentsWriter does pooling of the Posting instances and I'm 
wondering how much this improves performance.


We should retest this.  I think it was a decent difference in
performance but I don't remember how much.  I think the pooling can
also be made generic (handled by DocumentsWriter).  EG the plugin
could expose a newPosting()  method.



Yeah, but for code simplicity let's really figure out first how much 
pooling helps at all.



Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: [jira] Created: (LUCENE-1257) Port to Java5

2008-04-09 Thread melix

Hi,

I confirm your results. I didn't think there could be a difference using
foreach constructs...

Cedric


Steven A Rowe wrote:
 
 On 04/04/2008 at 4:40 AM, Toke Eskildsen wrote:
 On Wed, 2008-04-02 at 09:30 -0400, Mark Miller wrote:
   - replacement of indexed for loops with for each constructs
  
  Is this always the best idea? Doesn't the for loop construct make an
  iterator, which can be much slower than an indexed for loop?
 
 Only in the case of iterations over collections. For arrays, the foreach
 is syntactic sugar for indexed for-loop.
 http://java.sun.com/docs/books/jls/third_edition/html/statements.html#14.14.2
 
 I don't think this is actually true.  The text at the above-linked page
 simply says that for-each over an array means the same as an indexed
 loop over the same array.  Syntactic sugar, OTOH, implies that the
 resulting opcode is exactly the same.  When I look at the byte code (using
 javap) for the comparison test I include below, I can see that the indexed
 and for-each loops do not generate the same byte code.
 
 I constructed a simple program to compare the runtime length of the two
 loop control mechanisms, while varying the size of the array.  The test
 program takes command line parameters to control which loop control
 mechanism to use, the size of the array (#elems), and the number of times
 to execute the loop (#iters).  I used a Bash shell script to invoke the
 test program.
 
 Summary of the results: over int[] arrays, indexed loops are faster on
 arrays with fewer than about a million elements.  The fewer the elements,
 the faster indexed loops are relative to for-each loops.  This could be
 explained by a higher one-time setup cost for the for-each loop - above a
 certain array size, the for-each setup cost is lost in the noise.  It
 should be noted, however, that this one-time setup cost is quite small,
 and might be worth the increased code clarity.
 
 Here are the results for three different platforms:
 
   - Best of five iterations for each combination
   - All using the -server JVM option
   - Holding constant #iters * #elems = 10^10
   - Rounding the reported real time to the nearest tenth of a second
   - % Slower = 100 * ((For-each - Indexed) / Indexed)
 
 Platform #1: Windows XP SP2; Intel Core 2 duo [EMAIL PROTECTED]; Java 1.5.0_13
 
 #iters  #elems  For-each  Indexed  % Slower
 --  --    ---  
   10^910^1 22.3s13.8s   62%
   10^810^2 16.0s13.6s   18%
   10^610^4 14.8s13.0s   14%
   10^410^6 12.9s12.9s0%
   10^310^7 13.4s13.3s1%
 
 Platform #2: Debian Linux, 2.6.21.7 kernel; Intel Xeon [EMAIL PROTECTED]; Java
 1.5.0_14
 
 #iters  #elems  For-each  Indexed  % Slower
 --  --    ---  
   10^910^1 33.6s14.2s  137%
   10^810^2 20.4s13.9s   47%
   10^610^4 19.0s12.7s   50%
   10^410^6 12.7s12.8s   -1%
   10^310^7 13.2s13.2s0%
 
 Platform #3: Debian Linux, 2.6.21.7 kernel; Intel Xeon [EMAIL PROTECTED]; Java
 1.5.0_10
 
 #iters  #elems  For-each  Indexed  % Slower
 --  --    ---  
   10^910^1102.7s73.6s   40%
   10^810^2107.8s60.0s   80%
   10^610^4105.2s58.6s   80%
   10^410^6 58.8s53.0s   11%
   10^310^7 60.0s54.1s   11%
 
 
 - ForEachTest.java follows -
 
 import java.util.Date;
 import java.util.Random;
 
 /**
  * This is meant to be called from a shell script that varies the loop
 style,
  * the number of iterations over the loop, and the number of elements in
 the
  * array over which the loop iterates, e.g.:
  * 
  * cmd=java -server -cp . ForEachTest
  * for elems in 10 100 1 100 1000 ; do
  * iters=$((100/${elems}))
  * for run in 1 2 3 4 5 ; do
  * time $cmd --indexed --arraysize $elems --iterations $iters
  * time $cmd --foreach --arraysize $elems --iterations $iters
  * done
  * done
  *
  */
 public class ForEachTest {
   static String NL = System.getProperty(line.separator);
   static String usage
 = Usage: java -server -cp . ForEachTest [ --indexed | --foreach ]
   + NL + \t--iterations num-iterations  --arraysize array-size;
 
   public static void main(String[] args) {
 boolean useIndexedLoop = false;
 int size = 0;
 int iterations = 0;
 try {
   for (int argnum = 0 ; argnum  args.length ; ++argnum) {
 if (args[argnum].equals(--indexed)) {
   useIndexedLoop = true;
 } else if (args[argnum].equals(--foreach)) {
   useIndexedLoop = false;
 } else if (args[argnum].equals(--iterations)) {
   iterations = Integer.parseInt(args[++argnum]);
 } else if (args[argnum].equals(--arraysize)) {
   size = Integer.parseInt(args[++argnum]);
 

Re: [jira] Created: (LUCENE-1257) Port to Java5

2008-04-09 Thread robert engels

I think it is going to be highly JVM dependent.

I reworked it to call each twice (and reordered the tests)... the  
foreach is still faster. Ialso ran it on Windows (under Parallels)  
and got similar results, but in some cases the indexed was faster.


server times are tough to judge because normally the server is not  
going to compile until it hits it 10k times, but this can be  
configured...


I think this is a case where you need to make a judgement based on  
expected behavior as there are probably too many variables.


The 'foreach' should be faster in the general case for arrays as the  
bounds checking can be avoided.


But, I doubt the speed difference is going to matter much either way,  
and eventually the JVM impl will converge to near equal performance.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



index reopen question

2008-04-09 Thread John Wang
Hi:
Have been reading the 2.3.1 release code and have a few questions
regarding indexReader reopen:

1)  looking at the code:

if (this.hasChanges || this.isCurrent()) {

  // the index hasn't changed - nothing to do here

  return this;

}


   Shouldn't it be !this.hasChanges?


2) FilterIndexReader calls the ensureOpen() method from the super class
instead of overriding the method and call the inner reader's ensureOpen, is
that expected?


3) When you reopen an index, the inner reference count is not updated, is
that ok?


Thanks


-John


Re: [jira] Created: (LUCENE-1257) Port to Java5

2008-04-09 Thread Yonik Seeley
Just for kicks, I tried it on a 64 bit Athlon, linux_x86_64, jvm=64
bit Sun 1.6 -server.
The explicit loop counter was 50% faster (for N=10... the inner loop)

-Yonik

On Tue, Apr 8, 2008 at 8:21 PM, Yonik Seeley [EMAIL PROTECTED] wrote:
 On Tue, Apr 8, 2008 at 7:48 PM, robert engels [EMAIL PROTECTED] wrote:
   That is opposite of my testing:...
  
The 'foreach' is consistently faster.

  It's consistently slower for me (I tested java5 and java6 both with
  -server on a P4).
  I'm a big fan of testing different methods in different test runs
  (because of hotspot, gc, etc).

  Example results:
  $ c:/opt/jdk16/bin/java -server t 1 10 foreach
  N = 10
  method=foreachlen=10 indexed time = 8734

  [EMAIL PROTECTED] /cygdrive/h/tmp
  $ c:/opt/jdk16/bin/java -server t 1 10 iter
  N = 10
  method=iterlen=10 indexed time = 7062


  Here's my test code (a modified version of yours):

  public class t {

public static void main(String[] args) {
int I = Integer.parseInt(args[0]); // 100
int N = Integer.parseInt(args[1]); // 10
String method = args[2].intern();  // foreach or iter


String[] strings = new String[N];

for (int i = 0; i  N; i++) {
strings[i] = Integer.toString(i);

}

System.out.println(N =  + N);

long len = 0;
long start = System.currentTimeMillis();

if (method==foreach)

  for (int i = 0; i  I; i++) {
  for (String s : strings) {
  len += s.length();
  }
  }
else

  for (int i = 0; i  I; i++) {
  for (int j = 0; j  N; j++) {
  len += strings[j].length();
  }
  }

System.out.println(method=+method + len=+len+ indexed
  time =  + (System.currentTimeMillis() - start));
  }
  }


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1260) Norm codec strategy in Similarity

2008-04-09 Thread Karl Wettin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12587290#action_12587290
 ] 

Karl Wettin commented on LUCENE-1260:
-

{quote}
As long as the norm remains a fixed size (1 byte) then it doesn't really matter 
whether it's tied to Similarity's or the store itself - it would be nice if the 
Index could tell you which normDecoder to use, but it's not any more 
unreasonable to expect the application to keep track of this (if it's not the 
default encoding) since applications already have to keep track of things like 
which Analyzer is compatible with querying this index.

If we want norms to be more flexible, so tat apps can pick not only the 
encoding but also the size... then things get more interesting, but it's still 
feasible to say if you customize this, you have to make your reading apps and 
your writing apps smart enough to know about your customization.
{quote}

I like the idea of an index that is completely self aware of norm encoding, 
what payloads mean, et c. 

{quote}
I also want to move it to the instance scope so I can have multiple indices 
with unique norm span/resolutions created from the same classloader.
{quote}

My use case is really about document boost and not normalization. 

So another solution to this is to introduce a (variable bit sized?) document 
boost file and completely separate it from the norms instead of as now where  
normalization and document boost is baked up as the same thing. I think there 
would be no need to touch the norms encoding then, that the default resolution 
is good enough for /normalization/. It would fix several caveats with norms as 
I see it. 



 Norm codec strategy in Similarity
 -

 Key: LUCENE-1260
 URL: https://issues.apache.org/jira/browse/LUCENE-1260
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.3.1
Reporter: Karl Wettin
 Attachments: LUCENE-1260.txt


 The static span and resolution of the 8 bit norms codec might not fit with 
 all applications. 
 My use case requires that 100f-250f is discretized in 60 bags instead of the 
 default.. 10?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Flexible indexing design

2008-04-09 Thread Marvin Humphrey

On Apr 9, 2008, at 6:35 AM, Michael Busch wrote:

We also need to come up with a good solution for the dictionary,  
because a term with frq/prx postings needs to store two (or three  
for skiplist) file pointers in the dictionary, whereas e. g. a  
binary posting list only needs one pointer.


This is something I'm working on as well, and I hope we can solve a  
couple of design problems I've been turning over in my mind for some  
time.


In KS, the information Lucene stores in the frq/prx files is carried  
in one postings file per field, as discussed previously.  However, I  
made the additional change of breaking out skip data into a separate  
file (shared across all fields).  Isolating skip data sacrifices some  
locality of reference, but buys substantial gains in simplicity and  
compartmentalization.  Individual Posting subclasses, each of which  
defines a file format, don't have to know about skip algorithms at  
all.  :)  Further, improvements in the skip algorithm only require  
changes to the .skip file, and falling back to PostingList_Next still  
works if the .skip file becomes corrupted since .skip carries only  
optimization info and no real data.


For reasons I won't go into here, KS doesn't need to put a field  
number in it's TermInfo, but it does need doc freq, plus file  
positions for the postings file, the skip file, and the primary  
Lexicon file.  (Lexicon is the KS term dictionary class, akin to  
Lucene's TermEnum.)


  struct kino_TermInfo {
  kino_VirtualTable* _;
  kino_ref_t ref;
  chy_i32_t doc_freq;
  chy_u64_t post_filepos;
  chy_u64_t skip_filepos;
  chy_u64_t lex_filepos;
  };

There are two problems.

First is that I'd like to extend indexing with arbitrary subclasses of  
SegDataWriter, and I'd like these classes to be able to put their own  
file position bookmarks (or possibly other data) into TermInfo.   
Making TermInfo hash-based would probably do it, but there would be  
nasty performance and memory penalties since TermInfo objects are  
numerous.


So, what's the best way to allow multiple, unrelated classes to extend  
TermInfo and the term dictionary file format?  Is it to break up  
TermInfo information horizontally rather than vertically, so that  
instead of a single array of TermInfo objects, we have a flexible  
stack of arrays of 64-bit integers representing file positions?


The second problem is how to share a term dictionary over a cluster.   
It would be nice to be able to plug modules into IndexReader that  
represent clusters of machines but that are dedicated to specific  
tasks: one cluster could be dedicated to fetching full documents and  
applying highlighting; another cluster could be dedicated to scanning  
through postings and finding/scoring hits; a third cluster could store  
the entire term dictionary in RAM.


A centralized term dictionary held in RAM would be particularly handy  
for sorting purposes.  The problem is that the file pointers of a term  
dictionary are specific to indexes on individual machines.  A shared  
dictionary in RAM would have to contain pointers for *all* clients,  
which isn't really workable.


So, just how do you go about assembling task specific clusters?  The  
stored documents cluster is easy, but the term dictionary and the  
postings are hard.


For example, we should think about the Field APIs. Since we don't  
have global field semantics in Lucene I wonder how to handle  
conflict cases, e. g. when a document specifies a different posting  
list format than a previous one for the same field. The easiest way  
would be to not allow it and throw an exception. But this is kind of  
against Lucene's way of dealing with fields currently. But I'm  
scared of the complicated code to handle conflicts of all the  
possible combinations of posting list formats.


Yeah. Lucene's field definition conflict-resolution code is gnarly  
already. :(


KinoSearch doesn't have to worry about this, because it has a static  
schema (I think?), but isn't as flexible as Lucene.


Earlier versions of KS did not allow the addition of new fields on the  
fly, but this has been changed.  You can now add fields to an existing  
Schema object like so:


for my $doc (@docs) {
# Dynamically define any new fields as 'text'.
for my $field ( keys %$doc ) {
$schema-add_field( $field = 'text' );
}
$invindexer-add_doc($doc);
}

See the attached sample app for that snippet in context.

Here are some current differences between KS and Lucene:

  * KS doesn't yet purge *old* dynamic field definitions which have
become obsolete.  However, that should be possible to add later,
as a sweep triggered during full optimization.
  * You can't change the definition of an existing field.
  * Documents are hash-based, so you can't have multiple fields with
the same name within one document object.  However, I consider
that capability a misfeature of 

[jira] Updated: (LUCENE-1262) NullPointerException from FieldsReader after problem reading the index

2008-04-09 Thread Trejkaz (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Trejkaz updated LUCENE-1262:


Affects Version/s: (was: 2.3.1)
   2.2

Whoops.  I don't think it's 2.1 but it must be 2.2.

I'll try and reproduce this standalone but first I need a way to have 
readInternal throw an exception.  I presume you were using some kind of custom 
store implementation to do that, I'll see if I can make it happen.under 2.2 and 
then try the same thing under 2.3.1 to confirm whether it still breaks.


 NullPointerException from FieldsReader after problem reading the index
 --

 Key: LUCENE-1262
 URL: https://issues.apache.org/jira/browse/LUCENE-1262
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.2
Reporter: Trejkaz

 There is a situation where there is an IOException reading from Hits, and 
 then the next time you get a NullPointerException instead of an IOException.
 Example stack traces:
 java.io.IOException: The specified network name is no longer available
   at java.io.RandomAccessFile.readBytes(Native Method)
   at java.io.RandomAccessFile.read(RandomAccessFile.java:322)
   at 
 org.apache.lucene.store.FSIndexInput.readInternal(FSDirectory.java:536)
   at 
 org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:74)
   at 
 org.apache.lucene.index.CompoundFileReader$CSIndexInput.readInternal(CompoundFileReader.java:220)
   at 
 org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:93)
   at 
 org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:34)
   at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:57)
   at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:88)
   at 
 org.apache.lucene.index.SegmentReader.document(SegmentReader.java:344)
   at org.apache.lucene.index.IndexReader.document(IndexReader.java:368)
   at org.apache.lucene.search.IndexSearcher.doc(IndexSearcher.java:84)
   at org.apache.lucene.search.Hits.doc(Hits.java:104)
 That error is fine.  The problem is the next call to doc generates:
 java.lang.NullPointerException
   at 
 org.apache.lucene.index.FieldsReader.getIndexType(FieldsReader.java:280)
   at org.apache.lucene.index.FieldsReader.addField(FieldsReader.java:216)
   at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:101)
   at 
 org.apache.lucene.index.SegmentReader.document(SegmentReader.java:344)
   at org.apache.lucene.index.IndexReader.document(IndexReader.java:368)
   at org.apache.lucene.search.IndexSearcher.doc(IndexSearcher.java:84)
   at org.apache.lucene.search.Hits.doc(Hits.java:104)
 Presumably FieldsReader is caching partially-initialised data somewhere.  I 
 would normally expect the exact same IOException to be thrown for subsequent 
 calls to the method.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1262) NullPointerException from FieldsReader after problem reading the index

2008-04-09 Thread Trejkaz (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Trejkaz updated LUCENE-1262:


Affects Version/s: (was: 2.2)
   2.1

Okay I'll eat my words now, it is indeed 2.1 as the version doesn't have 
openInput(String,int) in it.

Anyway an update: I've managed to reproduce it on any text index by simulating 
random network outage.  I'm keeping a flag which I set to true.  The trick is 
that the wrapping IndexInput implementation *randomly* throws IOException if 
the flag is true -- if it always throws IOException the problem doesn't occur.  
If it randomly throws it then it occurs occasionally, and it always seems to be 
for larger queries (I'm using MatchAllDocsQuery now.)

I'll see if I can tweak the code to make it more likely to happen and then 
start working up to each version of Lucene to see if it stops happening 
somewhere.


 NullPointerException from FieldsReader after problem reading the index
 --

 Key: LUCENE-1262
 URL: https://issues.apache.org/jira/browse/LUCENE-1262
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.1
Reporter: Trejkaz

 There is a situation where there is an IOException reading from Hits, and 
 then the next time you get a NullPointerException instead of an IOException.
 Example stack traces:
 java.io.IOException: The specified network name is no longer available
   at java.io.RandomAccessFile.readBytes(Native Method)
   at java.io.RandomAccessFile.read(RandomAccessFile.java:322)
   at 
 org.apache.lucene.store.FSIndexInput.readInternal(FSDirectory.java:536)
   at 
 org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:74)
   at 
 org.apache.lucene.index.CompoundFileReader$CSIndexInput.readInternal(CompoundFileReader.java:220)
   at 
 org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:93)
   at 
 org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:34)
   at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:57)
   at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:88)
   at 
 org.apache.lucene.index.SegmentReader.document(SegmentReader.java:344)
   at org.apache.lucene.index.IndexReader.document(IndexReader.java:368)
   at org.apache.lucene.search.IndexSearcher.doc(IndexSearcher.java:84)
   at org.apache.lucene.search.Hits.doc(Hits.java:104)
 That error is fine.  The problem is the next call to doc generates:
 java.lang.NullPointerException
   at 
 org.apache.lucene.index.FieldsReader.getIndexType(FieldsReader.java:280)
   at org.apache.lucene.index.FieldsReader.addField(FieldsReader.java:216)
   at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:101)
   at 
 org.apache.lucene.index.SegmentReader.document(SegmentReader.java:344)
   at org.apache.lucene.index.IndexReader.document(IndexReader.java:368)
   at org.apache.lucene.search.IndexSearcher.doc(IndexSearcher.java:84)
   at org.apache.lucene.search.Hits.doc(Hits.java:104)
 Presumably FieldsReader is caching partially-initialised data somewhere.  I 
 would normally expect the exact same IOException to be thrown for subsequent 
 calls to the method.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1260) Norm codec strategy in Similarity

2008-04-09 Thread Hoss Man (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12587435#action_12587435
 ] 

Hoss Man commented on LUCENE-1260:
--

bq. My use case is really about document boost and not normalization.

bq. So another solution to this is to introduce a (variable bit sized?) 
document boost file and completely separate it from the norms instead...

1) norms is a vague term.  currently lengthNorm is folded in with field 
boosts and doc boosts to form a generic fieldNorm ... I assumed you were 
interested in a more general way to improve the resolution of fieldNorm

2) your description of general purpose variable sized document boosting sounds 
exactly like LUCENE-1231 ... in the long run utilities using LUCENE-1231 (or 
something like it) to replace field boosts and length norms might make the 
most sense as a way to eliminate the current static Norm encoding and put more 
flexibility in the hands of users

 Norm codec strategy in Similarity
 -

 Key: LUCENE-1260
 URL: https://issues.apache.org/jira/browse/LUCENE-1260
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.3.1
Reporter: Karl Wettin
 Attachments: LUCENE-1260.txt


 The static span and resolution of the 8 bit norms codec might not fit with 
 all applications. 
 My use case requires that 100f-250f is discretized in 60 bags instead of the 
 default.. 10?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1262) IndexOutOfBoundsException from FieldsReader after problem reading the index

2008-04-09 Thread Trejkaz (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Trejkaz updated LUCENE-1262:


Affects Version/s: (was: 2.1)
   2.3.1
  Summary: IndexOutOfBoundsException from FieldsReader after 
problem reading the index  (was: NullPointerException from FieldsReader after 
problem reading the index)

I managed to reproduce the problem as-is under version 2.2.

For 2.3 the problem has changed -- instead of a NullPointerException it is now 
an IndexOutOfBoundsException:

Exception in thread main java.lang.IndexOutOfBoundsException: Index: 52, 
Size: 34
at java.util.ArrayList.RangeCheck(ArrayList.java:547)
at java.util.ArrayList.get(ArrayList.java:322)
at org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:260)
at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:154)
at 
org.apache.lucene.index.SegmentReader.document(SegmentReader.java:659)
at org.apache.lucene.index.IndexReader.document(IndexReader.java:525)
at org.apache.lucene.search.IndexSearcher.doc(IndexSearcher.java:92)
at org.apache.lucene.search.Hits.doc(Hits.java:167)
at Test.main(Test.java:24)

Will attach my test program in a moment.


 IndexOutOfBoundsException from FieldsReader after problem reading the index
 ---

 Key: LUCENE-1262
 URL: https://issues.apache.org/jira/browse/LUCENE-1262
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.3.1
Reporter: Trejkaz

 There is a situation where there is an IOException reading from Hits, and 
 then the next time you get a NullPointerException instead of an IOException.
 Example stack traces:
 java.io.IOException: The specified network name is no longer available
   at java.io.RandomAccessFile.readBytes(Native Method)
   at java.io.RandomAccessFile.read(RandomAccessFile.java:322)
   at 
 org.apache.lucene.store.FSIndexInput.readInternal(FSDirectory.java:536)
   at 
 org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:74)
   at 
 org.apache.lucene.index.CompoundFileReader$CSIndexInput.readInternal(CompoundFileReader.java:220)
   at 
 org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:93)
   at 
 org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:34)
   at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:57)
   at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:88)
   at 
 org.apache.lucene.index.SegmentReader.document(SegmentReader.java:344)
   at org.apache.lucene.index.IndexReader.document(IndexReader.java:368)
   at org.apache.lucene.search.IndexSearcher.doc(IndexSearcher.java:84)
   at org.apache.lucene.search.Hits.doc(Hits.java:104)
 That error is fine.  The problem is the next call to doc generates:
 java.lang.NullPointerException
   at 
 org.apache.lucene.index.FieldsReader.getIndexType(FieldsReader.java:280)
   at org.apache.lucene.index.FieldsReader.addField(FieldsReader.java:216)
   at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:101)
   at 
 org.apache.lucene.index.SegmentReader.document(SegmentReader.java:344)
   at org.apache.lucene.index.IndexReader.document(IndexReader.java:368)
   at org.apache.lucene.search.IndexSearcher.doc(IndexSearcher.java:84)
   at org.apache.lucene.search.Hits.doc(Hits.java:104)
 Presumably FieldsReader is caching partially-initialised data somewhere.  I 
 would normally expect the exact same IOException to be thrown for subsequent 
 calls to the method.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1262) IndexOutOfBoundsException from FieldsReader after problem reading the index

2008-04-09 Thread Trejkaz (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Trejkaz updated LUCENE-1262:


Attachment: Test.java

Attaching a test program to reproduce the problem under 2.3.1.

It occurs approximately 1 in every 4 executions for any reasonably large text 
index (really small ones don't seem to do it so I couldn't attach a text index 
with it.)  The number of fields may be related, looking at the 
IndexOutOfBoundsException numbers it seems that the indexes we have happen to 
have a large number of fields.


 IndexOutOfBoundsException from FieldsReader after problem reading the index
 ---

 Key: LUCENE-1262
 URL: https://issues.apache.org/jira/browse/LUCENE-1262
 Project: Lucene - Java
  Issue Type: Bug
  Components: Index
Affects Versions: 2.3.1
Reporter: Trejkaz
 Attachments: Test.java


 There is a situation where there is an IOException reading from Hits, and 
 then the next time you get a NullPointerException instead of an IOException.
 Example stack traces:
 java.io.IOException: The specified network name is no longer available
   at java.io.RandomAccessFile.readBytes(Native Method)
   at java.io.RandomAccessFile.read(RandomAccessFile.java:322)
   at 
 org.apache.lucene.store.FSIndexInput.readInternal(FSDirectory.java:536)
   at 
 org.apache.lucene.store.BufferedIndexInput.readBytes(BufferedIndexInput.java:74)
   at 
 org.apache.lucene.index.CompoundFileReader$CSIndexInput.readInternal(CompoundFileReader.java:220)
   at 
 org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:93)
   at 
 org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:34)
   at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:57)
   at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:88)
   at 
 org.apache.lucene.index.SegmentReader.document(SegmentReader.java:344)
   at org.apache.lucene.index.IndexReader.document(IndexReader.java:368)
   at org.apache.lucene.search.IndexSearcher.doc(IndexSearcher.java:84)
   at org.apache.lucene.search.Hits.doc(Hits.java:104)
 That error is fine.  The problem is the next call to doc generates:
 java.lang.NullPointerException
   at 
 org.apache.lucene.index.FieldsReader.getIndexType(FieldsReader.java:280)
   at org.apache.lucene.index.FieldsReader.addField(FieldsReader.java:216)
   at org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:101)
   at 
 org.apache.lucene.index.SegmentReader.document(SegmentReader.java:344)
   at org.apache.lucene.index.IndexReader.document(IndexReader.java:368)
   at org.apache.lucene.search.IndexSearcher.doc(IndexSearcher.java:84)
   at org.apache.lucene.search.Hits.doc(Hits.java:104)
 Presumably FieldsReader is caching partially-initialised data somewhere.  I 
 would normally expect the exact same IOException to be thrown for subsequent 
 calls to the method.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1260) Norm codec strategy in Similarity

2008-04-09 Thread Karl Wettin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12587445#action_12587445
 ] 

Karl Wettin commented on LUCENE-1260:
-

{quote}
1) norms is a vague term. currently lengthNorm is folded in with field 
boosts and doc boosts to form a generic fieldNorm ... I assumed you were 
interested in a more general way to improve the resolution of fieldNorm
{quote}

I still am but mainly because it is the simplest and only way to get better 
document boost resolution at the moment.




 Norm codec strategy in Similarity
 -

 Key: LUCENE-1260
 URL: https://issues.apache.org/jira/browse/LUCENE-1260
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.3.1
Reporter: Karl Wettin
 Attachments: LUCENE-1260.txt


 The static span and resolution of the 8 bit norms codec might not fit with 
 all applications. 
 My use case requires that 100f-250f is discretized in 60 bags instead of the 
 default.. 10?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1260) Norm codec strategy in Similarity

2008-04-09 Thread Karl Wettin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12587446#action_12587446
 ] 

Karl Wettin commented on LUCENE-1260:
-

I notice there is a tyop in the patch. And there is no test case for 
SimpleNormCodec. I'll come up with that too.

 Norm codec strategy in Similarity
 -

 Key: LUCENE-1260
 URL: https://issues.apache.org/jira/browse/LUCENE-1260
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Search
Affects Versions: 2.3.1
Reporter: Karl Wettin
 Attachments: LUCENE-1260.txt


 The static span and resolution of the 8 bit norms codec might not fit with 
 all applications. 
 My use case requires that 100f-250f is discretized in 60 bags instead of the 
 default.. 10?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]