Re: lucene farsi problem

2008-05-06 Thread Vizzini

Sorry for cross posting, but why the word 'Farsi' instead of 'Persian'?  No
one says Lucnce français or Español, or Deutsch - so why Farsi?

Please read the following article, I found it quite enlightening. 
http://www.cais-soas.com/CAIS/Languages/persian_not_farsi.htm

PV

-- 
View this message in context: 
http://www.nabble.com/lucene-farsi-problem-tp16977096p17098552.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Are those runtime errors about the jdk, or lucene's jar, or my code?

2008-05-06 Thread crspan
Thanks so much, Mike. Those runtime errors were caused by one corrupted 
index, somehow corrupted during scp. It has Nothing to do with lucene 2.3.2.


For those who come by this thread:
   
   Please "CheckIndex"


That would saved me many hours of fruitless debugging.

Cheers,
Charlie


Michael McCandless wrote:


Hi,

Could you run org.apache.lucene.index.CheckIndex on your index and 
post the result?


Are these exceptions easily reproduced starting from scratch (new index)? 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: How to make a query that associates 2 index files

2008-05-06 Thread Michael Siu
Yes, there is many-to-one mapping to the content index. And the size of
content data is varying say from 1K to multiple Gs. That why it is not wise
to repeat the same content in a index document. 

Thanks for telling that the doc IDs are not constant. Yes, the keys to
content are generated on the fly and be unique per different content.

Thanks.
-m

-Original Message-
From: Erick Erickson [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, May 06, 2008 12:46 PM
To: java-user@lucene.apache.org
Subject: Re: How to make a query that associates 2 index files

Sure, just include different fields in different docs in your index.
Then, when you search since each term is on a field, docs without
that field are excluded from the search.

But this is really not very different in terms of a solution than
your earlier one. You still have the issue of searching the index
once to get the keys, then using those keys as part of another
search.

Are you saying that there is a many-to-one mapping to your content
index? And how much data are we talking about here? 10M, 100G?
this makes an enormous difference in your options.

I hope you're aware that Lucene doc IDs can change as the index is
modified, so I assume your "keys" are something you created and
NOT the lucene doc IDs.

Best
Erick

On Tue, May 6, 2008 at 3:03 PM, Michael Siu <[EMAIL PROTECTED]> wrote:

> My problem is: the [content] value can be huge. Duplicating it in more
> than
> one index document waste disk space (and search time?). In additions, when
> new documents are added to the second index, it will be faster to just
> index
> the linked [content] once (in first index file) and any subsequent
> reference
> to the same [content] will not need to re-index.
>
> In fact, I do not really need to physically separate the indices into 2
> files because Lucene supports Heterogeneous documents in an index file. I
> have no idea of how that work in a search. Does anyone know?
>
> Thanks in advance.
>
>
>
> -Original Message-
> From: Erick Erickson [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, May 06, 2008 9:36 AM
> To: java-user@lucene.apache.org
> Subject: Re: How to make a query that associates 2 index files
>
> You don't. You really have to roll your own solution here, there's
> no "inter-index" awareness that I know of in Lucene.
>
> Typically, people either do a half-half solution (that is, put the
> text search in Lucene and leave the DB parts in the DB) or
> de-normalize the data in a Lucene index so you don't have
> to even try to do things cross-index.
>
> And then there's Marcelo Ochoa (I may have the spelling wrong) who's
> put together some way to embed Lucene in an Oracle database, but
> that's magic to me.
>
> Why do you want to arrange your index this way in the first place?
> Perhaps there's a more specific answer waiting out there...
>
> Best
> Erick
>
> On Tue, May 6, 2008 at 12:14 PM, Michael Siu <[EMAIL PROTECTED]>
> wrote:
>
> > Hi,
> >
> >
> >
> > I am a newbie to Lucene. I have a question for making a query that
> > associate
> > 2 index files:
> >
> >
> >
> > - One index has the content index for a list of documents and a key to
> the
> > document. That means the Lucene document of this index contains 2
> fields:
> >
> > the 'content' and the 'key'.
> >
> > - another index has the some data indexed associated with the 'key' in
> the
> > previous index. The Lucene document of this index contains several
> fields:
> >
> > the 'who' that contains some data and the 'key' that _points_ to the
> > document in the first index.
> >
> >
> >
> > Sample data:
> >
> > Index_1:   [key] [content]
> >
> >Abc   "blah blah 123 ..."
> >
> >Xyz   "123 321 a nice day ..."
> >
> >
> >
> > Index_2:   [who][accessed]   [key]
> >
> >David1/1/2007 Abc
> >
> >Someone  1/2/2005 Abc
> >
> >Guess12/1/2000Xyz
> >
> >Harry1/1/2008 Abc
> >
> >Sandra   1/1/2003 Xyz
> >
> >
> >
> > As shown, the [key] field in Index_2 has repeated value that _points_ to
> > the
> > [key] values in Index_1. How do I make a query for the following:
> >
> >
> >
> > Find out all documents in Index_2:
> >
> > - [who] is in range of 'David' to 'Guess' and
> >
> > - [accessed] in range '1/1/1900' to '1/1/2010' and
> >
> > - [key] associated [content] in Index_1 that contains the term 'blah'
> >
> >
> >
> > I know this is more SQL like query. Is Lucene capable of doing this type
> > of
> > query that needs associations among index files?
> >
> >
> >
> > Thanks in advance.
> >
> >
> >
> > - m
> >
> >
> >
> >
> >
> >
> >
> >
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Help to solve an issue when upgrading Lucene-Oracle integration to lucene 2.3.1

2008-05-06 Thread Marcelo Ochoa
Hi Mike:
  Well the problem is consitently, but to test the code and the
project its necesary an Oracle 11g database :(
  I don't know why the computation of bufferUpto variable is wrong in
the last step, during all other calls pool.buffers.length is
consitently to 366 so I asume that its OK. And the value 8192 for
byfferUpto is suspicious because seem a bit shift or overrun.
  How bufferUpto variable is computed?
  Using some segment info alredy writed into the disk, or is in memory
at this point?
  I am still investigating the problem by adding more debugging information.
  Best regards, Marcelo.
On Tue, May 6, 2008 at 7:00 PM, Michael McCandless
<[EMAIL PROTECTED]> wrote:
>
>  Hi Marcelo,
>
>  Hmmm something is not right.
>
>  Somehow the byte slices, which DocumentsWriter uses to hold the postings in
> RAM, became corrupt.
>
>  Is this easily reproduced?
>
>  Mike
>
>  Marcelo Ochoa wrote:
>
> >
> >
> >
> > Hi Lucene experts:
> >  I am working upgrading Lucene-Oracle integration project to latest
> > Lucene 2.3.1 code.
> >  After correcting a minor issue on OJVMDirectory file implementation I
> > have the integration running with latest 2.3.1 code.
> >  But it only works with small indexes, I think index which are lower
> > than the Memory threshold.
> >  If I performs an index in a big table, I got this exception:
> >  Exception in thread "Root Thread"
> java.lang.ArrayIndexOutOfBoundsException
> >at
> org.apache.lucene.index.DocumentsWriter$ByteSliceReader.nextSlice(DocumentsWriter.java)
> >at
> org.apache.lucene.index.DocumentsWriter$ByteSliceReader.readByte(DocumentsWriter.java)
> >at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java)
> >at
> org.apache.lucene.index.DocumentsWriter.appendPostings(DocumentsWriter.java)
> >at
> org.apache.lucene.index.DocumentsWriter.writeSegment(DocumentsWriter.java:2011)
> >at
> org.apache.lucene.index.DocumentsWriter.flush(DocumentsWriter.java:548)
> >at
> org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:2497)
> >at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:2397)
> >at
> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java)
> >at
> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java)
> >at org.apache.lucene.indexer.TableIndexer.index(TableIndexer.java)
> >at
> org.apache.lucene.indexer.LuceneDomainIndex.ODCIIndexCreate(LuceneDomainIndex.java:477)
> >
> >  Something is wrong at nextSlice method :(
> >  I added some System.out info and, before the exception is thrown,
> > index and arrays have this information:
> >
> > .nextSlice (previous)
> > limit: 11393 buffer.length: 32768
> > level: 0 nextLevelArray.length: 10
> > bufferUpto: 147 pool.buffers.length: 366
> > .nextSlice (current)
> > limit: 6189 buffer.length: 32768
> > level: 0 nextLevelArray.length: 10
> > bufferUpto: 8192 pool.buffers.length: 366
> >
> >   As you can see bufferUpto variable have 8192 value and pool.buffers
> > is an array of 366 elements, this cause the exception  nextSlice()
> > method is:
> >public void nextSlice() {
> >
> >  // Skip to our next slice
> >  System.out.println(".nextSlice");
> >  System.out.println("limit: "+limit+" buffer.length: "+buffer.length);
> >  final int nextIndex = ((buffer[limit]&0xff)<<24) +
> > ((buffer[1+limit]&0xff)<<16) + ((buffer[2+limit]&0xff)<<8) +
> > (buffer[3+limit]&0xff);
> >
> >  System.out.println("level: "+level+" nextLevelArray.length:
> > "+nextLevelArray.length);
> >  level = nextLevelArray[level];
> >  final int newSize = levelSizeArray[level];
> >
> >  bufferUpto = nextIndex / BYTE_BLOCK_SIZE;
> >  bufferOffset = bufferUpto * BYTE_BLOCK_SIZE;
> >
> >  System.out.println("bufferUpto: "+bufferUpto+"
> > pool.buffers.length: "+pool.buffers.length);
> >  buffer = pool.buffers[bufferUpto];
> >  upto = nextIndex & BYTE_BLOCK_MASK;
> >
> >  if (nextIndex + newSize >= endIndex) {
> >// We are advancing to the final slice
> >assert endIndex - nextIndex > 0;
> >limit = endIndex - bufferOffset;
> >  } else {
> >// This is not the final slice (subtract 4 for the
> >// forwarding address at the end of this new slice)
> >limit = upto+newSize-4;
> >  }
> >}
> >
> >IndexWriter InfoStream information is:
> >  RAM: now flush @ usedMB=53.016 allocMB=53.016 triggerMB=53
> > IW 0 [Root Thread]:   flush: segment=_0 docStoreSegment=_0
> > docStoreOffset=0 flushDocs=true flushDeletes=false
> > flushDocStores=false numDocs=85
> > 564 numBufDelTerms=0
> > IW 0 [Root Thread]:   index before flush
> > flush postings as segment _0 numDocs=85564
> > . nextSlice output here.
> > docWriter: now abort
> > IW 1 [Root Thread]: hit exception flushing segment _0
> > docWriter: now abort
> > IFD [Root Thread]: now checkpoint "segments_1" [0 segments ; isCommit =
> false]
> > IFD [Root Thread]: refre

Re: Help to solve an issue when upgrading Lucene-Oracle integration to lucene 2.3.1

2008-05-06 Thread Michael McCandless


Hi Marcelo,

Hmmm something is not right.

Somehow the byte slices, which DocumentsWriter uses to hold the  
postings in RAM, became corrupt.


Is this easily reproduced?

Mike

Marcelo Ochoa wrote:

Hi Lucene experts:
  I am working upgrading Lucene-Oracle integration project to latest
Lucene 2.3.1 code.
 After correcting a minor issue on OJVMDirectory file implementation I
have the integration running with latest 2.3.1 code.
 But it only works with small indexes, I think index which are lower
than the Memory threshold.
 If I performs an index in a big table, I got this exception:
 Exception in thread "Root Thread"  
java.lang.ArrayIndexOutOfBoundsException
at org.apache.lucene.index.DocumentsWriter 
$ByteSliceReader.nextSlice(DocumentsWriter.java)
at org.apache.lucene.index.DocumentsWriter 
$ByteSliceReader.readByte(DocumentsWriter.java)
at org.apache.lucene.store.IndexInput.readVInt 
(IndexInput.java)
at org.apache.lucene.index.DocumentsWriter.appendPostings 
(DocumentsWriter.java)
at org.apache.lucene.index.DocumentsWriter.writeSegment 
(DocumentsWriter.java:2011)
at org.apache.lucene.index.DocumentsWriter.flush 
(DocumentsWriter.java:548)
at org.apache.lucene.index.IndexWriter.doFlush 
(IndexWriter.java:2497)
at org.apache.lucene.index.IndexWriter.flush 
(IndexWriter.java:2397)
at org.apache.lucene.index.IndexWriter.addDocument 
(IndexWriter.java)
at org.apache.lucene.index.IndexWriter.addDocument 
(IndexWriter.java)
at org.apache.lucene.indexer.TableIndexer.index 
(TableIndexer.java)
at  
org.apache.lucene.indexer.LuceneDomainIndex.ODCIIndexCreate 
(LuceneDomainIndex.java:477)


  Something is wrong at nextSlice method :(
  I added some System.out info and, before the exception is thrown,
index and arrays have this information:

.nextSlice (previous)
limit: 11393 buffer.length: 32768
level: 0 nextLevelArray.length: 10
bufferUpto: 147 pool.buffers.length: 366
.nextSlice (current)
limit: 6189 buffer.length: 32768
level: 0 nextLevelArray.length: 10
bufferUpto: 8192 pool.buffers.length: 366

   As you can see bufferUpto variable have 8192 value and pool.buffers
is an array of 366 elements, this cause the exception  nextSlice()
method is:
public void nextSlice() {

  // Skip to our next slice
  System.out.println(".nextSlice");
  System.out.println("limit: "+limit+" buffer.length:  
"+buffer.length);

  final int nextIndex = ((buffer[limit]&0xff)<<24) +
((buffer[1+limit]&0xff)<<16) + ((buffer[2+limit]&0xff)<<8) +
(buffer[3+limit]&0xff);

  System.out.println("level: "+level+" nextLevelArray.length:
"+nextLevelArray.length);
  level = nextLevelArray[level];
  final int newSize = levelSizeArray[level];

  bufferUpto = nextIndex / BYTE_BLOCK_SIZE;
  bufferOffset = bufferUpto * BYTE_BLOCK_SIZE;

  System.out.println("bufferUpto: "+bufferUpto+"
pool.buffers.length: "+pool.buffers.length);
  buffer = pool.buffers[bufferUpto];
  upto = nextIndex & BYTE_BLOCK_MASK;

  if (nextIndex + newSize >= endIndex) {
// We are advancing to the final slice
assert endIndex - nextIndex > 0;
limit = endIndex - bufferOffset;
  } else {
// This is not the final slice (subtract 4 for the
// forwarding address at the end of this new slice)
limit = upto+newSize-4;
  }
}

IndexWriter InfoStream information is:
  RAM: now flush @ usedMB=53.016 allocMB=53.016 triggerMB=53
IW 0 [Root Thread]:   flush: segment=_0 docStoreSegment=_0
docStoreOffset=0 flushDocs=true flushDeletes=false
flushDocStores=false numDocs=85
564 numBufDelTerms=0
IW 0 [Root Thread]:   index before flush
flush postings as segment _0 numDocs=85564
. nextSlice output here.
docWriter: now abort
IW 1 [Root Thread]: hit exception flushing segment _0
docWriter: now abort
IFD [Root Thread]: now checkpoint "segments_1" [0 segments ;  
isCommit = false]

IFD [Root Thread]: refresh [prefix=_0]: removing newly created
unreferenced file "_0.tii"
IFD [Root Thread]: delete "_0.tii"
IFD [Root Thread]: refresh [prefix=_0]: removing newly created
unreferenced file "_0.fdx"
IFD [Root Thread]: delete "_0.fdx"
IFD [Root Thread]: refresh [prefix=_0]: removing newly created
unreferenced file "_0.fnm"
IFD [Root Thread]: delete "_0.fnm"
IFD [Root Thread]: refresh [prefix=_0]: removing newly created
unreferenced file "_0.fdt"
IFD [Root Thread]: delete "_0.fdt"
IFD [Root Thread]: refresh [prefix=_0]: removing newly created
unreferenced file "_0.prx"
IFD [Root Thread]: delete "_0.prx"
IFD [Root Thread]: refresh [prefix=_0]: removing newly created
unreferenced file "_0.frq"
IFD [Root Thread]: delete "_0.frq"
IFD [Root Thread]: refresh [prefix=_0]: removing newly created
unreferenced file "_0.tis"
IFD [Root Thread]: delete "_0.tis"
IW 1 [Root Thread]: now flush at close
IW 1 [Root Thread]:   flush: segment=null docStoreSegment=null
docStoreOffset=0 flushDocs=false flushDelet

Help to solve an issue when upgrading Lucene-Oracle integration to lucene 2.3.1

2008-05-06 Thread Marcelo Ochoa
Hi Lucene experts:
  I am working upgrading Lucene-Oracle integration project to latest
Lucene 2.3.1 code.
 After correcting a minor issue on OJVMDirectory file implementation I
have the integration running with latest 2.3.1 code.
 But it only works with small indexes, I think index which are lower
than the Memory threshold.
 If I performs an index in a big table, I got this exception:
 Exception in thread "Root Thread" java.lang.ArrayIndexOutOfBoundsException
at 
org.apache.lucene.index.DocumentsWriter$ByteSliceReader.nextSlice(DocumentsWriter.java)
at 
org.apache.lucene.index.DocumentsWriter$ByteSliceReader.readByte(DocumentsWriter.java)
at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java)
at 
org.apache.lucene.index.DocumentsWriter.appendPostings(DocumentsWriter.java)
at 
org.apache.lucene.index.DocumentsWriter.writeSegment(DocumentsWriter.java:2011)
at 
org.apache.lucene.index.DocumentsWriter.flush(DocumentsWriter.java:548)
at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:2497)
at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:2397)
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java)
at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java)
at org.apache.lucene.indexer.TableIndexer.index(TableIndexer.java)
at 
org.apache.lucene.indexer.LuceneDomainIndex.ODCIIndexCreate(LuceneDomainIndex.java:477)

  Something is wrong at nextSlice method :(
  I added some System.out info and, before the exception is thrown,
index and arrays have this information:

.nextSlice (previous)
limit: 11393 buffer.length: 32768
level: 0 nextLevelArray.length: 10
bufferUpto: 147 pool.buffers.length: 366
.nextSlice (current)
limit: 6189 buffer.length: 32768
level: 0 nextLevelArray.length: 10
bufferUpto: 8192 pool.buffers.length: 366

   As you can see bufferUpto variable have 8192 value and pool.buffers
is an array of 366 elements, this cause the exception  nextSlice()
method is:
public void nextSlice() {

  // Skip to our next slice
  System.out.println(".nextSlice");
  System.out.println("limit: "+limit+" buffer.length: "+buffer.length);
  final int nextIndex = ((buffer[limit]&0xff)<<24) +
((buffer[1+limit]&0xff)<<16) + ((buffer[2+limit]&0xff)<<8) +
(buffer[3+limit]&0xff);

  System.out.println("level: "+level+" nextLevelArray.length:
"+nextLevelArray.length);
  level = nextLevelArray[level];
  final int newSize = levelSizeArray[level];

  bufferUpto = nextIndex / BYTE_BLOCK_SIZE;
  bufferOffset = bufferUpto * BYTE_BLOCK_SIZE;

  System.out.println("bufferUpto: "+bufferUpto+"
pool.buffers.length: "+pool.buffers.length);
  buffer = pool.buffers[bufferUpto];
  upto = nextIndex & BYTE_BLOCK_MASK;

  if (nextIndex + newSize >= endIndex) {
// We are advancing to the final slice
assert endIndex - nextIndex > 0;
limit = endIndex - bufferOffset;
  } else {
// This is not the final slice (subtract 4 for the
// forwarding address at the end of this new slice)
limit = upto+newSize-4;
  }
}

IndexWriter InfoStream information is:
  RAM: now flush @ usedMB=53.016 allocMB=53.016 triggerMB=53
IW 0 [Root Thread]:   flush: segment=_0 docStoreSegment=_0
docStoreOffset=0 flushDocs=true flushDeletes=false
flushDocStores=false numDocs=85
564 numBufDelTerms=0
IW 0 [Root Thread]:   index before flush
flush postings as segment _0 numDocs=85564
. nextSlice output here.
docWriter: now abort
IW 1 [Root Thread]: hit exception flushing segment _0
docWriter: now abort
IFD [Root Thread]: now checkpoint "segments_1" [0 segments ; isCommit = false]
IFD [Root Thread]: refresh [prefix=_0]: removing newly created
unreferenced file "_0.tii"
IFD [Root Thread]: delete "_0.tii"
IFD [Root Thread]: refresh [prefix=_0]: removing newly created
unreferenced file "_0.fdx"
IFD [Root Thread]: delete "_0.fdx"
IFD [Root Thread]: refresh [prefix=_0]: removing newly created
unreferenced file "_0.fnm"
IFD [Root Thread]: delete "_0.fnm"
IFD [Root Thread]: refresh [prefix=_0]: removing newly created
unreferenced file "_0.fdt"
IFD [Root Thread]: delete "_0.fdt"
IFD [Root Thread]: refresh [prefix=_0]: removing newly created
unreferenced file "_0.prx"
IFD [Root Thread]: delete "_0.prx"
IFD [Root Thread]: refresh [prefix=_0]: removing newly created
unreferenced file "_0.frq"
IFD [Root Thread]: delete "_0.frq"
IFD [Root Thread]: refresh [prefix=_0]: removing newly created
unreferenced file "_0.tis"
IFD [Root Thread]: delete "_0.tis"
IW 1 [Root Thread]: now flush at close
IW 1 [Root Thread]:   flush: segment=null docStoreSegment=null
docStoreOffset=0 flushDocs=false flushDeletes=false
flushDocStores=false numDocs=0 numBufDelTerms=0
IW 1 [Root Thread]:   index before flush
IW 1 [Root Thread]: close: wrote segments file "segments_2"
IFD [Root Thread]: now checkpoint "segments_2" [0 segments ; isCommit = t

Re: How to make a query that associates 2 index files

2008-05-06 Thread Erick Erickson
Sure, just include different fields in different docs in your index.
Then, when you search since each term is on a field, docs without
that field are excluded from the search.

But this is really not very different in terms of a solution than
your earlier one. You still have the issue of searching the index
once to get the keys, then using those keys as part of another
search.

Are you saying that there is a many-to-one mapping to your content
index? And how much data are we talking about here? 10M, 100G?
this makes an enormous difference in your options.

I hope you're aware that Lucene doc IDs can change as the index is
modified, so I assume your "keys" are something you created and
NOT the lucene doc IDs.

Best
Erick

On Tue, May 6, 2008 at 3:03 PM, Michael Siu <[EMAIL PROTECTED]> wrote:

> My problem is: the [content] value can be huge. Duplicating it in more
> than
> one index document waste disk space (and search time?). In additions, when
> new documents are added to the second index, it will be faster to just
> index
> the linked [content] once (in first index file) and any subsequent
> reference
> to the same [content] will not need to re-index.
>
> In fact, I do not really need to physically separate the indices into 2
> files because Lucene supports Heterogeneous documents in an index file. I
> have no idea of how that work in a search. Does anyone know?
>
> Thanks in advance.
>
>
>
> -Original Message-
> From: Erick Erickson [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, May 06, 2008 9:36 AM
> To: java-user@lucene.apache.org
> Subject: Re: How to make a query that associates 2 index files
>
> You don't. You really have to roll your own solution here, there's
> no "inter-index" awareness that I know of in Lucene.
>
> Typically, people either do a half-half solution (that is, put the
> text search in Lucene and leave the DB parts in the DB) or
> de-normalize the data in a Lucene index so you don't have
> to even try to do things cross-index.
>
> And then there's Marcelo Ochoa (I may have the spelling wrong) who's
> put together some way to embed Lucene in an Oracle database, but
> that's magic to me.
>
> Why do you want to arrange your index this way in the first place?
> Perhaps there's a more specific answer waiting out there...
>
> Best
> Erick
>
> On Tue, May 6, 2008 at 12:14 PM, Michael Siu <[EMAIL PROTECTED]>
> wrote:
>
> > Hi,
> >
> >
> >
> > I am a newbie to Lucene. I have a question for making a query that
> > associate
> > 2 index files:
> >
> >
> >
> > - One index has the content index for a list of documents and a key to
> the
> > document. That means the Lucene document of this index contains 2
> fields:
> >
> > the 'content' and the 'key'.
> >
> > - another index has the some data indexed associated with the 'key' in
> the
> > previous index. The Lucene document of this index contains several
> fields:
> >
> > the 'who' that contains some data and the 'key' that _points_ to the
> > document in the first index.
> >
> >
> >
> > Sample data:
> >
> > Index_1:   [key] [content]
> >
> >Abc   "blah blah 123 ..."
> >
> >Xyz   "123 321 a nice day ..."
> >
> >
> >
> > Index_2:   [who][accessed]   [key]
> >
> >David1/1/2007 Abc
> >
> >Someone  1/2/2005 Abc
> >
> >Guess12/1/2000Xyz
> >
> >Harry1/1/2008 Abc
> >
> >Sandra   1/1/2003 Xyz
> >
> >
> >
> > As shown, the [key] field in Index_2 has repeated value that _points_ to
> > the
> > [key] values in Index_1. How do I make a query for the following:
> >
> >
> >
> > Find out all documents in Index_2:
> >
> > - [who] is in range of 'David' to 'Guess' and
> >
> > - [accessed] in range '1/1/1900' to '1/1/2010' and
> >
> > - [key] associated [content] in Index_1 that contains the term 'blah'
> >
> >
> >
> > I know this is more SQL like query. Is Lucene capable of doing this type
> > of
> > query that needs associations among index files?
> >
> >
> >
> > Thanks in advance.
> >
> >
> >
> > - m
> >
> >
> >
> >
> >
> >
> >
> >
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


RE: How to make a query that associates 2 index files

2008-05-06 Thread Michael Siu
My problem is: the [content] value can be huge. Duplicating it in more than
one index document waste disk space (and search time?). In additions, when
new documents are added to the second index, it will be faster to just index
the linked [content] once (in first index file) and any subsequent reference
to the same [content] will not need to re-index.

In fact, I do not really need to physically separate the indices into 2
files because Lucene supports Heterogeneous documents in an index file. I
have no idea of how that work in a search. Does anyone know?

Thanks in advance.



-Original Message-
From: Erick Erickson [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, May 06, 2008 9:36 AM
To: java-user@lucene.apache.org
Subject: Re: How to make a query that associates 2 index files

You don't. You really have to roll your own solution here, there's
no "inter-index" awareness that I know of in Lucene.

Typically, people either do a half-half solution (that is, put the
text search in Lucene and leave the DB parts in the DB) or
de-normalize the data in a Lucene index so you don't have
to even try to do things cross-index.

And then there's Marcelo Ochoa (I may have the spelling wrong) who's
put together some way to embed Lucene in an Oracle database, but
that's magic to me.

Why do you want to arrange your index this way in the first place?
Perhaps there's a more specific answer waiting out there...

Best
Erick

On Tue, May 6, 2008 at 12:14 PM, Michael Siu <[EMAIL PROTECTED]>
wrote:

> Hi,
>
>
>
> I am a newbie to Lucene. I have a question for making a query that
> associate
> 2 index files:
>
>
>
> - One index has the content index for a list of documents and a key to the
> document. That means the Lucene document of this index contains 2 fields:
>
> the 'content' and the 'key'.
>
> - another index has the some data indexed associated with the 'key' in the
> previous index. The Lucene document of this index contains several fields:
>
> the 'who' that contains some data and the 'key' that _points_ to the
> document in the first index.
>
>
>
> Sample data:
>
> Index_1:   [key] [content]
>
>Abc   "blah blah 123 ..."
>
>Xyz   "123 321 a nice day ..."
>
>
>
> Index_2:   [who][accessed]   [key]
>
>David1/1/2007 Abc
>
>Someone  1/2/2005 Abc
>
>Guess12/1/2000Xyz
>
>Harry1/1/2008 Abc
>
>Sandra   1/1/2003 Xyz
>
>
>
> As shown, the [key] field in Index_2 has repeated value that _points_ to
> the
> [key] values in Index_1. How do I make a query for the following:
>
>
>
> Find out all documents in Index_2:
>
> - [who] is in range of 'David' to 'Guess' and
>
> - [accessed] in range '1/1/1900' to '1/1/2010' and
>
> - [key] associated [content] in Index_1 that contains the term 'blah'
>
>
>
> I know this is more SQL like query. Is Lucene capable of doing this type
> of
> query that needs associations among index files?
>
>
>
> Thanks in advance.
>
>
>
> - m
>
>
>
>
>
>
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Are those runtime errors about the jdk, or lucene's jar, or my code?

2008-05-06 Thread Michael McCandless


Hi,

Could you run org.apache.lucene.index.CheckIndex on your index and  
post the result?


Are these exceptions easily reproduced starting from scratch (new  
index)?


More responses/questions below:

crspan wrote:



-- OS: Linux lg99 2.6.5-7.276-smp #1 SMP Fri Sep 28 20:33:22 AKDT  
2007 x86_64 x86_64 x86_64 GNU/Linux


-- Lucene:  2.3.2  (tried 2.2.0 as well, since the index was built  
around 2.2.0, jdk1.6.0_01 )


Do you see these same exceptions when you run on Lucene 2.2.0?

-- JDK:  Sun jdk1.6.0_06 ( from jdk-6u6-linux-x64.bin ) &  Sun  
jdk1.5.0_15 ( from jdk-1_5_0_15-linux-amd64.bin)
( both installed locally in the user's home directory WITHOUT root  
privilege.)


-- Souce code:

System.out.print("\n\n Range = "+range+"\nQuery = "+q.toString() 
+"\n") ;

tds = is.search( q, (Filter)null, range );

-- Stack traces (1):

Range = 500
Query = TEXT:illeg^30.820824 TEXT:technolog^22.290413  
TEXT:transfer^33.307804 TEXT:bipartisan^20.942562  
TEXT:laboratori^18.500801 TEXT:norm^21.193087  
TEXT:counterintellig^29.724474 TEXT:spi^19.285275  
TEXT:lab^20.497044 TEXT:american^11.090684 TEXT:question^11.929131  
TEXT:review^14.588552 TEXT:obtain^17.56319 TEXT:commun^12.5947275  
TEXT:nation^10.737445 TEXT:offici^11.375352 TEXT:rep^17.646774  
TEXT:contribut^15.35846 TEXT:report^11.633566  
TEXT:congress^14.976282 TEXT:justic^16.433678 TEXT:govern^12.003913  
TEXT:declassifi^31.553194 TEXT:campaign^14.959521  
TEXT:inform^14.187338 TEXT:compani^13.717714  
TEXT:classifi^23.613848 TEXT:washington^13.995003  
TEXT:hugh^23.138725 TEXT:issu^14.177698 TEXT:space^18.239595 TEXT: 
1996^16.198292 TEXT:rocket^21.983511 TEXT:administr^17.11987  
TEXT:satellit^21.777317 TEXT:nuclear^20.927034  
TEXT:republican^18.929497 TEXT:committe^18.195517  
TEXT:intellig^21.868582 TEXT:hous^17.309698 TEXT:democrat^18.528954  
TEXT:investig^19.492653 TEXT:panel^22.208527 TEXT:senat^20.456139  
TEXT:chines^19.726551 TEXT:sensit^23.52441 TEXT:secur^20.280426  
TEXT:depart^21.874023 TEXT:missil^25.32581 TEXT:illeg^27.417799  
TEXT:loral^41.9551 TEXT:transfer^33.933247



QueryString:
illeg^30.820824 technolog^22.290413 transfer^33.307804
Error: java.lang.ArrayIndexOutOfBoundsException:  
132704java.lang.ArrayIndexOutOfBoundsException: 132704
   at org.apache.lucene.search.BooleanScorer2 
$Coordinator.coordFactor(BooleanScorer2.java:55)
   at org.apache.lucene.search.BooleanScorer2.score 
(BooleanScorer2.java:358)
   at org.apache.lucene.search.BooleanScorer2.score 
(BooleanScorer2.java:320)
   at org.apache.lucene.search.IndexSearcher.search 
(IndexSearcher.java:146)
   at org.apache.lucene.search.IndexSearcher.search 
(IndexSearcher.java:113)

   at org.apache.lucene.search.Searcher.search(Searcher.java:132)
   at org.cr.search.TrecQueryRelevanceFeedback.main 
(TrecQueryRelevanceFeedback.java:785)


I'm not sure what could cause this one.


-- Souce code:

 TermFreqVector[] termsV = reader.getTermFreqVectors 
(docID);


-- Stack traces (2):

Range = 500
Query = TEXT:oceanograph^68.48028 TEXT:vessel^43.191563


QueryString:
oceanograph^68.48028 vessel^43.191563
Error:  
java.lang.ArrayIndexOutOfBoundsExceptionjava.lang.ArrayIndexOutOfBound 
sException
   at org.apache.lucene.index.TermVectorsReader.readTermVector 
(TermVectorsReader.java:353)
   at org.apache.lucene.index.TermVectorsReader.readTermVectors 
(TermVectorsReader.java:287)
   at org.apache.lucene.index.TermVectorsReader.get 
(TermVectorsReader.java:232)
   at org.apache.lucene.index.SegmentReader.getTermFreqVectors 
(SegmentReader.java:981)
   at org.cr.rf.RelevanceFeedback.RelFeedbackWeight 
(RelevanceFeedback.java:145)
   at org.cr.search.TrecQueryRelevanceFeedback.main 
(TrecQueryRelevanceFeedback.java:789)




This looks like index corruption.  If you run CheckIndex it should  
detect this.  Did you hit any previous exceptions when writing to  
this index?  Is it possible to send me a copy of the index?




--  Other Info:

* This index can be searched in other programs in the same  
environment.
* The same program runs just fine in Window (1.6.0-b105 & 1.5.0_03- 
b07), and a HP-UX (1.5.0.05) without those runtime errors.


This is very strange.  These same programs that create the above two  
exceptions, run fine on Windows & HP-UX?  Did you copy the index  
between these machines, or is it the very same index in a shared mount?



-- My questions:



-? What is your reading of those two stack traces?


-? Where

QueryString:
illeg^30.820824 technolog^22.290413 transfer^33.307804

got printed? It is NOT from my code, so is it part of Lucene's  
error message? Which line in Lucene is for this print out?


I don't know!



-?? As you can see, from the line

System.out.print("\n\n Range = "+range+"\nQuery = "+q.toString() 
+"\n") ;


it just printed the 50 terms in the query:

Range = 500
Query = TEXT:illeg^30.820824 TEXT:technolog^22.290413  
TEXT:transfer^33.307804 TEXT:bipar

Are those runtime errors about the jdk, or lucene's jar, or my code?

2008-05-06 Thread crspan



-- OS: Linux lg99 2.6.5-7.276-smp #1 SMP Fri Sep 28 20:33:22 AKDT 2007 
x86_64 x86_64 x86_64 GNU/Linux


-- Lucene:  2.3.2  (tried 2.2.0 as well, since the index was built 
around 2.2.0, jdk1.6.0_01 )


-- JDK:  Sun jdk1.6.0_06 ( from jdk-6u6-linux-x64.bin ) &  Sun 
jdk1.5.0_15 ( from jdk-1_5_0_15-linux-amd64.bin)
( both installed locally in the user's home directory WITHOUT root 
privilege.)


-- Souce code:

System.out.print("\n\n Range = "+range+"\nQuery = "+q.toString()+"\n") ;
tds = is.search( q, (Filter)null, range );

-- Stack traces (1):

Range = 500
Query = TEXT:illeg^30.820824 TEXT:technolog^22.290413 
TEXT:transfer^33.307804 TEXT:bipartisan^20.942562 
TEXT:laboratori^18.500801 TEXT:norm^21.193087 
TEXT:counterintellig^29.724474 TEXT:spi^19.285275 TEXT:lab^20.497044 
TEXT:american^11.090684 TEXT:question^11.929131 TEXT:review^14.588552 
TEXT:obtain^17.56319 TEXT:commun^12.5947275 TEXT:nation^10.737445 
TEXT:offici^11.375352 TEXT:rep^17.646774 TEXT:contribut^15.35846 
TEXT:report^11.633566 TEXT:congress^14.976282 TEXT:justic^16.433678 
TEXT:govern^12.003913 TEXT:declassifi^31.553194 TEXT:campaign^14.959521 
TEXT:inform^14.187338 TEXT:compani^13.717714 TEXT:classifi^23.613848 
TEXT:washington^13.995003 TEXT:hugh^23.138725 TEXT:issu^14.177698 
TEXT:space^18.239595 TEXT:1996^16.198292 TEXT:rocket^21.983511 
TEXT:administr^17.11987 TEXT:satellit^21.777317 TEXT:nuclear^20.927034 
TEXT:republican^18.929497 TEXT:committe^18.195517 
TEXT:intellig^21.868582 TEXT:hous^17.309698 TEXT:democrat^18.528954 
TEXT:investig^19.492653 TEXT:panel^22.208527 TEXT:senat^20.456139 
TEXT:chines^19.726551 TEXT:sensit^23.52441 TEXT:secur^20.280426 
TEXT:depart^21.874023 TEXT:missil^25.32581 TEXT:illeg^27.417799 
TEXT:loral^41.9551 TEXT:transfer^33.933247



QueryString:
illeg^30.820824 technolog^22.290413 transfer^33.307804
Error: java.lang.ArrayIndexOutOfBoundsException: 
132704java.lang.ArrayIndexOutOfBoundsException: 132704
   at 
org.apache.lucene.search.BooleanScorer2$Coordinator.coordFactor(BooleanScorer2.java:55)
   at 
org.apache.lucene.search.BooleanScorer2.score(BooleanScorer2.java:358)
   at 
org.apache.lucene.search.BooleanScorer2.score(BooleanScorer2.java:320)
   at 
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:146)
   at 
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:113)

   at org.apache.lucene.search.Searcher.search(Searcher.java:132)
   at 
org.cr.search.TrecQueryRelevanceFeedback.main(TrecQueryRelevanceFeedback.java:785)






-- Souce code:

 TermFreqVector[] termsV = reader.getTermFreqVectors(docID);

-- Stack traces (2):

Range = 500
Query = TEXT:oceanograph^68.48028 TEXT:vessel^43.191563


QueryString:
oceanograph^68.48028 vessel^43.191563
Error: 
java.lang.ArrayIndexOutOfBoundsExceptionjava.lang.ArrayIndexOutOfBoundsException
   at 
org.apache.lucene.index.TermVectorsReader.readTermVector(TermVectorsReader.java:353)
   at 
org.apache.lucene.index.TermVectorsReader.readTermVectors(TermVectorsReader.java:287)
   at 
org.apache.lucene.index.TermVectorsReader.get(TermVectorsReader.java:232)
   at 
org.apache.lucene.index.SegmentReader.getTermFreqVectors(SegmentReader.java:981)
   at 
org.cr.rf.RelevanceFeedback.RelFeedbackWeight(RelevanceFeedback.java:145)
   at 
org.cr.search.TrecQueryRelevanceFeedback.main(TrecQueryRelevanceFeedback.java:789)




--  Other Info:

* This index can be searched in other programs in the same environment.
* The same program runs just fine in Window (1.6.0-b105 & 1.5.0_03-b07), 
and a HP-UX (1.5.0.05) without those runtime errors.




-- My questions:



-? What is your reading of those two stack traces?


-? Where

QueryString:
illeg^30.820824 technolog^22.290413 transfer^33.307804

got printed? It is NOT from my code, so is it part of Lucene's error 
message? Which line in Lucene is for this print out?



-?? As you can see, from the line

System.out.print("\n\n Range = "+range+"\nQuery = "+q.toString()+"\n") ;

it just printed the 50 terms in the query:

Range = 500
Query = TEXT:illeg^30.820824 TEXT:technolog^22.290413 
TEXT:transfer^33.307804 TEXT:bipartisan^20.942562 
TEXT:laboratori^18.500801 TEXT:norm^21.193087 
TEXT:counterintellig^29.724474 TEXT:spi^19.285275 TEXT:lab^20.497044 
TEXT:american^11.090684 TEXT:question^11.929131 TEXT:review^14.588552 
TEXT:obtain^17.56319 TEXT:commun^12.5947275 TEXT:nation^10.737445 
TEXT:offici^11.375352 TEXT:rep^17.646774 TEXT:contribut^15.35846 
TEXT:report^11.633566 TEXT:congress^14.976282 TEXT:justic^16.433678 
TEXT:govern^12.003913 TEXT:declassifi^31.553194 TEXT:campaign^14.959521 
TEXT:inform^14.187338 TEXT:compani^13.717714 TEXT:classifi^23.613848 
TEXT:washington^13.995003 TEXT:hugh^23.138725 TEXT:issu^14.177698 
TEXT:space^18.239595 TEXT:1996^16.198292 TEXT:rocket^21.983511 
TEXT:administr^17.11987 TEXT:satellit^21.777317 TEXT:nuclear^20.927034 
TEXT:republican^18.929497 TEXT:committe^18.195517 
TE

Re: Postcode/zipcode search

2008-05-06 Thread mark harwood
Can you not convert all postcodes to coordinates and do actual distance-based 
matching?

You will have to pay Royal Mail or 3rd party suppliers to get hold of the PAF 
data required for this geocoding (despite having funded this already as a UK 
tax payer- g)

Cheers
Mark

- Original Message 
From: Chris Mannion <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Tuesday, 6 May, 2008 5:28:25 PM
Subject: Postcode/zipcode search

Hi all

I've got a bit of a niggling problem with how one of my searches is working
as opposed to how my users would like it too work.  We're indexing on UK
postcodes, which are in the format of a 3 or 4 character area code followed
by a 3 or 4 character street specific code, e.g. "NW10 7NY" or "M11 1LQ".
We originally had the values being indexed as tokenized and used a very
simple search string in the format "postcode:xxx xxx", with no grouping or
boosting or fuzzy searching, just an straight search on whatever the user
answered.  This had the benefit of finding exact matches to searches and
allowing us to search just on the area part of the code to return all
records with that area code, eg a search on "NW2" returning anything
starting NW2, like "NW2 6TB", "NW2 1ER" etc etc.

However, the downside to that was that searches could also return records
only tenuously related to what was searched for, eg. a search for "NW10 7NY"
would also return a record with a postcode "SE9 6NY" because of the slight
match of the "NY".  Obviously this was technically correct but users
complained because their searches were returning records from completely
different areas.  Our first step to put this right was to take off the
tokenization of the field, which we also weren't happy with so have
continued to fiddle.

The current status is as follows - we index the values by stripping out
spaces and tokeniing them and use a keywordAnalyzer.  In searching we also
strip spaces from the search term entered and search with a
keywordAnalyzer.  Searches for full postcodes, e.g. "NW10 7NY" find all
exact matches but also any full values that are partial matches (e.g. some
records just have "NW10" as their postcode field and the "NW10 7NY" search
pulls them back too), but searches for partial postcodes e.g. "NW10" still
only finds exact matches, e.g. it only pulls back those record that have
just "NW10" as their postcode, rather than anything *starting* with NW10 as
we'd like it to do.

Can anyone help me get this working in the way we need it too please?

-- 
Chris Mannion
iCasework and LocalAlert implementation team
0208 144 4416





  __
Sent from Yahoo! Mail.
A Smarter Email http://uk.docs.yahoo.com/nowyoucan.html

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Postcode/zipcode search

2008-05-06 Thread AJ Weber
Maybe I'm oversimplifying it, and maybe this isn't what you desire, but...

What about breaking the postcode into two (or three) different fields?  Seems 
easy to parse on the ingestion-side, as you just break the string at the 
"middle" space.  Then store "postal_area", "postal_street", and optionally the 
original, full "postalcode".  (Probably do not need to tokenize the first two, 
maybe the last one.)

Then, and here's where you may throw this idea out entirely, it depends on how 
your searching application/page is setup.  You'd need to apply the values 
entered by the user appropriately.  If they enter 2-3chars with no spaces, 
search on the "postal_area" field.  If they enter > 4 chars (including a 
space), you could, again, split the string at the space and search on the two 
individual fields.

If you kept the original, full "postalcode" field, you could always put a link 
on the search results (or maybe only if zero results are returned) saying, 
"Didn't find what you're looking for?  Click here to broaden your search!"  -- 
And in that case send the whole query-string against the postalcode field.

Dunno.  Just an idea.  Good Luck!

-AJ

  - Original Message - 
  From: Chris Mannion 
  To: java-user@lucene.apache.org 
  Sent: Tuesday, May 06, 2008 12:28 PM
  Subject: Postcode/zipcode search


  Hi all

  I've got a bit of a niggling problem with how one of my searches is working
  as opposed to how my users would like it too work.  We're indexing on UK
  postcodes, which are in the format of a 3 or 4 character area code followed
  by a 3 or 4 character street specific code, e.g. "NW10 7NY" or "M11 1LQ".
  We originally had the values being indexed as tokenized and used a very
  simple search string in the format "postcode:xxx xxx", with no grouping or
  boosting or fuzzy searching, just an straight search on whatever the user
  answered.  This had the benefit of finding exact matches to searches and
  allowing us to search just on the area part of the code to return all
  records with that area code, eg a search on "NW2" returning anything
  starting NW2, like "NW2 6TB", "NW2 1ER" etc etc.

  However, the downside to that was that searches could also return records
  only tenuously related to what was searched for, eg. a search for "NW10 7NY"
  would also return a record with a postcode "SE9 6NY" because of the slight
  match of the "NY".  Obviously this was technically correct but users
  complained because their searches were returning records from completely
  different areas.  Our first step to put this right was to take off the
  tokenization of the field, which we also weren't happy with so have
  continued to fiddle.

  The current status is as follows - we index the values by stripping out
  spaces and tokeniing them and use a keywordAnalyzer.  In searching we also
  strip spaces from the search term entered and search with a
  keywordAnalyzer.  Searches for full postcodes, e.g. "NW10 7NY" find all
  exact matches but also any full values that are partial matches (e.g. some
  records just have "NW10" as their postcode field and the "NW10 7NY" search
  pulls them back too), but searches for partial postcodes e.g. "NW10" still
  only finds exact matches, e.g. it only pulls back those record that have
  just "NW10" as their postcode, rather than anything *starting* with NW10 as
  we'd like it to do.

  Can anyone help me get this working in the way we need it too please?

  -- 
  Chris Mannion
  iCasework and LocalAlert implementation team
  0208 144 4416


RE: Postcode/zipcode search

2008-05-06 Thread Will Johnson
You could split up the field into 2 separate fields:

Postcode:NW10 7NY -> post1:NW10 post2:7NY

Then rewrite user's queries using the same logic:  ie if the enter 1 term
'NW10' it gets rewritten to post1:NW10, if they enter 2 terms post1:NW10 AND
post2:7NY.

It also lets you do fuzzy search ie post1:NW10 post2:7?Y and so on.

- will

-Original Message-
From: Chris Mannion [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, May 06, 2008 12:28 PM
To: java-user@lucene.apache.org
Subject: Postcode/zipcode search

Hi all

I've got a bit of a niggling problem with how one of my searches is working
as opposed to how my users would like it too work.  We're indexing on UK
postcodes, which are in the format of a 3 or 4 character area code followed
by a 3 or 4 character street specific code, e.g. "NW10 7NY" or "M11 1LQ".
We originally had the values being indexed as tokenized and used a very
simple search string in the format "postcode:xxx xxx", with no grouping or
boosting or fuzzy searching, just an straight search on whatever the user
answered.  This had the benefit of finding exact matches to searches and
allowing us to search just on the area part of the code to return all
records with that area code, eg a search on "NW2" returning anything
starting NW2, like "NW2 6TB", "NW2 1ER" etc etc.

However, the downside to that was that searches could also return records
only tenuously related to what was searched for, eg. a search for "NW10 7NY"
would also return a record with a postcode "SE9 6NY" because of the slight
match of the "NY".  Obviously this was technically correct but users
complained because their searches were returning records from completely
different areas.  Our first step to put this right was to take off the
tokenization of the field, which we also weren't happy with so have
continued to fiddle.

The current status is as follows - we index the values by stripping out
spaces and tokeniing them and use a keywordAnalyzer.  In searching we also
strip spaces from the search term entered and search with a
keywordAnalyzer.  Searches for full postcodes, e.g. "NW10 7NY" find all
exact matches but also any full values that are partial matches (e.g. some
records just have "NW10" as their postcode field and the "NW10 7NY" search
pulls them back too), but searches for partial postcodes e.g. "NW10" still
only finds exact matches, e.g. it only pulls back those record that have
just "NW10" as their postcode, rather than anything *starting* with NW10 as
we'd like it to do.

Can anyone help me get this working in the way we need it too please?

-- 
Chris Mannion
iCasework and LocalAlert implementation team
0208 144 4416


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Multiple Field search

2008-05-06 Thread Erick Erickson
Well, it's the one I'd use. Whether it's the best or not is...er...not so
certain .

Erick

On Tue, May 6, 2008 at 12:37 PM, Kelvin Foo Chuan Lyi <[EMAIL PROTECTED]>
wrote:

> Thanks... that's what I thought of ... but was wondering if that was the
> best method to do so... i guess it is then... :)
>
>
>
>
> On Wed, May 7, 2008 at 12:32 AM, Erick Erickson <[EMAIL PROTECTED]>
> wrote:
>
> > One of my favorite quotes from Roger Zelazny... "postulating
> > infinity, the rest is easy".
> >
> > In this case, "infinity" is how you break up your query. The easy part
> is
> > making your search return what you want.
> >
> > Assuming you know that you want "greatest" and
> > "hits" to go against the title field and "beatles" to
> > go against description, your query looks something
> > like:
> >
> > +title:greatest +title:hits +description:beatles
> >
> > But knowing you want to break up the query like that is the hard part.
> >
> > Sometimes you can make it work well enough by submitting all
> > terms against both fields with an OR clause, something like:
> >
> > title:(+greatest +hits +beatles) description:(+greatest +hits +beatles)
> > (note, OR is implied between title and description)
> >
> > which would not work in your case. Another technique is to dump
> > all the words into a single uber-field *as well as* your individual
> > fields, so the search
> > uber:(+greatest +hits +beatles) would work for you. Note that if
> > you index (but do NOT store) the uber field and store (but do NOT index)
> > the text and description fields, your index stays about the same
> size.
> >
> > Anyway, you need to carefully define how your searches *should* work,
> > then define your index structure IMO.
> >
> > Best
> > Erick
> >
> > On Tue, May 6, 2008 at 12:07 PM, Kelvin Foo Chuan Lyi <[EMAIL PROTECTED]
> >
> > wrote:
> >
> > > I'm new to lucene and have a question on how to create a query for the
> > > following example... Say I have two fields, Title and Description,
> with
> > > the
> > > following data
> > >
> > > Item 1
> > > Title: The greatest hits
> > > Description : Collection of the best music from The Beatles.
> > >
> > > Item 2
> > > Title: U2 collections
> > > Description : Greatest hits collection of U2
> > >
> > >
> > > In my search, I want to search for 'greatest hits beatles'. The result
> > > should return only Item 1..  how should the query looks like??
> > >
> > >
> > > Thanks.
> > >
> > >
> > >
> > >
> > > --
> > > Sayonara,
> > > Kelvin
> > >
> >
>
>
>
> --
> Sayonara,
> Kelvin
>


Re: Postcode/zipcode search

2008-05-06 Thread Erick Erickson
Have you looked at PrefixQuery? If that doesn't work for you, could you give
a few
more examples of expected inputs and outputs?

Best
Erick

On Tue, May 6, 2008 at 12:28 PM, Chris Mannion <[EMAIL PROTECTED]>
wrote:

> Hi all
>
> I've got a bit of a niggling problem with how one of my searches is
> working
> as opposed to how my users would like it too work.  We're indexing on UK
> postcodes, which are in the format of a 3 or 4 character area code
> followed
> by a 3 or 4 character street specific code, e.g. "NW10 7NY" or "M11 1LQ".
> We originally had the values being indexed as tokenized and used a very
> simple search string in the format "postcode:xxx xxx", with no grouping or
> boosting or fuzzy searching, just an straight search on whatever the user
> answered.  This had the benefit of finding exact matches to searches and
> allowing us to search just on the area part of the code to return all
> records with that area code, eg a search on "NW2" returning anything
> starting NW2, like "NW2 6TB", "NW2 1ER" etc etc.
>
> However, the downside to that was that searches could also return records
> only tenuously related to what was searched for, eg. a search for "NW10
> 7NY"
> would also return a record with a postcode "SE9 6NY" because of the slight
> match of the "NY".  Obviously this was technically correct but users
> complained because their searches were returning records from completely
> different areas.  Our first step to put this right was to take off the
> tokenization of the field, which we also weren't happy with so have
> continued to fiddle.
>
> The current status is as follows - we index the values by stripping out
> spaces and tokeniing them and use a keywordAnalyzer.  In searching we also
> strip spaces from the search term entered and search with a
> keywordAnalyzer.  Searches for full postcodes, e.g. "NW10 7NY" find all
> exact matches but also any full values that are partial matches (e.g. some
> records just have "NW10" as their postcode field and the "NW10 7NY" search
> pulls them back too), but searches for partial postcodes e.g. "NW10" still
> only finds exact matches, e.g. it only pulls back those record that have
> just "NW10" as their postcode, rather than anything *starting* with NW10
> as
> we'd like it to do.
>
> Can anyone help me get this working in the way we need it too please?
>
> --
> Chris Mannion
> iCasework and LocalAlert implementation team
> 0208 144 4416
>


Re: How to make a query that associates 2 index files

2008-05-06 Thread Chris Lu
No easy way unless you merge your 2 indexes into:

Index:  [who][accessed]   [key]  [content]


David1/1/2007 Abc"blah blah 123 ..."

Someone  1/2/2005 Abc"blah blah 123 ..."

Guess12/1/2000Xyz"123 321 a nice day ..."

Harry1/1/2008 Abc"blah blah 123 ..."

Sandra   1/1/2003 Xyz"123 321 a nice day ..."


Anyway, Lucene index is more like Database Index. It's only efficient for
particular query execution paths.
If you have special requirements, you will have to re-structure your index
for performance.

-- 
Chris Lu
-
Instant Scalable Full-Text Search On Any Database/Application
site: http://www.dbsight.net
demo: http://search.dbsight.com
Lucene Database Search in 3 minutes:
http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
DBSight customer, a shopping comparison site, (anonymous per request) got
2.6 Million Euro funding!

On Tue, May 6, 2008 at 9:14 AM, Michael Siu <[EMAIL PROTECTED]> wrote:

> Hi,
>
>
>
> I am a newbie to Lucene. I have a question for making a query that
> associate
> 2 index files:
>
>
>
> - One index has the content index for a list of documents and a key to the
> document. That means the Lucene document of this index contains 2 fields:
>
> the 'content' and the 'key'.
>
> - another index has the some data indexed associated with the 'key' in the
> previous index. The Lucene document of this index contains several fields:
>
> the 'who' that contains some data and the 'key' that _points_ to the
> document in the first index.
>
>
>
> Sample data:
>
> Index_1:   [key] [content]
>
>Abc   "blah blah 123 ..."
>
>Xyz   "123 321 a nice day ..."
>
>
>
> Index_2:   [who][accessed]   [key]
>
>David1/1/2007 Abc
>
>Someone  1/2/2005 Abc
>
>Guess12/1/2000Xyz
>
>Harry1/1/2008 Abc
>
>Sandra   1/1/2003 Xyz
>
>
>
> As shown, the [key] field in Index_2 has repeated value that _points_ to
> the
> [key] values in Index_1. How do I make a query for the following:
>
>
>
> Find out all documents in Index_2:
>
> - [who] is in range of 'David' to 'Guess' and
>
> - [accessed] in range '1/1/1900' to '1/1/2010' and
>
> - [key] associated [content] in Index_1 that contains the term 'blah'
>
>
>
> I know this is more SQL like query. Is Lucene capable of doing this type
> of
> query that needs associations among index files?
>
>
>
> Thanks in advance.
>
>
>
> - m
>
>
>
>
>
>
>
>


Re: Postcode/zipcode search

2008-05-06 Thread Grant Ingersoll
You might have a look at using a phrase query when you have more than  
one term in the query in addition to your term query, but giving the  
phrase query more weight (i.e. give an exact match more weight) and  
keep your original tokenization process.


Something like:
"NW10 7NY"^5 OR NW10 OR 7NY

or even downweighting the individual terms.  Thus, exact matches on  
the full phrase will weigh much higher, and you can still do  
individual term matching for the single term case (NW10)


-Grant

On May 6, 2008, at 12:28 PM, Chris Mannion wrote:


Hi all

I've got a bit of a niggling problem with how one of my searches is  
working
as opposed to how my users would like it too work.  We're indexing  
on UK
postcodes, which are in the format of a 3 or 4 character area code  
followed
by a 3 or 4 character street specific code, e.g. "NW10 7NY" or "M11  
1LQ".
We originally had the values being indexed as tokenized and used a  
very
simple search string in the format "postcode:xxx xxx", with no  
grouping or
boosting or fuzzy searching, just an straight search on whatever the  
user
answered.  This had the benefit of finding exact matches to searches  
and

allowing us to search just on the area part of the code to return all
records with that area code, eg a search on "NW2" returning anything
starting NW2, like "NW2 6TB", "NW2 1ER" etc etc.

However, the downside to that was that searches could also return  
records
only tenuously related to what was searched for, eg. a search for  
"NW10 7NY"
would also return a record with a postcode "SE9 6NY" because of the  
slight

match of the "NY".  Obviously this was technically correct but users
complained because their searches were returning records from  
completely

different areas.  Our first step to put this right was to take off the
tokenization of the field, which we also weren't happy with so have
continued to fiddle.

The current status is as follows - we index the values by stripping  
out
spaces and tokeniing them and use a keywordAnalyzer.  In searching  
we also

strip spaces from the search term entered and search with a
keywordAnalyzer.  Searches for full postcodes, e.g. "NW10 7NY" find  
all
exact matches but also any full values that are partial matches  
(e.g. some
records just have "NW10" as their postcode field and the "NW10 7NY"  
search
pulls them back too), but searches for partial postcodes e.g. "NW10"  
still
only finds exact matches, e.g. it only pulls back those record that  
have
just "NW10" as their postcode, rather than anything *starting* with  
NW10 as

we'd like it to do.

Can anyone help me get this working in the way we need it too please?

--
Chris Mannion
iCasework and LocalAlert implementation team
0208 144 4416


--
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ







-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: Multiple Field search

2008-05-06 Thread Kelvin Foo Chuan Lyi
Thanks... that's what I thought of ... but was wondering if that was the
best method to do so... i guess it is then... :)




On Wed, May 7, 2008 at 12:32 AM, Erick Erickson <[EMAIL PROTECTED]>
wrote:

> One of my favorite quotes from Roger Zelazny... "postulating
> infinity, the rest is easy".
>
> In this case, "infinity" is how you break up your query. The easy part is
> making your search return what you want.
>
> Assuming you know that you want "greatest" and
> "hits" to go against the title field and "beatles" to
> go against description, your query looks something
> like:
>
> +title:greatest +title:hits +description:beatles
>
> But knowing you want to break up the query like that is the hard part.
>
> Sometimes you can make it work well enough by submitting all
> terms against both fields with an OR clause, something like:
>
> title:(+greatest +hits +beatles) description:(+greatest +hits +beatles)
> (note, OR is implied between title and description)
>
> which would not work in your case. Another technique is to dump
> all the words into a single uber-field *as well as* your individual
> fields, so the search
> uber:(+greatest +hits +beatles) would work for you. Note that if
> you index (but do NOT store) the uber field and store (but do NOT index)
> the text and description fields, your index stays about the same size.
>
> Anyway, you need to carefully define how your searches *should* work,
> then define your index structure IMO.
>
> Best
> Erick
>
> On Tue, May 6, 2008 at 12:07 PM, Kelvin Foo Chuan Lyi <[EMAIL PROTECTED]>
> wrote:
>
> > I'm new to lucene and have a question on how to create a query for the
> > following example... Say I have two fields, Title and Description, with
> > the
> > following data
> >
> > Item 1
> > Title: The greatest hits
> > Description : Collection of the best music from The Beatles.
> >
> > Item 2
> > Title: U2 collections
> > Description : Greatest hits collection of U2
> >
> >
> > In my search, I want to search for 'greatest hits beatles'. The result
> > should return only Item 1..  how should the query looks like??
> >
> >
> > Thanks.
> >
> >
> >
> >
> > --
> > Sayonara,
> > Kelvin
> >
>



-- 
Sayonara,
Kelvin


Re: How to make a query that associates 2 index files

2008-05-06 Thread Erick Erickson
You don't. You really have to roll your own solution here, there's
no "inter-index" awareness that I know of in Lucene.

Typically, people either do a half-half solution (that is, put the
text search in Lucene and leave the DB parts in the DB) or
de-normalize the data in a Lucene index so you don't have
to even try to do things cross-index.

And then there's Marcelo Ochoa (I may have the spelling wrong) who's
put together some way to embed Lucene in an Oracle database, but
that's magic to me.

Why do you want to arrange your index this way in the first place?
Perhaps there's a more specific answer waiting out there...

Best
Erick

On Tue, May 6, 2008 at 12:14 PM, Michael Siu <[EMAIL PROTECTED]>
wrote:

> Hi,
>
>
>
> I am a newbie to Lucene. I have a question for making a query that
> associate
> 2 index files:
>
>
>
> - One index has the content index for a list of documents and a key to the
> document. That means the Lucene document of this index contains 2 fields:
>
> the 'content' and the 'key'.
>
> - another index has the some data indexed associated with the 'key' in the
> previous index. The Lucene document of this index contains several fields:
>
> the 'who' that contains some data and the 'key' that _points_ to the
> document in the first index.
>
>
>
> Sample data:
>
> Index_1:   [key] [content]
>
>Abc   "blah blah 123 ..."
>
>Xyz   "123 321 a nice day ..."
>
>
>
> Index_2:   [who][accessed]   [key]
>
>David1/1/2007 Abc
>
>Someone  1/2/2005 Abc
>
>Guess12/1/2000Xyz
>
>Harry1/1/2008 Abc
>
>Sandra   1/1/2003 Xyz
>
>
>
> As shown, the [key] field in Index_2 has repeated value that _points_ to
> the
> [key] values in Index_1. How do I make a query for the following:
>
>
>
> Find out all documents in Index_2:
>
> - [who] is in range of 'David' to 'Guess' and
>
> - [accessed] in range '1/1/1900' to '1/1/2010' and
>
> - [key] associated [content] in Index_1 that contains the term 'blah'
>
>
>
> I know this is more SQL like query. Is Lucene capable of doing this type
> of
> query that needs associations among index files?
>
>
>
> Thanks in advance.
>
>
>
> - m
>
>
>
>
>
>
>
>


Re: Multiple Field search

2008-05-06 Thread Erick Erickson
One of my favorite quotes from Roger Zelazny... "postulating
infinity, the rest is easy".

In this case, "infinity" is how you break up your query. The easy part is
making your search return what you want.

Assuming you know that you want "greatest" and
"hits" to go against the title field and "beatles" to
go against description, your query looks something
like:

+title:greatest +title:hits +description:beatles

But knowing you want to break up the query like that is the hard part.

Sometimes you can make it work well enough by submitting all
terms against both fields with an OR clause, something like:

title:(+greatest +hits +beatles) description:(+greatest +hits +beatles)
(note, OR is implied between title and description)

which would not work in your case. Another technique is to dump
all the words into a single uber-field *as well as* your individual
fields, so the search
uber:(+greatest +hits +beatles) would work for you. Note that if
you index (but do NOT store) the uber field and store (but do NOT index)
the text and description fields, your index stays about the same size.

Anyway, you need to carefully define how your searches *should* work,
then define your index structure IMO.

Best
Erick

On Tue, May 6, 2008 at 12:07 PM, Kelvin Foo Chuan Lyi <[EMAIL PROTECTED]>
wrote:

> I'm new to lucene and have a question on how to create a query for the
> following example... Say I have two fields, Title and Description, with
> the
> following data
>
> Item 1
> Title: The greatest hits
> Description : Collection of the best music from The Beatles.
>
> Item 2
> Title: U2 collections
> Description : Greatest hits collection of U2
>
>
> In my search, I want to search for 'greatest hits beatles'. The result
> should return only Item 1..  how should the query looks like??
>
>
> Thanks.
>
>
>
>
> --
> Sayonara,
> Kelvin
>


Postcode/zipcode search

2008-05-06 Thread Chris Mannion
Hi all

I've got a bit of a niggling problem with how one of my searches is working
as opposed to how my users would like it too work.  We're indexing on UK
postcodes, which are in the format of a 3 or 4 character area code followed
by a 3 or 4 character street specific code, e.g. "NW10 7NY" or "M11 1LQ".
We originally had the values being indexed as tokenized and used a very
simple search string in the format "postcode:xxx xxx", with no grouping or
boosting or fuzzy searching, just an straight search on whatever the user
answered.  This had the benefit of finding exact matches to searches and
allowing us to search just on the area part of the code to return all
records with that area code, eg a search on "NW2" returning anything
starting NW2, like "NW2 6TB", "NW2 1ER" etc etc.

However, the downside to that was that searches could also return records
only tenuously related to what was searched for, eg. a search for "NW10 7NY"
would also return a record with a postcode "SE9 6NY" because of the slight
match of the "NY".  Obviously this was technically correct but users
complained because their searches were returning records from completely
different areas.  Our first step to put this right was to take off the
tokenization of the field, which we also weren't happy with so have
continued to fiddle.

The current status is as follows - we index the values by stripping out
spaces and tokeniing them and use a keywordAnalyzer.  In searching we also
strip spaces from the search term entered and search with a
keywordAnalyzer.  Searches for full postcodes, e.g. "NW10 7NY" find all
exact matches but also any full values that are partial matches (e.g. some
records just have "NW10" as their postcode field and the "NW10 7NY" search
pulls them back too), but searches for partial postcodes e.g. "NW10" still
only finds exact matches, e.g. it only pulls back those record that have
just "NW10" as their postcode, rather than anything *starting* with NW10 as
we'd like it to do.

Can anyone help me get this working in the way we need it too please?

-- 
Chris Mannion
iCasework and LocalAlert implementation team
0208 144 4416


How to make a query that associates 2 index files

2008-05-06 Thread Michael Siu
Hi,

 

I am a newbie to Lucene. I have a question for making a query that associate
2 index files:

 

- One index has the content index for a list of documents and a key to the
document. That means the Lucene document of this index contains 2 fields:

the 'content' and the 'key'.

- another index has the some data indexed associated with the 'key' in the
previous index. The Lucene document of this index contains several fields:

the 'who' that contains some data and the 'key' that _points_ to the
document in the first index.

 

Sample data:

Index_1:   [key] [content] 

Abc   "blah blah 123 ..."

Xyz   "123 321 a nice day ..."

 

Index_2:   [who][accessed]   [key]

David1/1/2007 Abc

Someone  1/2/2005 Abc

Guess12/1/2000Xyz

Harry1/1/2008 Abc

Sandra   1/1/2003 Xyz

 

As shown, the [key] field in Index_2 has repeated value that _points_ to the
[key] values in Index_1. How do I make a query for the following: 

 

Find out all documents in Index_2:

- [who] is in range of 'David' to 'Guess' and

- [accessed] in range '1/1/1900' to '1/1/2010' and

- [key] associated [content] in Index_1 that contains the term 'blah' 

 

I know this is more SQL like query. Is Lucene capable of doing this type of
query that needs associations among index files?

 

Thanks in advance.

 

- m

 

 

 



Multiple Field search

2008-05-06 Thread Kelvin Foo Chuan Lyi
I'm new to lucene and have a question on how to create a query for the
following example... Say I have two fields, Title and Description, with the
following data

Item 1
Title: The greatest hits
Description : Collection of the best music from The Beatles.

Item 2
Title: U2 collections
Description : Greatest hits collection of U2


In my search, I want to search for 'greatest hits beatles'. The result
should return only Item 1..  how should the query looks like??


Thanks.




-- 
Sayonara,
Kelvin


Re: Filtering a SpanQuery

2008-05-06 Thread Paul Elschot
Eran,

Op Tuesday 06 May 2008 10:15:10 schreef Eran Sevi:
> Hi,
>
> I am looking for a way to filter a SpanQuery according to some other
> query (on another field from the one used for the SpanQuery). I need
> to get access to the spans themselves of course.
>  I don't care about the scoring of the filter results and just need
> the positions of hits found in the documents that matches the filter.

I think you'll have to implement the filtering on the Spans yourself.
That's not really difficult, just use Spans.skipTo().
The code to do that could look sth like this (untested):

Spans spans = yourSpanQuery.getSpans(reader);
BitSet bits = yourFilter.bits(reader);
int filterDoc = bits.nextSetBit(0);
while ((filterDoc >= 0) and spans.skipTo(filterDoc)) {
  boolean more = true;
  while (more and (spans.doc() == filterDoc)) {
 // use spans.start() and spans.end() here
 // ...
 more = spans.next();
  }
  if (! more) {
break;
  }
  filterDoc = bits.nextSetBit(spans.doc());
}

Please check the javadocs of java.util.BitSet, there may
be a 1 off error in the arguments to nextSetBit().

Regards,
Paul Elschot


>
> I tried looking through the archives and found some reference to a
> SpanQueryFilter patch, however I don't see how it can help me achieve
> what I want to do. This class receives only one query parameter
> (which I guess is the actual query) and not a query and a filter for
> example.
>
> Any help about how I can achieve this will be appreciated.
>
> Thanks,
> Eran.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RE: lucene farsi problem

2008-05-06 Thread esra

Hi Steven ,
Hi Steven,

i tried the class and it works fine with the locale parameter "ar".

Actually we are using "fa" for farsi and "ar" for arabic.
I have added a little control for the locale parameter in my code and now i
can see the correct results.

Thank you very much for ypur help.

Esra.



Steven A Rowe wrote:
> 
> Hi Esra,
> 
> I have attached a patch to LUCENE-1279 containing a new class:
> CollatingRangeQuery.
> 
> The patch also contains a test class: TestCollatingRangeQuery.  One of the
> test methods checks for the Farsi range you were having trouble with.
> 
> It should be mentioned that according to Collator.getAvailableLocales(),
> neither Java 1.4.2 nor Java 1.5.0 contains Farsi collation information. 
> However, in the test class I use the Arabic Locale, which seems to
> properly collate the non-Arabic Farsi letter U+0698, and hopefully behaves
> well with other Farsi letters.  If you find that this is not the case, you
> can look into writing collation rules using RuleBasedCollator - you should
> be able to directly specify the proper letter orderings for Farsi;
> CollatingRangeQuery would also have to supply a constructor that takes in
> a Collator instead of a Locale.
> 
> Please give the class a try and post back about how it works.
> 
> Thanks,
> Steve
> 
> On 05/03/2008 at 8:33 AM, esra wrote:
>> 
>> Hi Steven,
>> 
>> thanks for your help
>> 
>> Esra
>> 
>> 
>> Steven A Rowe wrote:
>> > 
>> > Hi Esra,
>> > 
>> > I have created an issue for this - see
>> > .
>> > 
>> > I'll try to take a crack at a patch this weekend.
>> > 
>> > Steve
>> > 
>> > On 05/02/2008 at 12:55 PM, esra wrote:
>> > > 
>> > > Hi Steven ,
>> > > 
>> > > yes you are right, sorry i am a bit confused.
>> > > 
>> > > i checked again and the correct one is  "zhe"/U+698.
>> > > 
>> > > It seems the word is in the range but my customer says it
>> > > shouldn't be.
>> > > 
>> > > I think problem occurs because  "zhe" is a Persian letter
>> > > outside the Arabic
>> > > alphabet. In farsi alphabet this letter is not after the "س"
>> > > letter but it's
>> > > unicode is bigger than "س" letter's and the searcher works
>> > > with unicodes.
>> > > 
>> > > Esra
>> > > 
>> > > 
>> > > Steven A Rowe wrote:
>> > > > 
>> > > > Hi Esra,
>> > > > 
>> > > > You are *still* incorrectly referring to the glyph with three dots
>> > > > over it:
>> > > > 
>> > > > On 05/02/2008 at 12:18 PM, esra wrote:
>> > > > > yes the correct one is "ژ "/"ze"/U+632.
>> > > > 
>> > > > "ژ" is *not* "ze"/U+632 - it is "zhe"/U+698.
>> > > > 
>> > > > Have you increased the font size?  Can you see the difference
>> between
>> > > > these two?:
>> > > > 
>> > > > "ژ"/"zhe"/U+698
>> > > > "ز"/"ze"/U+632
>> > > > 
>> > > > > my problem is when i do search for  "د-ژ" range. The result is 
>> "ساب
>> > > > > ووفر" and this word's first letter is "س" and it's unicode is
>> > > > > "U+633" and it is not in the in the [ U+062F - U+0632 ] range.
>> > > > 
>> > > > Like I keep saying, in the above description, you're using the
>> glyph
>> > > > "ژ"/"zhe"/U+698, while calling at the same time incorrectly
>> referring
>> > > > to it as "ze"/U+632.
>> > > > 
>> > > > I don't mean to continually bang on about this - if you're *sure*
>> > > > that when you search, you're using the character represented by the
>> > > > glyph with one dot (and not three), i.e. "ز"/"ze"/U+632, then the
>> > > > problem lies elsewhere.
>> > > > 
>> > > > Steve
>> > > > 
>> > > > On 05/02/2008 at 12:18 PM, esra wrote:
>> > > > > yes the correct one is "ژ "/"ze"/U+632.
>> > > > > 
>> > > > > my problem is when i do search for  "  د-ژ" range. The result
>> > > > > is  ""ساب ووفر
>> > > > > " and this word's first letter is "س " and it's unicode is
>> > > > > "U+633"  and  it
>> > > > > is not in the in the [ U+062F - U+0632 ] range.
>> > > > > 
>> > > > > am i wrong?
>> > > > > 
>> > > > > Esra
>> > > > > 
>> > > > > Steven A Rowe wrote:
>> > > > > > 
>> > > > > > Hi Esra,
>> > > > > > 
>> > > > > > I still think you're wrong :).
>> > > > > > 
>> > > > > > On 05/02/2008 at 9:31 AM, esra wrote:
>> > > > > > > > ژ = U+632
>> > > > > > 
>> > > > > > According to the website you linked to, the above character,
>> which
>> > > > > > has three dots over it, is named "zhe", and its
>> Unicode code point
>> > > is
>> > > > > > U+698. (I had to increase the font size to see the three dots.)
>> > > > > > 
>> > > > > > I think you are confusing "ژ"/"zhe"/U+698 with the letter
>> > > > > > "ز"/"ze"/U+632, which has just one dot over it.
>> > > > > > 
>> > > > > > Unless you were mistaken in all of your emails when
>> you included
>> > > the
>> > > > > > character "ژ"/"zhe" instead of "ز"/"ze", then what I said in my
>> > > > > > previous email still stands: there is no problem here.
>> > > > > > 
>> > > > > > Steve
>> > > > > > 
>> > > > > > On 05/02/2008 at 9:31 AM, esra wrote:
>> > > > > > > 
>> > > > > > > Hi Steven,
>> > > > > > > 
>> > > >

Re: index corruption with latest lucene

2008-05-06 Thread Gopikrishnan Subramani
Thanks Mike. Sorry, I should have mentioned that I'm using 1.6.0_04. I
happened to look at the thread a while ago and used -Xbatch but that didn't
help which made me think may be it's a different issue. I'll try with -Xint
before downgrading to 1.6.0_03 to be doubly sure.

-Gopi


On 5/6/08, Michael McCandless <[EMAIL PROTECTED]> wrote:
>
>
> Are you using JRE 1.6.0_04 or 1.6.0_05?
>
> This sounds exactly the same as this:
>
>http://www.gossamer-threads.com/lists/lucene/java-user/59650
>
> If it is the same issue, which seems to be a bug in the hotspot compiler,
> downgrading to JRE 1.6.0_03, or running Java with -Xbatch (forces up-front
> compilation) or -Xint (disables compilation) works around it.
>
> Can you test either of these and report back?  Thanks.
>
> Mike
>
> Gopikrishnan Subramani wrote:
>
> > [ Sorry if I'm hijacking this thread, if you feel this error is
> > unrelated to
> > this thread, I'll move this to a separate thread. ]
> >
> > Even after upgrading to 2.3.1 I'm running into index corruption
> > problems.
> > I'm posting below the exception that is generated while searching. The
> > stack
> > trace looks like,
> >
> >
> > org.apache.lucene.index.CorruptIndexException: doc counts differ for
> > segment
> > _kk: fieldsReader shows 72670 but segmentInfo shows 72671
> >at
> > org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:313)
> >at
> > org.apache.lucene.index.SegmentReader.get(SegmentReader.java:262)
> >at
> > org.apache.lucene.index.SegmentReader.get(SegmentReader.java:230)
> >at
> >
> > org.apache.lucene.index.DirectoryIndexReader$1.doBody(DirectoryIndexReader.java:73)
> >at
> >
> > org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:636)
> >at
> >
> > org.apache.lucene.index.DirectoryIndexReader.open(DirectoryIndexReader.java:63)
> >at org.apache.lucene.index.IndexReader.open(IndexReader.java:209)
> >at org.apache.lucene.index.IndexReader.open(IndexReader.java:173)
> >at
> > org.apache.lucene.search.IndexSearcher.(IndexSearcher.java:48)
> >
> >
> > About the setup: all documents in the these indexes have the same set of
> > fields, but some fields are not added if the value is null. We have over
> > 500
> > indexes and they are indexed incrementally on a daily basis. The index
> > is
> > updated in place with autocommit turned on. A single thread writes to
> > the
> > index and once all the documents are updated, the index is commited.
> > About 5
> > indexes are getting corrupted per week on an average and a full index
> > fixes
> > the problem. This is proving to be a lot of pain and any help in
> > identifying
> > the problem is much appreicated.
> >
> > thanks,
> > Gopi
> >
> > On 5/6/08, Mark Miller <[EMAIL PROTECTED]> wrote:
> >
> > >
> > > I am getting even more confused. I luckily found a copy of one of the
> > > corrupted test indices that i had made on 4/28/08...lucky as its the
> > > only one I have ever made :) It doesn't have the problem. This is very
> > > interesting to me, because the other site that has the problem has
> > > been
> > > in action for months now. Both were running with my previous version
> > > of
> > > Lucene, which was a trunk build from around when 2.3 was released I
> > > think. Just seems odd that the test index was corrupted so recently.
> > >
> > >
> > > So I am a bit stuck...its probably my own problem though, so unless
> > > someone else sees it, Ill just report back if/when I find out more.
> > >
> > > - Mark
> > >
> > >
> > > On Mon, 2008-05-05 at 18:07 -0400, Michael McCandless wrote:
> > >
> > > > Mark,
> > > >
> > > > Which exact version of the JRE are you using?
> > > >
> > > > Mike
> > > >
> > > > Mark Miller wrote:
> > > >
> > > > > On Mon, 2008-05-05 at 17:26 -0400, Michael McCandless wrote:
> > > > >
> > > > > > Actually that stack trace looks like it's from trunk, not from
> > > > > > 2.3.2
> > > > > > (pre)?  OK, I think you said it's from "post 2.3 trunk".
> > > > > >
> > > > >
> > > > > Right...the Lucene that showed the problem was build from a trunk
> > > > > grab
> > > > > late last week. One of the problem indexes was built with a 2.0 or
> > > > > 2.1
> > > > > and the other was built with a post 2.3 trunk (but weeks (prob
> > > > > months)
> > > > > before the one i grabbed late last week :) )
> > > > >
> > > > >
> > > > > > Another question: is autoCommit false or true?
> > > > > >
> > > > > false
> > > > >
> > > > >
> > > > >
> > > > > If I can get you an affected index I will.
> > > > >
> > > > > - mark
> > > > >
> > > > >
> > > > >
> > > > > > More responses below:
> > > > > >
> > > > > > Mark Miller wrote:
> > > > > >
> > > > > > > On Mon, 2008-05-05 at 16:32 -0400, Michael McCandless wrote:
> > > > > > >
> > > > > > > > Hi Mark,
> > > > > > > >
> > > > > > > > Not good!
> > > > > > > >
> > > > > > > > Can you describe how this index was created?  Did you use
> > > > > > > > multiple
> > > > > > > > thre

Re: index corruption with latest lucene

2008-05-06 Thread Michael McCandless


Could you provide more detail on how you hit these two exceptions?   
Are they reproducible from scratch (creating a new index)?


Are you using multiple threads against IndexWriter?  Is autoCommit  
true or false?  Any prior exceptions hit?  Do your documents have  
varying number/configuration of fields?  Is the index copied  
somewhere after being written?


Can you run with assertions enabled to see if something is tripped?

Maybe run CheckIndex on the index?

Mike

crspan wrote:

coincidence or it is from 2.3.2 ?

env:
lucene 2.3.2
jdk1.6.0_06 & jdk1.5.0_15


QueryString:
illeg^30.820824 technolog^22.290413 transfer^33.307804
Error: java.lang.ArrayIndexOutOfBoundsException:  
132704java.lang.ArrayIndexOutOfBoundsException: 132704
at org.apache.lucene.search.BooleanScorer2$Coordinator.coordFactor 
(BooleanScorer2.java:55)
at org.apache.lucene.search.BooleanScorer2.score 
(BooleanScorer2.java:358)
at org.apache.lucene.search.BooleanScorer2.score 
(BooleanScorer2.java:320)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java: 
146)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java: 
113)

at org.apache.lucene.search.Searcher.search(Searcher.java:132)
at org.cr.search.TrecQueryRelevanceFeedback.main 
(TrecQueryRelevanceFeedback.java:776)



QueryString:
oceanograph^68.48028 vessel^43.191563
Error:  
java.lang.ArrayIndexOutOfBoundsExceptionjava.lang.ArrayIndexOutOfBound 
sException

at java.lang.System.arraycopy(Native Method)
at org.apache.lucene.index.TermVectorsReader.readTermVector 
(TermVectorsReader.java:353)
at org.apache.lucene.index.TermVectorsReader.readTermVectors 
(TermVectorsReader.java:287)
at org.apache.lucene.index.TermVectorsReader.get 
(TermVectorsReader.java:232)
at org.apache.lucene.index.SegmentReader.getTermFreqVectors 
(SegmentReader.java:981)
at org.cr.rf.RelevanceFeedback.RelFeedbackWeight 
(RelevanceFeedback.java:134)
at org.cr.search.TrecQueryRelevanceFeedback.main 
(TrecQueryRelevanceFeedback.java:781)





Mark Miller wrote:

Any recent changes that would expose index corruption?

I am getting two new errors when trying to search:

nullpointer fieldsreaders line 260

indexoutofbounds on fieldinfo line 185

I am kind of screwed, because reindexing fixes this, but I cant  
reindex!


Any ideas?


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: index corruption with latest lucene

2008-05-06 Thread Michael McCandless


Are you using JRE 1.6.0_04 or 1.6.0_05?

This sounds exactly the same as this:

http://www.gossamer-threads.com/lists/lucene/java-user/59650

If it is the same issue, which seems to be a bug in the hotspot  
compiler, downgrading to JRE 1.6.0_03, or running Java with -Xbatch  
(forces up-front compilation) or -Xint (disables compilation) works  
around it.


Can you test either of these and report back?  Thanks.

Mike

Gopikrishnan Subramani wrote:
[ Sorry if I'm hijacking this thread, if you feel this error is  
unrelated to

this thread, I'll move this to a separate thread. ]

Even after upgrading to 2.3.1 I'm running into index corruption  
problems.
I'm posting below the exception that is generated while searching.  
The stack

trace looks like,


org.apache.lucene.index.CorruptIndexException: doc counts differ  
for segment

_kk: fieldsReader shows 72670 but segmentInfo shows 72671
at
org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java: 
313)
at org.apache.lucene.index.SegmentReader.get 
(SegmentReader.java:262)
at org.apache.lucene.index.SegmentReader.get 
(SegmentReader.java:230)

at
org.apache.lucene.index.DirectoryIndexReader$1.doBody 
(DirectoryIndexReader.java:73)

at
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run 
(SegmentInfos.java:636)

at
org.apache.lucene.index.DirectoryIndexReader.open 
(DirectoryIndexReader.java:63)
at org.apache.lucene.index.IndexReader.open 
(IndexReader.java:209)
at org.apache.lucene.index.IndexReader.open 
(IndexReader.java:173)

at
org.apache.lucene.search.IndexSearcher.(IndexSearcher.java:48)


About the setup: all documents in the these indexes have the same  
set of
fields, but some fields are not added if the value is null. We have  
over 500
indexes and they are indexed incrementally on a daily basis. The  
index is
updated in place with autocommit turned on. A single thread writes  
to the
index and once all the documents are updated, the index is  
commited. About 5
indexes are getting corrupted per week on an average and a full  
index fixes
the problem. This is proving to be a lot of pain and any help in  
identifying

the problem is much appreicated.

thanks,
Gopi

On 5/6/08, Mark Miller <[EMAIL PROTECTED]> wrote:


I am getting even more confused. I luckily found a copy of one of the
corrupted test indices that i had made on 4/28/08...lucky as its the
only one I have ever made :) It doesn't have the problem. This is  
very
interesting to me, because the other site that has the problem has  
been
in action for months now. Both were running with my previous  
version of

Lucene, which was a trunk build from around when 2.3 was released I
think. Just seems odd that the test index was corrupted so recently.


So I am a bit stuck...its probably my own problem though, so unless
someone else sees it, Ill just report back if/when I find out more.

- Mark


On Mon, 2008-05-05 at 18:07 -0400, Michael McCandless wrote:

Mark,

Which exact version of the JRE are you using?

Mike

Mark Miller wrote:

On Mon, 2008-05-05 at 17:26 -0400, Michael McCandless wrote:
Actually that stack trace looks like it's from trunk, not from  
2.3.2

(pre)?  OK, I think you said it's from "post 2.3 trunk".


Right...the Lucene that showed the problem was build from a  
trunk grab
late last week. One of the problem indexes was built with a 2.0  
or 2.1
and the other was built with a post 2.3 trunk (but weeks (prob  
months)

before the one i grabbed late last week :) )



Another question: is autoCommit false or true?

false



If I can get you an affected index I will.

- mark




More responses below:

Mark Miller wrote:

On Mon, 2008-05-05 at 16:32 -0400, Michael McCandless wrote:

Hi Mark,

Not good!

Can you describe how this index was created?  Did you use  
multiple

threads on one IndexWriter?  Multiple sessions of IndexWriter
appending to the index?  addIndexes*?  Is the index copied  
from one

place to another after being written and before being searched?


Both sites were created by a single thread on a single  
IndexWriter.

Updates are done through multiple threads and one IndexWriter. No
addIndexes. Index was never copied, always same path.



If you run CheckIndex, what does it report?


This was my next move...unfortunately, someone accidentally  
kicked

off a
complete reindex before I could do it. From what I can tell by  
the

stack
trace, its a per doc problem...I am guessing I could have
printed the
ids of the problem docs and just reindex those? I have to deal  
with

this
at many other sites, so that may be my attack...I cannot reindex
everything to fix.


It would be great to know if that workaround works (and indeed  
it's a

per-doc issue).  I'd also love to know how many docs are affected,
when you hit this.

If there's any way to zip up the index and send it to me, even  
just
the files for the one segment that has the corrupted doc,  
that'd be

great.




Re: index corruption with latest lucene

2008-05-06 Thread Gopikrishnan Subramani
[ Sorry if I'm hijacking this thread, if you feel this error is unrelated to
this thread, I'll move this to a separate thread. ]

Even after upgrading to 2.3.1 I'm running into index corruption problems.
I'm posting below the exception that is generated while searching. The stack
trace looks like,


org.apache.lucene.index.CorruptIndexException: doc counts differ for segment
_kk: fieldsReader shows 72670 but segmentInfo shows 72671
at
org.apache.lucene.index.SegmentReader.initialize(SegmentReader.java:313)
at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:262)
at org.apache.lucene.index.SegmentReader.get(SegmentReader.java:230)
at
org.apache.lucene.index.DirectoryIndexReader$1.doBody(DirectoryIndexReader.java:73)
at
org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:636)
at
org.apache.lucene.index.DirectoryIndexReader.open(DirectoryIndexReader.java:63)
at org.apache.lucene.index.IndexReader.open(IndexReader.java:209)
at org.apache.lucene.index.IndexReader.open(IndexReader.java:173)
at
org.apache.lucene.search.IndexSearcher.(IndexSearcher.java:48)


About the setup: all documents in the these indexes have the same set of
fields, but some fields are not added if the value is null. We have over 500
indexes and they are indexed incrementally on a daily basis. The index is
updated in place with autocommit turned on. A single thread writes to the
index and once all the documents are updated, the index is commited. About 5
indexes are getting corrupted per week on an average and a full index fixes
the problem. This is proving to be a lot of pain and any help in identifying
the problem is much appreicated.

thanks,
Gopi

On 5/6/08, Mark Miller <[EMAIL PROTECTED]> wrote:
>
> I am getting even more confused. I luckily found a copy of one of the
> corrupted test indices that i had made on 4/28/08...lucky as its the
> only one I have ever made :) It doesn't have the problem. This is very
> interesting to me, because the other site that has the problem has been
> in action for months now. Both were running with my previous version of
> Lucene, which was a trunk build from around when 2.3 was released I
> think. Just seems odd that the test index was corrupted so recently.
>
>
> So I am a bit stuck...its probably my own problem though, so unless
> someone else sees it, Ill just report back if/when I find out more.
>
> - Mark
>
>
> On Mon, 2008-05-05 at 18:07 -0400, Michael McCandless wrote:
> > Mark,
> >
> > Which exact version of the JRE are you using?
> >
> > Mike
> >
> > Mark Miller wrote:
> > > On Mon, 2008-05-05 at 17:26 -0400, Michael McCandless wrote:
> > >> Actually that stack trace looks like it's from trunk, not from 2.3.2
> > >> (pre)?  OK, I think you said it's from "post 2.3 trunk".
> > >
> > > Right...the Lucene that showed the problem was build from a trunk grab
> > > late last week. One of the problem indexes was built with a 2.0 or 2.1
> > > and the other was built with a post 2.3 trunk (but weeks (prob months)
> > > before the one i grabbed late last week :) )
> > >
> > >>
> > >> Another question: is autoCommit false or true?
> > > false
> > >
> > >
> > >
> > > If I can get you an affected index I will.
> > >
> > > - mark
> > >
> > >
> > >>
> > >> More responses below:
> > >>
> > >> Mark Miller wrote:
> > >>> On Mon, 2008-05-05 at 16:32 -0400, Michael McCandless wrote:
> >  Hi Mark,
> > 
> >  Not good!
> > 
> >  Can you describe how this index was created?  Did you use multiple
> >  threads on one IndexWriter?  Multiple sessions of IndexWriter
> >  appending to the index?  addIndexes*?  Is the index copied from one
> >  place to another after being written and before being searched?
> > >>>
> > >>> Both sites were created by a single thread on a single IndexWriter.
> > >>> Updates are done through multiple threads and one IndexWriter. No
> > >>> addIndexes. Index was never copied, always same path.
> > >>>
> > 
> >  If you run CheckIndex, what does it report?
> > >>>
> > >>> This was my next move...unfortunately, someone accidentally kicked
> > >>> off a
> > >>> complete reindex before I could do it. From what I can tell by the
> > >>> stack
> > >>> trace, its a per doc problem...I am guessing I could have
> > >>> printed the
> > >>> ids of the problem docs and just reindex those? I have to deal with
> > >>> this
> > >>> at many other sites, so that may be my attack...I cannot reindex
> > >>> everything to fix.
> > >>
> > >> It would be great to know if that workaround works (and indeed it's a
> > >> per-doc issue).  I'd also love to know how many docs are affected,
> > >> when you hit this.
> > >>
> > >> If there's any way to zip up the index and send it to me, even just
> > >> the files for the one segment that has the corrupted doc, that'd be
> > >> great.
> > >>
> > 
> >  Any prior exceptions on this index?
> > >>>
> > >>> Not that I can reca

Filtering a SpanQuery

2008-05-06 Thread Eran Sevi
Hi,

I am looking for a way to filter a SpanQuery according to some other query
(on another field from the one used for the SpanQuery). I need to get access
to the spans themselves of course.
 I don't care about the scoring of the filter results and just need the
positions of hits found in the documents that matches the filter.

I tried looking through the archives and found some reference to a
SpanQueryFilter patch, however I don't see how it can help me achieve what I
want to do. This class receives only one query parameter (which I guess is
the actual query) and not a query and a filter for example.

Any help about how I can achieve this will be appreciated.

Thanks,
Eran.