date:20070816

Re: query question

2007-08-16 Thread Mohammad Norouzi

Yes karl, when I explore the index by Luke I can see the terms
for example I have a field namely, patientResult, it contains value "Ca.
Oxalate:many" and also other values such as "Ca. Oxalate:few" etc.

the problems are when I put this query: patientResult:(Ca. Oxalate:few)
the result is
84329 Ca. Oxalate:few
112519 Ca. Oxalate:many
139141 Ca. Oxalate:many
394321 Ca. Oxalate:few
397671 Ca. Oxalate:nod
387549 Ca. Oxalate: mod

however this is not the required result but another problem is when I put
patientResult:Oxalate or patientResult:Oxalate* no result will return!!!

let me tell you that I am extended MultiFieldQueryParser to override its
methods and in getFieldQuery(...) method I return TermQuery

I don't know what I was made wrong?

On 8/15/07, karl wettin <[EMAIL PROTECTED]> wrote:
>
>
> 15 aug 2007 kl. 07.18 skrev Mohammad Norouzi:
>
> > I am using WhitespaceAnalyzer and the query is " icdCode:H* " but
> > there is
> > no result however I know that there are many documents with this
> > field value
> > such as H20, H20.5 etc. this field is tokenized and indexed
> > what is
> > wrong with this?
> > when I test this query with Luke it will return no result as well.
>
> Can you also use Luke to inspect documents you know should contain these
> terms and make sure it really is in there?
>
> --
> karl
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

-- 
Regards,
Mohammad
--
see my blog: http://brainable.blogspot.com/
another in Persian: http://fekre-motefavet.blogspot.com/

getting term offset information for fields with multiple value entiries

2007-08-16 Thread duiduder

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hello,

I have an index with an 'actor' field, for each actor there exists an single 
field value entry, e.g.

stored/compressed,indexed,tokenized,termVector,termVectorOffsets,termVectorPosition
 

movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo)
movie_actors:Miguel Bosé
movie_actors:Anna Lizaran (as Ana Lizaran)
movie_actors:Raquel Sanchís
movie_actors:Angelina Llongueras

I try to get the term offset, e.g. for 'angelina' with

termPositionVector = (TermPositionVector) reader.getTermFreqVector(docNumber, 
"movie_actors");
int iTermIndex = termPositionVector.indexOf("angelina");
TermVectorOffsetInfo[] termOffsets = termPositionVector.getOffsets(iTermIndex);


I get one TermVectorOffsetInfo for the field - with offset numbers that are 
bigger than one single
Field entry.
I guessed that Lucene gives the offset number for the situation that all values 
were concatenated,
which is for the single (virtual) string:

movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo)Miguel BoséAnna Lizaran 
(as Ana Lizaran)Raquel SanchísAngelina Llongueras

This fits in nearly no situation, so my second guess was that lucene adds some 
virtual delimiters between the single
field entries for offset calculation. I added a delimiter, so the result would 
be:

movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo) Miguel Bosé Anna Lizaran 
(as Ana Lizaran) Raquel Sanchís Angelina Llongueras
(note the ' ' between each actor name)

..this also fits not for each situation - there are too much delimiters there 
now, so, further, I guessed that Lucene don't add
a delimiter in each situation. So I added only one when the last character of 
an entry was no alphanumerical one, with:
StringBuilder strbAttContent = new StringBuilder();
for (String strAttValue : m_luceneDocument.getValues(strFieldName))
{
   strbAttContent.append(strAttValue);
   if(strbAttContent.substring(strbAttContent.length() - 1).matches("\\w"))
  strbAttContent.append(' ');
}

where I get the result (virtual) entry:
movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo)Miguel BoséAnna Lizaran 
(as Ana Lizaran)Raquel Sanchís Angelina Llongueras

this fits in ~96% of all my queriesbut still its not 100% the way lucene 
calculates the offset value for fields with multiple
value entries.


..maybe the problem is that there are special characters inside my database 
(e.g. the 'é' at 'Bosé'), where my '\w' don't matches.
I have looked to this specific situation, but considering this one character 
don't solves the problem.


How do Lucene calculates these offsets? I also searched inside the source code, 
but can't find the correct place.


Thanks in advance!

Christian Reuschling





- --
__
Christian Reuschling, Dipl.-Ing.(BA)
Software Engineer

Knowledge Management Department
German Research Center for Artificial Intelligence DFKI GmbH
Trippstadter Straße 122, D-67663 Kaiserslautern, Germany

Phone: +49.631.20575-125
mailto:[EMAIL PROTECTED]  http://www.dfki.uni-kl.de/~reuschling/

- Legal Company Information Required by German Law--
Geschäftsführung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster (Vorsitzender)
  Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313=
__
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.2 (GNU/Linux)

iD8DBQFGxB3XQoTr50f1tpcRAti+AKCH0YgcHjA+bO9NTbuxaAlKb8dO5gCfSfSK
oVOiAdWYROqXOMqHv176xBY=
=b2jO
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: out of order

2007-08-16 Thread Michael McCandless


Well then that is particularly spooky!!

And, hopefully, possible/easy to reproduce.  Thanks.

Mike

"testn" <[EMAIL PROTECTED]> wrote:
> 
> I use RAMDirectory and the error often shows the low number. Last time it
> happened with message "7<=7". Nest time it happens, I will try to capture
> the stacktrace.
> 
> 
> 
> Michael McCandless-2 wrote:
> > 
> > 
> > "testn" <[EMAIL PROTECTED]> wrote:
> >> 
> >> Using Lucene 2.2.0, I still sporadically got doc out of order error. I
> >> indexed all of my stuff in one thread. Do you have any idea why it
> >> happens?
> > 
> > Hm, that is not good.  I thought we had finally fixed this with
> > LUCENE-140.  Though un-corrected disk errors could in theory lead to
> > this too.
> > 
> > Are you able to easily reproduce it?  Can you post the full exception?
> > 
> > Mike
> > 
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> > 
> > 
> > 
> 
> -- 
> View this message in context:
> http://www.nabble.com/out-of-order-tf4276385.html#a12173705
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Can I do boosting based on term postions?

2007-08-16 Thread vini


Hi Shailendra,

Could you pls send the same class file to my gmail a/c too ?

Regards
vini

Shailendra Sharma wrote:
> 
> Ah, Good way !
> 
> On 8/4/07, Paul Elschot <[EMAIL PROTECTED]> wrote:
>>
>> On Friday 03 August 2007 20:35, Shailendra Sharma wrote:
>> > Paul,
>> >
>> > If I understand Cedric right, he wants to have different boosting
>> depending
>> > on search term positions in the document. By using SpanFirstQuery he
>> will
>> > only be able to consider in terms till particular position;
>>
>>
>> > but he won't be
>> > able to do something like following:
>> >   a) Give 100% boosting to matching in first 100 words.
>> >   b) Give 80% boosting to matching in next 100 words.
>> >   c) Give 60% boosting to matching in next 100 words.
>>
>> > Though it can be done by writing DisjunctionMaxQuery having multiple
>> > SpanFirstQuery with different boosting - but I see it as a workaround
>> only
>> > and not the direct and efficient solution.
>>
>> You're right, but SpanFirstQuery needs only a minor modification
>> for this to work.
>>
>> This modification to SpanFirstQuery would be that the Spans
>> returned by SpanFirstQuery.getSpans() must always return 0
>> from its start() method. Then the slop passed to sloppyFreq(slop)
>> would be the distance from the beginning of the indexed field
>> to the end of the Spans of the SpanQuery passed to SpanFirstQuery.
>>
>> Then the following should work:
>>
>> Term firstTerm =  ;
>>
>> SpanFirstQuery sfq = new SpanFirstQuery(
>>   new SpanTermQuery( firstTerm),
>>   Integer.MAX_VALUE)) {
>> ...
>> public Similarity getSimilarity() {
>> return new Similarity() {
>> ...
>> float sloppyFreq(slop) {
>>   return (slop < 100)  ? 1.0f
>>: (slop < 200) ? 0.8f
>>: (slop < 300) ? 0.6f
>>: 0.4f ; // etc. etc.
>> 
>>
>>
>> Actually, I'm a bit surprised that SpanFirstQuery does not work that
>> way now.
>>
>> Regards,
>> Paul Elschot
>>
>>
>> >
>> > Cedric,
>> >
>> > I am sending you the implementation of SpanTermQuery to your gmail
>> > account (lucene
>> > mailing list is bouncing email with attachment). I have named the class
>> as
>> > VSpanTermQuery (I have followed the same package hierarchy as lucene).
>> You
>> > also need to extend VSimilarity class - which would require
>> implementation
>> > of method scoreSpan(..).
>> >
>> > Let me know how it went. Though I did a testing for it, but before
>> > submitting to contrib, I need to do extensive testing.
>> >
>> > Thanks,
>> > Shailendra
>> >
>> > On 8/3/07, Paul Elschot <[EMAIL PROTECTED]> wrote:
>> > >
>> > > Cedric,
>> > >
>> > > You can choose the end limit for SpanFirstQuery yourself.
>> > >
>> > > Regards,
>> > > Paul Elschot
>> > >
>> > >
>> > > On Friday 03 August 2007 05:38, Cedric Ho wrote:
>> > > > Hi Paul,
>> > > >
>> > > > Isn't SpanFirstQuery only match those with position less than a
>> > > > certain end position?
>> > > >
>> > > > I am rather looking for a query that would score a document higher
>> for
>> > > > terms appear near the start but not totally discard those with
>> terms
>> > > > appear near the end.
>> > > >
>> > > > Regards,
>> > > > Cedric
>> > > >
>> > > > On 8/2/07, Paul Elschot <[EMAIL PROTECTED]> wrote:
>> > > > > Cedric,
>> > > > >
>> > > > > SpanFirstQuery could be a solution without payloads.
>> > > > > You may want to give it your own Similarity.sloppyFreq() .
>> > > > >
>> > > > > Regards,
>> > > > > Paul Elschot
>> > > > >
>> > > > > On Thursday 02 August 2007 04:07, Cedric Ho wrote:
>> > > > > > Thanks for the quick response =)
>> > > > > >
>> > > > > > On 8/1/07, Shailendra Sharma <[EMAIL PROTECTED]>
>> wrote:
>> > > > > > > Yes, it is easily doable through "Payload" facility. During
>> > > indexing
>> > > > > process
>> > > > > > > (mainly tokenization), you need to push this extra
>> information
>> in
>> > > each
>> > > > > > > token. And then you can use BoostingTermQuery for using
>> Payload
>> > > value
>> > > to
>> > > > > > > include Payload in the score. You also need to implement
>> > > Similarity
>> > > for
>> > > > > this
>> > > > > > > (mainly scorePayload method).
>> > > > > >
>> > > > > > If I store, say a custom boost factor as Payload, does it means
>> that
>> > > I
>> > > > > > will store one more byte per term per document in the index
>> file? So
>> > > > > > the index file would be much larger?
>> > > > > >
>> > > > > > >
>> > > > > > > Other way can be to extend SpanTermQuery, this already
>> calculates
>> > > the
>> > > > > > > position of match. You just need to do something to use this
>> > > position
>> > > > > value
>> > > > > > > in the score calculation.
>> > > > > >
>> > > > > > I see that SpanTermQuery takes a TermPositions from the
>> indexReader
>> > > > > > and I can get the term position from there. However I am not
>> sure
>> > > how
>> > > > > > to incorporate it into the score calculation. Would you mind
>> give a
>> > > > > > little more detail on this?
>> > > > > >
>> > > > > > >
>> > > > > > >

Re: Question about highlighting returning nothing

2007-08-16 Thread Donna L Gresh

Actually I don't think I'm having trouble-- as I mentioned,
my text is *not* stored, so to do highlighting I retrieve the
text from the database, apply the appropriate analyzer, 
and do the highlighting. It seems to be working exactly as
it should. My problem was that in a few cases, the document
has been removed from the database (but not from the index)
so when I queried the database using the identifier for the "best
hit" from the index, nothing
was being returned. Passing "nothing" to the highlighter 
resulted in, of course, nothing, so I was getting no highlighted
text. Once I updated my index to be in synch with the database,
I no longer had any empty returns from the highlighter.

Donna L. Gresh
Services Research, Mathematical Sciences Department
IBM T.J. Watson Research Center
(914) 945-2472
http://www.research.ibm.com/people/g/donnagresh
[EMAIL PROTECTED]

"Lukas Vlcek" <[EMAIL PROTECTED]> 
08/15/2007 03:49 PM
Please respond to
java-user@lucene.apache.org

To
java-user@lucene.apache.org
cc

Subject
Re: Question about highlighting returning nothing

Donna,

I have been investigation highlighters in Lucene recently a bit. The 
humble
experience I've learned so far is that highlighting is completely 
different
task from indexing/searching tandem. This simple fact is not obvious to a
lot of people. In your particular casue it would be helpful if you can 
post
more technical details about your system settings. Not only it is 
important
if the field to be highlighted is stored but also it is important if you
allow for query rewrite and what king of queries you are using (Prefix,
Wildcard ... etc).

Just my 2 cents.

Lukas

On 8/15/07, Donna L Gresh <[EMAIL PROTECTED]> wrote:
>
> Well, in my case the highlighting was returning nothing because of (my
> favorite acronym) PBCAK--
>
> I don't store the text in the index, so I have to retrieve it separately
> (from a database) for the highlighting, and my database was not in sync
> with the index, so in a few cases the document in the index had been
> deleted from the database--thus a score, but no document text.
>
> But I guess my original question remains; under what conditions would 
the
> highlighter return nothing? Only if no terms matched?
>
> Donna
>

Re: 答复: Indexing correctly?

2007-08-16 Thread John Paul Sondag

I've started to redo tests one at a time to see what exactly caused the
decreased index time.  Using the absolute path instead of the relative path
to the data doesn't seem to have made a significant difference, but using
StringBuffers (with a default of 25) made a huge change.  I still have
to try having the RAMDir as the only change to see what happens.

--JP


On 8/15/07, Erick Erickson <[EMAIL PROTECTED]> wrote:
>
> OK, what worked? Using a RAMDir?
>
> Erick
>
> On 8/15/07, John Paul Sondag <[EMAIL PROTECTED]> wrote:
> >
> > It worked!  My indexing time went from over 6 hours to 592
> seconds!  Thank
> > you guys so much!
> >
> > --JP
> >
> > On 8/14/07, karl wettin <[EMAIL PROTECTED]> wrote:
> > >
> > >
> > > 14 aug 2007 kl. 21.34 skrev John Paul Sondag:
> > >
> > > > What exactly is a RAMDirectory, I didn't see it mentioned on that
> > > > page.  Is
> > > > there example code of using it?   Do I just create a Ram Directory
> > > > and then
> > > > use it like it's a normal directory?
> > >
> > > Yes, it is just like FSDirectory, but resides in RAM and is not
> > > persistent.
> > >
> > > 
> > >
> > > --
> > > karl
> > >
> > >
> > >
> > >
> > >
> > > -
> > > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > > For additional commands, e-mail: [EMAIL PROTECTED]
> > >
> > >
> >
>

Stemmed terms/common terms

2007-08-16 Thread Alf Eaton


A couple of questions about term frequencies and stemming:

- What's the best way to get the most common unstemmed form of a  
Porter-stemmed word from the index? For example given the stem  
'walk', find that 'walking' is the most common full word in the index.


- Is there a way to get a list of all the terms in the index (or  
maybe just the top n) ordered by descending frequency of usage? I  
imagine it's related to docFreq, but can't see how to get a list of  
terms in all documents.


I'm using PyLucene and Solr, so if there are easy solutions in either  
of those that would be ideal.


Thanks,
alf.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Question about highlighting returning nothing

2007-08-16 Thread Lukas Vlcek

Donna,

Now I understand what you are saying (seems that I had PBCAK as well ;-)

As for your last question: ...under what conditions would the highlighter
return nothing? Only if no terms matched?

I remember that I found that highlighter can return null or empty string in
different situations. I think it depends on the Analyzer used or something
like that...

BR
Lukas

On 8/16/07, Donna L Gresh <[EMAIL PROTECTED]> wrote:
>
> Actually I don't think I'm having trouble-- as I mentioned,
> my text is *not* stored, so to do highlighting I retrieve the
> text from the database, apply the appropriate analyzer,
> and do the highlighting. It seems to be working exactly as
> it should. My problem was that in a few cases, the document
> has been removed from the database (but not from the index)
> so when I queried the database using the identifier for the "best
> hit" from the index, nothing
> was being returned. Passing "nothing" to the highlighter
> resulted in, of course, nothing, so I was getting no highlighted
> text. Once I updated my index to be in synch with the database,
> I no longer had any empty returns from the highlighter.
>
> Donna L. Gresh
> Services Research, Mathematical Sciences Department
> IBM T.J. Watson Research Center
> (914) 945-2472
> http://www.research.ibm.com/people/g/donnagresh
> [EMAIL PROTECTED]
>
>
>
>
> "Lukas Vlcek" <[EMAIL PROTECTED]>
> 08/15/2007 03:49 PM
> Please respond to
> java-user@lucene.apache.org
>
>
> To
> java-user@lucene.apache.org
> cc
>
> Subject
> Re: Question about highlighting returning nothing
>
>
>
>
>
>
> Donna,
>
> I have been investigation highlighters in Lucene recently a bit. The
> humble
> experience I've learned so far is that highlighting is completely
> different
> task from indexing/searching tandem. This simple fact is not obvious to a
> lot of people. In your particular casue it would be helpful if you can
> post
> more technical details about your system settings. Not only it is
> important
> if the field to be highlighted is stored but also it is important if you
> allow for query rewrite and what king of queries you are using (Prefix,
> Wildcard ... etc).
>
> Just my 2 cents.
>
> Lukas
>
> On 8/15/07, Donna L Gresh <[EMAIL PROTECTED]> wrote:
> >
> > Well, in my case the highlighting was returning nothing because of (my
> > favorite acronym) PBCAK--
> >
> > I don't store the text in the index, so I have to retrieve it separately
> > (from a database) for the highlighting, and my database was not in sync
> > with the index, so in a few cases the document in the index had been
> > deleted from the database--thus a score, but no document text.
> >
> > But I guess my original question remains; under what conditions would
> the
> > highlighter return nothing? Only if no terms matched?
> >
> > Donna
> >
>
>

Re: Question about highlighting returning nothing

2007-08-16 Thread mark harwood

Highlighter deliberately returns null so the calling app can tell when the text 
wasn't successfully highlighted.

Situations when this can happen are:

1) The text is out of synch with the index (the scenario you encountered)
2) The choice of analyzer used to tokenize the text differs from that used by 
the query parser
3) The document was matched on un-highlightable criteria e.g. a range query 
which the QueryParser will have turned into a ConstantScoreQuery wrapping a 
filter for performance reasons - no terms in the range are visible to the 
highlighter under these circumstances as the criteria becomes a bitset rather 
than a list of terms in the rewritten query.


Cheers
Mark


- Original Message 
From: Lukas Vlcek <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Thursday, 16 August, 2007 4:06:36 PM
Subject: Re: Question about highlighting returning nothing

Donna,

Now I understand what you are saying (seems that I had PBCAK as well ;-)

As for your last question: ...under what conditions would the highlighter
return nothing? Only if no terms matched?

I remember that I found that highlighter can return null or empty string in
different situations. I think it depends on the Analyzer used or something
like that...

BR
Lukas

On 8/16/07, Donna L Gresh <[EMAIL PROTECTED]> wrote:
>
> Actually I don't think I'm having trouble-- as I mentioned,
> my text is *not* stored, so to do highlighting I retrieve the
> text from the database, apply the appropriate analyzer,
> and do the highlighting. It seems to be working exactly as
> it should. My problem was that in a few cases, the document
> has been removed from the database (but not from the index)
> so when I queried the database using the identifier for the "best
> hit" from the index, nothing
> was being returned. Passing "nothing" to the highlighter
> resulted in, of course, nothing, so I was getting no highlighted
> text. Once I updated my index to be in synch with the database,
> I no longer had any empty returns from the highlighter.
>
> Donna L. Gresh
> Services Research, Mathematical Sciences Department
> IBM T.J. Watson Research Center
> (914) 945-2472
> http://www.research.ibm.com/people/g/donnagresh
> [EMAIL PROTECTED]
>
>
>
>
> "Lukas Vlcek" <[EMAIL PROTECTED]>
> 08/15/2007 03:49 PM
> Please respond to
> java-user@lucene.apache.org
>
>
> To
> java-user@lucene.apache.org
> cc
>
> Subject
> Re: Question about highlighting returning nothing
>
>
>
>
>
>
> Donna,
>
> I have been investigation highlighters in Lucene recently a bit. The
> humble
> experience I've learned so far is that highlighting is completely
> different
> task from indexing/searching tandem. This simple fact is not obvious to a
> lot of people. In your particular casue it would be helpful if you can
> post
> more technical details about your system settings. Not only it is
> important
> if the field to be highlighted is stored but also it is important if you
> allow for query rewrite and what king of queries you are using (Prefix,
> Wildcard ... etc).
>
> Just my 2 cents.
>
> Lukas
>
> On 8/15/07, Donna L Gresh <[EMAIL PROTECTED]> wrote:
> >
> > Well, in my case the highlighting was returning nothing because of (my
> > favorite acronym) PBCAK--
> >
> > I don't store the text in the index, so I have to retrieve it separately
> > (from a database) for the highlighting, and my database was not in sync
> > with the index, so in a few cases the document in the index had been
> > deleted from the database--thus a score, but no document text.
> >
> > But I guess my original question remains; under what conditions would
> the
> > highlighter return nothing? Only if no terms matched?
> >
> > Donna
> >
>
>





  ___ 
Want ideas for reducing your carbon footprint? Visit Yahoo! For Good  
http://uk.promotions.yahoo.com/forgood/environment.html

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: out of order

2007-08-16 Thread testn


Here you go

 -> Error during the indexing : docs out of order (0 <= 0 )
org.apache.lucene.index.CorruptIndexException: docs out of order (0 <= 0 )
at
org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:368)
at
org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:325)
at
org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:297)
at
org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:261)
at
org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:98)
at
org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:1883)
at
org.apache.lucene.index.IndexWriter.flushRamSegments(IndexWriter.java:1741)
at
org.apache.lucene.index.IndexWriter.flushRamSegments(IndexWriter.java:1733)
at
org.apache.lucene.index.IndexWriter.maybeFlushRamSegments(IndexWriter.java:1727)
at
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1004)
at
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:983)
at
org.springmodules.lucene.index.factory.SimpleLuceneIndexWriter.addDocument(SimpleLuceneIndexWriter.java:44)
)
at
org.springmodules.lucene.index.object.database.DefaultDatabaseIndexer.doHandleRequest(DefaultDatabaseIndexer.java:306)
at
org.springmodules.lucene.index.object.database.DefaultDatabaseIndexer.index(DefaultDatabaseIndexer.java:354)


Michael McCandless-2 wrote:
> 
> 
> Well then that is particularly spooky!!
> 
> And, hopefully, possible/easy to reproduce.  Thanks.
> 
> Mike
> 
> "testn" <[EMAIL PROTECTED]> wrote:
>> 
>> I use RAMDirectory and the error often shows the low number. Last time it
>> happened with message "7<=7". Nest time it happens, I will try to capture
>> the stacktrace.
>> 
>> 
>> 
>> Michael McCandless-2 wrote:
>> > 
>> > 
>> > "testn" <[EMAIL PROTECTED]> wrote:
>> >> 
>> >> Using Lucene 2.2.0, I still sporadically got doc out of order error. I
>> >> indexed all of my stuff in one thread. Do you have any idea why it
>> >> happens?
>> > 
>> > Hm, that is not good.  I thought we had finally fixed this with
>> > LUCENE-140.  Though un-corrected disk errors could in theory lead to
>> > this too.
>> > 
>> > Are you able to easily reproduce it?  Can you post the full exception?
>> > 
>> > Mike
>> > 
>> > -
>> > To unsubscribe, e-mail: [EMAIL PROTECTED]
>> > For additional commands, e-mail: [EMAIL PROTECTED]
>> > 
>> > 
>> > 
>> 
>> -- 
>> View this message in context:
>> http://www.nabble.com/out-of-order-tf4276385.html#a12173705
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>> 
>> 
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/out-of-order-tf4276385.html#a12184067
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Stemmed terms/common terms

2007-08-16 Thread Grant Ingersoll



On Aug 16, 2007, at 10:17 AM, Alf Eaton wrote:


A couple of questions about term frequencies and stemming:

- What's the best way to get the most common unstemmed form of a  
Porter-stemmed word from the index? For example given the stem  
'walk', find that 'walking' is the most common full word in the index.


Are both in the index?  I would think this is going to take some  
application specific logic, since Lucene doesn't inherently track  
these relations.  You might be able to string something together  
using some of the regular expression/wildcard queries, but it is  
going to take some work on your part.


Another approach might be to put some mechanisms in place during  
analysis that track this information.




- Is there a way to get a list of all the terms in the index (or  
maybe just the top n) ordered by descending frequency of usage? I  
imagine it's related to docFreq, but can't see how to get a list of  
terms in all documents.


Have a look at Luke if you just want the info as part of a UI.  Also,  
I _believe_ Solr has added a LukeRequestHandler (see http:// 
wiki.apache.org/solr/LukeRequestHandler), not sure if it does  
everything you are looking for, but it might be a place to start.   
You might ask your question on the Solr mailing list.




I'm using PyLucene and Solr, so if there are easy solutions in  
either of those that would be ideal.


Thanks,
alf.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Stemmed terms/common terms

2007-08-16 Thread Alf Eaton



On 16 Aug 2007, at 17:06, Grant Ingersoll wrote:



On Aug 16, 2007, at 10:17 AM, Alf Eaton wrote:


A couple of questions about term frequencies and stemming:

- What's the best way to get the most common unstemmed form of a  
Porter-stemmed word from the index? For example given the stem  
'walk', find that 'walking' is the most common full word in the  
index.


Are both in the index?  I would think this is going to take some  
application specific logic, since Lucene doesn't inherently track  
these relations.  You might be able to string something together  
using some of the regular expression/wildcard queries, but it is  
going to take some work on your part.


Hmm, no - the stemmed token is indexed and the full field is stored.  
I guess that means running a search for the stem and then using the  
same logic as a highlighter to find and extract the actual terms from  
each document.


Another approach might be to put some mechanisms in place during  
analysis that track this information.


How would you recommend doing this - using positionIncrement to store  
the stem and the original word at the same position, perhaps?


alf.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Question about highlighting returning nothing

2007-08-16 Thread Lukas Vlcek

Hi,

What I meant was that highlighter can return either null or empty string. So
one should check for the null first and then also for "". At least that is
my observation...

Lukas

On 8/16/07, mark harwood <[EMAIL PROTECTED]> wrote:
>
> Highlighter deliberately returns null so the calling app can tell when the
> text wasn't successfully highlighted.
>
> Situations when this can happen are:
>
> 1) The text is out of synch with the index (the scenario you encountered)
> 2) The choice of analyzer used to tokenize the text differs from that used
> by the query parser
> 3) The document was matched on un-highlightable criteria e.g. a range
> query which the QueryParser will have turned into a ConstantScoreQuery
> wrapping a filter for performance reasons - no terms in the range are
> visible to the highlighter under these circumstances as the criteria becomes
> a bitset rather than a list of terms in the rewritten query.
>
>
> Cheers
> Mark
>
>
> - Original Message 
> From: Lukas Vlcek <[EMAIL PROTECTED]>
> To: java-user@lucene.apache.org
> Sent: Thursday, 16 August, 2007 4:06:36 PM
> Subject: Re: Question about highlighting returning nothing
>
> Donna,
>
> Now I understand what you are saying (seems that I had PBCAK as well ;-)
>
> As for your last question: ...under what conditions would the highlighter
> return nothing? Only if no terms matched?
>
> I remember that I found that highlighter can return null or empty string
> in
> different situations. I think it depends on the Analyzer used or something
> like that...
>
> BR
> Lukas
>
> On 8/16/07, Donna L Gresh <[EMAIL PROTECTED]> wrote:
> >
> > Actually I don't think I'm having trouble-- as I mentioned,
> > my text is *not* stored, so to do highlighting I retrieve the
> > text from the database, apply the appropriate analyzer,
> > and do the highlighting. It seems to be working exactly as
> > it should. My problem was that in a few cases, the document
> > has been removed from the database (but not from the index)
> > so when I queried the database using the identifier for the "best
> > hit" from the index, nothing
> > was being returned. Passing "nothing" to the highlighter
> > resulted in, of course, nothing, so I was getting no highlighted
> > text. Once I updated my index to be in synch with the database,
> > I no longer had any empty returns from the highlighter.
> >
> > Donna L. Gresh
> > Services Research, Mathematical Sciences Department
> > IBM T.J. Watson Research Center
> > (914) 945-2472
> > http://www.research.ibm.com/people/g/donnagresh
> > [EMAIL PROTECTED]
> >
> >
> >
> >
> > "Lukas Vlcek" <[EMAIL PROTECTED]>
> > 08/15/2007 03:49 PM
> > Please respond to
> > java-user@lucene.apache.org
> >
> >
> > To
> > java-user@lucene.apache.org
> > cc
> >
> > Subject
> > Re: Question about highlighting returning nothing
> >
> >
> >
> >
> >
> >
> > Donna,
> >
> > I have been investigation highlighters in Lucene recently a bit. The
> > humble
> > experience I've learned so far is that highlighting is completely
> > different
> > task from indexing/searching tandem. This simple fact is not obvious to
> a
> > lot of people. In your particular casue it would be helpful if you can
> > post
> > more technical details about your system settings. Not only it is
> > important
> > if the field to be highlighted is stored but also it is important if you
> > allow for query rewrite and what king of queries you are using (Prefix,
> > Wildcard ... etc).
> >
> > Just my 2 cents.
> >
> > Lukas
> >
> > On 8/15/07, Donna L Gresh <[EMAIL PROTECTED]> wrote:
> > >
> > > Well, in my case the highlighting was returning nothing because of (my
> > > favorite acronym) PBCAK--
> > >
> > > I don't store the text in the index, so I have to retrieve it
> separately
> > > (from a database) for the highlighting, and my database was not in
> sync
> > > with the index, so in a few cases the document in the index had been
> > > deleted from the database--thus a score, but no document text.
> > >
> > > But I guess my original question remains; under what conditions would
> > the
> > > highlighter return nothing? Only if no terms matched?
> > >
> > > Donna
> > >
> >
> >
>
>
>
>
>
>   ___
> Want ideas for reducing your carbon footprint? Visit Yahoo! For Good
> http://uk.promotions.yahoo.com/forgood/environment.html
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

[Fwd: Exception in MultiLevelSkipListReader$SkipBuffer.readByte]

2007-08-16 Thread Scott Montgomerie

I'm getting an ArrayIndexOutOfBoundsException in
MultiLevelSkipListReader$SkipBuffer. This happens sporadically, on a
fairly small index (18 MB, about 30,000 documents). The index is
subject to a lot of adds and deletes, some of them concurrently. It
happens after about 4 days of heavy usage. I was able to isolate a copy
of the index that causes the exception, and I can reproduce the
exception cleanly in a Junit test.
I can see that readByte(), where the error is occuring, has no bounds
checking, therefore I assume that the data in there must be correct?
Hence, the index has obviously become corrupted. Further, optimizing
the index fixes the problem.

The problem is reproducible in working system. As I said, around 4-5
days after optimization, the same error occurs sporadically.
Any ideas?

Oh and this is Lucene 2.2.0, jdk 1.5.0_12.

The code from the junit test that calls this is pretty simple:

Query profileQuery = new TermQuery(new
Term(IndexFields.bookmark_profile_id, "1"));
Hits h = searcher.search(profileQuery, filterPrivate());

search is a plain old IndexSearcher, and filterPrivate() returns a
QueryFilter based on a 2-term BooleanQuery.


Full stack trace:

Exception in thread "MultiSearcher thread #2"
java.lang.ArrayIndexOutOfBoundsException: 14
at
org.apache.lucene.index.MultiLevelSkipListReader$SkipBuffer.readByte(MultiLevelSkipListReader.java:258)
at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:57)
at
org.apache.lucene.index.DefaultSkipListReader.readSkipData(DefaultSkipListReader.java:110)
at
org.apache.lucene.index.MultiLevelSkipListReader.loadNextSkip(MultiLevelSkipListReader.java:140)
at
org.apache.lucene.index.MultiLevelSkipListReader.skipTo(MultiLevelSkipListReader.java:110)
at
org.apache.lucene.index.SegmentTermDocs.skipTo(SegmentTermDocs.java:164)
at org.apache.lucene.index.MultiTermDocs.skipTo(MultiReader.java:413)
at org.apache.lucene.search.TermScorer.skipTo(TermScorer.java:145)
at
org.apache.lucene.util.ScorerDocQueue.topSkipToAndAdjustElsePop(ScorerDocQueue.java:120)
at
org.apache.lucene.search.DisjunctionSumScorer.skipTo(DisjunctionSumScorer.java:229)
at
org.apache.lucene.search.BooleanScorer2.skipTo(BooleanScorer2.java:381)
at
org.apache.lucene.search.ConjunctionScorer.doNext(ConjunctionScorer.java:63)
at
org.apache.lucene.search.ConjunctionScorer.next(ConjunctionScorer.java:58)
at
org.apache.lucene.search.BooleanScorer2.score(BooleanScorer2.java:327)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:146)
at org.apache.lucene.search.Searcher.search(Searcher.java:118)
at org.apache.lucene.search.Searcher.search(Searcher.java:97)
at
org.apache.lucene.search.QueryWrapperFilter.bits(QueryWrapperFilter.java:50)
at
org.apache.lucene.search.CachingWrapperFilter.bits(CachingWrapperFilter.java:58)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:133)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:113)
at
org.apache.lucene.search.MultiSearcherThread.run(ParallelMultiSearcher.java:250)

java.lang.NullPointerException
at
org.apache.lucene.search.MultiSearcherThread.hits(ParallelMultiSearcher.java:280)
at
org.apache.lucene.search.ParallelMultiSearcher.search(ParallelMultiSearcher.java:83)
at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:74)
at org.apache.lucene.search.Hits.(Hits.java:53)
at org.apache.lucene.search.Searcher.search(Searcher.java:46)

Thanks.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [Fwd: Exception in MultiLevelSkipListReader$SkipBuffer.readByte]

2007-08-16 Thread Yonik Seeley

I wonder if this is related to
https://issues.apache.org/jira/browse/LUCENE-951

If it's easy enough for you to reproduce, could you try the trunk
version of Lucene and see if it's fixed?

-Yonik

On 8/16/07, Scott Montgomerie <[EMAIL PROTECTED]> wrote:
> I'm getting an ArrayIndexOutOfBoundsException in
> MultiLevelSkipListReader$SkipBuffer. This happens sporadically, on a
> fairly small index (18 MB, about 30,000 documents). The index is
> subject to a lot of adds and deletes, some of them concurrently. It
> happens after about 4 days of heavy usage. I was able to isolate a copy
> of the index that causes the exception, and I can reproduce the
> exception cleanly in a Junit test.
> I can see that readByte(), where the error is occuring, has no bounds
> checking, therefore I assume that the data in there must be correct?
> Hence, the index has obviously become corrupted. Further, optimizing
> the index fixes the problem.
>
> The problem is reproducible in working system. As I said, around 4-5
> days after optimization, the same error occurs sporadically.
> Any ideas?
>
> Oh and this is Lucene 2.2.0, jdk 1.5.0_12.
>
> The code from the junit test that calls this is pretty simple:
>
> Query profileQuery = new TermQuery(new
> Term(IndexFields.bookmark_profile_id, "1"));
> Hits h = searcher.search(profileQuery, filterPrivate());
>
> search is a plain old IndexSearcher, and filterPrivate() returns a
> QueryFilter based on a 2-term BooleanQuery.
>
>
> Full stack trace:
>
> Exception in thread "MultiSearcher thread #2"
> java.lang.ArrayIndexOutOfBoundsException: 14
> at
> org.apache.lucene.index.MultiLevelSkipListReader$SkipBuffer.readByte(MultiLevelSkipListReader.java:258)
> at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:57)
> at
> org.apache.lucene.index.DefaultSkipListReader.readSkipData(DefaultSkipListReader.java:110)
> at
> org.apache.lucene.index.MultiLevelSkipListReader.loadNextSkip(MultiLevelSkipListReader.java:140)
> at
> org.apache.lucene.index.MultiLevelSkipListReader.skipTo(MultiLevelSkipListReader.java:110)
> at
> org.apache.lucene.index.SegmentTermDocs.skipTo(SegmentTermDocs.java:164)
> at org.apache.lucene.index.MultiTermDocs.skipTo(MultiReader.java:413)
> at org.apache.lucene.search.TermScorer.skipTo(TermScorer.java:145)
> at
> org.apache.lucene.util.ScorerDocQueue.topSkipToAndAdjustElsePop(ScorerDocQueue.java:120)
> at
> org.apache.lucene.search.DisjunctionSumScorer.skipTo(DisjunctionSumScorer.java:229)
> at
> org.apache.lucene.search.BooleanScorer2.skipTo(BooleanScorer2.java:381)
> at
> org.apache.lucene.search.ConjunctionScorer.doNext(ConjunctionScorer.java:63)
> at
> org.apache.lucene.search.ConjunctionScorer.next(ConjunctionScorer.java:58)
> at
> org.apache.lucene.search.BooleanScorer2.score(BooleanScorer2.java:327)
> at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:146)
> at org.apache.lucene.search.Searcher.search(Searcher.java:118)
> at org.apache.lucene.search.Searcher.search(Searcher.java:97)
> at
> org.apache.lucene.search.QueryWrapperFilter.bits(QueryWrapperFilter.java:50)
> at
> org.apache.lucene.search.CachingWrapperFilter.bits(CachingWrapperFilter.java:58)
> at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:133)
> at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:113)
> at
> org.apache.lucene.search.MultiSearcherThread.run(ParallelMultiSearcher.java:250)
>
> java.lang.NullPointerException
> at
> org.apache.lucene.search.MultiSearcherThread.hits(ParallelMultiSearcher.java:280)
> at
> org.apache.lucene.search.ParallelMultiSearcher.search(ParallelMultiSearcher.java:83)
> at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:74)
> at org.apache.lucene.search.Hits.(Hits.java:53)
> at org.apache.lucene.search.Searcher.search(Searcher.java:46)
>
> Thanks.
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Stemmed terms/common terms

2007-08-16 Thread Alf Eaton


On 16 Aug 2007, at 15:17, Alf Eaton wrote:


- Is there a way to get a list of all the terms in the index (or  
maybe just the top n) ordered by descending frequency of usage? I  
imagine it's related to docFreq, but can't see how to get a list of  
terms in all documents.


Thanks to http://tinyurl.com/2gndww I worked out how to do this (to  
get a list of terms and their frequency) with PyLucene:


terms = reader.terms()
while terms.next():
  term = terms.term()
  if term.field() == 'title':
print '%s - %d' % (term.text(), reader.docFreq(term))


alf.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Possible to expose similarity as a property in hits collection?

2007-08-16 Thread Michael Barbarelli

Hello all.

I am trying to get at the raw difference that Lucene uses -- the result of
the fail-fast Levenstein distance algorithm.  I believe that it is
calculated in FuzzyTermEnum.java (FuzzyTermEnum.cs).
In the application I have built upon Lucene, I would like to expose
similarity as the score, instead of the default score Lucene generates which
I believe is based on term frequency, etc.
Ideally, I would like to get at the FuzzyQuery/edit-distance-related numbers
from the Hits object, not just the normalized score.

Is it possible to do this?  Anyone have any experience with this?
Incidentally, I am using Lucene.NET.

Thanks,
Mike

Re: query question

2007-08-16 Thread testn


Can you post your code? Make sure that when you use wildcard in your custom
query parser, it will generate either WildcardQuery or PrefixQuery
correctly. 


is_maximum wrote:
> 
> Yes karl, when I explore the index by Luke I can see the terms
> for example I have a field namely, patientResult, it contains value "Ca.
> Oxalate:many" and also other values such as "Ca. Oxalate:few" etc.
> 
> the problems are when I put this query: patientResult:(Ca. Oxalate:few)
> the result is
> 84329 Ca. Oxalate:few
> 112519 Ca. Oxalate:many
> 139141 Ca. Oxalate:many
> 394321 Ca. Oxalate:few
> 397671 Ca. Oxalate:nod
> 387549 Ca. Oxalate: mod
> 
> however this is not the required result but another problem is when I put
> patientResult:Oxalate or patientResult:Oxalate* no result will return!!!
> 
> let me tell you that I am extended MultiFieldQueryParser to override its
> methods and in getFieldQuery(...) method I return TermQuery
> 
> I don't know what I was made wrong?
> 
> 
> 
> 
> On 8/15/07, karl wettin <[EMAIL PROTECTED]> wrote:
>>
>>
>> 15 aug 2007 kl. 07.18 skrev Mohammad Norouzi:
>>
>> > I am using WhitespaceAnalyzer and the query is " icdCode:H* " but
>> > there is
>> > no result however I know that there are many documents with this
>> > field value
>> > such as H20, H20.5 etc. this field is tokenized and indexed
>> > what is
>> > wrong with this?
>> > when I test this query with Luke it will return no result as well.
>>
>> Can you also use Luke to inspect documents you know should contain these
>> terms and make sure it really is in there?
>>
>> --
>> karl
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
> 
> 
> -- 
> Regards,
> Mohammad
> --
> see my blog: http://brainable.blogspot.com/
> another in Persian: http://fekre-motefavet.blogspot.com/
> 
> 

-- 
View this message in context: 
http://www.nabble.com/query-question-tf4271198.html#a12185271
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: out of order

2007-08-16 Thread Michael McCandless


OK.  Is it possible to capture this as small test case?

Maybe also call IndexWriter.setInfoStream(System.out) and capture details on
what segments are being merged?

Can you shed some light on how the application is using Lucene?  Are you doing
deletes as well as adds?  Opening readers against this RAMDirectory?  Closing/
opening writers at different times?  Any changes to the default parameters
(mergeFactor, maxBufferedDocs, etc.)?

Mike

"testn" <[EMAIL PROTECTED]> wrote:
> 
> Here you go
> 
>  -> Error during the indexing : docs out of order (0 <= 0 )
> org.apache.lucene.index.CorruptIndexException: docs out of order (0 <= 0
> )
> at
> org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:368)
> at
> org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:325)
> at
> org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:297)
> at
> org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:261)
> at
> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:98)
> at
> org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:1883)
> at
> org.apache.lucene.index.IndexWriter.flushRamSegments(IndexWriter.java:1741)
> at
> org.apache.lucene.index.IndexWriter.flushRamSegments(IndexWriter.java:1733)
> at
> org.apache.lucene.index.IndexWriter.maybeFlushRamSegments(IndexWriter.java:1727)
> at
> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1004)
> at
> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:983)
> at
> org.springmodules.lucene.index.factory.SimpleLuceneIndexWriter.addDocument(SimpleLuceneIndexWriter.java:44)
> )
> at
> org.springmodules.lucene.index.object.database.DefaultDatabaseIndexer.doHandleRequest(DefaultDatabaseIndexer.java:306)
> at
> org.springmodules.lucene.index.object.database.DefaultDatabaseIndexer.index(DefaultDatabaseIndexer.java:354)
> 
> 
> Michael McCandless-2 wrote:
> > 
> > 
> > Well then that is particularly spooky!!
> > 
> > And, hopefully, possible/easy to reproduce.  Thanks.
> > 
> > Mike
> > 
> > "testn" <[EMAIL PROTECTED]> wrote:
> >> 
> >> I use RAMDirectory and the error often shows the low number. Last time it
> >> happened with message "7<=7". Nest time it happens, I will try to capture
> >> the stacktrace.
> >> 
> >> 
> >> 
> >> Michael McCandless-2 wrote:
> >> > 
> >> > 
> >> > "testn" <[EMAIL PROTECTED]> wrote:
> >> >> 
> >> >> Using Lucene 2.2.0, I still sporadically got doc out of order error. I
> >> >> indexed all of my stuff in one thread. Do you have any idea why it
> >> >> happens?
> >> > 
> >> > Hm, that is not good.  I thought we had finally fixed this with
> >> > LUCENE-140.  Though un-corrected disk errors could in theory lead to
> >> > this too.
> >> > 
> >> > Are you able to easily reproduce it?  Can you post the full exception?
> >> > 
> >> > Mike
> >> > 
> >> > -
> >> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> >> > For additional commands, e-mail: [EMAIL PROTECTED]
> >> > 
> >> > 
> >> > 
> >> 
> >> -- 
> >> View this message in context:
> >> http://www.nabble.com/out-of-order-tf4276385.html#a12173705
> >> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> >> 
> >> 
> >> -
> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >> For additional commands, e-mail: [EMAIL PROTECTED]
> >> 
> > 
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> > 
> > 
> > 
> 
> -- 
> View this message in context:
> http://www.nabble.com/out-of-order-tf4276385.html#a12184067
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: out of order

2007-08-16 Thread testn


Does it help you to find out if I create an empty index before start the real
operation?

IndexWriter writer = new IndexWriter(directory, new
SimpleAnalyzer(), true);
writer.close();
/* add new index afterward */

This is to clean up the index since springmodules doesn't support the way to
clear up the existing index.



Michael McCandless-2 wrote:
> 
> 
> OK.  Is it possible to capture this as small test case?
> 
> Maybe also call IndexWriter.setInfoStream(System.out) and capture details
> on
> what segments are being merged?
> 
> Can you shed some light on how the application is using Lucene?  Are you
> doing
> deletes as well as adds?  Opening readers against this RAMDirectory? 
> Closing/
> opening writers at different times?  Any changes to the default parameters
> (mergeFactor, maxBufferedDocs, etc.)?
> 
> Mike
> 
> "testn" <[EMAIL PROTECTED]> wrote:
>> 
>> Here you go
>> 
>>  -> Error during the indexing : docs out of order (0 <= 0 )
>> org.apache.lucene.index.CorruptIndexException: docs out of order (0 <= 0
>> )
>> at
>> org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:368)
>> at
>> org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:325)
>> at
>> org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:297)
>> at
>> org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:261)
>> at
>> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:98)
>> at
>> org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:1883)
>> at
>> org.apache.lucene.index.IndexWriter.flushRamSegments(IndexWriter.java:1741)
>> at
>> org.apache.lucene.index.IndexWriter.flushRamSegments(IndexWriter.java:1733)
>> at
>> org.apache.lucene.index.IndexWriter.maybeFlushRamSegments(IndexWriter.java:1727)
>> at
>> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1004)
>> at
>> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:983)
>> at
>> org.springmodules.lucene.index.factory.SimpleLuceneIndexWriter.addDocument(SimpleLuceneIndexWriter.java:44)
>> )
>> at
>> org.springmodules.lucene.index.object.database.DefaultDatabaseIndexer.doHandleRequest(DefaultDatabaseIndexer.java:306)
>> at
>> org.springmodules.lucene.index.object.database.DefaultDatabaseIndexer.index(DefaultDatabaseIndexer.java:354)
>> 
>> 
>> Michael McCandless-2 wrote:
>> > 
>> > 
>> > Well then that is particularly spooky!!
>> > 
>> > And, hopefully, possible/easy to reproduce.  Thanks.
>> > 
>> > Mike
>> > 
>> > "testn" <[EMAIL PROTECTED]> wrote:
>> >> 
>> >> I use RAMDirectory and the error often shows the low number. Last time
>> it
>> >> happened with message "7<=7". Nest time it happens, I will try to
>> capture
>> >> the stacktrace.
>> >> 
>> >> 
>> >> 
>> >> Michael McCandless-2 wrote:
>> >> > 
>> >> > 
>> >> > "testn" <[EMAIL PROTECTED]> wrote:
>> >> >> 
>> >> >> Using Lucene 2.2.0, I still sporadically got doc out of order
>> error. I
>> >> >> indexed all of my stuff in one thread. Do you have any idea why it
>> >> >> happens?
>> >> > 
>> >> > Hm, that is not good.  I thought we had finally fixed this with
>> >> > LUCENE-140.  Though un-corrected disk errors could in theory lead to
>> >> > this too.
>> >> > 
>> >> > Are you able to easily reproduce it?  Can you post the full
>> exception?
>> >> > 
>> >> > Mike
>> >> > 
>> >> >
>> -
>> >> > To unsubscribe, e-mail: [EMAIL PROTECTED]
>> >> > For additional commands, e-mail: [EMAIL PROTECTED]
>> >> > 
>> >> > 
>> >> > 
>> >> 
>> >> -- 
>> >> View this message in context:
>> >> http://www.nabble.com/out-of-order-tf4276385.html#a12173705
>> >> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>> >> 
>> >> 
>> >> -
>> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> >> For additional commands, e-mail: [EMAIL PROTECTED]
>> >> 
>> > 
>> > -
>> > To unsubscribe, e-mail: [EMAIL PROTECTED]
>> > For additional commands, e-mail: [EMAIL PROTECTED]
>> > 
>> > 
>> > 
>> 
>> -- 
>> View this message in context:
>> http://www.nabble.com/out-of-order-tf4276385.html#a12184067
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>> 
>> 
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/out-of-order-tf4276385.html#a12186579
Sent from the Lucene - Java Users mailing list

Document Similarities lucene(particularly using doc id's)

2007-08-16 Thread Lokeya


Hi All,

I have the following set up: a) Indexed set of docs. b) Ran 1st query and
got tops docs  c) Fetched the id's from that and stored in a data structure.
d) Ran 2nd query , got top docs , fetched id's and stored in a data
structure.

Now i have 2 sets of doc ids (set 1) and (set 1).

I want to find out the document content similarity between these 2 sets(just
using doc ids information which i have).

Qn 1: Is it possible using any lucene api's. In that case can you point me
to the appropriate API's. I did a search at
:http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/index.html
But couldn't find anything.

Qn 2: If this can't be done then whats the best way to do this using already
available Lucene Api's.

Thanks in Advance.
-- 
View this message in context: 
http://www.nabble.com/Document-Similarities-lucene%28particularly-using-doc-id%27s%29-tf4281286.html#a12186723
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

tell snowballfilter not to stem certain words?

2007-08-16 Thread Donna L Gresh

Apologies if this is in the FAQ or elsewhere available but I could not 
find this.

Can I provide a list of words that should *not* be stemmed by the
SnowballFilter? My analyzer looks like this:

analyzer = new StandardAnalyzer(stopwords) {
public TokenStream tokenStream(String fieldName, java.io.Reader 
reader) {
  return new SnowballFilter(super.tokenStream(fieldName,reader),
"English");
}
};

It is removing the trailing "S" from some words which I don't want to have 
this happen for--

Donna

Re: tell snowballfilter not to stem certain words?

2007-08-16 Thread Erick Erickson

Not that I know of. I suspect you'll have to write a filter that returns
the stemmed or unstemmed based on membership in your list
of words not to stem.

Best
Erick

On 8/16/07, Donna L Gresh <[EMAIL PROTECTED]> wrote:
>
> Apologies if this is in the FAQ or elsewhere available but I could not
> find this.
>
> Can I provide a list of words that should *not* be stemmed by the
> SnowballFilter? My analyzer looks like this:
>
> analyzer = new StandardAnalyzer(stopwords) {
> public TokenStream tokenStream(String fieldName, java.io.Reader
> reader) {
>   return new SnowballFilter(super.tokenStream(fieldName,reader),
> "English");
> }
> };
>
> It is removing the trailing "S" from some words which I don't want to have
> this happen for--
>
> Donna
>
>

Re: out of order

2007-08-16 Thread Michael McCandless


Hmmm.  It is interesting, because that specific call (using
IndexWriter to "create" an index) was one of the causes in
LUCENE-140.  But I'm pretty sure we fixed that cause.  As part of
LUCENE-140 we also added further checks to catch re-using of an old
.del file at a lower level, and you're not hitting those.

After you close that IndexWriter, can you list the files in your
directory (that's a RAMDirectory right?)?  Something like this:

import java.util.Arrays;

String[] l = directory.list();
Arrays.sort(l);
for(int i=0;i wrote:
> 
> Does it help you to find out if I create an empty index before start the
> real
> operation?
> 
> IndexWriter writer = new IndexWriter(directory, new
> SimpleAnalyzer(), true);
> writer.close();
> /* add new index afterward */
> 
> This is to clean up the index since springmodules doesn't support the way
> to
> clear up the existing index.
> 
> 
> 
> Michael McCandless-2 wrote:
> > 
> > 
> > OK.  Is it possible to capture this as small test case?
> > 
> > Maybe also call IndexWriter.setInfoStream(System.out) and capture details
> > on
> > what segments are being merged?
> > 
> > Can you shed some light on how the application is using Lucene?  Are you
> > doing
> > deletes as well as adds?  Opening readers against this RAMDirectory? 
> > Closing/
> > opening writers at different times?  Any changes to the default parameters
> > (mergeFactor, maxBufferedDocs, etc.)?
> > 
> > Mike
> > 
> > "testn" <[EMAIL PROTECTED]> wrote:
> >> 
> >> Here you go
> >> 
> >>  -> Error during the indexing : docs out of order (0 <= 0 )
> >> org.apache.lucene.index.CorruptIndexException: docs out of order (0 <= 0
> >> )
> >> at
> >> org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:368)
> >> at
> >> org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:325)
> >> at
> >> org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:297)
> >> at
> >> org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:261)
> >> at
> >> org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:98)
> >> at
> >> org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:1883)
> >> at
> >> org.apache.lucene.index.IndexWriter.flushRamSegments(IndexWriter.java:1741)
> >> at
> >> org.apache.lucene.index.IndexWriter.flushRamSegments(IndexWriter.java:1733)
> >> at
> >> org.apache.lucene.index.IndexWriter.maybeFlushRamSegments(IndexWriter.java:1727)
> >> at
> >> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1004)
> >> at
> >> org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:983)
> >> at
> >> org.springmodules.lucene.index.factory.SimpleLuceneIndexWriter.addDocument(SimpleLuceneIndexWriter.java:44)
> >> )
> >> at
> >> org.springmodules.lucene.index.object.database.DefaultDatabaseIndexer.doHandleRequest(DefaultDatabaseIndexer.java:306)
> >> at
> >> org.springmodules.lucene.index.object.database.DefaultDatabaseIndexer.index(DefaultDatabaseIndexer.java:354)
> >> 
> >> 
> >> Michael McCandless-2 wrote:
> >> > 
> >> > 
> >> > Well then that is particularly spooky!!
> >> > 
> >> > And, hopefully, possible/easy to reproduce.  Thanks.
> >> > 
> >> > Mike
> >> > 
> >> > "testn" <[EMAIL PROTECTED]> wrote:
> >> >> 
> >> >> I use RAMDirectory and the error often shows the low number. Last time
> >> it
> >> >> happened with message "7<=7". Nest time it happens, I will try to
> >> capture
> >> >> the stacktrace.
> >> >> 
> >> >> 
> >> >> 
> >> >> Michael McCandless-2 wrote:
> >> >> > 
> >> >> > 
> >> >> > "testn" <[EMAIL PROTECTED]> wrote:
> >> >> >> 
> >> >> >> Using Lucene 2.2.0, I still sporadically got doc out of order
> >> error. I
> >> >> >> indexed all of my stuff in one thread. Do you have any idea why it
> >> >> >> happens?
> >> >> > 
> >> >> > Hm, that is not good.  I thought we had finally fixed this with
> >> >> > LUCENE-140.  Though un-corrected disk errors could in theory lead to
> >> >> > this too.
> >> >> > 
> >> >> > Are you able to easily reproduce it?  Can you post the full
> >> exception?
> >> >> > 
> >> >> > Mike
> >> >> > 
> >> >> >
> >> -
> >> >> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> >> >> > For additional commands, e-mail: [EMAIL PROTECTED]
> >> >> > 
> >> >> > 
> >> >> > 
> >> >> 
> >> >> -- 
> >> >> View this message in context:
> >> >> http://www.nabble.com/out-of-order-tf4276385.html#a12173705
> >> >> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> >> >> 
> >> >> 
> >> >> -
> >> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >> >> For additional commands, e-mail: [EMAIL PROTECTED]
> >> >> 
> >> > 
> >> > --

Re: out of order

2007-08-16 Thread Chris Hostetter

: After you close that IndexWriter, can you list the files in your
: directory (that's a RAMDirectory right?)?  Something like this:

The OP said this was a fairly small RAMDirectory index right?  would it be
worth while to just write the whole thing to disk and post it onlin so
people could see every byte of every file?

(i'm all thumbs when it comes to index internals and the file formats, but
i'm just tossig it out there as an idea)



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: tell snowballfilter not to stem certain words?

2007-08-16 Thread karl wettin



16 aug 2007 kl. 20.34 skrev Donna L Gresh:


Apologies if this is in the FAQ or elsewhere available but I could not
find this.

Can I provide a list of words that should *not* be stemmed by the
SnowballFilter?


If it is a static list, simply add it as an exception in the snowball  
code and recompile to Java.



--
karl

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: out of order

2007-08-16 Thread testn


There are two files:
1. segments_2 [-1, -1, -3, 0, 0, 1, 20, 112, 39, 17, -80, 0, 0, 0, 0, 0, 0,
0, 0] 
2. segments.gen [-1, -1, -1, -2, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0,
0, 2]

but this one when the index is done done properly.


hossman wrote:
> 
> : After you close that IndexWriter, can you list the files in your
> : directory (that's a RAMDirectory right?)?  Something like this:
> 
> The OP said this was a fairly small RAMDirectory index right?  would it be
> worth while to just write the whole thing to disk and post it onlin so
> people could see every byte of every file?
> 
> (i'm all thumbs when it comes to index internals and the file formats, but
> i'm just tossig it out there as an idea)
> 
> 
> 
> -Hoss
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/out-of-order-tf4276385.html#a12187972
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: out of order

2007-08-16 Thread Michael McCandless


OK, that's clean (no leftover files).  So this cause does not seem to
be the same cause as LUCENE-140.

Can you capture the exact docs you are adding (all indexed fields) and
try to replay them to see if the same exception is reproducible?

Have you seen this happen on a different machine?  (Just in case, I
admit rather remote and hopeful on my part ;), that you have bad RAM
in your machine).

Mike

"testn" <[EMAIL PROTECTED]> wrote:
> 
> There are two files:
> 1. segments_2 [-1, -1, -3, 0, 0, 1, 20, 112, 39, 17, -80, 0, 0, 0, 0, 0,
> 0,
> 0, 0] 
> 2. segments.gen [-1, -1, -1, -2, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0,
> 0,
> 0, 2]
> 
> but this one when the index is done done properly.
> 
> 
> hossman wrote:
> > 
> > : After you close that IndexWriter, can you list the files in your
> > : directory (that's a RAMDirectory right?)?  Something like this:
> > 
> > The OP said this was a fairly small RAMDirectory index right?  would it be
> > worth while to just write the whole thing to disk and post it onlin so
> > people could see every byte of every file?
> > 
> > (i'm all thumbs when it comes to index internals and the file formats, but
> > i'm just tossig it out there as an idea)
> > 
> > 
> > 
> > -Hoss
> > 
> > 
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> > 
> > 
> > 
> 
> -- 
> View this message in context:
> http://www.nabble.com/out-of-order-tf4276385.html#a12187972
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Location of SpanRegexQuery

2007-08-16 Thread dontspamterry


Hi,

While researching support for wildcards in a PhraseQuery, I see various
references to SpanRegexQuery which is not part of the 2.2 distribution. I
checked the Lucene site to see if it's some add-on jar, but couldn't find
anything so I'm wondering where can I obtain the .class/jar file(s) for this
class?

Thanks,
-Terry
-- 
View this message in context: 
http://www.nabble.com/Location-of-SpanRegexQuery-tf4281915.html#a12188851
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Location of SpanRegexQuery

2007-08-16 Thread Erick Erickson

It should already be on your disk with the distribution. Try
/contrib/regex.

Lots of things are rooted in contrib, and I've never had to
find any other jars from the Lucene site, they've all
been in contrib

Hope this helps
Erick

On 8/16/07, dontspamterry <[EMAIL PROTECTED]> wrote:
>
>
> Hi,
>
> While researching support for wildcards in a PhraseQuery, I see various
> references to SpanRegexQuery which is not part of the 2.2 distribution. I
> checked the Lucene site to see if it's some add-on jar, but couldn't find
> anything so I'm wondering where can I obtain the .class/jar file(s) for
> this
> class?
>
> Thanks,
> -Terry
> --
> View this message in context:
> http://www.nabble.com/Location-of-SpanRegexQuery-tf4281915.html#a12188851
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

Re: Location of SpanRegexQuery

2007-08-16 Thread dontspamterry


And so it is! My bad - guess I should have paid more attention to the README
file which clearly explains the contents  :P

-Terry


Erick Erickson wrote:
> 
> It should already be on your disk with the distribution. Try
> /contrib/regex.
> 
> Lots of things are rooted in contrib, and I've never had to
> find any other jars from the Lucene site, they've all
> been in contrib
> 
> Hope this helps
> Erick
> 
> On 8/16/07, dontspamterry <[EMAIL PROTECTED]> wrote:
>>
>>
>> Hi,
>>
>> While researching support for wildcards in a PhraseQuery, I see various
>> references to SpanRegexQuery which is not part of the 2.2 distribution. I
>> checked the Lucene site to see if it's some add-on jar, but couldn't find
>> anything so I'm wondering where can I obtain the .class/jar file(s) for
>> this
>> class?
>>
>> Thanks,
>> -Terry
>> --
>> View this message in context:
>> http://www.nabble.com/Location-of-SpanRegexQuery-tf4281915.html#a12188851
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Location-of-SpanRegexQuery-tf4281915.html#a12189507
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: getting term offset information for fields with multiple value entiries

2007-08-16 Thread Grant Ingersoll


Hi Christian,

Is there anyway you can post a complete, self-contained example  
preferably as a JUnit test?  I think it would be useful to know more  
about how you are indexing (i.e. what Analyzer, etc.)
The offsets should be taken from whatever is set in on the Token  
during Analysis.  I, too, am trying to remember where in the code  
this is taking place


Also, what version of Lucene are you using?

-Grant

On Aug 16, 2007, at 5:50 AM, [EMAIL PROTECTED] wrote:


-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hello,

I have an index with an 'actor' field, for each actor there exists  
an single field value entry, e.g.


stored/ 
compressed,indexed,tokenized,termVector,termVectorOffsets,termVectorPo 
sition 


movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo)
movie_actors:Miguel Bosé
movie_actors:Anna Lizaran (as Ana Lizaran)
movie_actors:Raquel Sanchís
movie_actors:Angelina Llongueras

I try to get the term offset, e.g. for 'angelina' with

termPositionVector = (TermPositionVector) reader.getTermFreqVector 
(docNumber, "movie_actors");

int iTermIndex = termPositionVector.indexOf("angelina");
TermVectorOffsetInfo[] termOffsets = termPositionVector.getOffsets 
(iTermIndex);



I get one TermVectorOffsetInfo for the field - with offset numbers  
that are bigger than one single

Field entry.
I guessed that Lucene gives the offset number for the situation  
that all values were concatenated,

which is for the single (virtual) string:

movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo)Miguel  
BoséAnna Lizaran (as Ana Lizaran)Raquel SanchísAngelina Llongueras


This fits in nearly no situation, so my second guess was that  
lucene adds some virtual delimiters between the single
field entries for offset calculation. I added a delimiter, so the  
result would be:


movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo) Miguel Bosé  
Anna Lizaran (as Ana Lizaran) Raquel Sanchís Angelina Llongueras

(note the ' ' between each actor name)

..this also fits not for each situation - there are too much  
delimiters there now, so, further, I guessed that Lucene don't add
a delimiter in each situation. So I added only one when the last  
character of an entry was no alphanumerical one, with:

StringBuilder strbAttContent = new StringBuilder();
for (String strAttValue : m_luceneDocument.getValues(strFieldName))
{
   strbAttContent.append(strAttValue);
   if(strbAttContent.substring(strbAttContent.length() - 1).matches 
("\\w"))

  strbAttContent.append(' ');
}

where I get the result (virtual) entry:
movie_actors:Mayrata O'Wisiedo (as Mairata O'Wisiedo)Miguel  
BoséAnna Lizaran (as Ana Lizaran)Raquel Sanchís Angelina Llongueras


this fits in ~96% of all my queriesbut still its not 100% the  
way lucene calculates the offset value for fields with multiple

value entries.


..maybe the problem is that there are special characters inside my  
database (e.g. the 'é' at 'Bosé'), where my '\w' don't matches.
I have looked to this specific situation, but considering this one  
character don't solves the problem.



How do Lucene calculates these offsets? I also searched inside the  
source code, but can't find the correct place.



Thanks in advance!

Christian Reuschling





- --
__ 


Christian Reuschling, Dipl.-Ing.(BA)
Software Engineer

Knowledge Management Department
German Research Center for Artificial Intelligence DFKI GmbH
Trippstadter Straße 122, D-67663 Kaiserslautern, Germany

Phone: +49.631.20575-125
mailto:[EMAIL PROTECTED]  http://www.dfki.uni-kl.de/~reuschling/

- Legal Company Information Required by German  
Law--
Geschäftsführung: Prof. Dr. Dr. h.c. mult. Wolfgang Wahlster  
(Vorsitzender)

  Dr. Walter Olthoff
Vorsitzender des Aufsichtsrats: Prof. Dr. h.c. Hans A. Aukes
Amtsgericht Kaiserslautern, HRB 2313=
__ 


-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.2 (GNU/Linux)

iD8DBQFGxB3XQoTr50f1tpcRAti+AKCH0YgcHjA+bO9NTbuxaAlKb8dO5gCfSfSK
oVOiAdWYROqXOMqHv176xBY=
=b2jO
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [Fwd: Exception in MultiLevelSkipListReader$SkipBuffer.readByte]

2007-08-16 Thread Scott Montgomerie

I just tried it with the latest nightly build, the problem still happens.

I think it must have to do with a corrupted index somehow.   I've also
noticed, as a separate issue, that after this period of time (4-5 days),
certain documents aren't indexed correctly.  For example, I will do a query:

Query for Field 1 with value A returns a list of documents.  In these
documents is a document with Field 2 with value B.
Query for Field 2 with value B, return 0 documents.

Therefore the index on Field 2 is somehow missing certain documents.  Is
this possible?

Yonik Seeley wrote:
> I wonder if this is related to
> https://issues.apache.org/jira/browse/LUCENE-951
>
> If it's easy enough for you to reproduce, could you try the trunk
> version of Lucene and see if it's fixed?
>
> -Yonik
>
> On 8/16/07, Scott Montgomerie <[EMAIL PROTECTED]> wrote:
>   
>> I'm getting an ArrayIndexOutOfBoundsException in
>> MultiLevelSkipListReader$SkipBuffer. This happens sporadically, on a
>> fairly small index (18 MB, about 30,000 documents). The index is
>> subject to a lot of adds and deletes, some of them concurrently. It
>> happens after about 4 days of heavy usage. I was able to isolate a copy
>> of the index that causes the exception, and I can reproduce the
>> exception cleanly in a Junit test.
>> I can see that readByte(), where the error is occuring, has no bounds
>> checking, therefore I assume that the data in there must be correct?
>> Hence, the index has obviously become corrupted. Further, optimizing
>> the index fixes the problem.
>>
>> The problem is reproducible in working system. As I said, around 4-5
>> days after optimization, the same error occurs sporadically.
>> Any ideas?
>>
>> Oh and this is Lucene 2.2.0, jdk 1.5.0_12.
>>
>> The code from the junit test that calls this is pretty simple:
>>
>> Query profileQuery = new TermQuery(new
>> Term(IndexFields.bookmark_profile_id, "1"));
>> Hits h = searcher.search(profileQuery, filterPrivate());
>>
>> search is a plain old IndexSearcher, and filterPrivate() returns a
>> QueryFilter based on a 2-term BooleanQuery.
>>
>>
>> Full stack trace:
>>
>> Exception in thread "MultiSearcher thread #2"
>> java.lang.ArrayIndexOutOfBoundsException: 14
>> at
>> org.apache.lucene.index.MultiLevelSkipListReader$SkipBuffer.readByte(MultiLevelSkipListReader.java:258)
>> at org.apache.lucene.store.IndexInput.readVInt(IndexInput.java:57)
>> at
>> org.apache.lucene.index.DefaultSkipListReader.readSkipData(DefaultSkipListReader.java:110)
>> at
>> org.apache.lucene.index.MultiLevelSkipListReader.loadNextSkip(MultiLevelSkipListReader.java:140)
>> at
>> org.apache.lucene.index.MultiLevelSkipListReader.skipTo(MultiLevelSkipListReader.java:110)
>> at
>> org.apache.lucene.index.SegmentTermDocs.skipTo(SegmentTermDocs.java:164)
>> at org.apache.lucene.index.MultiTermDocs.skipTo(MultiReader.java:413)
>> at org.apache.lucene.search.TermScorer.skipTo(TermScorer.java:145)
>> at
>> org.apache.lucene.util.ScorerDocQueue.topSkipToAndAdjustElsePop(ScorerDocQueue.java:120)
>> at
>> org.apache.lucene.search.DisjunctionSumScorer.skipTo(DisjunctionSumScorer.java:229)
>> at
>> org.apache.lucene.search.BooleanScorer2.skipTo(BooleanScorer2.java:381)
>> at
>> org.apache.lucene.search.ConjunctionScorer.doNext(ConjunctionScorer.java:63)
>> at
>> org.apache.lucene.search.ConjunctionScorer.next(ConjunctionScorer.java:58)
>> at
>> org.apache.lucene.search.BooleanScorer2.score(BooleanScorer2.java:327)
>> at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:146)
>> at org.apache.lucene.search.Searcher.search(Searcher.java:118)
>> at org.apache.lucene.search.Searcher.search(Searcher.java:97)
>> at
>> org.apache.lucene.search.QueryWrapperFilter.bits(QueryWrapperFilter.java:50)
>> at
>> org.apache.lucene.search.CachingWrapperFilter.bits(CachingWrapperFilter.java:58)
>> at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:133)
>> at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:113)
>> at
>> org.apache.lucene.search.MultiSearcherThread.run(ParallelMultiSearcher.java:250)
>>
>> java.lang.NullPointerException
>> at
>> org.apache.lucene.search.MultiSearcherThread.hits(ParallelMultiSearcher.java:280)
>> at
>> org.apache.lucene.search.ParallelMultiSearcher.search(ParallelMultiSearcher.java:83)
>> at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:74)
>> at org.apache.lucene.search.Hits.(Hits.java:53)
>> at org.apache.lucene.search.Searcher.search(Searcher.java:46)
>>
>> Thanks.
>>
>>
>> -
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
>> 
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>   

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional command

Re: Nested concept fields

2007-08-16 Thread Chris Hostetter


:   sent:(expired num[1 TO 5] "days ago")
:
: I don't see how to do this using either Lucene's QueryParser or the
: QsolParser. Is it possible to do it using the Query API (and the appropriate
: indexing changes)?

take a look at Span queries, particularly SpanNearQuery ... that can do
pretty much everything you describe assuming creative indexing.  the one
thing that's not quite ready made would be the range query -- you'd need
to write a kind of SpanRangeQuery that rewrite itself into a SpanOrQuery
containing all of the numbers in the range.



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [Fwd: Exception in MultiLevelSkipListReader$SkipBuffer.readByte]

2007-08-16 Thread Michael Busch

Scott Montgomerie wrote:
> I just tried it with the latest nightly build, the problem still happens.
> 
> I think it must have to do with a corrupted index somehow.   I've also
> noticed, as a separate issue, that after this period of time (4-5 days),
> certain documents aren't indexed correctly.  For example, I will do a query:
> 
> Query for Field 1 with value A returns a list of documents.  In these
> documents is a document with Field 2 with value B.
> Query for Field 2 with value B, return 0 documents.
> 
> Therefore the index on Field 2 is somehow missing certain documents.  Is
> this possible?
> 

Hi Scott,

hmm that doesn't look good. I haven't seen this problem before. Could
you send the index (preferably compressed as .zip) to me (not the
mailing list), please? And your JUnit test that hits this exception, so
that I can debug it?

Thanks for your help!

- Michael

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Question on custom scoring

2007-08-16 Thread Chris Hostetter

: document in the scoring formula, and I thought the CustomScoreQuery would be
: useful, but I am realizing that it may not be easy because the "relevance"
: score from Lucene has no absolute meaning. The relevance score could be 5 or
: 500 and there is no way for me gauge what that number means and how much I
: should weigh he "popularity" value relative to it when computing the custom

it's definitely tricky, and not something that can be decided universally
regardless of your data (or query structure) -- like all "fuzzy" logic you
have to try lots of different use cases and find something that works
well.  And yes, you have to re-evaluate your choices over time sa you
index changes.



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: query question

getting term offset information for fields with multiple value entiries

Re: out of order

Re: Can I do boosting based on term postions?

Re: Question about highlighting returning nothing

Re: 答复: Indexing correctly?

Stemmed terms/common terms

Re: Question about highlighting returning nothing

Re: Question about highlighting returning nothing

Re: out of order

Re: Stemmed terms/common terms

Re: Stemmed terms/common terms

Re: Question about highlighting returning nothing

[Fwd: Exception in MultiLevelSkipListReader$SkipBuffer.readByte]

Re: [Fwd: Exception in MultiLevelSkipListReader$SkipBuffer.readByte]

Re: Stemmed terms/common terms

Possible to expose similarity as a property in hits collection?

Re: query question

Re: out of order

Re: out of order

Document Similarities lucene(particularly using doc id's)

tell snowballfilter not to stem certain words?

Re: tell snowballfilter not to stem certain words?

Re: out of order

Re: out of order

Re: tell snowballfilter not to stem certain words?

Re: out of order

Re: out of order

Location of SpanRegexQuery

Re: Location of SpanRegexQuery

Re: Location of SpanRegexQuery

Re: getting term offset information for fields with multiple value entiries

Re: [Fwd: Exception in MultiLevelSkipListReader$SkipBuffer.readByte]

Re: Nested concept fields

Re: [Fwd: Exception in MultiLevelSkipListReader$SkipBuffer.readByte]

Re: Question on custom scoring

36 matches

Site Navigation

Mail list logo

Footer information