date:20040428

Re: Count for a keyword occurance in a file

2004-04-28 Thread Ype Kingma

On Thursday 29 April 2004 08:14, Nader S. Henein wrote:
> Tricky, scoring has to do with the frequency of the occurrence of the word
> as opposed to the amount of words in the file in general (Somebody correct
> me if I'm wrong) , so short of an educated approximation, you could hack

Lucene uses two frequencies for a term: the nr. of docs in which it occurs
in an index (basis for IDF), and the nr of times a term occurs in a document.

> the indexer to dynamically store the frequency of a word (oh so
> unadvisable). Personally I recommend the educated approximation, because
> you could index the document with the number of words in it ( you would
> have to make sure you're not using Stop Word Analyzer or Port Stemmer) and
> then based on the score reverse engineer the result you want.
>
> Nader Henein
>
> -Original Message-
> From: hemal bhatt [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, April 28, 2004 5:50 PM
> To: Lucene Users List
> Subject: Count for a keyword occurance in a file
>
>
> Hi,
>
> How can I get a count of the score given by Hits.Score().
> i.e I want to know how many times a keyword occurs in a file. Any help on
> this would be appreciated.

The easiest way is to use IndexReader. I don't know what you mean by file
(index or document), but you can have both frequencies I mentioned above
from an IndexReader, evt. using skipTo() to go to the document.
The methods are docFreq(Term) and termDocs(Term).

Regards,
Ype



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Bug Luke

2004-04-28 Thread Vladimir Yuryev

Hi!

The search works not correctly c RussianAnalyzer allocating stems.
It(he) searches only for words conterminous with stem. 
For example, WildCard the search gives another result. 

Thanks,
Vladimir.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Count for a keyword occurance in a file

2004-04-28 Thread Nader S. Henein

Tricky, scoring has to do with the frequency of the occurrence of the word
as opposed to the amount of words in the file in general (Somebody correct
me if I'm wrong) , so short of an educated approximation, you could hack the
indexer to dynamically store the frequency of a word (oh so unadvisable).
Personally I recommend the educated approximation, because you could index
the document with the number of words in it ( you would have to make sure
you're not using Stop Word Analyzer or Port Stemmer) and then based on the
score reverse engineer the result you want.

Nader Henein

-Original Message-
From: hemal bhatt [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, April 28, 2004 5:50 PM
To: Lucene Users List
Subject: Count for a keyword occurance in a file


Hi,

How can I get a count of the score given by Hits.Score().
i.e I want to know how many times a keyword occurs in a file. Any help on
this would be appreciated.
  
regards
Hemal Bhatt



regards
Hemal bhatt



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Documents the same search is done many times.

2004-04-28 Thread Nader S. Henein


The short answer is, it's up to you :-)  Lucene doesn't know which document
is your primary key (you're thinking like a DB programmer) id you ad the new
document with ID="one" without deleting the old one from the index then when
you search you'll get two documents "pig" and "mongoose" but if you delete
all documents with ID="one" then index you're new document then you'll only
get "mongoose", From a DBA perspective Lucene is like a table with a unique
ID on each document (that being the Lucene assigned DOC ID (which changes
every time you optimize, but nevertheless remains unique) and then all other
columns weather indexed, tokenized, stored or not, can bare repetition, so
if you want to implement a unique key like ID on your Lucene index, you 'll
have to do a little delete based on that ID field every time you insert a
new document into the index, quite simple and I've been doing it or a few
years now without fail.

Hope this helps

Nader Henein



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

DEFAULT_OPERATOR_AND

2004-04-28 Thread Vladimir Yuryev

Hi!

I have lucene1.4-rc3-dev.
TestQueryParser works with RussianAnalyzer(RussianCharsets.CP1251) and 
russian terms.
...
  public Query getQueryDOA(String query, Analyzer a)
throws Exception {
if (a == null)
	a = new RussianAnalyzer(RussianCharsets.CP1251);
//  a = new SimpleAnalyzer();
QueryParser qp = new QueryParser("field", a);
qp.setOperator(QueryParser.DEFAULT_OPERATOR_AND);
return qp.parse(query);
  }
...

In a reality QueryParser work as QueryParser.DEFAULT_OPERATOR_OR after 
set QueryParser.DEFAULT_OPERATOR_AND.
For example:  
1. Query: (after set DEFAULT _ OPERATOR _ AND): term1 term2 term3
Result : term1 OR term2 OR term3
2. Query: +term1 +term2 +term3
Result : term1 AND term2 AND term3

Please, help to decide this problem?

Thanks,
Vladimir.
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Combining text search + relational search

2004-04-28 Thread Tatu Saloranta

On Wednesday 28 April 2004 11:00, [EMAIL PROTECTED] wrote:
> Bascially I want to limit the results of the text search by the rows that
> are returned in a relational search of other attribute data related to the
> document. The text of the document is just like any other attribute it just
> needs to be queried differently. Does that make sense?

Yes. But why not just store textual content in one Lucene field, and metadata 
in one or more separate fields? You can then easily build queries to combine 
searches. And as long as metadata values are normalized, added index size is 
probably insignificant compared to full indexed text content.

-+ Tatu +-

>
> Thanks
> Mike
>
>
>
>
>
>
>
>  Stephane James
>  Vaucher
>  <[EMAIL PROTECTED]  To
>  qc.ca>Lucene Users List
><[EMAIL PROTECTED]>
>  04/28/2004 10:38   cc
>  AM
>Subject
>Re: Combining text search +
>  Please respond to relational search
>"Lucene Users
>List"
>  <[EMAIL PROTECTED]
>   rta.apache.org>
>
>
>
>
>
>
> I'm a bit confused why you want this.
>
> As far as I know, but relational db searches will return exact
> matches without a mesure of relevancy. To mesure relevancy, you need a
> search engine. For your results to be coherent, you would have to put
> everything in the lucene index.
>
> As for memory consumption, for searching, if the index is on disk, then
> the memory footprint depends on the type of queries you use. For indexing,
> it depends if you use tmp RAMDirectory to do merges, otherwise, memory
> consumption is minimal.
>
> HTH
> sv
>
> On Wed, 28 Apr 2004 [EMAIL PROTECTED] wrote:
> > I need to somehow aloow users to do a text search and query relational
> > database attributes at the same time. The attributes are basically
>
> metadata
>
> > about the documents that the text search will be perfomed on. I have the
> > text of the documents indexed in Lucene. Does anyone have any advice or
> > examples. I also need to make sure I don't garble up all the memory on
>
> our
>
> > server
> >
> > Thanks
> > Mike
> >
> >
> > -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Documents the same search is done many times.

2004-04-28 Thread Supun Edirisinghe

what happens in this situation:

I query a field "id" for "one"

say I get a Document object (object A) from a search which has a field
"content" with value "pig". and that object persists forever.

then a new index is written with a document with "id"="one" and
"content"="mongoose"

another search does the same search querying "id" for "one". will this
new search return the same object A or a new object?

if they are different, will examining object A show that the "content"
field has changed?

thanks


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Created LockObtainTimedOut wiki page

2004-04-28 Thread Kevin A. Burton

I just created a LockObtainTimedOut wiki entry... feel free to add.  I 
just entered the Tomcat issue with java.io.tmpdir as well.

http://wiki.apache.org/jakarta-lucene/LockObtainTimedOut  

Peace!

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature

Re: lucene applicability and performance

2004-04-28 Thread Ype Kingma

Greg,

> Yes, see RemoteSearchable and MultiSearcher in org.apache.lucene.search.
> (See the javadoc on the website)

I meant ParallelMultiSearcher.

Good night,
Ype


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: lucene applicability and performance

2004-04-28 Thread Ype Kingma

Greg,

On Wednesday 28 April 2004 21:44, Greg Conway wrote:
> Hello.  Apologies if this has come up before, I'm new to the list and
> didn't see anything in the archives that exactly matched my situation.

It has, but each situation is different. Try this:
http://jakarta.apache.org/lucene/docs/benchmarks.html

> I am considering using Lucene to index and search a large collection of
> small documents in a  specialized domain -- probably only a few
>
> thousands unique terms spanning across anywhere from one million to ten
> million small source documents.  I hope to be able to get ranked search
> results back in less than 400 msec.
>
> I suspect one issue I may face is index density owing to the large
> numbers of documents and relatively small vocabulary.  That, in turn,
> may be a drag on query processing.  I am working on strategies to
> ameliorate that somewhat but it may be difficult.

A text search engine is your best bet in this situation.

> In the meantime, I'm looking for some gut reactions from the experts
> before I take this to the next stage.  Can Lucene scale well to this
> kind of situation?  Can I realistically hope to get anywhere near my

Yes.

> performance targets?  Will I have to distribute pieces of the index

Yes.

> across several machines,  parallelize my retrievals, and merge the

That's more difficult to say. You'll need to try.

> results to do so?  If so, does Lucene already support that or will I

Yes, see RemoteSearchable and MultiSearcher in org.apache.lucene.search.
(See the javadoc on the website)
But first make sure that the Analyzer you use for indexing fits your needs.

> have to develop that logic in house?  (Seems like I saw a reference

No.

> somewhere that such a feature was coming soon, but I'm not sure when or
> how it will be implemented.)

Have fun,
Ype


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: 'Lock obtain timed out' even though NO locks exist...

2004-04-28 Thread Kevin A. Burton

Gus Kormeier wrote:

Not sure if our installation is the same or not, but we are also using
Tomcat.
I had a similiar problem last week, it occurred after Tomcat went through a
hard restart and some software errors had the website hammered.
I found the lock file in /usr/local/tomcat/temp/ using locate.
According to the README.txt this is a directory created for the JVM within
Tomcat.  So it is a system temp directory, just inside Tomcat.
 

Man... you ROCK!  I didn't even THINK of that... Hm... I wonder if we 
should include the name of the lock file in the Exception within 
Tomcat.  That would probably have saved me a lot of time :)

Either that or we can put this in the wiki

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature

Bug in Sandbox - Berkeley DB

2004-04-28 Thread Andy Goodell

IndexReader.delete(int docid) doesn't work with the Berkeley DB
implementation of org.apache.lucene.store.Directory

This error message appears when closing an IndexReader which has a deletion:
PANIC: Invalid argument

I get this stack trace:
java.io.IOException: DB_RUNRECOVERY: Fatal error, run database recovery
   at org.apache.lucene.store.db.Block.put(Block.java:128)
   at org.apache.lucene.store.db.DbOutputStream.close(DbOutputStream.java:111)
   at org.apache.lucene.util.BitVector.write(BitVector.java:155)
   at org.apache.lucene.index.SegmentReader$1.doBody(SegmentReader.java:162)
   at org.apache.lucene.store.Lock$With.run(Lock.java:148)
   at org.apache.lucene.index.SegmentReader.doClose(SegmentReader.java:157)
   at org.apache.lucene.index.IndexReader.close(IndexReader.java:422)

Help!

- andy g

code that triggers this:
// dbdir is a working DbDirectory, docid is a search result
IndexReader read = IndexReader.open(dbdir);
read.delete(docid);
read.close();

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: 'Lock obtain timed out' even though NO locks exist...

2004-04-28 Thread Gus Kormeier

Not sure if our installation is the same or not, but we are also using
Tomcat.
I had a similiar problem last week, it occurred after Tomcat went through a
hard restart and some software errors had the website hammered.

I found the lock file in /usr/local/tomcat/temp/ using locate.
According to the README.txt this is a directory created for the JVM within
Tomcat.  So it is a system temp directory, just inside Tomcat.

Hope that helps,
-Gus

-Original Message-
From: Kevin A. Burton [mailto:[EMAIL PROTECTED]
Sent: Wednesday, April 28, 2004 1:01 PM
To: Lucene Users List
Subject: Re: 'Lock obtain timed out' even though NO locks exist...


James Dunn wrote:

>Which version of lucene are you using?  In 1.2, I
>believe the lock file was located in the index
>directory itself.  In 1.3, it's in your system's tmp
>folder.  
>  
>
Yes... 1.3 and I have a script that removes the locks from both dirs... 
This is only one process so it's just fine to remove them.

>Perhaps it's a permission problem on either one of
>those folders.  Maybe your process doesn't have write
>access to the correct folder and is thus unable to
>create the lock file?  
>  
>
I thought about that too... I have plenty of disk space so that's not an 
issue.  Also did a chmod -R so that should work too.

>You can also pass lucene a system property to increase
>the lock timeout interval, like so:
>
>-Dorg.apache.lucene.commitLockTimeout=6
>
>or 
>
>-Dorg.apache.lucene.writeLockTimeout=6
>  
>
I'll give that a try... good idea.

Kevin

-- 

Please reply using PGP.

http://peerfear.org/pubkey.asc

NewsMonster - http://www.newsmonster.org/

Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
   AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
  IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: 'Lock obtain timed out' even though NO locks exist...

2004-04-28 Thread Kevin A. Burton

James Dunn wrote:

Which version of lucene are you using?  In 1.2, I
believe the lock file was located in the index
directory itself.  In 1.3, it's in your system's tmp
folder.  
 

Yes... 1.3 and I have a script that removes the locks from both dirs... 
This is only one process so it's just fine to remove them.

Perhaps it's a permission problem on either one of
those folders.  Maybe your process doesn't have write
access to the correct folder and is thus unable to
create the lock file?  
 

I thought about that too... I have plenty of disk space so that's not an 
issue.  Also did a chmod -R so that should work too.

You can also pass lucene a system property to increase
the lock timeout interval, like so:
-Dorg.apache.lucene.commitLockTimeout=6

or 

-Dorg.apache.lucene.writeLockTimeout=6
 

I'll give that a try... good idea.

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature

Re: 'Lock obtain timed out' even though NO locks exist...

2004-04-28 Thread Kevin A. Burton

Kevin A. Burton wrote:

Actually this is exactly the problem... I ran some single index tests 
and a single process seems to read from it.

The problem is that we were running under Tomcat with diff webapps for 
testing and didn't run into this problem before.  We had an 11G index 
that just took a while to open and during this open Lucene was 
creating a lock.
I wasn't sure that Tomcat was multithreading this so maybe it is and 
it's just taking longer to open the lock in some situations.

This is strange... after removing all the webapps (besides 1) Tomcat 
still refuses to allow Lucene to open this index with Lock obtain timed out.

If I open it up from the console it works just fine.  I'm only doing it 
with one index and a ulimit -n so it's not a files issue.  Memory is 1G 
for Tomcat.

If I figure this out will be sure to send a message to the list.  This 
is a strange one

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature

lucene applicability and performance

2004-04-28 Thread Greg Conway

Hello.  Apologies if this has come up before, I'm new to the list and
didn't see anything in the archives that exactly matched my situation.

I am considering using Lucene to index and search a large collection of
small documents in a  specialized domain -- probably only a few
thousands unique terms spanning across anywhere from one million to ten
million small source documents.  I hope to be able to get ranked search
results back in less than 400 msec.

I suspect one issue I may face is index density owing to the large
numbers of documents and relatively small vocabulary.  That, in turn,
may be a drag on query processing.  I am working on strategies to
ameliorate that somewhat but it may be difficult.

In the meantime, I'm looking for some gut reactions from the experts
before I take this to the next stage.  Can Lucene scale well to this
kind of situation?  Can I realistically hope to get anywhere near my
performance targets?  Will I have to distribute pieces of the index
across several machines,  parallelize my retrievals, and merge the
results to do so?  If so, does Lucene already support that or will I
have to develop that logic in house?  (Seems like I saw a reference
somewhere that such a feature was coming soon, but I'm not sure when or
how it will be implemented.)

Any help, tips, references, or advice would be welcome and appreciated.
Thank you!

Regards,

Greg 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: 'Lock obtain timed out' even though NO locks exist...

2004-04-28 Thread Kevin A. Burton

[EMAIL PROTECTED] wrote:

It is possible that a previous operation on the index left the lock open.
Leaving the IndexWriter or Reader open without closing them ( in a finally
block ) could cause this.
 

Actually this is exactly the problem... I ran some single index tests 
and a single process seems to read from it.

The problem is that we were running under Tomcat with diff webapps for 
testing and didn't run into this problem before.  We had an 11G index 
that just took a while to open and during this open Lucene was creating 
a lock. 

I wasn't sure that Tomcat was multithreading this so maybe it is and 
it's just taking longer to open the lock in some situations.

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature

Re: 'Lock obtain timed out' even though NO locks exist...

2004-04-28 Thread James Dunn

Which version of lucene are you using?  In 1.2, I
believe the lock file was located in the index
directory itself.  In 1.3, it's in your system's tmp
folder.  

Perhaps it's a permission problem on either one of
those folders.  Maybe your process doesn't have write
access to the correct folder and is thus unable to
create the lock file?  

You can also pass lucene a system property to increase
the lock timeout interval, like so:

-Dorg.apache.lucene.commitLockTimeout=6

or 

-Dorg.apache.lucene.writeLockTimeout=6

The above sets the timeout to one minute.

Hope this helps,

Jim

--- "Kevin A. Burton" <[EMAIL PROTECTED]> wrote:
> I've noticed this really strange problem on one of
> our boxes.  It's 
> happened twice already.
> 
> We have indexes where when Lucnes starts it says
> 'Lock obtain timed out' 
> ... however NO locks exist for the directory. 
> 
> There are no other processes present and no locks in
> the index dir or /tmp.
> 
> Is there anyway to figure out what's going on here?
> 
> Looking at the index it seems just fine... But this
> is only a brief 
> glance.  I was hoping that if it was corrupt (which
> I don't think it is) 
> that lucene would give me a better error than "Lock
> obtain timed out"
> 
> Kevin
> 
> -- 
> 
> Please reply using PGP.
> 
> http://peerfear.org/pubkey.asc
> 
> NewsMonster - http://www.newsmonster.org/
> 
> Kevin A. Burton, Location - San Francisco, CA, Cell
> - 415.595.9965
>AIM/YIM - sfburtonator,  Web -
> http://peerfear.org/
> GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D
> 8D04 99F1 4412
>   IRC - freenode.net #infoanarchy | #p2p-hackers |
> #newsmonster
> 
> 

> ATTACHMENT part 2 application/pgp-signature
name=signature.asc

__
Do you Yahoo!?
Win a $20,000 Career Makeover at Yahoo! HotJobs  
http://hotjobs.sweepstakes.yahoo.com/careermakeover 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: 'Lock obtain timed out' even though NO locks exist...

2004-04-28 Thread ANarayan

It is possible that a previous operation on the index left the lock open.
Leaving the IndexWriter or Reader open without closing them ( in a finally
block ) could cause this.

Anand

-Original Message-
From: Kevin A. Burton [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, April 28, 2004 2:57 PM
To: Lucene Users List
Subject: 'Lock obtain timed out' even though NO locks exist... 

I've noticed this really strange problem on one of our boxes.  It's 
happened twice already.

We have indexes where when Lucnes starts it says 'Lock obtain timed out' 
... however NO locks exist for the directory. 

There are no other processes present and no locks in the index dir or /tmp.

Is there anyway to figure out what's going on here?

Looking at the index it seems just fine... But this is only a brief 
glance.  I was hoping that if it was corrupt (which I don't think it is) 
that lucene would give me a better error than "Lock obtain timed out"

Kevin

-- 

Please reply using PGP.

http://peerfear.org/pubkey.asc

NewsMonster - http://www.newsmonster.org/

Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
   AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
  IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

'Lock obtain timed out' even though NO locks exist...

2004-04-28 Thread Kevin A. Burton

I've noticed this really strange problem on one of our boxes.  It's 
happened twice already.

We have indexes where when Lucnes starts it says 'Lock obtain timed out' 
... however NO locks exist for the directory. 

There are no other processes present and no locks in the index dir or /tmp.

Is there anyway to figure out what's going on here?

Looking at the index it seems just fine... But this is only a brief 
glance.  I was hoping that if it was corrupt (which I don't think it is) 
that lucene would give me a better error than "Lock obtain timed out"

Kevin

--

Please reply using PGP.

   http://peerfear.org/pubkey.asc
   
   NewsMonster - http://www.newsmonster.org/
   
Kevin A. Burton, Location - San Francisco, CA, Cell - 415.595.9965
  AIM/YIM - sfburtonator,  Web - http://peerfear.org/
GPG fingerprint: 5FB2 F3E2 760E 70A8 6174 D393 E84D 8D04 99F1 4412
 IRC - freenode.net #infoanarchy | #p2p-hackers | #newsmonster



signature.asc
Description: OpenPGP digital signature

RE: ArrayIndexOutOfBoundsException

2004-04-28 Thread James Dunn

Philippe, thanks for the reply.  I didn't FTP my index
anywhere, but your response does make it seem that my
index is in fact corrupted somehow.

Does anyone know of a tool that can verify the
validity of a Lucene index, and/or possibly repair it?
 If not, anyone have any idea how difficult it would
be to write one?  

Thanks,

Jim 

--- Phil brunet <[EMAIL PROTECTED]> wrote:
> 
> Hi.
> 
> I had this problem when i transfered a Lucene index
> by FTP in "ASCII" mode. 
> Using binary mode, i never has such a problem.
> 
> Philippe
> 
> >From: James Dunn <[EMAIL PROTECTED]>
> >Reply-To: "Lucene Users List"
> <[EMAIL PROTECTED]>
> >To: [EMAIL PROTECTED]
> >Subject: ArrayIndexOutOfBoundsException
> >Date: Mon, 26 Apr 2004 12:15:39 -0700 (PDT)
> >
> >Hello all,
> >
> >I have a web site whose search is driven by Lucene
> >1.3.  I've been doing some load testing using
> JMeter
> >and occassionally I will see the exception below
> when
> >the search page is under heavy load.
> >
> >Has anyone seen similar errors during load testing?
> >
> >I've seen some posts with similar exceptions and
> the
> >general consensus is that this error means that the
> >index is corrupt.  I'm not sure my index is corrupt
> >however.  I can run all the queries I use for load
> >testing under normal load and I don't appear to get
> >this error.
> >
> >Is there any way to verify that a Lucene index is
> >corrupt or not?
> >
> >Thanks,
> >
> >Jim
> >
> >java.lang.ArrayIndexOutOfBoundsException: 53 >= 52
> > at
> java.util.Vector.elementAt(Vector.java:431)
> > at
>
>org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:135)
> > at
>
>org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:103)
> > at
>
>org.apache.lucene.index.SegmentReader.document(SegmentReader.java:275)
> > at
>
>org.apache.lucene.index.SegmentsReader.document(SegmentsReader.java:112)
> > at
>
>org.apache.lucene.search.IndexSearcher.doc(IndexSearcher.java:107)
> > at
>
>org.apache.lucene.search.MultiSearcher.doc(MultiSearcher.java:100)
> > at
>
>org.apache.lucene.search.MultiSearcher.doc(MultiSearcher.java:100)
> > at
> >org.apache.lucene.search.Hits.doc(Hits.java:130)
> >
> >
> >
> >
> >
> >__
> >Do you Yahoo!?
> >Yahoo! Photos: High-quality 4x6 digital prints for
> 25¢
> >http://photos.yahoo.com/ph/print_splash
> >
>
>-
> >To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> >For additional commands, e-mail:
> [EMAIL PROTECTED]
> >
> 
>
_
> Hotmail : un compte GRATUIT qui vous suit partout et
> tout le temps ! 
> http://g.msn.fr/FR1000/9493
> 
> 
>
-
> To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> For additional commands, e-mail:
> [EMAIL PROTECTED]
> 





__
Do you Yahoo!?
Win a $20,000 Career Makeover at Yahoo! HotJobs  
http://hotjobs.sweepstakes.yahoo.com/careermakeover 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Combining text search + relational search

2004-04-28 Thread Otis Gospodnetic

Create a Lucene index from data in DB, and make sure to include PKs in
one of the fields (use Field.Keyword).
Then query your RDBMS and get back the ResultSet.
Then get the PK from each ResultSet and use it to construct a Lucene
BooleanQuery, which should include your original query string AND
returned PKs combined with OR.

That is, if I understand what yo uare trying to do :)

Otis


--- [EMAIL PROTECTED] wrote:
> 
> 
> 
> 
> Bascially I want to limit the results of the text search by the rows
> that
> are returned in a relational search of other attribute data related
> to the
> document. The text of the document is just like any other attribute
> it just
> needs to be queried differently. Does that make sense?
> 
> Thanks
> Mike
> 
> 
> 
> 
> 
> 
>  
>  
>  Stephane James  
>  
>  Vaucher 
>  
>  <[EMAIL PROTECTED]   
>   To 
>  qc.ca>Lucene Users List 
>  
>   
> <[EMAIL PROTECTED]>
>  04/28/2004 10:38
>   cc 
>  AM  
>  
>   
> Subject 
>Re: Combining text search +   
>  
>  Please respond to relational search 
>  
>"Lucene Users 
>  
>List" 
>  
>  <[EMAIL PROTECTED]   
>  
>   rta.apache.org>
>  
>  
>  
>  
>  
> 
> 
> 
> 
> I'm a bit confused why you want this.
> 
> As far as I know, but relational db searches will return exact
> matches without a mesure of relevancy. To mesure relevancy, you need
> a
> search engine. For your results to be coherent, you would have to put
> everything in the lucene index.
> 
> As for memory consumption, for searching, if the index is on disk,
> then
> the memory footprint depends on the type of queries you use. For
> indexing,
> it depends if you use tmp RAMDirectory to do merges, otherwise,
> memory
> consumption is minimal.
> 
> HTH
> sv
> 
> On Wed, 28 Apr 2004 [EMAIL PROTECTED] wrote:
> 
> >
> > I need to somehow aloow users to do a text search and query
> relational
> > database attributes at the same time. The attributes are basically
> metadata
> > about the documents that the text search will be perfomed on. I
> have the
> > text of the documents indexed in Lucene. Does anyone have any
> advice or
> > examples. I also need to make sure I don't garble up all the memory
> on
> our
> > server
> >
> > Thanks
> > Mike
> >
> >
> >
> -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail:
> [EMAIL PROTECTED]
> >
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Combining text search + relational search

2004-04-28 Thread Mike_Belasco

Bascially I want to limit the results of the text search by the rows that
are returned in a relational search of other attribute data related to the
document. The text of the document is just like any other attribute it just
needs to be queried differently. Does that make sense?

Thanks
Mike

 Stephane James
 Vaucher   
 <[EMAIL PROTECTED]  To 
 qc.ca>Lucene Users List   
   <[EMAIL PROTECTED]>
 04/28/2004 10:38   cc 
 AM
   Subject 
   Re: Combining text search + 
 Please respond to relational search   
   "Lucene Users   
   List"   
 <[EMAIL PROTECTED] 
  rta.apache.org>  

I'm a bit confused why you want this.

As far as I know, but relational db searches will return exact
matches without a mesure of relevancy. To mesure relevancy, you need a
search engine. For your results to be coherent, you would have to put
everything in the lucene index.

As for memory consumption, for searching, if the index is on disk, then
the memory footprint depends on the type of queries you use. For indexing,
it depends if you use tmp RAMDirectory to do merges, otherwise, memory
consumption is minimal.

HTH
sv

On Wed, 28 Apr 2004 [EMAIL PROTECTED] wrote:

>
> I need to somehow aloow users to do a text search and query relational
> database attributes at the same time. The attributes are basically
metadata
> about the documents that the text search will be perfomed on. I have the
> text of the documents indexed in Lucene. Does anyone have any advice or
> examples. I also need to make sure I don't garble up all the memory on
our
> server
>
> Thanks
> Mike
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Read past EOF and negative bufferLength problem (1.4 rc2)

2004-04-28 Thread Joe Berkovitz

Daniel,

Everything works fine with the latest CVS version of lucene.  It looks 
like the bug I hit was the one that you referenced in your email which 
is now fixed.

Thanks for your help.

.   ..  . ...joe

Daniel Naber wrote:

Am Dienstag, 27. April 2004 21:00 schrieb Joe Berkovitz:

 

Using Lucene 1.4 rc2 I've run into a fatal problem:
   

Could you try with the latest version from CVS? Several severe problems have 
been fixed, but I'm not sure if yours was one of them. Also see
http://issues.apache.org/bugzilla/show_bug.cgi?id=27587
 



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Combining text search + relational search

2004-04-28 Thread Stephane James Vaucher

I'm a bit confused why you want this.

As far as I know, but relational db searches will return exact
matches without a mesure of relevancy. To mesure relevancy, you need a
search engine. For your results to be coherent, you would have to put
everything in the lucene index.

As for memory consumption, for searching, if the index is on disk, then
the memory footprint depends on the type of queries you use. For indexing,
it depends if you use tmp RAMDirectory to do merges, otherwise, memory
consumption is minimal.

HTH
sv

On Wed, 28 Apr 2004 [EMAIL PROTECTED] wrote:

>
> I need to somehow aloow users to do a text search and query relational
> database attributes at the same time. The attributes are basically metadata
> about the documents that the text search will be perfomed on. I have the
> text of the documents indexed in Lucene. Does anyone have any advice or
> examples. I also need to make sure I don't garble up all the memory on our
> server
>
> Thanks
> Mike
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Re-associate a token with its source

2004-04-28 Thread Olaia Vázquez Sánchez

Thank you, but I think I didn't explained my problem clearly enough.

I have four positions (top, bottom, right and left) for each one of the
words of the document so I would have to store in the index the content of
the page with the positions in the middle.

org.apache.lucene.document.Field#UnIndexed("content", "house 1142 1231 3212
2214 dog 2213 2432 3214 2134 ...")

In order to get the values after a search I would need to parse the document
returned to find the positions that are next to the searched word. I have
seen that the class Token has 4 properties: beginColumn, beginLine,
endColumn and endLine and I don't know if it is possible to use them to
store for each token the position that I want.

I think this approach is not the correct one so any help on this would be
appreciated.

Olaia.

-Mensaje original-
De: Stephane James Vaucher [mailto:[EMAIL PROTECTED] 
Enviado el: martes, 27 de abril de 2004 21:46
Para: Lucene Users List
Asunto: Re: Re-associate a token with its source

When indexing, use UnIndexed fields to store this data in your document.

org.apache.lucene.document.Field#UnIndexed(String name, String value) 

Add the fields using:
org.apache.lucene.document.Document.add(Field)

After your search, you can get the field value from:
Document Hits.doc(int)

You can retrieve your store values using 
String Document.get(String name) 

HTH,
sv

On Tue, 27 Apr 2004, Olaia Vázquez Sánchez wrote:

> Hello
> 
>  
> 
> I have documents in XML in which, for each word, I have 4 positions (top,
> down, left and right) that would let me to highlight this word in a jpg
> image. I want to index this XML documents and to highlight the results of
> the queries in the image, so I need to store this positions for each word
> inside the index.
> 
>  
> 
> I was searching about how can I use the Token fields to store this
> attributes but I didnt found any example where this fields were used.
> 
>  
> 
> Thanks,
> 
>  
> 
> Olaia Vázquez
> 
> 

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Combining text search + relational search

2004-04-28 Thread Mike_Belasco





I need to somehow aloow users to do a text search and query relational
database attributes at the same time. The attributes are basically metadata
about the documents that the text search will be perfomed on. I have the
text of the documents indexed in Lucene. Does anyone have any advice or
examples. I also need to make sure I don't garble up all the memory on our
server

Thanks
Mike


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[Lucene] XML Indexing

2004-04-28 Thread Samuel Tang

XMLIndexingDemo seems not able to index traditional Chinese characters. I can only 
search for English text and not Chinese. In fact, my XML document contains both 
Chinese and English text. How can I fix this problem? Is it necessary for me to 
convert the Chinese characters in BIG5 to UTF-8 before doing the file indexing? If it 
is, then how can we do it? This problem won't happen on indexing bilingual HTML files 
(Chinese & English) with Lucene Demo HTML parser. 

必殺技、飲歌、小星星...
浪漫鈴聲  情心連繫
http://ringtone.yahoo.com.hk/

Count for a keyword occurance in a file

2004-04-28 Thread hemal bhatt

Hi,

How can I get a count of the score given by Hits.Score().
i.e I want to know how many times a keyword occurs in a file.
Any help on this would be appreciated.
  
regards
Hemal Bhatt



regards
Hemal bhatt

Re: Segments file get deleted?!

2004-04-28 Thread Surya Kiran

Hi Thanks for reply. I got that error in my previous build. Now i didnt see
it at all.
Also i couldnt able to retain the log. I will definetly come back if i see
it again.
Anyway below is my machine config:

Windows XP Personal Ed., 512MB, P4.
My app server is Resin 2.1.12

I will definetly come up with more details when i get it again.Thanks again.

Surya

- Original Message - 
From: "Nader S. Henein" <[EMAIL PROTECTED]>
To: "'Lucene Users List'" <[EMAIL PROTECTED]>
Sent: Monday, April 26, 2004 12:42 PM
Subject: RE: Segments file get deleted?!


Can you give us a bit of background, we've been using Lucene since the first
stable release 2 years ago, and I 've never had segments disappear on me,
first of all can you provide some background on your setup and secondly when
you say "a certain period of time", how much time are we talking about here
and does that interval coincide with your indexing schedule, because you may
have the create flag on the Indexer set to true so it simply recreates the
index at every update and deleted whatever was there, of course if there are
no files to index at any point it will just give you a blank index.


Nader Henein

-Original Message-
From: Surya Kiran [mailto:[EMAIL PROTECTED]
Sent: Monday, April 26, 2004 7:48 AM
To: [EMAIL PROTECTED]
Subject: Segments file get deleted?!


Hi all, we have implemented our portal search using Lucene. It  works fine.
But after a certain period of time "Lucene segments" file get deleted.
Eventually all searches fails. Anyone can guess where the error could be.

Thanks a lot.

Regards
Surya.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: ArrayIndexOutOfBoundsException

2004-04-28 Thread Phil brunet

Hi.

I had this problem when i transfered a Lucene index by FTP in "ASCII" mode. 
Using binary mode, i never has such a problem.

Philippe

From: James Dunn <[EMAIL PROTECTED]>
Reply-To: "Lucene Users List" <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Subject: ArrayIndexOutOfBoundsException
Date: Mon, 26 Apr 2004 12:15:39 -0700 (PDT)
Hello all,

I have a web site whose search is driven by Lucene
1.3.  I've been doing some load testing using JMeter
and occassionally I will see the exception below when
the search page is under heavy load.
Has anyone seen similar errors during load testing?

I've seen some posts with similar exceptions and the
general consensus is that this error means that the
index is corrupt.  I'm not sure my index is corrupt
however.  I can run all the queries I use for load
testing under normal load and I don't appear to get
this error.
Is there any way to verify that a Lucene index is
corrupt or not?
Thanks,

Jim

java.lang.ArrayIndexOutOfBoundsException: 53 >= 52
at java.util.Vector.elementAt(Vector.java:431)
at
org.apache.lucene.index.FieldInfos.fieldInfo(FieldInfos.java:135)
at
org.apache.lucene.index.FieldsReader.doc(FieldsReader.java:103)
at
org.apache.lucene.index.SegmentReader.document(SegmentReader.java:275)
at
org.apache.lucene.index.SegmentsReader.document(SegmentsReader.java:112)
at
org.apache.lucene.search.IndexSearcher.doc(IndexSearcher.java:107)
at
org.apache.lucene.search.MultiSearcher.doc(MultiSearcher.java:100)
at
org.apache.lucene.search.MultiSearcher.doc(MultiSearcher.java:100)
at
org.apache.lucene.search.Hits.doc(Hits.java:130)

__
Do you Yahoo!?
Yahoo! Photos: High-quality 4x6 digital prints for 25¢
http://photos.yahoo.com/ph/print_splash
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
_
Hotmail : un compte GRATUIT qui vous suit partout et tout le temps ! 
http://g.msn.fr/FR1000/9493

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: status of LARM project

2004-04-28 Thread Otis Gospodnetic

Kelvin is all correct.
A few years ago there were no quality open source crawlers available. 
There are now a number of very good ones.  Archive.org's crawler is
available, there is Larbin, Nutch, etc.
LARM works, it's just not maintained any more.

Otis

--- Kelvin Tan <[EMAIL PROTECTED]> wrote:
> As far as I know, LARM is defunct. I read somewhere, perhaps
> apocryphal, that
> Clemens got a job which wasn't supportive of his continued
> development on LARM.
> AFAIK there aren't any other active developers of LARM (at least at
> the time it
> branched off to SF).
> 
> Otis recently posted to use Nutch instead of LARM.
> 
> Kelvin
> 
> On 28 Apr 2004 09:44:04 +0800, Sebastian Ho said:
> > Hi
> >
> > I have look at LARM website and I get different results
> >
> > http://nagoya.apache.org/wiki/apachewiki.cgi?LuceneLARMPages
> > It says that development has stopped for this project.
> >
> > LARM hosted on sourceforge.
> > The last message was dated 2003 in the mailing list. Is it still
> > supported and active?
> >
> > LARM hosted on apache.
> > It says the project is moved to sourceforge.
> >
> > Any one here who is active in LARM can comment on the status?
> >
> > Regards
> >
> > Sebastian Ho
> >
> >
> >
> -
> > To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail:
> [EMAIL PROTECTED]
> 
> 
> 
> 
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Count for a keyword occurance in a file

Bug Luke

RE: Count for a keyword occurance in a file

RE: Documents the same search is done many times.

DEFAULT_OPERATOR_AND

Re: Combining text search + relational search

Documents the same search is done many times.

Created LockObtainTimedOut wiki page

Re: lucene applicability and performance

Re: lucene applicability and performance

Re: 'Lock obtain timed out' even though NO locks exist...

Bug in Sandbox - Berkeley DB

RE: 'Lock obtain timed out' even though NO locks exist...

Re: 'Lock obtain timed out' even though NO locks exist...

Re: 'Lock obtain timed out' even though NO locks exist...

lucene applicability and performance

Re: 'Lock obtain timed out' even though NO locks exist...

Re: 'Lock obtain timed out' even though NO locks exist...

RE: 'Lock obtain timed out' even though NO locks exist...

'Lock obtain timed out' even though NO locks exist...

RE: ArrayIndexOutOfBoundsException

Re: Combining text search + relational search

Re: Combining text search + relational search

Re: Read past EOF and negative bufferLength problem (1.4 rc2)

Re: Combining text search + relational search

RE: Re-associate a token with its source

Combining text search + relational search

[Lucene] XML Indexing

Count for a keyword occurance in a file

Re: Segments file get deleted?!

RE: ArrayIndexOutOfBoundsException

Re: status of LARM project

32 matches

Site Navigation

Mail list logo

Footer information