Re[2]: Is IndexSearcher thread safe?

2005-03-01 Thread Yura Smolsky
Hello, Volodymyr.

VB Additional question.
VB If I'm sharing one instance of IndexSearcher between different threads
VB Is it good to just to drop this instance to GC.
VB Because I don't know if some thread is still using this searcher or done
VB with it.

It is safe to share one instance between many threads and it should be
safe to drop old object to GC.

But I have discovered one strange fact. When you have indexSearcher on
big index, so IndexSearcher object takes a lot of memory (900Mb) and
when you create new IndexSearcher after deletion of all references to
old IndexSearcher then memory consumed my old IndexSearcher will not be
ever freed.
What can community answer on this strange fact?

Yura Smolsky.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



IndexSearch and IndexWriter on 2 CPU's

2005-02-28 Thread Yura Smolsky
Hello.

I have Dual CPU's box with RH Linux. I run two processes on this box.

1. IndexWriter which adds new documents into index constantly 24/7/365
:)
2. IndexSearcher, which perform searchers from this index.

Sometimes writer begins to merge index (this caused by mergeFactor
and structure of Lucene index) inside addDocument method. And if merge begins 
then my writer process
takes both CPU's time (180-200% totally). Actually most time time goes
to IO operations.

When merge operation begins then all searches performed by
IndexSearcher on this computer are very-very slowed down b/c all CPU
time is under first process.

How can I give second process more CPU time or how can I reduce IO
time of first process?

Maybe I can tweak something about index configuration.
I have set
   writer.mergeFactor = 2
   writer.minMergeDocs = 2500


Yura Smolsky.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re[2]: sorted search

2005-02-24 Thread Yura Smolsky
Hello, Erik.

if i need to store hour and minute then I need to place date into
following integer format:
MMDDHHII
?
Will it be faster than current solution?
And will I have ability to do Ranged queries (from Date A to Date B)?

EH Sorting by String uses up lots more RAM than a numeric sort.  If you
EH use a numeric (yet lexicographically orderable) date format (e.g. 
EH MMDD) you'll see better performance most likely.

EH Erik


EH On Feb 24, 2005, at 1:01 PM, Yura Smolsky wrote:

 Hello, lucene-user.

 I have index with many documents, more than 40 Mil.
 Each document has DateField (It is time stamp of document)

 I need the most recent results only. I use single instance of 
 IndexSearcher.
 When I perform sorted search on this index:
   Sort sort = new Sort();
   sort.setSort( new SortField[] { new SortField (modified, 
 SortField.STRING, true) } );
   Hits hits =
 searcher.search(QueryParser.parse(good, content,
   StandardAnalyzer()), sort);

 then search speed is not good.

 Today I have tried search without sort by modified, but with sort by
 Relevance. Speed was much better!

 I think that Sort by DateField is very slow. Maybe I do something
 wrong about this kind of sorted search? Can you give me advices about
 this?

 Thanks.

 Yura Smolsky.



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]


EH -
EH To unsubscribe, e-mail: [EMAIL PROTECTED]
EH For additional commands, e-mail:
EH [EMAIL PROTECTED]





Yura Smolsky.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re[2]: sorted search

2005-02-24 Thread Yura Smolsky
Hello, Erik.

about memory usage...
DateField takes string of 9 bytes in memory ('000ic64p7')
How much memory will be taken by this string?

How much memory will be taken by integer?

EH Sorting by String uses up lots more RAM than a numeric sort.  If you
EH use a numeric (yet lexicographically orderable) date format (e.g. 
EH MMDD) you'll see better performance most likely.

EH Erik


EH On Feb 24, 2005, at 1:01 PM, Yura Smolsky wrote:

 Hello, lucene-user.

 I have index with many documents, more than 40 Mil.
 Each document has DateField (It is time stamp of document)

 I need the most recent results only. I use single instance of 
 IndexSearcher.
 When I perform sorted search on this index:
   Sort sort = new Sort();
   sort.setSort( new SortField[] { new SortField (modified, 
 SortField.STRING, true) } );
   Hits hits =
 searcher.search(QueryParser.parse(good, content,
   StandardAnalyzer()), sort);

 then search speed is not good.

 Today I have tried search without sort by modified, but with sort by
 Relevance. Speed was much better!

 I think that Sort by DateField is very slow. Maybe I do something
 wrong about this kind of sorted search? Can you give me advices about
 this?

 Thanks.

 Yura Smolsky.



 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]


EH -
EH To unsubscribe, e-mail: [EMAIL PROTECTED]
EH For additional commands, e-mail:
EH [EMAIL PROTECTED]





Yura Smolsky.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re[2]: Search Performance

2005-02-18 Thread Yura Smolsky
Hello, Michael.

btw, you can recreate IndexSeacher every 5|10|30|60|X minutes

MC My index is changing in real time constantly... in this case I guess this
MC will not work for me any suggestions...

MC Michael

MC -Original Message-
MC From: David Townsend [mailto:[EMAIL PROTECTED] 
MC Sent: Friday, February 18, 2005 11:50 AM
MC To: Lucene Users List
MC Subject: RE: Search Performance

MC IndexSearchers are thread safe, so you can use the same object on multiple
MC requests.  If the index is static and not constantly updating, just keep one
MC IndexSearcher for the life of the app.  If the index changes and you need
MC that instantly reflected in the results, you need to check if the index has
MC changed, if it has create a new cached IndexSearcher.  To check for changes
MC use you'll need to monitor the version number of the index obtained via

MC IndexReader.getCurrentVersion(Index Name)

MC David

MC -Original Message-
MC From: Stefan Groschupf [mailto:[EMAIL PROTECTED]
MC Sent: 18 February 2005 16:15
MC To: Lucene Users List
MC Subject: Re: Search Performance


MC Try a singleton pattern or an static field.

MC Stefan

MC Michael Celona wrote:

I am creating new IndexSearchers... how do I cache my IndexSearcher...

Michael

-Original Message-
From: David Townsend [mailto:[EMAIL PROTECTED] 
Sent: Friday, February 18, 2005 11:00 AM
To: Lucene Users List
Subject: RE: Search Performance

Are you creating new IndexSearchers or IndexReaders on each search?
MC Caching
your IndexSearchers has a dramatic effect on speed.

David Townsend

-Original Message-
From: Michael Celona [mailto:[EMAIL PROTECTED]
Sent: 18 February 2005 15:55
To: Lucene Users List
Subject: Search Performance


What is single handedly the best way to improve search performance?  I have
an index in the 2G range stored on the local file system of the searcher.
Under a load test of 5 simultaneous users my average search time is ~4700
ms.  Under a load test of 10 simultaneous users my average search time is
~1 ms.I have given the JVM 2G of memory and am a using a dual 3GHz
Zeons.  Any ideas?  

 

Michael


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


  



MC -
MC To unsubscribe, e-mail: [EMAIL PROTECTED]
MC For additional commands, e-mail:
MC [EMAIL PROTECTED]


MC -
MC To unsubscribe, e-mail: [EMAIL PROTECTED]
MC For additional commands, e-mail:
MC [EMAIL PROTECTED]



MC -
MC To unsubscribe, e-mail: [EMAIL PROTECTED]
MC For additional commands, e-mail:
MC [EMAIL PROTECTED]





Yura Smolsky.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



big index and multi threaded IndexSearcher

2005-02-16 Thread Yura Smolsky
Hello.

I use PyLucene, python port of Lucene.

I have problem about using big index (50Gb) with IndexSearcher
from many threads.
I use IndexSearcher from PyLucene's PythonThread. It's really a wrapper
around a Java/libgcj thread that python is tricked into thinking
it's one of its own.

The core of problem:
When I have many threads (more than 5) I receive this exception:
  File /usr/lib/python2.4/site-packages/PyLucene.py, line 2241, in search
def search(*args): return _PyLucene.Searcher_search(*args)
ValueError: java.lang.OutOfMemoryError
   No stacktrace available

When I decrease number of threads to 3 or even 1 then search works.
How do many threads can affect to this exception?..

I have 2 Gb of memory. So with one thread the process takes like
1200-1300Mb.

Andi Vajda suggested that There may be overhead involved in having
multiple threads against a given index.

Does anyone here have experience in handling big indexes with many
threads?

Any ideas are appreciated.

Yura Smolsky.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re[2]: big index and multi threaded IndexSearcher

2005-02-16 Thread Yura Smolsky
Hello, PA.


 Does anyone here have experience in handling big indexes with many
 threads?
P What about turning the problem around and spitting your index in
P several chunks? Then you could search those (smaller) indices in 
P parallel and consolidate the final result, no?

Well, I have not 6 CPU in one box :)

Yura Smolsky.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re[2]: big index and multi threaded IndexSearcher

2005-02-16 Thread Yura Smolsky
Hello, Erik.

EH Are you using multiple IndexSearcher instances?Or only one and
EH sharing it across multiple threads?

EH If using a single shared IndexSearcher instance doesn't help, it may be
EH beneficial to port your code to Java and try it there.

I have single instance of IndexSearcher and I pass reference of it to each
thread. I will port code to Java if no other ideas will come my
mind...

EH On Feb 16, 2005, at 3:04 PM, Yura Smolsky wrote:

 Hello.

 I use PyLucene, python port of Lucene.

 I have problem about using big index (50Gb) with IndexSearcher
 from many threads.
 I use IndexSearcher from PyLucene's PythonThread. It's really a wrapper
 around a Java/libgcj thread that python is tricked into thinking
 it's one of its own.

 The core of problem:
 When I have many threads (more than 5) I receive this exception:
   File /usr/lib/python2.4/site-packages/PyLucene.py, line 2241, in
 search
 def search(*args): return _PyLucene.Searcher_search(*args)
 ValueError: java.lang.OutOfMemoryError
No stacktrace available

 When I decrease number of threads to 3 or even 1 then search works.
 How do many threads can affect to this exception?..

 I have 2 Gb of memory. So with one thread the process takes like
 1200-1300Mb.

 Andi Vajda suggested that There may be overhead involved in having
 multiple threads against a given index.

 Does anyone here have experience in handling big indexes with many
 threads?


Yura Smolsky.



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Highlighter: how to specify text from external source?

2005-02-08 Thread Yura Smolsky
Hello, lucene-user.

If I do not store text fields in the index, is there a way to specify
values for Highlighter from external source and how?

Thanks in advance.

Yura Smolsky



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



ParallelMultiSearcher and many RemoteSearchers

2005-02-05 Thread Yura Smolsky
Hello, lucene-user.

Does anyone have idea will ParallelMultiSearcher and many
RemoteSearchers be a way to get fast search on distributed index on
many servers.

For example I have 5 servers with indexes of 50Gb on each server.
Indexes are updated interactively. I want to run on 6th server
ParallelMultiSearcher which will be connected to other 5 server
through RemoteSearcher.

Does it okay to go with RemoteSearcher class based on RMI in this case?..
I am concerned about reponse time and speed of the system...

Yura Smolsky




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re[2]: Disk space used by optimize

2005-02-04 Thread Yura Smolsky
Hello, Doug.

 There is a big difference when you use compound index format or
 multiple files. I have tested it on the big index (45 Gb). When I used
 compound file then optimize takes 3 times more space, b/c *.cfs needs
 to be unpacked.
 
 Now I do use non compound file format. It needs like twice as much
 disk space.
DC Perhaps we should add something to the javadocs noting this?

Sure. I was a bit confused about optimizing compound file format b/c I
had not info about space usage when optimizing.
More info in the javadocs will save somebody's time :)


Yura Smolsky




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re[2]: Disk space used by optimize

2005-01-30 Thread Yura Smolsky
Hello, Otis.

There is a big difference when you use compound index format or
multiple files. I have tested it on the big index (45 Gb). When I used
compound file then optimize takes 3 times more space, b/c *.cfs needs
to be unpacked.

Now I do use non compound file format. It needs like twice as much
disk space.

OG Have you tried using the multifile index format?  Now I wonder if there
OG is actually a difference in disk space cosumed by optimize() when you
OG use multifile and compound index format...

OG Otis

OG --- Kauler, Leto S [EMAIL PROTECTED] wrote:

 Our copy of LIA is in the mail ;)
 
 Yes the final three files are: the .cfs (46.8MB), deletable (4
 bytes),
 and segments (29 bytes).
 
 --Leto
 
 
 
  -Original Message-
  From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
  
  Hello,
  
  Yes, that is how optimize works - copies all existing index 
  segments into one unified index segment, thus optimizing it.
  
  see hit #1:
 http://www.lucenebook.com/search?query=optimize+disk+space
  
  However, three times the space sounds a bit too much, or I 
  make a mistake in the book. :)
  
  You said you end up with 3 files - .cfs is one of them, right?
  
  Otis
  
  
  --- Kauler, Leto S [EMAIL PROTECTED] wrote:
  
   
   Just a quick question:  after writing an index and then calling
   optimize(), is it normal for the index to expand to about 
  three times 
   the size before finally compressing?
   
   In our case the optimise grinds the disk, expanding the index
 into 
   many files of about 145MB total, before compressing down to three
 
   files of about 47MB total.  That must be a lot of disk activity
 for 
   the people with multi-gigabyte indexes!
   
   Regards,
   Leto
 
 CONFIDENTIALITY NOTICE AND DISCLAIMER
 
 Information in this transmission is intended only for the person(s)
 to whom it is addressed and may contain privileged and/or
 confidential information. If you are not the intended recipient, any
 disclosure, copying or dissemination of the information is
 unauthorised and you should delete/destroy all copies and notify the
 sender. No liability is accepted for any unauthorised use of the
 information contained in this transmission.
 
 This disclaimer has been automatically added.
 
 -
 To unsubscribe, e-mail: [EMAIL PROTECTED]
 For additional commands, e-mail:
 [EMAIL PROTECTED]
 
 


OG -
OG To unsubscribe, e-mail: [EMAIL PROTECTED]
OG For additional commands, e-mail:
OG [EMAIL PROTECTED]


Yura Smolsky,




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



IndexWriter.addIndexes()

2005-01-19 Thread Yura Smolsky
Hello, lucene-user.

Is there a way to do index merge without optimization?..

Yura Smolsky,




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re[2]: RemoteSearcher

2005-01-07 Thread Yura Smolsky
Hello, Otis.

Interesting. Nutch doesnt use RemoteSearchable b/b RemoteSearchable is not
very useful? I mean does it suitable for distibuting index process in
parallel on many services or not? Will it give us good performance.

We have RemoteSearchable in the sources, but anyone does not use it. :)

I ask this question, b/c I use PyLucene (very good port in
Python) and I need to realize a lot of things about implementation of
RemoteSearchable in omniORBpy (CORBA).  I have big index (3,000,000 docs) and
many fields. I have noticed, that search becomes slower. I want to
distribute index on many servers. Does RemoteSearchable worse of it?

BTW, Is there working demo of nutch with big index?

OG Nutch (nutch.org) has a pretty sophisticated infrastructure for
OG distributed searching, but it doesn't use RemoteSearcher.


 Does anyone know application which based on RemoteSearcher to
 distribute index on many servers?
 


Yura Smolsky,




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



RemoteSearcher

2005-01-05 Thread Yura Smolsky
Hello.

Does anyone know application which based on RemoteSearcher to
distribute index on many servers?

Yura Smolsky,




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



InderWriter.optimize()

2004-12-09 Thread Yura Smolsky
Hello, lucene-user.

I used FSDirectory as storage for index. And I have used optimize()
method of IndexWriter to optimize index for faster access.

Now I use DbDirectory (Berkley DB) as storage. Does it make sense to
use optimize method on index stored in this storage?..

What does optimize do actually?

Yura Smolsky




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]