from:"Robert Engels"

RE: constant scoring queries

2005-05-10 Thread Robert Engels

I did the nearly the exact same thing in my "derived" Lucene. But in order
to limit modifications to the Lucene core, I created a QueryCache class, and
have derived versions of Prefix and Range query consult the class, passing
in the IndexReader and query to see if there is a cached result. I also
calls QueryCache.clear(IndexReader), when the IndexReader goes out of scope.

Will there be a problem with associating the cache with the IndexSearcher
instances, since it seems that common Lucene code uses code similar to

IndexSearcher searcher = new IndexSearcher(reader);

every time they need to perform a search?

It is REALLY efficient for automatic caching of common range queries and
prefix queries, as I think many users of Lucene pass use a range query to
look for documents modified in the "last n days". The ONLY overhead is extra
memory usage (since without the cache the query needs to be executed as is),
but the size of the LRU cache can be controlled via a property.

-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED]
Sent: Tuesday, May 10, 2005 3:40 PM
To: java-dev@lucene.apache.org
Subject: constant scoring queries


Background: In http://issues.apache.org/bugzilla/show_bug.cgi?id=34673,
Yonik Seely proposes a ConstantScoreQuery, based on a Filter.  And in
http://www.mail-archive.com/lucene-dev@jakarta.apache.org/msg08007.html
I proposed a mechanism to promote the use of Filters.  Through all of
this, Paul Elshot has hinted that there might be a better way.

Here's another proposal, tackling many of the same issues:

1. Add two methods to Query.java:

   public boolean constantScoring();
   public void constantScoring(boolean);

   When constantScoring(), the boost() is the score for matches.

2. Add two methods to Searcher.java:

   public BitSet cachedBitSet(Query) { return null; }
   public void cacheBitSet(Query, BitSet) {}

   IndexSearcher overrides these to maintain an LRU cache of bitsets.

3. Modify BooleanQuery so that, when constantScoring(), TooManyClauses
is not thrown.

4. Modify BooleanScorer to, if constantScoring(),
   - check Searcher for a cached bitset
   - failing that, create a bitset
   - evaluate clauses serially, saving results in bitset
   - cache the bitset
   - use the bitset to handle doc(), next() and skipTo();

5. TermQuery and PhraseQuery could be similarly modified, so that, when
constant scoring, bitsets are cached for very common terms (e.g., >5% of
documents).

With these changes, WildcardQuery, PrefixQuery, RangeQuery etc., when
declared to be constant scoring, will operate much faster and never
throw TooManyClauses.  We can add an option (the default?) to
QueryParser to make these constant scoring.

Also, instead of BitSet we could use an interface:

   public interface DocIdSet {
 void add(int docId);
 boolean contains(int docId);
 int next(int docId);
   }

to permit sparse representations.

Thoughts?

Doug


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: constant scoring queries

2005-05-11 Thread Robert Engels

Would there be anyway to "rewrite" the cached queries as documents are
added?

By this I mean, if a user runs an "expensive" range query that gets cached,
then another user adds a document that should be included in the cached
query, the addDocument() method would "update" the cached query. I think
this is useful when using the "rolling index" method of performing
incremental updates to the index, so cached queries could remain valid after
a roll.

I think a callback interface similar to

updateQuery(Term t,BitSet newdocs);

Then the query if free to ignore the update if the term does not fall inside
the query. This check is obviously query specific.


-Original Message-
From: Yonik Seeley [mailto:[EMAIL PROTECTED]
Sent: Tuesday, May 10, 2005 9:42 PM
To: java-dev@lucene.apache.org
Subject: Re: constant scoring queries


Hey now... you're going to obsolete all my in-house code and put me
out of a job ;-)

Could you elaborate on the advantage of having say a TermQuery that
could be either normal-scoring or constant-scoring vs two different
Query classes for doing this?  They seem roughly equivalent.


> 1. Add two methods to Query.java:
>
>public boolean constantScoring();
>public void constantScoring(boolean);
>
>When constantScoring(), the boost() is the score for matches.

That seems fine.

> 2. Add two methods to Searcher.java:
>
>public BitSet cachedBitSet(Query) { return null; }
>public void cacheBitSet(Query, BitSet) {}
>
>IndexSearcher overrides these to maintain an LRU cache of bitsets.

Yup, that's what I have.
Things should be extensible and use a caching interface - the default
implementation being an LRU cache, but users could use their own
implementations to get LFU behavior or whatever.

> 3. Modify BooleanQuery so that, when constantScoring(), TooManyClauses
> is not thrown.

This is good, but not sufficient for RangeQuery.  If
RangeQuery.constantScoring(), then it should not rewrite to a
BooleanQuery at all.  Depending on the RangeQuery, just the creation
of a BooleanQuery that matches it is too heavyweight.

> Also, instead of BitSet we could use an interface:
>
>public interface DocIdSet {
>  void add(int docId);
>  boolean contains(int docId);
>  int next(int docId);
>}
>
> to permit sparse representations.

Definitely a DocIdSet.  It's called DocSet in my code and has a bitset
implementation and a compact implementation that's an int hash set
(unordered cause I just use it as a filter now).  Here is the basic
interface:

public interface DocSet {
  public int size();
  public boolean exists(int docid);
  public DocIterator iterator();
  public BitSet getBits();
  public long memSize();
  public DocSet intersection(DocSet other);
  public int intersectionSize(DocSet other);
  public DocSet union(DocSet other);
  public int unionSize(DocSet other);
}


I would separate out int next(int docId) into an iterator.  It may be
more efficient to iterate over certain structures if you can maintain
state about where you are (and this may even be true of a BitSet).

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

optimized reopen?

2005-05-11 Thread Robert Engels

Is there any way to optimize the closing & reopening of an Index?

Since the IndexReader.open() opens a MultiReader is there are multiple
segments, it seems a reopen() method could be implemented, which detects
which segments are the same as the current open index, and then passes those
SegementReaders to a new Multireader rather than creating new SegmentReaders
for the unmodified segments.

Does this sounds feasible? Would it improve the performance?

Thanks

RE: optimized reopen?

2005-05-11 Thread Robert Engels

I was pretty sure reopen() would need to return a new IndexReader instance,
because otherwise the 'indices never change' paradigm would break, but it
would seem that I could share the existing SegmentReaders with the new
IndexReader.

Why do you think the SegmentReaders cannot be shared?

R

-Original Message-
From: Yonik Seeley [mailto:[EMAIL PROTECTED]
Sent: Wednesday, May 11, 2005 4:50 PM
To: java-dev@lucene.apache.org; [EMAIL PROTECTED]
Subject: Re: optimized reopen?

Things are cached using an IndexReader as the key, so you would have
to be careful not to break the current behaviour (that an
IndexReader's view of an index doesn't change - deletes from that
specific reader aside).

Maybe you could invoke reopen() on an existing IndexReader and it
would return a new IndexReader that shares sub-readers that haven't
changed?  But I don't think that sub-readers can be shared like this
right now... the difficulty may lie in deleted-docs.

-Yonik

On 5/11/05, Robert Engels <[EMAIL PROTECTED]> wrote:
> Is there any way to optimize the closing & reopening of an Index?
>
> Since the IndexReader.open() opens a MultiReader is there are multiple
> segments, it seems a reopen() method could be implemented, which detects
> which segments are the same as the current open index, and then passes
those
> SegementReaders to a new Multireader rather than creating new
SegmentReaders
> for the unmodified segments.
>
> Does this sounds feasible? Would it improve the performance?
>
> Thanks
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: optimized reopen?

2005-05-11 Thread Robert Engels

I did an implementation of IndexReader.reopen() in my modified Lucene lib.

It is over 200% faster than closing and opening the index.

I just made the semantics of IndexReader.reopen() be that the original
IndexReader is now closed, and can no longer be used (since I cannot close()
the original, otherwise the SegmentReaders will close, so I close the unused
SegmentReaders directly). In order to make this work I also needed to make
commit() public, since the commit() needs to be performed prior to the
segment merging, and I needed to add a package method to MultiReader to get
the underlying subReaders.

Although this may not be useful to many, in our case since we use a single
shared reader and a single exclusive writer in a server process (with
external locking, so we can guarantee than when the IndexReader is "rolled",
there are no references to the previous reader.

Robert

-Original Message-
From: Yonik Seeley [mailto:[EMAIL PROTECTED]
Sent: Wednesday, May 11, 2005 4:50 PM
To: java-dev@lucene.apache.org; [EMAIL PROTECTED]
Subject: Re: optimized reopen?

Things are cached using an IndexReader as the key, so you would have
to be careful not to break the current behaviour (that an
IndexReader's view of an index doesn't change - deletes from that
specific reader aside).

Maybe you could invoke reopen() on an existing IndexReader and it
would return a new IndexReader that shares sub-readers that haven't
changed?  But I don't think that sub-readers can be shared like this
right now... the difficulty may lie in deleted-docs.

-Yonik

On 5/11/05, Robert Engels <[EMAIL PROTECTED]> wrote:
> Is there any way to optimize the closing & reopening of an Index?
>
> Since the IndexReader.open() opens a MultiReader is there are multiple
> segments, it seems a reopen() method could be implemented, which detects
> which segments are the same as the current open index, and then passes
those
> SegementReaders to a new Multireader rather than creating new
SegmentReaders
> for the unmodified segments.
>
> Does this sounds feasible? Would it improve the performance?
>
> Thanks
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: One Byte is Seven bits too many? - A Design suggestion

2005-05-22 Thread Robert Engels

I have always thought that the norms should be an interface, rather than
fixed, as there are many uses of lucene where norms are not necessary, and
the memory overhead is substantial.

-Original Message-
From: Arvind Srinivasan [mailto:[EMAIL PROTECTED]
Sent: Sunday, May 22, 2005 7:05 PM
To: java-dev@lucene.apache.org
Subject: One Byte is Seven bits too many? - A Design suggestion

One Byte is Seven bits too many? - A Design suggestion

Hi,

The norm takes up 1 byte of storage per document per field.  While this may
seem
very small, a simple calculation shows that the IndexSearcher can consume
lots of
memory when it caches the norms. Further, the current implementation loads
up the
norms in memory as soon as the segments gets loaded.  Here are the
calculations:

For Medium sized archives
docs=40Million, Fields=2  =>  80MB memory
docs=40Million, Fields=10 => 400MB memory
docs=40Million, Fields=20 => 800MB memory

For larger sized archives

docs=400Million, Fields=2  =>  800MB memory
docs=400Million, Fields=10 =>  ~4GB memory
docs=400Million, Fields=20 =>  ~8GB memory

To further compound the issues, we have found JVM performance drops when the
memory
that it manages increases.

While the storage itself may not be concern, the runtime memory requirement
can use
some optimization, especially for large number of fields.
The fields itself may fall in one of 3 categories

 (a) Tokenized fields have huge variance in number of Tokens,
 example - HTML page, Mail Body etc.
 (b) Tokenized fields with very little variance in number of token,
 example - HTML Page Title, Mail Subject etc.
 (c) Fixed Tokenized Fields
 example - Department, City, State etc.

The one byte usage is very applicable for (a) and not for (b) or (c).  In
typical
usage, field increases can be attributed to (b) and (c).

Two solutions come to mind:

(1) If there is forethought in the field design, then one can prefix the
field tokens
and then reduce the number of fields.  Of course, this will add the overhead
of
embarrassing explanation to every query writer of why to add Prefix for
every token.
If however, this prefix can be done underneath, it may work but still not
elegant.

(2)  The norm values for (c) has only two values. One is 0 when the field is
not present,
and the other value is a fixed one.  In this scenario, the memory
requirement
is only one bit per doc per field. I would argue that even for (b) one can
approximate the
value with one bit and not much loose much in ranking of documents.

Several implementation options are possible:

(a) We can implement the approximation at the time of writing index
(Backward compatibility
has to be considered)
(b) Use a bitset instead of an array  for search purposes.  I have been
wanting to do for the last 6 months, but have found time yet.  If I do, will
submit an
implementation.

Also, if a field is mandatory, then the 0 scenario never occurs and in this
situation, we
use a single constant to represent the array. May be One byte is 8 bits too
many:-))

Arvind.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: major searching performance improvement

2005-05-25 Thread Robert Engels

I will look at separating it out. I wanted to get initial feedback before
moving on.

1. I agree that the initialValue() is the way to go. I'll make the changes.

2. I agree that creating NioFSDirectory rather than modifying FSDirectory. I
originally felt the memory mapped files would be the fastest, but it also
requires OS calls, the "caching" code is CONSIDERABLY faster, since it does
not need to do any JNI, or make OS calls.

3. I think a "simple" fix for the case you cite, is to add an additional
'max size' parameter, which controls the maximum size of the cache for each
'segment file', so using the mergeFactor, and compound files, you can easily
compute what this max would be based on available memory and expected index
size (number of files). The problem with a SoftCache and indices of that
size, is that the JVM memory consumption would still grow to the limit
before it discarded anything (which may be ideal in some cases).

As for creating a CachingDirectory that can cache any directory that should
be feasible as well, but I am not sure it would perform as well as the
direct internal cache version.

Robert

-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED]
Sent: Wednesday, May 25, 2005 4:20 PM
To: java-dev@lucene.apache.org
Subject: Re: major searching performance improvement

Robert Engels wrote:
> Attached are files that dramatically improve the searching performance
> (2x improvement on several hardware configurations!) in a multithreaded,
> high concurrency environment.

This looks like some good stuff!  Can you perhaps break it down into
independent, layered patches?  That way it would be easier to discuss
and integrate them.

> The change has 3 parts:
>
> 1) remove synchronization required in SegmentReader document. This
> required changes to FieldsReader to handle concurrent access.

This makes good sense.  Stylistically, I would prefer the cloning be
done in ThreadLocal.initialValue().  That way if another method ever
needs the input streams the cloning code need not be altered.

> 2) change FSDirectory to use a 'nio' to improve concurrency. Changed to
> use NioFile. This class has some workaround because under Windows, the
> FileChannel is not fully reentrant, and so allocates multiple handles
> per physical file - this code can be removed under non-Windows
> systems. This also required changes to InputStream to allow for reading
> at a direct offset.

Could you please explore making this a new Directory class, extending
rather than replacing FSDirectory?  That would make it easier for folks
to evaluate.  Look at MMapDirectory for an example.

Also, did you compare the performance of this to MMapDirectory?  That
already uses nio, and should thus avoid the thread contention of
FSDirectory.  However it does not scale well on 32-bit machines whose
address space limits indexes to 4GB.

Finally, for Windows-specific code, you can check
org.apache.lucene.util.Constants.WINDOWS at runtime.

> 3) move disk buffering into the Java layer to avoid the overhead of OS
> calls. The buffer percentage can be configured to store the entire index
> in memory. Running with as little as a 10% cache, the performance is
> dramatically improved. Reading larger blocks also improves the
> performance in most cases, but can actually degrade performance if doing
> very small reads. Using the cache implies that you have configured the
> JVM to have as much heap space available as the percent of index size on
> the disk. The NioFile can be easily changed to use a "soft" cache to
> avoid the potential of OutOfMemoryExceptions.

It would be nice if this functionality could be layered on any
Directory.  Did you consider making a CachingDirectory that one can wrap
around an existing Directory implementation, that keeps an LRU cache of
data?  Even 10% by default will probably break a lot of applications.
At the Internet Archive I frequently search indexes 100GB gigabyte
indexes on machines with just 1GB of RAM.  So I am leery of enabling
this by default.

Cheers,

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: major searching performance improvement

2005-05-25 Thread Robert Engels

It is my understanding of memory mapped files is that the file is assigned
an address range in the virtual address space, and using the MMU/paging
facilities the file is mapped into that range. Java cannot work with memory
pointers directly, so there is at minimum some JNI calls that are made when
filling the ByteBuffer (vs. C which will can read the memory addresses
directly).  I believe the "direct" buffers in nio improve this somewhat, but
still not to the efficiency of C.

I think the only caching performed for memory mapped files is the same that
is performed for any file by the OS, i.e. when a page of the mapped file
needs to be brought into main memory, it may be available in the disk cache,
and will be retrieved from there rather than from disk.

I agree with the soft cache... what I planned to do was change the
MemoryLRUCache to have a maximum hard size, and a maximum soft size, and
when blocks are evicted from the hard cache, they are put into the soft
cache using soft references. The soft cache will have a maximum number of
elements so that the soft cache will not grow to the maximum heap size.

Robert

-Original Message-
From: Yonik Seeley [mailto:[EMAIL PROTECTED]
Sent: Wednesday, May 25, 2005 9:13 PM
To: java-dev@lucene.apache.org; [EMAIL PROTECTED]
Subject: Re: major searching performance improvement


Looks like really great stuff Robert!

> 2. I agree that creating NioFSDirectory rather than modifying FSDirectory.
I
> originally felt the memory mapped files would be the fastest, but it also
> requires OS calls, the "caching" code is CONSIDERABLY faster, since it
does
> not need to do any JNI, or make OS calls.

What do you mean by OS calls required by memory mapped files?

I'm not 100% sure how mmap works in java, but in C the only OS calls
are typically to set up and tear down the mapping.  Reads from the
mmaped region that are already in memory proceed at the same speed as
reads from any other piece of memory.


> The problem with a SoftCache and indices of that
> size, is that the JVM memory consumption would still grow to the limit
> before it discarded anything (which may be ideal in some cases).

Soft caches aren't great, esp with apps that generate a lot of garbage.
What might be better is a multi-level LRU that spills over into soft
references at a certain point.

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Lucene vs. Ruby/Odeum

2005-06-01 Thread Robert Engels

I read his complete article... he still doesn't have a clue.

The opening and closing of the IndexReader's is just creating garbage which
is distorting the memory consumption, and ruining the performance - it is
akin to starting it from the command-line to perform a search. As you state,
the Java memory consumption will continue to grow until it hits the Xmx
until it even attempts to purge, and even then it may not release the memory
back to the OS. Even small one-time use strings will eventually show huge
memory consumption until the runtime needs the memory.

It is my understanding in reviewing the Lucene code, is that Lucene caches
VERY LITTLE information in memory - seemingly just the skip table for terms,
and relies on the OS's disk cache for performance (other than the new
enhancements I posted that move a cache into the Lucene directory). Maybe
somebody has more information here?

Robert Engels

-Original Message-
From: Daniel Naber [mailto:[EMAIL PROTECTED]
Sent: Wednesday, June 01, 2005 5:07 PM
To: java-dev@lucene.apache.org
Subject: Re: Lucene vs. Ruby/Odeum


On Tuesday 17 May 2005 04:41, Otis Gospodnetic wrote:

> http://www.zedshaw.com/projects/ruby_odeum/performance.html

Here's a follow up:
http://www.zedshaw.com/projects/ruby_odeum/odeum_lucene_part2.html

Now the claim is that Lucene is faster than Ruby/Odeum but it takes 36
times more memory. However, I cannot find any information on how exactly
Lucene was started. It's no surprise that Java requires much memory and
doesn't clean up if it never comes close to the limit set with -Xmx.

Regards
 Daniel

--
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Lucene vs. Ruby/Odeum

2005-06-01 Thread Robert Engels

One more thing, his statement that "why returning 20 documents would perform
any better than returning all of them" (paraphrased), shows complete
ignorance of proper Lucene usage.

R

-Original Message-----
From: Robert Engels [mailto:[EMAIL PROTECTED]
Sent: Wednesday, June 01, 2005 5:52 PM
To: java-dev@lucene.apache.org
Subject: RE: Lucene vs. Ruby/Odeum

I read his complete article... he still doesn't have a clue.

The opening and closing of the IndexReader's is just creating garbage which
is distorting the memory consumption, and ruining the performance - it is
akin to starting it from the command-line to perform a search. As you state,
the Java memory consumption will continue to grow until it hits the Xmx
until it even attempts to purge, and even then it may not release the memory
back to the OS. Even small one-time use strings will eventually show huge
memory consumption until the runtime needs the memory.

It is my understanding in reviewing the Lucene code, is that Lucene caches
VERY LITTLE information in memory - seemingly just the skip table for terms,
and relies on the OS's disk cache for performance (other than the new
enhancements I posted that move a cache into the Lucene directory). Maybe
somebody has more information here?

Robert Engels

-Original Message-
From: Daniel Naber [mailto:[EMAIL PROTECTED]
Sent: Wednesday, June 01, 2005 5:07 PM
To: java-dev@lucene.apache.org
Subject: Re: Lucene vs. Ruby/Odeum

On Tuesday 17 May 2005 04:41, Otis Gospodnetic wrote:

> http://www.zedshaw.com/projects/ruby_odeum/performance.html

Here's a follow up:
http://www.zedshaw.com/projects/ruby_odeum/odeum_lucene_part2.html

Now the claim is that Lucene is faster than Ruby/Odeum but it takes 36
times more memory. However, I cannot find any information on how exactly
Lucene was started. It's no surprise that Java requires much memory and
doesn't clean up if it never comes close to the limit set with -Xmx.

Regards
 Daniel

--
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Lucene vs. Ruby/Odeum

2005-06-01 Thread Robert Engels

I think I am going to start a new Blog - "Zed's an Idiot".

-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Wednesday, June 01, 2005 6:39 PM
To: java-dev@lucene.apache.org
Subject: Re: Lucene vs. Ruby/Odeum

On Jun 1, 2005, at 6:07 PM, Daniel Naber wrote:

> On Tuesday 17 May 2005 04:41, Otis Gospodnetic wrote:
>
>
>> http://www.zedshaw.com/projects/ruby_odeum/performance.html
>>
>
> Here's a follow up:
> http://www.zedshaw.com/projects/ruby_odeum/odeum_lucene_part2.html
>
> Now the claim is that Lucene is faster than Ruby/Odeum but it takes 36
> times more memory. However, I cannot find any information on how  
> exactly
> Lucene was started. It's no surprise that Java requires much memory  
> and
> doesn't clean up if it never comes close to the limit set with -Xmx.

I went around several times in e-mail with Zed, the author of this  
comparison after his follow-up.  His paraphrasing of me in there is  
only partially sort of what I said to him.  He's instantiating an  
IndexSearcher inside a tight loop which I told him was a very bad  
thing to do with Lucene and that his loops are so tight that garbage  
collection isn't getting a chance to kick in.  He doesn't currently  
believe some of this from me, and also feels that adjusting the code  
to make Lucene happy is being unfair.

I wish the RubyLucene folks would hurry up and get a port over there  
so that we could compare against Ruby/Odeum "fairly" :)

 Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Lucene vs. Ruby/Odeum

2005-06-01 Thread Robert Engels

Sorry if you thought my comment was destructive or counter-productive.

I read all of Zed's posts on the subject and I feel he certainly presents a
strong anti-Java, if not anti-Lucene bias - maybe just pro Ruby. The funny
thing is that I am not a Java zealot by any means, and I am a firm believer
in the "right tool for the job", but Zed's analysis is similar to testing
screwdrivers, and then determining that one hammers nails way better than
another.

If you do not even adhere to the principle designer's "guidelines to proper
usage", your tests are meaningless. It's akin to using a new flat screen
monitor and claiming "boy, it has a fuzzy picture", because you didn't
follow the instructions that said "remove protective film before using".

I just get frustrated when people use their "advanced methods" to prove
their point (even though the statistics are very basic), but avoid the use
of common sense. The adage "garbage in, garbage out" will always hold true.

Zed is using a very constrained test - which is probably very UNCOMMON in
the real world of server based systems, to attempt to discern the relative
performance characteristics of Lucene/Java/Ruby/etc. The tests may be
applicable in his poorly designed environment, but he presents his limited
finding as "gospel", and that it should hold true in all cases. I quote...
"For the people who have no clue (also known as "Executives") here's the
information you need to tell all your employees they need to adopt the
latest and greatest thing without ever having to understand anything you
read. Cheaper than an article in CIO magazine and even has big words like
"standard deviation"." and then goes on to present his "statistically
correct" performance numbers.

As an aside, in my performance testing of Lucene using JProfiler, it seems
to me that the only way to improve Lucene's performance greatly can come
from 2 areas

1. optimizing the JVM array/looping/JIT constructs/capabilities to avoid
bounds checking/improve performance
2. improve function call overhead

Other than that, other changes will require a significant change in the code
structure (manually unrolling loops), at the sacrifice of
readability/maintainability.

R

-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Wednesday, June 01, 2005 8:46 PM
To: java-dev@lucene.apache.org
Subject: Re: Lucene vs. Ruby/Odeum

Robert - Please tone it down.  Zed is aware of this thread and
perhaps even seeing this message.  There is no need to resort to such
verbiage - Zed and I have been communicating and he is a fan of
Lucene and has proven in his last entry that Lucene is faster than
Ruby/Odeum even with the massive memory issue he notes (and has been
properly informed of what he's doing incorrectly in that situation).

Speaking for myself - I want the most accurate, flexible, and fastest
search system possible regardless of platform or language.  Certainly
I want it to be Lucene, but I welcome competition and those that go
to the extensive effort of collecting data and making studies such as
Zed has.  The Lucene community can help keep this type of competition
healthy and positive by educating folks in proper Lucene usage and
responding in kind regardless of the mistakes, attitudes, or flame-
bait we may encounter.

 Erik

On Jun 1, 2005, at 7:48 PM, Robert Engels wrote:

> I think I am going to start a new Blog - "Zed's an Idiot".
>
> -Original Message-
> From: Erik Hatcher [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, June 01, 2005 6:39 PM
> To: java-dev@lucene.apache.org
> Subject: Re: Lucene vs. Ruby/Odeum
>
>
>
> On Jun 1, 2005, at 6:07 PM, Daniel Naber wrote:
>
>
>> On Tuesday 17 May 2005 04:41, Otis Gospodnetic wrote:
>>
>>
>>
>>> http://www.zedshaw.com/projects/ruby_odeum/performance.html
>>>
>>>
>>
>> Here's a follow up:
>> http://www.zedshaw.com/projects/ruby_odeum/odeum_lucene_part2.html
>>
>> Now the claim is that Lucene is faster than Ruby/Odeum but it
>> takes 36
>> times more memory. However, I cannot find any information on how
>> exactly
>> Lucene was started. It's no surprise that Java requires much memory
>> and
>> doesn't clean up if it never comes close to the limit set with -Xmx.
>>
>
> I went around several times in e-mail with Zed, the author of this
> comparison after his follow-up.  His paraphrasing of me in there is
> only partially sort of what I said to him.  He's instantiating an
> IndexSearcher inside a tight loop which I told him was a very bad
> thing to do with Lucene and that his loops are so tight that garbage
> collection isn't getting a chance to kick in.  He d

RE: Lucene vs. Ruby/Odeum

2005-06-02 Thread Robert Engels

There are still very SIGNIFICANT problems with his tests.

1. The environment is not "real", except for possibly desktop searching.
Whether the JVM needs 64m or 128m to perform adequately is immaterial, given
the price of RAM and the ease of expansion. It would be akin to saying
"let's all try to write programs that work with 64k of memory", why
needlessly constrain yourself? I would take performance, readability,
maintainability, and reliability over memory consumption any day. If the
point of the test is to show that a Java based searching system needs more
memory than a script language on top of a C db library, who cares? The
smallest possible Java program I can write, shows a VM size of 8.5mb under
Windows XP - which is larger than the whole of Odeum/Ruby. There is a lot
that the Java system provides that isn't useful in this particular use case,
but I run Lucene in a multithreaded, server environment, with a test index
of 350mb, and I can run it in as little as 19mb - but why would I want to?

2. The search is always for the same word. The Odeum database based version
will almost certainly cache all the required data and index blocks in memory
after the first run, avoiding all calls to the OS. Since Lucene performs no
local caching (without my mods), it will ALWAYS require trips to the OS.
Also, each run of Lucene is going to generate garbage without question. A
properly designed non-Java db will almost certainly generate no increased
memory usage in the constrained case. Running the tests using multiple
threads on random words would be far more interesting.

Lastly, for what's it's worth (and that's probably not much!) - if Odeum was
the "better search engine", you could do a Java -> Odeum mapping and I
GUARANTEE the Java implementation using the latest JIT JVMs compilers will
be faster than the Ruby one. Also, where's my cross platform GUI for
displaying the search results?

The best developers attempt to use the right tool for the job. Ruby is a
GREAT scripting language, and is perfect for all the things scripting
languages are good for. Let's leave it at that.

-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Thursday, June 02, 2005 5:09 AM
To: java-dev@lucene.apache.org
Subject: Re: Lucene vs. Ruby/Odeum

Zed has updated his second part with more experiments with different
JVM's and memory settings:

 http://www.zedshaw.com/projects/ruby_odeum/odeum_lucene_part2.html

On Jun 2, 2005, at 12:27 AM, Robert Engels wrote:
> I read all of Zed's posts on the subject and I feel he certainly
> presents a
> strong anti-Java

Most definitely an anti-Java leaning - but at least he's working on
being objective about it by measuring things :)

> , if not anti-Lucene bias - maybe just pro Ruby.

He's quite pro-Lucene, and most definitely pro-Ruby.  I consider
myself in those categories myself.

> If you do not even adhere to the principle designer's "guidelines
> to proper
> usage", your tests are meaningless. It's akin to using a new flat
> screen
> monitor and claiming "boy, it has a fuzzy picture", because you didn't
> follow the instructions that said "remove protective film before
> using".

I concur with your sentiment and I've done what I can via e-mail with
him to educate him on my experience with Lucene and JVM garbage
collection.  I'd encourage anyone who has the the time and
inclination to take him up on the request to show how to do it better
since he's made his code available.

> Zed is using a very constrained test - which is probably very
> UNCOMMON in
> the real world of server based systems, to attempt to discern the
> relative
> performance characteristics of Lucene/Java/Ruby/etc. The tests may be
> applicable in his poorly designed environment, but he presents his
> limited
> finding as "gospel", and that it should hold true in all cases. I
> quote...
> "For the people who have no clue (also known as "Executives")
> here's the
> information you need to tell all your employees they need to adopt the
> latest and greatest thing without ever having to understand
> anything you
> read. Cheaper than an article in CIO magazine and even has big
> words like
> "standard deviation"." and then goes on to present his "statistically
> correct" performance numbers.

Don't get me wrong - Zed is using inflammatory language.  We should
work to not lower ourselves to speaking in that same tone but rather
objectively and nicely point out the errors of his ways.  He's open
to that despite his caustic tone - at least from the e-mail exchanges
I've had with him.

 Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Lucene vs. Ruby/Odeum

2005-06-02 Thread Robert Engels

One more thing, I did some simple tests with my caching enhancements. And
using a similar test (performing the search for the same word over and
over), there was a 100% performance improvement, so I would expect Lucene to
blow the doors of Odeum in this case.

This is why 'test cases' are so easy to manipulate. I am sure there are
parameter's for Odeum that allow you to increase its index & data block
cache sizes, but the minimum/defaults may be enough to hold all of the data
necessary for the test. As the test coverage gets wider, allocating more
buffer space will usually compensate, and give similar performance numbers.

-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Thursday, June 02, 2005 5:09 AM
To: java-dev@lucene.apache.org
Subject: Re: Lucene vs. Ruby/Odeum

Zed has updated his second part with more experiments with different
JVM's and memory settings:

 http://www.zedshaw.com/projects/ruby_odeum/odeum_lucene_part2.html

On Jun 2, 2005, at 12:27 AM, Robert Engels wrote:
> I read all of Zed's posts on the subject and I feel he certainly
> presents a
> strong anti-Java

Most definitely an anti-Java leaning - but at least he's working on
being objective about it by measuring things :)

> , if not anti-Lucene bias - maybe just pro Ruby.

He's quite pro-Lucene, and most definitely pro-Ruby.  I consider
myself in those categories myself.

> If you do not even adhere to the principle designer's "guidelines
> to proper
> usage", your tests are meaningless. It's akin to using a new flat
> screen
> monitor and claiming "boy, it has a fuzzy picture", because you didn't
> follow the instructions that said "remove protective film before
> using".

I concur with your sentiment and I've done what I can via e-mail with
him to educate him on my experience with Lucene and JVM garbage
collection.  I'd encourage anyone who has the the time and
inclination to take him up on the request to show how to do it better
since he's made his code available.

> Zed is using a very constrained test - which is probably very
> UNCOMMON in
> the real world of server based systems, to attempt to discern the
> relative
> performance characteristics of Lucene/Java/Ruby/etc. The tests may be
> applicable in his poorly designed environment, but he presents his
> limited
> finding as "gospel", and that it should hold true in all cases. I
> quote...
> "For the people who have no clue (also known as "Executives")
> here's the
> information you need to tell all your employees they need to adopt the
> latest and greatest thing without ever having to understand
> anything you
> read. Cheaper than an article in CIO magazine and even has big
> words like
> "standard deviation"." and then goes on to present his "statistically
> correct" performance numbers.

Don't get me wrong - Zed is using inflammatory language.  We should
work to not lower ourselves to speaking in that same tone but rather
objectively and nicely point out the errors of his ways.  He's open
to that despite his caustic tone - at least from the e-mail exchanges
I've had with him.

 Erik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Search deadlocking under load

2005-07-13 Thread Robert Engels

I had posted an NioFile and caching system that greatly increases the
parallelness of Lucene. Although on some platforms (Windows), the low-level
NioChannel is not completely thread-safe so it can still block, although the
code has some work-arounds for this problem.

You can never achieve "100% parallel", as a thread will block doing disk io
at some point in the driver (unless everything is in the disk cache), but
even without this case, unless you have the same number of processors as
threads, there will always be a "blocking/waiting".

If the time to perform a search is greater than the time needed for a
certain # of requests per second, you will always generate more threads,
unless you limit the number of threads accepting requests in the first
place.

R



-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
Sent: Wednesday, July 13, 2005 5:53 PM
To: java-dev@lucene.apache.org
Cc: Nathan Brackett
Subject: RE: Search deadlocking under load


This may be better for [EMAIL PROTECTED]

I've looked at the source of that method, but I don't see a way of
removing that synchronized block.  Maybe somebody else has ideas, but
it looks like the synchronization is there to ensure the file being
read is read in fully, without some other thread modifying it "under
the reader's feet".

Otis

--- Nathan Brackett <[EMAIL PROTECTED]> wrote:

> Otis,
>
> After further testing it turns out that the 'deadlock' we're
> encountering is
> not a deadlock at all, but a result of resin hitting its maximum
> number of
> allowed threads.  We bumped up the max-threads in the config and it
> fixed
> the problem for a certain amount of load, but we'd much prefer to go
> after
> the source of the problem, namely:
>
> As the number of threads hitting lucene increases, contention for
> locks
> increases, meaning the average response time decreases.  This places
> us in a
> downward spiral of performance because as the incoming number of hits
> per
> second stays constant, the response time decreases, meaning that the
> total
> number of threads inside resin doing work will increase.  This
> problem
> compounds itself, escalating the number of threads in resin until we
> crash.
>
>
> Admittedly this is a pretty harsh test (~~20 hits per second
> triggering
> complex searches, which starts fine but then escalates to > 150
> threads as
> processing slows down but number of incoming hits per second does
> not)
>
> Our ultimate goal, however, is to have each search be completely and
> 100%
> parallel.
>
> The point of contention seems to be the method below:
>
> FSDirectory.java:486 (class FSInputStream)
>
>
>
>   protected final void readInternal(byte[] b, int offset, int len)
>   throws IOException {
>   synchronized (file) {
>   long position = getFilePointer();
>   if (position != file.position) {
>   file.seek(position);
>   file.position = position;
>   }
>   int total = 0;
>   do {
>   int i = file.read(b, offset+total, len-total);
>   if (i == -1)
>   throw new IOException("read past EOF");
>   file.position += i;
>   total += i;
>   } while (total < len);
>   }
>   }
>
>
>
>
> The threads are usually all lined up to reach this.  Why are so many
> threads
> backed up behind the same instance of FSInputStream.readInternal?
> Shouldn't
> each search have a different input stream?  What would you suggest as
> the
> best path to achieve 100% parallel searching?  Here's a sample of our
> thread
> dump, you can see 2 threads waiting for the same
> FSInputStream$Descriptor
> (which is the synchronized(file) above):
>
> "tcpConnection-8080-11" daemon prio=5 tid=0x08304600 nid=0x8304800
> waiting
> for monitor entry [bf494000..bf494d08]
> at
>
org.apache.lucene.store.FSInputStream.readInternal(FSDirectory.java:412)
> - waiting to lock <0x2f2b7a38> (a
> org.apache.lucene.store.FSInputStream$Descriptor)
> at
> org.apache.lucene.store.InputStream.refill(InputStream.java:158)
> at
> org.apache.lucene.store.InputStream.readByte(InputStream.java:43)
> at
> org.apache.lucene.store.InputStream.readVInt(InputStream.java:83)
> at
>
org.apache.lucene.index.SegmentTermDocs.read(SegmentTermDocs.java:126)
> at
> org.apache.lucene.search.TermScorer.next(TermScorer.java:55)
> at
> org.apache.lucene.search.BooleanScorer.next(BooleanScorer.java:112)
> at org.apache.lucene.search.Scorer.score(Scorer.java:37)
> at
> org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:92)
> at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:64)
> at org.apache.lucene.search.Hits.(Hits.java:43)
> at org.apache.lucene.search.Searcher.search(Searcher.java:33)
> at org.apache.lucene.search.Searcher

RE: lucene API

2005-08-17 Thread Robert Engels

I think you should leave the API as is, and write a custom XML writer for
lucene search results. The request is trivial since you can simple pass the
single string. I would not write wrapper beans just to use the built-in
serialization support.

The custom XML writer will be MUCH, MUCH faster, as you do not need to
create an XML document first. The XML needed to support a Lucene search
result is quite trivial.

R

-Original Message-
From: Maros Ivanco [mailto:[EMAIL PROTECTED]
Sent: Wednesday, August 17, 2005 8:27 AM
To: java-dev@lucene.apache.org
Cc: Marek Baumerth
Subject: lucene API




Hi there,

   I am creating a search solution based on Lucene. A part of the solution
   is Lucene web service. Even though the Lucene API is very straitforward
   to use on a local machine, I found creation of Lucene WS to be extremely
   difficult. The problem causes the API, which very often does not obey
   even trivial coding conventions (getter and setter names, for instance).
   As a result, the jax-rpc subsystem is unable to produce correct
   serializers and deserializers. From my point of view, there are serveral
   possibilities to solve the problem:

 1., write envelopes for nonserializable objects (Document, Field, Term,
 ...)
 2., write custom serializers and deserializers
 3., change lucene API

1. requires creation of many envelopes, since most of the Lucene classes do
not obey JavaBean semantics thus, they are not serializable by jax-rpc.
Wrapping and uwrapping of Lucene objects takes extra processing. Moreover,
the un/wrapping is sometimes not possible, because some objects (Filters,
for instance) do not exhibit their full state. I am doing this right now
(and doing, and doing,... I cannot see the end :-)).

2. requires creation of de/serializer for every nonserializable class. Also
requires extra configuration + generation of factories. Twice as difficult
as solution 1. Moreover, this solution renders resulting WS as nonportable.

3. requires change of lucene API in the way that will allow either direct
de/serialization or, at least, the solution 1. As far as Lucene is open
sourced, everybody can make changes, but the real question is, whether the
changes will become part of the official distribution. If not, the overall
search solution will remain stucked with current release.

I would personally prefer the third one, but only if the changes will make
it to the official release. Our company plans to deploy several instances
of the solution. There is certain probability, that my employer will
contribute some resources (my time) to the project. The question is,
whether the contemporary development comunity is willing to accept this
kind of changes and if I can participate. So, is it?

Maros.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: lucene API

2005-08-18 Thread Robert Engels

Sorry, but I don't think you have a clear understanding of web services.

In order to design a really useful WebService it is almost always best to
create a web service API that is drastically different (usually much
simpler) than the underlying systems API.

Think of the web service interface to most financial systems. The "records"
they return are usually a small subset of the information maintained by the
system - if they tried to expose the underlying system directly it would
takes years for a single client to be developed.

Most webservices must be consumed by systems that have no concept of
objects, inheritance, etc. so attempting to expose a complex system
interface using automated tools and have it work with all clients is
futile - it may be possible, but it certainly wouldn't be very useful.

For an example, take a look at the google web api. There are no methods to
perform inserts, not that they couldn't allow it, they just decided not to.

In the same vain, there is no reason you cannot develop simpler, flat
objects and interfaces for Lucene (a web doc, a web search request, a web
insert request).

Lucene is a library api. Most web service APIs use a far simpler "consumer"
interface.

Robert Engels

-Original Message-
From: Maros Ivanco [mailto:[EMAIL PROTECTED]
Sent: Thursday, August 18, 2005 11:40 AM
To: java-dev@lucene.apache.org
Subject: RE: lucene API

-"Robert Engels" <[EMAIL PROTECTED]> wrote: -

>To: 
>From: "Robert Engels" <[EMAIL PROTECTED]>
>Date: 17.08.2005 16:20
>Subject: RE: lucene API
>
>I think you should leave the API as is, and write a custom XML writer
>for
>lucene search results. The request is trivial since you can simple
>pass the
>single string.

Well, actually that is not true. You assume that all I need to do is the
search. But I also need to expose index reading and index writing API. I
need
reading API to remove expired documents, index management, end so forth. I
need writing API to write documents by robot which resides on different
machine. Anyway, even exposing only the search API is not trivial in the
case
you want to search in multiple languages (you have to use custom
analyzers),
specify order, or filter. Even if you would not, there is always Hits
object,
which is (by its nature) nonserializable.

>I would not write wrapper beans just to use the
>built-in
>serialization support.
>
>The custom XML writer will be MUCH, MUCH faster, as you do not need
>to create an XML document first.

Perhaps faster to run (I must admit you have a point here), but defenetly
not
faster to develop. Usage of the custom de/serializers is also discouraged
by
jax-rpc spec (which is not an issue), and is not well documented.
Anyway, it is imposible to write de/serializers for certain Lucene classes.
For instance, you cannot write de/serializers for
org.apache.lucene.search.Filter and its subclasses,
because the inner state of their instances is not accessible. Other
examples
of nonserializable classes include Hits, Sort, SortField, ScoreDoc,
TopDocs,
...

So, to summarize, I do not see Lucene API ready for integration with other
systems, even though it may be very effective when used standalone.

Maros.

>The XML needed to support a Lucene
>search
>result is quite trivial.
>
>R
>
>-Original Message-
>From: Maros Ivanco [mailto:[EMAIL PROTECTED]
>Sent: Wednesday, August 17, 2005 8:27 AM
>To: java-dev@lucene.apache.org
>Cc: Marek Baumerth
>Subject: lucene API
>
>
>
>
>Hi there,
>
>   I am creating a search solution based on Lucene. A part of the
>solution
>   is Lucene web service. Even though the Lucene API is very
>straitforward
>   to use on a local machine, I found creation of Lucene WS to be
>extremely
>   difficult. The problem causes the API, which very often does not
>obey
>   even trivial coding conventions (getter and setter names, for
>instance).
>   As a result, the jax-rpc subsystem is unable to produce correct
>   serializers and deserializers. From my point of view, there are
>serveral
>   possibilities to solve the problem:
>
> 1., write envelopes for nonserializable objects (Document, Field,
>Term,
> ...)
> 2., write custom serializers and deserializers
> 3., change lucene API
>
>1. requires creation of many envelopes, since most of the Lucene
>classes do
>not obey JavaBean semantics thus, they are not serializable by
>jax-rpc.
>Wrapping and uwrapping of Lucene objects takes extra processing.
>Moreover,
>the un/wrapping is sometimes not possible, because some objects
>(Filters,
>for instance) do not exhibit their full state. I am doing this right
>now
>(and doing, and doing,... I cannot see the end :-)).
>
>2. requires creation of de/serializer for every nonserializable
>c

RE: Lucene does NOT use UTF-8.

2005-08-28 Thread Robert Engels

Sorry, but I think you are barking up the wrong tree... and your tone is
quite bizarre. My personal OPINION is that your "script" language is an
abomination, and anyone that develops in it is clearly hurting the
advancement of all software - but that is another story, and doesn't matter
much to the discussion - in a similar fashion your choice of words is
clearly not gong to help matters.

Just because Lucene uses a proprietary encoding that is efficient for Java,
does not make it non-portable. It is certainly not "Java only" by any
stretch - all you need to know is that a Java "character" is always 2 bytes.
It may be less efficient to decode in other languages, but I don't think the
original Lucene designers were too worried about the efficiencies of other
languages/platforms. The API documentation may have been inaccurate, but
there has been no attempt to "hide" the process - it is still completely
"open".

All that being said, it is trivial to make the VInt the number of bytes, and
use the built in UTF-8 encoders/decoders available in Java.

Using String.getBytes("UTF-8"), and String.String(byte[],"UTF-8") is all
that is needed.

R

-Original Message-
From: Marvin Humphrey [mailto:[EMAIL PROTECTED]
Sent: Sunday, August 28, 2005 8:57 PM
To: java-dev@lucene.apache.org
Subject: Re: Lucene does NOT use UTF-8.


Ken Krugler sent a reply to the user list.  In an effort to keep all
the developers informed, I'm sending my reply to the developer list
and including his entire original post below my sig.

Ken writes...

 > Since a null in the
 > middle of a string is rare, as is a character outside of the BMP, a
 > quick scan of the text should be sufficient to determine if it can be
 > written as-is.

Let's see.  I think we are looking at two scans, (one index(), one
regex), or a regex that uses alternation.  I strongly suspect two scans
are faster.

 if (  (index($string, "\xC0\x80") != -1)
or ($string =~ /[\F0-\xF7]/ ) # only exists in 4-byte UTF-8
 ) {
 # Process string...
 }

That would tell us whether the string needed to be specially encoded for
Java's sake on output.  Yes, I suspect that's considerably more
efficient than always converting first to UTF-16 and then to "Modified
UTF-8".

It's also completely unnecessary, as you'll see from the patch below,
so I'm going to press ahead and make these XS ports of InputStream and
OutputStream work with legal UTF-8.

It would actually make a lot more sense for Plucene if the integer at
the head of a string measured *bytes* instead of either Unicode code
points or Java chars.  Then it's just a straight up copy!  No scanning
OR decoding required.

(Hmm... I wonder if there's a way to make Lucene work quickly if the
VInt were redefined to be "length in bytes"...)

Speaking of which, the Lucene file formats document also says this...

 "Lucene writes strings as a VInt representing the length,
followed by
 the character data."

The ambiguity of the word "length" in this sentence left me scratching
my head.  Length in bytes or length in UTF-8 characters?  Of course
the real answer is... neither. :\

It's length in Java chars, or, if you prefer to further Sun's
disinformation campaign, ;) "Modified UTF-8 characters".  If the Lucene
docs had stated "Java chars" explicitly, I would have had a better idea
about why the value of that VInt is what it is -- a Java-specific
quirk at odds with a widely-accepted standard -- and about what
it was going to take to adhere to the spec.

 > I'd need to look at the code more, but using something other than the
 > Java serialized format would probably incur a performance penalty for
 > the Java implementation. Or at least make it harder to handle the
 > strings using the standard Java serialization support.

I believe that the following true-UTF-8 replacement for the
readChars function is at least as fast as the current implementation,
unless your text contains characters outside the BMP.  It's incomplete,
because my Java expertise is quite limited, but it should be
conceptually sound.  The algo is adapted from C code supplied by the
Unicode consortium.

http://www.unicode.org/Public/PROGRAMS/CVTUTF/ConvertUTF.c

   static final byte[] TRAILING_BYTES_FOR_UTF8 = {
   0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
   0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
   0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
   0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
   0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
   0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
   1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,
   2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 3,3,3,3,3,3,3,3,4,4,4,4,5,5,5,5
   };

   public final void readChars(char[] buffer, int start, int length)
throws IOException {
 int end = start + length; // No longer a final int

RE: Lucene does NOT use UTF-8.

2005-08-29 Thread Robert Engels

I think the VInt should be the numbers of bytes to be stored using the UTF-8
encoding.

It is trivial to use the String methods identified before to do the
conversion. The String(char[]) allocates a new char array.

For performance, you can use the actual CharSet encoding classes - avoiding
all of the lookups performed by the String class.

-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED]
Sent: Monday, August 29, 2005 4:24 PM
To: java-dev@lucene.apache.org
Subject: Re: Lucene does NOT use UTF-8.

Ken Krugler wrote:
> The remaining issue is dealing with old-format indexes.

I think that revving the version number on the segments file would be a
good start.  This file must be read before any others.  Its current
version is -1 and would become -2.  (All positive values are version 0,
for back-compatibility.)  Implementations can be modified to pass the
version around if they wish to be back-compatible, or they can simply
throw exceptions for old format indexes.

I would argue that the length written be the number of characters in the
string, rather than the number of bytes written, since that can minimize
string memory allocations.

> I'm going to take this off-list now [ ... ]

Please don't.  It's better to have a record of the discussion.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Lucene does NOT use UTF-8.

2005-08-30 Thread Robert Engels

That method should easily be changed to

public final String readString() throws IOException {
int length = readVInt();
return new String(readBytes(length),"UTF-8);
}

readBytes(0 could reuse the same array if it was large enough. Then only the
single char[] is created in the String code.

-Original Message-
From: Yonik Seeley [mailto:[EMAIL PROTECTED]
Sent: Tuesday, August 30, 2005 11:28 AM
To: java-dev@lucene.apache.org
Subject: Re: Lucene does NOT use UTF-8.


> How will the difference impact String memory allocations? Looking at the
> String code, I can't see where it would make an impact.


This is from Lucene InputStream:
public final String readString() throws IOException {
int length = readVInt();
if (chars == null || length > chars.length)
chars = new char[length];
readChars(chars, 0, length);
return new String(chars, 0, length);
}

If you know the length in bytes, you still have to allocate that many chars
(even though the number of chars may be less than the number of bytes). Not
a big deal IMHO.

A bigger pain is on the writing side, where you can't stream things because
you don't know what the length is going to be (in either bytes *or* UTF-8
chars).

So it turns out that Java's 16 bit chars were just a waste... it's still a
multibyte format *and* it takes up more space. UTF-8 would have been nice -
no conversions necessary.


-Yonik Now hiring -- http://tinyurl.com/7m67g


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Lucene does NOT use UTF-8.

2005-08-30 Thread Robert Engels

I think you guys are WAY overcomplicating things, or you just don't know
enough about the Java class libraries.

If you use the java.nio.charset.CharsetEncoder class, then you can reuse the
byte[] array, and then it is a simple write of the length, and a blast copy
of the required number of bytes to the OutputStream (which will either fit
or expand its byte[]). You can perform all of this WITHOUT creating new
byte[] or char[] (as long as the existing one is large enough to fit the
encoded/decoded data).

There is no need to use any sort of file position mark/reset stuff.

R

-Original Message-
From: Ken Krugler [mailto:[EMAIL PROTECTED]
Sent: Tuesday, August 30, 2005 11:54 AM
To: java-dev@lucene.apache.org
Subject: RE: Lucene does NOT use UTF-8.

>I think the VInt should be the numbers of bytes to be stored using the
UTF-8
>encoding.
>
>It is trivial to use the String methods identified before to do the
>conversion. The String(char[]) allocates a new char array.
>
>For performance, you can use the actual CharSet encoding classes - avoiding
>all of the lookups performed by the String class.

Regardless of what underlying support is used, if you want to write
out the VInt value as UTF-8 bytes versus Java chars, the Java String
has to either be converted to UTF-8 in memory first, or pre-scanned.
The first is a memory hit, and the second is a performance hit. I
don't know the extent of either, but it's there.

Note that since the VInt is a variable size, you can't write out the
bytes first and then fill in the correct value later.

-- Ken

>-Original Message-
>From: Doug Cutting [mailto:[EMAIL PROTECTED]
>Sent: Monday, August 29, 2005 4:24 PM
>To: java-dev@lucene.apache.org
>Subject: Re: Lucene does NOT use UTF-8.
>
>
>Ken Krugler wrote:
>>  The remaining issue is dealing with old-format indexes.
>
>I think that revving the version number on the segments file would be a
>good start.  This file must be read before any others.  Its current
>version is -1 and would become -2.  (All positive values are version 0,
>for back-compatibility.)  Implementations can be modified to pass the
>version around if they wish to be back-compatible, or they can simply
>throw exceptions for old format indexes.
>
>I would argue that the length written be the number of characters in the
>string, rather than the number of bytes written, since that can minimize
>string memory allocations.
>
>>  I'm going to take this off-list now [ ... ]
>
>Please don't.  It's better to have a record of the discussion.
>
>Doug

--
Ken Krugler
TransPac Software, Inc.

+1 530-470-9200

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Lucene does NOT use UTF-8.

2005-08-30 Thread Robert Engels

At bit more clarity...

Using CharBuffer and ByteBuffer allows for easy reuse and expansion. You
also need to use the CharSetDecoder class as well.

-Original Message-
From: Robert Engels [mailto:[EMAIL PROTECTED]
Sent: Tuesday, August 30, 2005 12:40 PM
To: java-dev@lucene.apache.org
Subject: RE: Lucene does NOT use UTF-8.

I think you guys are WAY overcomplicating things, or you just don't know
enough about the Java class libraries.

If you use the java.nio.charset.CharsetEncoder class, then you can reuse the
byte[] array, and then it is a simple write of the length, and a blast copy
of the required number of bytes to the OutputStream (which will either fit
or expand its byte[]). You can perform all of this WITHOUT creating new
byte[] or char[] (as long as the existing one is large enough to fit the
encoded/decoded data).

There is no need to use any sort of file position mark/reset stuff.

R

-Original Message-
From: Ken Krugler [mailto:[EMAIL PROTECTED]
Sent: Tuesday, August 30, 2005 11:54 AM
To: java-dev@lucene.apache.org
Subject: RE: Lucene does NOT use UTF-8.

>I think the VInt should be the numbers of bytes to be stored using the
UTF-8
>encoding.
>
>It is trivial to use the String methods identified before to do the
>conversion. The String(char[]) allocates a new char array.
>
>For performance, you can use the actual CharSet encoding classes - avoiding
>all of the lookups performed by the String class.

Regardless of what underlying support is used, if you want to write
out the VInt value as UTF-8 bytes versus Java chars, the Java String
has to either be converted to UTF-8 in memory first, or pre-scanned.
The first is a memory hit, and the second is a performance hit. I
don't know the extent of either, but it's there.

Note that since the VInt is a variable size, you can't write out the
bytes first and then fill in the correct value later.

-- Ken

>-Original Message-
>From: Doug Cutting [mailto:[EMAIL PROTECTED]
>Sent: Monday, August 29, 2005 4:24 PM
>To: java-dev@lucene.apache.org
>Subject: Re: Lucene does NOT use UTF-8.
>
>
>Ken Krugler wrote:
>>  The remaining issue is dealing with old-format indexes.
>
>I think that revving the version number on the segments file would be a
>good start.  This file must be read before any others.  Its current
>version is -1 and would become -2.  (All positive values are version 0,
>for back-compatibility.)  Implementations can be modified to pass the
>version around if they wish to be back-compatible, or they can simply
>throw exceptions for old format indexes.
>
>I would argue that the length written be the number of characters in the
>string, rather than the number of bytes written, since that can minimize
>string memory allocations.
>
>>  I'm going to take this off-list now [ ... ]
>
>Please don't.  It's better to have a record of the discussion.
>
>Doug

--
Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Lucene does NOT use UTF-8.

2005-08-30 Thread Robert Engels

Not true. You do not need to pre-scan it.

When you use CharSet encoder, it will write the bytes to a buffer (expanding
as needed). At the end of the encoding you can get the actual number of
bytes needed.

The pseudo-code is

use CharsetEncoder to write String to ByteBuffer
write VInt using ByteBuffer.getLength()
write bytes using ByteBuffer.getByte[]

better yet you NIO so you can pass the ByteBuffer directly.


-Original Message-
From: Yonik Seeley [mailto:[EMAIL PROTECTED]
Sent: Tuesday, August 30, 2005 12:56 PM
To: java-dev@lucene.apache.org; [EMAIL PROTECTED]
Subject: Re: Lucene does NOT use UTF-8.


> I think you guys are WAY overcomplicating things, or you just don't know
> enough about the Java class libraries.


People were just pointing out that if the vint isn't String.length(), then
one has to either buffer the entire string, or pre-scan it.

It's a valid point, and CharsetEncoder doesn't change that.

 -Yonik Now hiring -- http://tinyurl.com/7m67g


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Lucene does NOT use UTF-8.

2005-08-30 Thread Robert Engels

Since the buffer can be reused, seems that is the proper choice, and the
"increased memory" you cited originally is not an issue.

-Original Message-
From: Yonik Seeley [mailto:[EMAIL PROTECTED]
Sent: Tuesday, August 30, 2005 1:07 PM
To: java-dev@lucene.apache.org; [EMAIL PROTECTED]
Subject: Re: Lucene does NOT use UTF-8.


On 8/30/05, Robert Engels <[EMAIL PROTECTED]> wrote:
>
> Not true. You do not need to pre-scan it.


What I previously wrote, with emphasis on key words added:
"one has to *either* buffer the entire string, *or* pre-scan it."

-Yonik Now hiring -- http://tinyurl.com/7m67g


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Eliminating norms ... completley

2005-10-07 Thread Robert Engels

I did exactly this in my custom lucene, since the array of a byte per
document is extremely wasteful in a lot of applications. I just changed the
code to return null from getNorms() and modified the callers to treat a null
array as always 1 for any document.

-Original Message-
From: Chris Hostetter [mailto:[EMAIL PROTECTED]
Sent: Friday, October 07, 2005 4:18 PM
To: java-dev@lucene.apache.org
Subject: Eliminating norms ... completley

Yonik and I have been looking at the memory requirements of an application
we've got.  We use a lot of indexed fields, primarily so I can do a lot
of numeric tests (using RangeFilter).   When I say "a lot" I mean
arround 8,000 -- many of which are not used by all documents in the index.

Now there are some basic usage changes I can make to cut this number in
half, and some more complex biz rule changes I can make to get the number
down some more (at the expense of flexibility) but even then we'd have
arround 1,000 -- which is still a lot more then the recommended "handful"

After discussing some options, I asked the question "Remind me again why
having lots of indexed fields makes the memory requirements jump up --
even if only a few documents use some field?" and Yonik reminded me about
the norm[] -- an array of bytes representating the field boost + length
boost for each document.  One of these arrays exists for every indexed
field.

So then I asked the $50,000,000 question:  "Is there any way to get rid of
this array for certain fields? ... or any way to get rid of it completely
for every field in a specific index?"

This may sound like a silly question for most IR applications where you
want length normalization to contribute to your scores, but in this
particular case most of these fields are only used to store single numeric
value, to be certain, there are some fields we have (or may add in the
future) that could benefit from having a narms[] ... but if it had to be
an all or nothing thing we could certainly live without them.

It seems to me, that in an ideal world, deciding wether or not you wanted
to store norms for a field would be like deciding wether you wanted to
store TermVectors for a field.  I can imagine a Field.isNormStored()
method ... but that seems like a pretty significant change to the existing
code base.

Alternately, I started wondering if if would be possible to write our own
IndexReader/IndexWriter subclasses that would ignore the norm info
completely (with maybe an optional list of field names the logic should be
lmited to), and return nothing but fixed values for any parts of the code
base that wanted them.  Looking at SegmentReader and MultiReader this
looked very promising (especailly considering the way SegmentReader uses a
system property to decide which acctaul class ot use).  But I was less
enthusiastic when i started looking at IndexWriter and the DocumentWriter
classes  there doesn't seem to be any clean way to subclass the
existing code base to eliminate the writing of the norms to the Directory
(curses those final classes, and private final methods).

So I'm curious what you guys think...

  1) Regarding the root problem: is there any other things you can think
 of besides norms[] that would contribute to the memory foot print
 needed by a large number of indexed fields?
  2) Can you think of a clean way for individual applications to eliminate
 norms (via subclassing the lucene code base - ie: no patching)
  3) Yonik is currently looking into what kind of patch it would take to
 optionally turn off norms (I'm not sure if he's looking at doing it
 "per field" or "per index").  Is that the kind of thing that would
 even be considered for getting commited?

--

---
"Oh, you're a tricky one."Chris M Hostetter
 -- Trisha Weir[EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Eliminating norms ... completley

2005-10-10 Thread Robert Engels

Doesn't this cause a problem for highly interactive and large indexes? Since
every update to the index requires the rewriting of the norms, and
constructing a new array.

How expensive is the maintining of the norms on disk, at least in regards to
index merging?

-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED]
Sent: Monday, October 10, 2005 2:15 PM
To: java-dev@lucene.apache.org
Subject: Re: Eliminating norms ... completley

Chris Hostetter wrote:
>   2) Can you think of a clean way for individual applications to eliminate
>  norms (via subclassing the lucene code base - ie: no patching)

Can't you simply subclass FilterIndexReader and override norms() to
return a cached dummy array of Similarity.encodeNorm(1.0f) for those
fields whose norms you don't want?  You'd still have to have a single
array of bytes, but no longer one per field.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Are Non-consecutive Document IDs feasible?

2005-10-11 Thread Robert Engels

Just add another field to document that is your "external" document
identifier, which is what the request is essentially asking for - another
layer of indirection between identifiers and physical locations in the
index.

-Original Message-
From: Shane O'Sullivan [mailto:[EMAIL PROTECTED]
Sent: Tuesday, October 11, 2005 10:59 AM
To: java-dev@lucene.apache.org
Subject: Are Non-consecutive Document IDs feasible?

Hi all,

As far as I understand today, Lucene assigns docIDs to documents according
to the order in which the documents are added to the index. Hence, docIDs
are assigned by the engine in a sequential manner, without gaps. This order
of document identifiers then determines the order of the postings in the
postings lists, i.e. all postings lists are sorted by docID. It also means
that the same document appearing in two different indices would probably not
have the same docID (unless some extreme care was taken to insert documents
in the same order).

There are situations where the application wants to determine the docID for
the index, i.e. to control the ordering of occurrences in the postings
lists. This is useful to ensure, for example, that a document has a stable
and consistent document identifier regardless of insertion order to an
index.

In either case, the application would want to pass into the index the
numeric identifier of the document. However, such identifiers may not be
sequential, i.e. it's possible that there would be a document with docID M
without there being any document whose docID is M-1.

Q1. How difficult would it be to change Lucene to accept the docIDs from the
application, and not care about any possible gaps those ids may have?
One possible problem is that since the Doc Ids could become very large, and
are non-sequential, creating a single array for them all would not be
feasible.

Q2. Does Lucene's search code depend on the fact that document IDs are
sequential?

Thanks

Shane

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: [jira] Commented: (LUCENE-414) Java NIO patch against Lucene 1.9

2005-10-26 Thread Robert Engels

The reason for using Nio and not IO is IO requires multiple file handles per 
file. There are already numerous bugs/work-arounds in Lucene to limit the use 
of file handles (as this is a OS limited resource), so I did not wish to 
further increase the number of file descriptors needed.

Your statement that a raid system would be needed to exploit the added 
concurrency is not exactly correct. By using multiple threads, even if the disk 
is busy handling a request, the OS can combine the pending requests and perform 
more efficient reads to the disk subsystem when it becomes available.

I also dispute the performance numbers cited. In my testing the 'user level' 
cache improved performance of query operations nearly 100%. I will write a 
testcase to demonstrate the increased performance. This testcase can be written 
independent of Lucene.

-Original Message-
From: Doug Cutting (JIRA) [mailto:[EMAIL PROTECTED]
Sent: Wednesday, October 26, 2005 4:31 PM
To: java-dev@lucene.apache.org
Subject: [jira] Commented: (LUCENE-414) Java NIO patch against Lucene
1.9


[ 
http://issues.apache.org/jira/browse/LUCENE-414?page=comments#action_12356015 ] 

Doug Cutting commented on LUCENE-414:
-

The channels should all be opened when the IndexInput is created, as files can 
subsequently get deleted.

Also, I'm not sure why this uses nio.  Classic io would also permit you to have 
multiple file handles per file, for more parallel io.  So you could just patch 
FSDirectory to permit that, no?

Finally, if files are on a single drive, then the concurrency improvements are 
probably negligible.  This would only really pay off with a RAID, where 
different parts of a file are stored on different physical devices.  Or am I 
missing something?

> Java NIO patch against Lucene 1.9
> -
>
>  Key: LUCENE-414
>  URL: http://issues.apache.org/jira/browse/LUCENE-414
>  Project: Lucene - Java
> Type: Bug
>   Components: Store
> Versions: unspecified
>  Environment: Operating System: All
> Platform: All
> Reporter: Chris Lamprecht
> Assignee: Lucene Developers
>  Attachments: MemoryLRUCache.java, NioFile.java, nio-lucene-1.9.patch
>
> Robert Engels previously submitted a patch against Lucene 1.4 for a Java NIO-
> based Directory implementation.  It also included some changes to FSDirectory 
> to allow better concurrency when searching from multiple threads.  The 
> complete thread is at:
> http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200505.mbox/%
> [EMAIL PROTECTED]
> This thread ended with Doug Cutting suggesting that someone port Robert's 
> changes to the SVN trunk.  This is what I've done in this patch.
> There are two parts to the patch.  The first part modifies FieldsReader, 
> CompoundFileReader, and SegmentReader, to allow better concurrency when 
> reading an index.  The second part includes the new NioFSDirectory 
> implementation, and makes small changes to FSDirectory and IndexInput to 
> accomodate this change.  I'll put a more detailed outline of the changes to 
> each file in a separate message.
> To use the new NioFSDirectory, set the system property 
> org.apache.lucene.FSDirectory.class to 
> org.apache.lucene.store.NioFSDirectory.  This will cause 
> FSDirectory.getDirectory() to return an NioFSDirectory instance.  By default, 
> NioFile limits the number of concurrent channels to 4, but you can override 
> this by setting the system property org.apache.lucene.nio.channels.  
> I did some performance tests with these patches.  The biggest improvement 
> came 
> from the concurrency improvements.  NioFSDirectory performed about the same 
> as 
> FSDirectory (with the concurrency improvements).  
> I ran my tests under Fedora Core 1; uname -a reports:
> Linux myhost 2.4.22-1.2199.nptlsmp #1 SMP Wed Aug 4 11:48:29 EDT 2004 i686 
> i686 i386 GNU/Linux
> The machine is a dual xeon 2.8GHz with 4GB RAM, and the tests were run 
> against 
> a 9GB compound index file.  The tests were run "hot" -- with everything 
> already cached by linux's filesystem cache.  The numbers are:
> FSDirectory without patch:  13.3 searches per second
> FSDirectory WITH concurrency patch: 14.3 searches per second
> Both tests were run with 6 concurrent threads, which gave the highest numbers 
> in each case.  I suspect that the concurrency improvements would make a 
> bigger 
> difference on a more realistic test where the index isn't all cached in RAM 
> already, since the I/O happens whild holding the sychronized lock.  Patches 
> to 
> follow...
> Thoughts?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http

RE: [jira] Commented: (LUCENE-414) Java NIO patch against Lucene 1.9

2005-10-26 Thread Robert Engels

You are correct, this is to get around JDK bug 6265734. (The bug was
originally cited by me, by the test code attached to the bug seems to bear
out that my assessment is correct). It should be document in the code that
this is a work-around (and does increase the number of file handles needed).
I will look into whether or not using multiple RandomAccessFiles has any
performance difference.

I am not sure how to benchmark this. I state this from my understanding of
optimizing disk subsystems, but I am sure it is very hardware dependent. I
do know by reading the SCSI documentation, and other UltaATA documentation
that the controller will coalesce requests, so you need to get multiple
requests to the controller. If the thread blocks in java, you will never get
multiple requests to the controller.

I am working on a performance test case for the caching right now...

-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED]
Sent: Wednesday, October 26, 2005 4:51 PM
To: java-dev@lucene.apache.org
Subject: Re: [jira] Commented: (LUCENE-414) Java NIO patch against
Lucene 1.9

Robert Engels wrote:
> The reason for using Nio and not IO is IO requires multiple file handles
per file. There are already numerous bugs/work-arounds in Lucene to limit
the use of file handles (as this is a OS limited resource), so I did not
wish to further increase the number of file descriptors needed.

Yes, but it appears to me that the submitted NioFile class opens a new
file handle per channel.  So I don't see how this addresses that.

> Your statement that a raid system would be needed to exploit the added
concurrency is not exactly correct. By using multiple threads, even if the
disk is busy handling a request, the OS can combine the pending requests and
perform more efficient reads to the disk subsystem when it becomes
available.

Perhaps.  It would be nice to see a benchmark demonstrating this.

> I also dispute the performance numbers cited. In my testing the 'user
level' cache improved performance of query operations nearly 100%. I will
write a testcase to demonstrate the increased performance. This testcase can
be written independent of Lucene.

Can you provide your benchmark results?

Thanks,

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: bytecount as String and prefix length

2005-10-31 Thread Robert Engels

All of the JDK source is available via download from Sun.

-Original Message-
From: Marvin Humphrey [mailto:[EMAIL PROTECTED]
Sent: Monday, October 31, 2005 6:31 PM
To: java-dev@lucene.apache.org
Subject: Re: bytecount as String and prefix length

I wrote...

> I think I'll take a crack at a custom charsToUTF8 converter algo.

Still no luck.  Still 20% slower than the current implementation.   
The algo is below, for reference.

It's entirely possible that my patches are doing something dumb  
that's causing this, given my limited experience with Java.  But if  
that's not the case, I can think of two other explanations.

One is that the passage of the text through an intermediate buffer  
before blasting it out is considerably more expensive than anticipated.

The other is that the pre-allocation of a char[] array based on the  
length VInt yields a significant benefit over the standard techniques  
for reading in UTF-8.  That wouldn't be hard to believe.  Without  
that number, there's a lot of guesswork involved.  English requires  
about 1.1 bytes per UTF-8 code point; Japanese, 3.  Multiple memory  
allocation ops may be required as bytes get read in, especially if  
the final String object kicked out HAS to use the bare minimum amount  
of memory.  I don't suppose there's any way for me to snoop just  
what's happening under the hood in these CharsetDecoder classes or  
String constructors, is there?

Scanning through a SegmentTermEnum with next() doesn't seem to be any  
slower with a byte-based TermBuffer, and my index-1000-wikipedia-docs  
benchmarker doesn't slow down that much when IndexInput is changed to  
use a String constructor that accepts UTF-8 bytes rather than chars.   
However, it's possible that the modified toTerm method of TermBuffer  
is a bottleneck, as it also uses the UTF-8 String constructor.  It  
doesn't get exercised under SegmentTermEnum.next(), but during  
merging of segments I believe it sees plenty of action -- maybe a lot  
more than IndexInput's readString.

So my next step is to write a utf8ToString method that's as efficient  
as I can make it.  After that... I dunno, I'm running out of ideas.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

   public static final ByteBuffer stringToUTF8(
 String s, int start, int length, ByteBuffer byteBuf) {
 byteBuf.clear();
 int i = start;
 int j = 0;
 try {
   final int end = start + length;
   byte[] bytes = byteBuf.array();
   for ( ; i < end; i++) {
 final int code = (int)s.charAt(i);
 if (code < 0x80)
   bytes[j++] = (byte)code;
 else if (code < 0x800) {
   bytes[j++] = (byte)(0xC0 | (code >> 6));
   bytes[j++] = (byte)(0x80 | (code & 0x3F));
 } else if (code < 0xD800 || code > 0xDFFF) {
   bytes[j++] = (byte)(0xE0 | (code >>> 12));
   bytes[j++] = (byte)(0x80 | ((code >> 6) & 0x3F));
   bytes[j++] = (byte)(0x80 | (code & 0x3F));
 } else {
   // surrogate pair
   int utf32;
   // confirm valid high surrogate
   if (code < 0xDC00 && (i < end-1)) {
 utf32 = ((int)s.charAt(i+1));
 // confirm valid low surrogate and write pair
 if (utf32 >= 0xDC00 && utf32 <= 0xDFFF) {
   utf32 = ((code - 0xD7C0) << 10) + (utf32 & 0x3FF);
   i++;
   bytes[j++] = (byte)(0xF0 | (utf32 >>> 18));
   bytes[j++] = (byte)(0x80 | ((utf32 >> 12) & 0x3f));
   bytes[j++] = (byte)(0x80 | ((utf32 >> 6) & 0x3F));
   bytes[j++] = (byte)(0x80 | (utf32 & 0x3F));
   continue;
 }
   }
   // replace unpaired surrogate or out-of-order low surrogate
   // with substitution character
   bytes[j++] = (byte)0xEF;
   bytes[j++] = (byte)0xBF;
   bytes[j++] = (byte)0xBD;
 }
   }
 }
 catch (ArrayIndexOutOfBoundsException e) {
   // guess how many more bytes it will take, plus 10%
   float charsProcessed = (float)(i - start);
   float bytesPerChar = (j / charsProcessed) * 1.1f;

   float charsLeft = length - charsProcessed;
   float targetSize
 = (float)byteBuf.capacity() + bytesPerChar * charsLeft + 1.0f;

   return stringToUTF8(s, start, length, ByteBuffer.allocate((int) 
targetSize));
 }
 byteBuf.position(j);
 return byteBuf;
   }

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: word count in a doc

2005-10-31 Thread Robert Engels

it is the frequency().

-Original Message-
From: jacky [mailto:[EMAIL PROTECTED]
Sent: Monday, October 31, 2005 9:49 PM
To: java-dev@lucene.apache.org
Subject: word count in a doc

hi,
  e.g in test.txt:
  hello world, hello friend.

  When i search word "hello", can i get the count 2 by lucene?
  I found the word count is a field of index data structure, but i can't
found the api to get it.
 Best Regards.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Faking index merge by modifying segments file?

2005-11-01 Thread Robert Engels

Problem is the terms need to be sorted in a single segment.

-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
Sent: Tuesday, November 01, 2005 1:52 AM
To: java-dev@lucene.apache.org
Subject: Faking index merge by modifying segments file?

Hello,

I spent most of today talking to some people about Lucene, and one of
them said how they would really like to have an "instantaneous index
merge", and how he is thinking he could achieve that by simply opening
segments file of one index, and adding segment names of the other
index/indices, plus adjusting the segment size (SegSize in
fileformats.html), thus creating a single (but unoptimized) index.

Any reactions to that?

I imagine this isn't quite that simple to implement, as one would have
to renumber all documents, in order to avoid having multiple documents
with the same document id.

Can anyone think of any other problems with this approach, or perhaps
offer ideas for possible document renumbering?

Thanks,
Otis

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Faking index merge by modifying segments file?

2005-11-01 Thread Robert Engels

The solution we came up with is (I think) a bit better, since it does
require any copying of files.

Since MultiSegmentReader already does the segment/document # offsetting, and
a segment does not change after written, we created a reopen() method that
reopens an existing index, (knowing which segments are the same, so those do
not need to be opened again).

For a general library, some sort of reference counting for an open segment
would need to be implemented. (Since our index management is behind another
layer this is not needed).

-Original Message-
From: Kevin Oliver [mailto:[EMAIL PROTECTED]
Sent: Tuesday, November 01, 2005 11:18 AM
To: java-dev@lucene.apache.org
Subject: RE: Faking index merge by modifying segments file?

Hello Otis,

I worked on a similar issue a couple on months ago. I've included our
email conversation below.

Hopefully, your thread will prompt more interest from the mailing list.

-Kevin

Sort of -- but only within a very controlled situation along with some
hackery you can comment out both of them.

Here's what I had to do -- in pseudo code.

Create a new IndexWriter subclass, call it IndexWriter2 that gets its
segmentCounter initialized to the real (the actual pre-existing index)
index's segmentCounter - 1. I suspect this part is not very robust, as I
don't completely understand why I needed to subtract 1 (it has something
to do with the temporary RAMDirectory that gets used before actually
getting written to disk).

Add the documents into our IndexWriter2 so that they get properly
written to a separate place on the file system and get the correct
"next" segment names that would appear in the real index.

Within the addIndexes() loop over the dirs, you move all the newly
created files from their current location over to the real indexes file
directory. This part in particular feels very hacked.

Finally, instead of calling optimize() at the end of addIndexes(), you
rewrite the segments file so that it includes all these new segments.

One other note is that if I wasn't using compound files, then I _think_
I could just rename the all of the files when they get moved into the
real index's file directory. But, compound files create their internal
files using the segmentName that the segment was created with, thus
creating a mismatch when you rename it externally.

-Kevin

-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
Sent: Thursday, August 11, 2005 3:36 PM
To: java-dev@lucene.apache.org
Subject: Re: Avoiding segment merges during indexing

Kevin - are you saying that you can just comment out the 2 optimize()
calls and addIndexes(Directory[]) will keep working?  I don't recall
why there are optimize() calls again, but I know several people had
issues with it...

Otis

--- Kevin Oliver <[EMAIL PROTECTED]> wrote:

> This is a proposal that is in need of some insights.
>
> In an effort to speed up adding documents to an existing index, we
> are
> pursuing using IndexWriter.addIndexes(Directory[]). In theory this
> should work great -- you index your new documents into a new
> Directory,
> then add them into to your existing directory, saving you the time
> spent
> merging segments that would be caused by the normal
> IndexWriter.addDocument(Document) calls during indexing.
>
> However, addIndexes() has the property that it calls optimize() both
> before and after adding the new directories. This wipes out the
> performance boost, and then some.
>
> So I found a way to work around this, but I don't like what I've had
> to
> do and I was wondering if anybody has any ideas on what could be done
> to
> make this more pleasant.
>
> It appears that by getting the new segment files into the existing
> directory, with the correct segment names, it will work without all
> of
> the optimize calls. Unfortunately, getting the segment names right
> and
> getting the files into the right location is a big ugly hack and is
> quite fragile.
>
> Is there a better way? I think maybe some explanation into why the 2
> optimizes are there would help my understanding. Is there a clean way
> of
> doing what I'm proposing? Is there some hidden catch I'm missing and
> I've been going down the wrong path?
>
> It seems to me this would be a great benefit to anyone who does
> indexing
> on existing indexes and wants it to be fast.
>
> Thanks,
> Kevin Oliver
>

-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
Sent: Monday, October 31, 2005 11:52 PM
To: java-dev@lucene.apache.org
Subject: Faking index merge by modifying segments file?

Hello,

I spent most of today talking to some people about Lucene, and one of
them said how they would really like to have an "instantaneous index
merge", and how he is thinking he could achieve that by simply opening
segments file of one index, and adding segment names of the other
index/indices, plus adjusting the segment size (SegSize in
fileformats.html), thus creating a single (but unoptimized) index.

Any r

RE: Faking index merge by modifying segments file?

2005-11-02 Thread Robert Engels

There only need to be sorted if segA and segB were combined so in your case,
this is not needed.

I am not sure that what you are describing is any different than how
MultiReader works, and it does not need to perform any file copying of
linking.

Just create the new index. Write the documents. And open all indexes using a
MultiReader? Maybe I am missing something, but I see that as a simple way of
doing what you are trying to do.

-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
Sent: Wednesday, November 02, 2005 5:14 AM
To: java-dev@lucene.apache.org
Subject: RE: Faking index merge by modifying segments file?

Hello,

--- Robert Engels <[EMAIL PROTECTED]> wrote:

> Problem is the terms need to be sorted in a single segment.

Are you referring to Term Dictionary (.tis and .tii files as described
at http://lucene.apache.org/java/docs/fileformats.html )?  If so, is
that really true?

I don't have an optimized Lucene multi-file index handy to look at, but
.tis and .tii files are "per segment" files, so wouldn't a set of .tis
and .tii files from multiple indices be equivalent to a set of .tis and
.tii files from multiple segments of a single index?

For example, if we have two indices, A and B, both optimized, we have:

A: segA.tis   (this may contain terms bar and foo)
   segA.tii
   ...
   segments   (this would list segA)

B: segB.tis   (this may contain terms piggy and bank)
   segB.tii
   ...
   segments   (this would list segB)

Wouldn't that be the same as a single index, say index C:

C: segA.tis   (this may contain terms bar and foo)
   segA.tii
   segB.tis   (this may contain terms piggy and bank)
   segB.tii
   ...
   segments   (this would list segments segA and segB)

That is really what I am talking about: take all index files of index A
and all index files of segment B and stick them in a new index dir for
a new index C.  Then open segments files of index A and index B, pull
out segment names and other information from there, and write a new
segments file with that information in index dir for that new index C.

This sounds like it should be possible, except for docId clashes - if
index A had a document with Id 100 and index B also has a document with
Id 100, after my index file copying, index C will end up having 2
documents with Id 100, and that won't work.  So, documents in C would
have to be renumbered (re-assigned Ids), as they get renumbered during
optimization, but without rewriting all index files in index C.

Does this sound right?

Also, I may not need to actually copy/move files around, if I just make
use of sym/hard links.

Thanks,
Otis

> -Original Message-
> From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, November 01, 2005 1:52 AM
> To: java-dev@lucene.apache.org
> Subject: Faking index merge by modifying segments file?
>
>
> Hello,
>
> I spent most of today talking to some people about Lucene, and one of
> them said how they would really like to have an "instantaneous index
> merge", and how he is thinking he could achieve that by simply
> opening
> segments file of one index, and adding segment names of the other
> index/indices, plus adjusting the segment size (SegSize in
> fileformats.html), thus creating a single (but unoptimized) index.
>
> Any reactions to that?
>
> I imagine this isn't quite that simple to implement, as one would
> have
> to renumber all documents, in order to avoid having multiple
> documents
> with the same document id.
>
> Can anyone think of any other problems with this approach, or perhaps
> offer ideas for possible document renumbering?
>
> Thanks,
> Otis
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Put field in database vs. Lucene

2005-11-02 Thread Robert Engels

If you do not put them in Lucene, performing any sort of AND search will be
VERY difficult, and/or VERY slow.

-Original Message-
From: Mario Alejandro M. [mailto:[EMAIL PROTECTED]
Sent: Wednesday, November 02, 2005 4:08 PM
To: Lucene Developers List
Subject: Put field in database vs. Lucene

I have a requeriment in mix structured and un-structured data. Also, the
non-structured data have meta-data.

If I have:

Id
Title
Author
Content
DateCreated
DateAccesed

And a lot od schemes like that... so I put all fields less Content in a
database or in lucene?

--
Mario Alejandro Montoya
MCP
www.solucionesvulcano.com 
!Obtenga su sitio Web dinámico!

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Capitalized Method Names?

2005-11-25 Thread Robert Engels

I noticed that there a few Capitalized method names in the FastCharStream?

Is there a reason for this? It is not according to Java standards.

Re: "Advanced" query language

2005-12-02 Thread Robert Engels

I don't see the value in this. What ever is generating the xml could just as 
easily create/instantiate the query objects.

I would much rather see the query parser migrated to an internal parser (that 
would be easier to maintain), and develop a syntax that allowed easier use of 
the most common/powerful features.

-Original Message-
From: mark harwood <[EMAIL PROTECTED]>
Sent: Dec 2, 2005 10:03 AM
To: java-dev@lucene.apache.org
Subject: "Advanced" query language

There seems to be a growing gap between Lucene
functionality and the query language offered by
QueryParser (eg no support for regex queries, span
queries, "more like this", filter queries,
minNumShouldMatch etc etc).

Closing this gap is hard when:
a) The availability of Javacc+Lucene skills is a
bottleneck 
b) The syntax of the query language makes it difficult
to add new features eg rapidly running out of "special
characters"

I don't think extending the existing query
parser/language is necessarily useful and I see it
being used purely to support the classic "simple
search engine" syntax. 

Unfortunately the fall-back position for applications
which require more complex queries is to "just write
some Java code to instantiate the Query objects
programmatically." This is OK but I think there is
value in having an advanced search syntax capable of
supporting the latest Lucene features and expressed in
XML. It's worth considering why it's useful to have a
String-representable form for queries:
1) Queries can be stored eg in audit logs or "saved
queries" used for tasks like auto-categorization
2) Clients built in languages other than Java can
issue queries to a Lucene server
3) I can decouple a request from the code that
implements the query when distributing software e.g my
applet may not want Lucene dragging down to the client

Currently we cannot easily do the above for any
"complex" queries  because they are not easily
persisted (yes, we could serialize Query objects but
that seems messy and does not solve points 2 and 3).

We can potentially use XML in the same way ANT does
i.e. a declarative way of invoking an extensible list
of Java-implemented features. A query interpreter is
used to instantiate the configured Java Query objects
and populates them with settings from the XML in a
generic fashion (using reflection) eg:

   
  
Lorem ipsum dolor sit amet, consectetuer
adipiscing
elit. Morbi eget ante blandit quam faucibus
posuere. Vivamus
porta, elit fringilla venenatis consequat, neque
lectus
gravida dolor, sed cursus nunc elit non lorem.
Nullam congue
orci id eros. Nunc aliquet posuere enim.
  
   


Do people feel this would be a worthwhile endeavour?
I'm not sure if enough people feel pain around the
points 1-3 outlined above to make it worth pursuing.


Cheers
Mark




___ 
How much free photo storage do you get? Store your holiday 
snaps for FREE with Yahoo! Photos http://uk.photos.yahoo.com

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

NioFile cache performance

2005-12-08 Thread Robert Engels




I finally got around 
to writing a testcase to verify the numbers I presented. The following testcase 
and results are for the lowest level disk operations. On my machine reading from 
the cache, vs. going to disk (even when the data is in the OS cache) is 30%-40% 
faster. Since Lucene makes extensive use of disk IO and often reads the same 
data (e.g. reading the terms), a localized user-level cache can provide 
significant performance benefits.
 
Using a 4mb file (so 
I could be "guarantee" the disk data would be in the OS cache as well), the test 
shows the following results.
 
Most of the CPU time 
is actually used during the synchronization with multiple threads. I hacked 
together a version of MemoryLRUCache that used a ConcurrentHashMap from JDK 1.5, 
and it was another 50% faster ! At a minimum, if the ReadWriteLock class was 
modified to use the 1.5 facilities some significant additional performance 
gains should be realized.
 

filesize is 4194304
non-cached time = 10578, avg = 0.010578
non-cached threaded (3 threads) time = 32094, avg = 0.010698
cached time = 6125, avg = 0.006125
cache hits 996365
cache misses 3635
cached threaded (3 threads) time = 20734, avg = 0.0069116
cache hits 3989089
cache misses 10911
When using the shared test (which is more like 
the lucene usage, since a single "file" is shared by multiple threads), the 
difference is even more dramatic with multiple threads (since the cache size is 
effectively reduced by the number of threads). This test also shows the value of 
using multiple file handles when using multiple threads to read a single file 
(rather than using a shared file handle).
filesize is 4194304
non-cached time = 10594, avg = 0.010594
non-cached threaded (3 threads) time = 42110, avg = 0.014036
cached time = 6047, avg = 0.006047
cache hits 996827
cache misses 3173
cached threaded (3 threads) time = 20079, avg = 0.006693
cache hits 3995776
cache misses 4224
package org.apache.lucene.util;

import java.io.*;
import java.util.Random;

import junit.framework.TestCase;

public class TestNioFilePerf extends TestCase {
static final String FILENAME = "testfile.dat";
static final int BLOCKSIZE = 2048;
static final int NBLOCKS = 2048; // 4 mb file
static final int NREADS = 50;
static final int NTHREADS = 3;

static {
System.setProperty("org.apache.lucene.CachePercent","90");
}

public void setUp() throws Exception {
FileOutputStream f = new FileOutputStream(FILENAME);
Random r = new Random();

byte[] block = new byte[BLOCKSIZE]; 
for(int i=0;ipackage org.apache.lucene.util;

/**
 * a read/write lock. allows unlimited simultaneos readers, or a single writer. A thread with
 * the "wrte" lock implictly owns a read lock as well.
 */
public class ReadWriteLock {
int readlocks = 0;
int writelocks = 0;
Thread writethread = null;

public synchronized void readLock() {
while(true) {
if(writelocks==0 || (Thread.currentThread()==writethread) ) {
readlocks++;
return;
} else {
try {
wait();
} catch (InterruptedException e) {
}
}
}
}

public synchronized void readUnlock() {
readlocks--;
notifyAll();
}

public synchronized void writeLock() {
while(true) {
if(tryWriteLock())
return;
try {
wait();
} catch (InterruptedException e) {
}
}
}

/**
 * try to get the write lock
 *  
 * @return true if the write lock could be acquired, else false
 */
public synchronized boolean tryWriteLock() {
if(readlocks==0 && (writelocks==0 || writethread == Thread.currentThread())) {
writethread = Thread.currentThread();
writelocks++;
return true;
}
return false;
}

public synchronized void writeUnlock() {
if(writelocks==0)
throw new IllegalStateException("caller does not own write lock");
if(--writelocks==0)
writethread=null;
notifyAll();
}

/**
 * checks if the calling thread owns the write lock
 * 
 * @return true if the calling thread owns the write lock
 */
public synchronized boolean ownsWriteLock() {
return Thread.currentThread()==writethread;
}
}
package org.apache.lucene.util;

import java.io.*;
import java.nio.ByteBuffer;
import java.nio.channels.FileChannel;

/**
 * wrapper for NIO FileChannel in order to circumvent problems with multiple threads reading the
 * same FileChannel, and to provide local cache. The current Windows implementation of FileChannel
 * has some synchronization even when performing positioned reads. See JDK bug #6265734.
 * 
 * The NioFile contains internal caching to red

RE: NioFile cache performance

2005-12-08 Thread Robert Engels

As a follow-up...

The real performance benefit comes in a shared server environment, where the
Lucene process runs along side other processes - i.e. competes for the use
of the OS file cache. Since the Lucene process can be configured with a
dedicated memory pool, using facilities like NioFile allows for an large
dedicated application cache - similar to how databases buffer data/index
blocks and don't rely on the OS to do so.

If the Lucene process (we wrap Lucene in a server "process") is the "only"
process on the server, the OS cache will likely perform well-enough for most
applications.

I will attempt to get some performance numbers using/not using NioFile
performing actual Lucene queries.
  -Original Message-----
  From: Robert Engels [mailto:[EMAIL PROTECTED]
  Sent: Thursday, December 08, 2005 10:37 AM
  To: Lucene-Dev
  Subject: NioFile cache performance


  I finally got around to writing a testcase to verify the numbers I
presented. The following testcase and results are for the lowest level disk
operations. On my machine reading from the cache, vs. going to disk (even
when the data is in the OS cache) is 30%-40% faster. Since Lucene makes
extensive use of disk IO and often reads the same data (e.g. reading the
terms), a localized user-level cache can provide significant performance
benefits.

  Using a 4mb file (so I could be "guarantee" the disk data would be in the
OS cache as well), the test shows the following results.

  Most of the CPU time is actually used during the synchronization with
multiple threads. I hacked together a version of MemoryLRUCache that used a
ConcurrentHashMap from JDK 1.5, and it was another 50% faster ! At a
minimum, if the ReadWriteLock class was modified to use the 1.5 facilities
some significant additional performance gains should be realized.

  filesize is 4194304

  non-cached time = 10578, avg = 0.010578

  non-cached threaded (3 threads) time = 32094, avg = 0.010698

  cached time = 6125, avg = 0.006125

  cache hits 996365

  cache misses 3635

  cached threaded (3 threads) time = 20734, avg = 0.0069116

  cache hits 3989089

  cache misses 10911

  When using the shared test (which is more like the lucene usage, since a
single "file" is shared by multiple threads), the difference is even more
dramatic with multiple threads (since the cache size is effectively reduced
by the number of threads). This test also shows the value of using multiple
file handles when using multiple threads to read a single file (rather than
using a shared file handle).

  filesize is 4194304

  non-cached time = 10594, avg = 0.010594

  non-cached threaded (3 threads) time = 42110, avg = 0.014036

  cached time = 6047, avg = 0.006047

  cache hits 996827

  cache misses 3173

  cached threaded (3 threads) time = 20079, avg = 0.006693

  cache hits 3995776

  cache misses 4224

RE: NioFile cache performance

2005-12-08 Thread Robert Engels




I 
modified MemoryLRUCache to use the attached ConcurrentHashMap.java and ran under 
1.4.2_10

filesize is 4194304
non-cached time = 11140, avg = 0.01114
non-cached threaded (3 threads) time = 35485, avg = 0.011828
cached time = 6109, avg = 0.006109
cache hits 996138
cache misses 3862
cached threaded (3 threads) time = 17281, avg = 0.0057605
cache hits 3985911
cache misses 14089
with 
the shared test
 
filesize is 
4194304

non-cached time = 11266, avg = 0.011266
non-cached threaded (3 threads) time = 46734, avg = 0.015578
cached time = 6094, avg = 0.006094
cache hits 996133
cache misses 3867
cached threaded (3 threads) time = 16500, avg = 0.0055
cache hits 3994999
cache misses 5001
I then 
ran the tests using jdk 1.5.0_06 using the built-in 
ConcurrentHashMap

filesize is 4194304
non-cached time = 10515, avg = 0.010515
non-cached threaded (3 threads) time = 30688, avg = 0.010229
cached time = 7031, avg = 0.007031
cache hits 996742
cache misses 3258
cached threaded (3 threads) time = 17468, avg = 0.0058226667
cache hits 3989122
cache misses 10878
with 
the shared test
 
filesize is 4194304

non-cached time = 10187, avg = 
0.010187
non-cached threaded (3 threads) time = 
44000, avg = 0.014666
cached time = 6234, avg = 
0.006234
cache hits 996315
cache misses 3685
cached threaded (3 threads) time = 
16766, avg = 0.005588
cache hits 3995081
cache misses 4919
surprisingly the 1.4.2_10 version performed as well (if not better) than 
the jdk 1.5 version.
 
Also, I am only running 
on a single processor box (non-hyper threaded), so it would be interesting to 
see the numbers on a true multi-processor box. My thinking is that the cached 
version will be MUCH faster than the non, as many more context switches into the 
OS will be avoided.

  -Original Message-From: Paul Smith 
  [mailto:[EMAIL PROTECTED]Sent: Thursday, December 08, 2005 1:54 
  PMTo: java-dev@lucene.apache.orgSubject: Re: NioFile 
  cache performance
  
  

Most of the CPU 
time is actually used during the synchronization with multiple threads. I 
hacked together a version of MemoryLRUCache that used a ConcurrentHashMap 
from JDK 1.5, and it was another 50% faster ! At a minimum, if the 
ReadWriteLock class was modified to use the 1.5 facilities some significant 
additional performance gains should be 
  realized.
  Would you be able to run the same test in JDK 1.4 but use the 
  util.concurrent compatibility pack? (supposedly the same classes in Java5) It 
  would be nice to verify whether the gain is the result of the different 
  ConcurrentHashMap vs the different JDK itself.
  
  Paul Smith
/*
  File: ConcurrentHashMap

  Written by Doug Lea. Adapted and released, under explicit
  permission, from JDK1.2 HashMap.java and Hashtable.java which
  carries the following copyright:

 * Copyright 1997 by Sun Microsystems, Inc.,
 * 901 San Antonio Road, Palo Alto, California, 94303, U.S.A.
 * All rights reserved.
 *
 * This software is the confidential and proprietary information
 * of Sun Microsystems, Inc. ("Confidential Information").  You
 * shall not disclose such Confidential Information and shall use
 * it only in accordance with the terms of the license agreement
 * you entered into with Sun.

  History:
  Date   WhoWhat
  26nov2000  dl   Created, based on ConcurrentReaderHashMap
  12jan2001  dl   public release
  17nov2001  dl   Minor tunings
  24oct2003  dl   Segment implements Serializable
  23jun2004  dl   Avoid bad array sizings in view toArray methods
*/

package org.apache.lucene.util;

import java.util.Map;
import java.util.AbstractMap;
import java.util.AbstractSet;
import java.util.AbstractCollection;
import java.util.Collection;
import java.util.Set;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.Enumeration;
import java.util.NoSuchElementException;

import java.io.Serializable;
import java.io.IOException;
import java.io.ObjectInputStream;
import java.io.ObjectOutputStream;


/**
 * A version of Hashtable supporting 
 * concurrency for both retrievals and updates:
 *
 *  
 *  Retrievals
 *
 *  Retrievals may overlap updates.  (This is the same policy as
 * ConcurrentReaderHashMap.)  Successful retrievals using get(key) and
 * containsKey(key) usually run without locking. Unsuccessful
 * retrievals (i.e., when the key is not present) do involve brief
 * synchronization (locking).  Because retrieval operations can
 * ordinarily overlap with update operations (i.e., put, remove, and
 * their derivatives), retrievals can only be guaranteed to return the
 * results of the most recently completed operations holding
 * upon their onset. Retrieval operations may or may not return
 * results reflecting in-progress writing operations.  However, the
 * retrieval operations do always return consistent r

RE: NioFile cache performance

2005-12-09 Thread Robert Engels

As stated in a previous email - good idea.

All of the code and testcases were attached to the original email.

The testcases were the answer to a request for such (at least a month ago if
not longer).

-Original Message-
From: DM Smith [mailto:[EMAIL PROTECTED]
Sent: Friday, December 09, 2005 7:07 AM
To: java-dev@lucene.apache.org
Subject: Re: NioFile cache performance

John Haxby wrote:

> Robert Engels wrote:
>
>> Using a 4mb file (so I could be "guarantee" the disk data would be in
>> the OS cache as well), the test shows the following results.
>
>
> Which OS?   If it's Linux, what kernel version and distro?   What
> hardware (disk type, controller etc).
>
> It's important to know: I/O (and caching) is very different between
> Linux 2.4 and 2.6.   The choice of I/O scheduler can also make a
> significant difference on 2.6, depending on the workload.   The type
> of disk and its controller is also important -- and when you get
> really picky, the mobo model number.
>
> I don't dispute your finding for a second, but it would be good to run
> the same test on other platforms to get comparative data: not least
> because you can get the kind of I/O time improvement you're seeing on
> some workloads on different versions of the Linux kernel.

I think that the results were informative from a comparative basis on a
single machine. It compared different techniques and showed their
relative performance on that machine.

I also agree that the architecture of the machine can play an important
part in how code performs. I wrote a piece of software that ran well on
a 4-way, massive raid configuration, with gobs of ram only to have it
re-targeted to a 1-way, small ram box, where it had to be rewritten to
run at all.

Perhaps, it would be good to establish guidelines for reporting
performance, including the posting of test data and test code.

This may encourage others to download the data and code, perform the
test and report the results.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: NioFile cache performance

2005-12-09 Thread Robert Engels

That is why "guarantee" is in quotes. Any OS worth its salt should end up
caching a "massively read" 4mb file (unless you are running on an OS with
1mb or memory). If the file was 400mb, you probably have a different story.

I welcome (and would hope) others to run the tests on various OSs and
various hardware configs.

I would expect similar results on other platforms, as most of the
improvement is due to keeping the machine in user context and avoiding
context switches to OS level code.

-Original Message-
From: John Haxby [mailto:[EMAIL PROTECTED]
Sent: Friday, December 09, 2005 4:24 AM
To: java-dev@lucene.apache.org
Subject: Re: NioFile cache performance

Robert Engels wrote:

> Using a 4mb file (so I could be "guarantee" the disk data would be in
> the OS cache as well), the test shows the following results.

Which OS?   If it's Linux, what kernel version and distro?   What
hardware (disk type, controller etc).

It's important to know: I/O (and caching) is very different between
Linux 2.4 and 2.6.   The choice of I/O scheduler can also make a
significant difference on 2.6, depending on the workload.   The type of
disk and its controller is also important -- and when you get really
picky, the mobo model number.

I don't dispute your finding for a second, but it would be good to run
the same test on other platforms to get comparative data: not least
because you can get the kind of I/O time improvement you're seeing on
some workloads on different versions of the Linux kernel.

jch

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: "Advanced" query language

2005-12-19 Thread Robert Engels

Why not just write a Quark Storage component based upon lucene, and then you
get XPath/Query compliance?

and leave the custom (and simple) XML based persistence mechanism for lucene
queries as proprietary.

-Original Message-
From: Joaquin Delgado [mailto:[EMAIL PROTECTED]
Sent: Monday, December 19, 2005 6:44 PM
To: java-dev@lucene.apache.org
Subject: Re: "Advanced" query language

Comments in-line

Wolfgang Hoschek wrote:

> Yes, there are interesting impls out there. I've myself implemented
> XQuery fulltext search via extension functions build on Lucene. See
> http://dsd.lbl.gov/nux/index.html#Google-like%20realtime%20fulltext%
> 20search%20via%20Apache%20Lucene%20engine
>
> However, rather than targetting fulltext search of infrequent queries
> over huge persistent data archives (historic search), Nux targets
> streaming fulltext search of huge numbers of queries over
> comparatively small transient realtime data (prospective search),
> e.g. 10 queries/sec ballpark. Think XML router. That's probably
> distinctly different than what many (most?) other folks would like to
> do, and requires a different, somewhat non-standard, architecture.
>
> [The underlying lucene code lives in lucene SVN in the lucene/contrib/
> memory module, the remainder is in Nux.]
>
> Implementing XQuery in full compliance with the spec is a rather
> gigantic undertaking. Separating the XQuery language and the fulltext
> language greatly simplified the system design, and made it more
> flexible and extensible.

[JOAQUIN] One of the arguable advantage of this new XQuery FT draft is
that the semantics (http://www.w3.org/TR/xquery-full-text/#tq-semantics)
are defined using XQuery  functions, thus it is relatively easy to build
a "dumb" XQuery-FT compliant engine using these definitions :-)  Here is
a Java based XQuery engine developed in Cornell that satisfies most of
the working draft's requirements:
http://www.cs.cornell.edu/database/Quark/quark_main.html

> Further, consider that tulltext search capabilities are typically
> quite open ended and context/application specific. Seems to me that
> that's one of the reasons why lucene is more a set of interfaces and
> diverse building blocks than a complete end user system. I find it
> difficult to believe that making the fulltext language an *integral
> part of XQuery* will enable sufficient "extension points" to prove
> meaningful to end users and implementors. Standards evolve at a
> glacial pace; it effectively means that most or all flexibility is
> lost. I tend to think that the W3C is jumping the gun and attempting
> to standardize what is more an R&D concept than a well understood set
> of capabilities across a wide range of actual real world use cases,
> and it does so in a non-modular manner.

Full-text search remains open ended and context/app specific thus it
makes sense to leave Lucene as is and still have, for example Nutch.
However the moment you are promoting INTEROPERABILITY with other
search/retrieval systems by XMLizing the query input and the result
output, like Mark is, then it makes sense to adhere to standards and the
standard to query XML is XQuery. Because of the nature of the data (XML)
full-text becomes a *must* requirement of the standard. If Mark comes up
with yet another query language with some custom tags it would be
denying the fact that search systems need to communicate among them and
thus re-inventing the wheel. Besides, almost 80% of all full-text
operators (Boolean, wildcards, proximity, etc.) just differ in syntax
from one search engine to another. Just look at another "Common Query
Language" now being used by the Library of Congress
(http://www.loc.gov/standards/sru/cql/) for federated search.

Maybe I'm being too ambitious here but if we have an implementation of
XQuery-FT compliant XQuery engine on top of Lucene indices or at the
minimum _Lucene could interpret XPath queries_ where element node labels
are  equivalent to Lucene fields we begin thinking of exposing Lucene
sources to more sophisticated and distributed XQuery engines, thus
providing full XML support on any Lucene based system. Unfortunately
Lucene does not support nested fields but that is OK for now.

-- Joaquin

>
> On Dec 17, 2005, at 5:43 PM, [EMAIL PROTECTED] wrote:
>
>> Paul and  Wolfang,
>>
>> Thank you very much for your input. I think there are two distinct
>> problems that have emerged from this thread:
>> 1) The ability to create efficient structures to index and query  XML
>> documents (element, attributes and corresponding values) with a
>> full-text query language and perforators. After all XML is text. As
>> Paul pointed out people have already tried this with Lucene.
>> 2) The need for a standard query language like XQuery aiming at
>> system interoperability in the now XMLized world that has the same
>> effect that SQL had in the relational world.
>>
>> While I can see how in the SQL case extension functions can be used
>> to implement full-text capab

RE: JE Directory/XA Transactions

2005-12-29 Thread Robert Engels

I think JE transactions are held completely in memory, so this may be an
issue - although I have not reviewed your implementation yet... :)

-Original Message-
From: Andi Vajda [mailto:[EMAIL PROTECTED]
Sent: Thursday, December 29, 2005 5:14 PM
To: java-dev@lucene.apache.org
Subject: Re: JE Directory/XA Transactions

On Thu, 29 Dec 2005, Donovan Aaron wrote:

> I recently ported Andi Vajda's DbDirectory to the Java Edition of
> Berkeley DB.  The main reason being the JCA connector and XA
> transactions.  Initial results are great and shown below.  I'm new to
> contributing.  What is the procedure for making this code available?

If you attach the Apache 2.0 license, I'd be glad to add your files
alongside
the C DB based implementation, since I'm currently the maintainer of the
'db'
Java Lucene contrib area.

Your timings look great but did you run DbDirectory in DB_AUTO_COMMIT mode
or
did you bracket the operations with your own transaction ? (it should make a
big difference to use a transaction for the whole indexing operation,
including IndexWriter.optimize(), instead of running in DB_AUTO_COMMIT mode,
where each write call gets its own transaction).

Andi..

>
> JEDirectory
>
> Writing files byte by byte
> 1453 total milliseconds to create, 5214 kb/s
> 722 total milliseconds to read, 10494 kb/s
> 80 total milliseconds to delete
> 2255 total milliseconds
> Writing files as one byte array
> 1032 total milliseconds to create, 7918 kb/s
> 311 total milliseconds to read, 26274 kb/s
> 60 total milliseconds to delete
> 1403 total milliseconds
>
>
> DbDirectory
>
> Writing files byte by byte
> 9879 total milliseconds to create, 766 kb/s
> 601 total milliseconds to read, 12607 kb/s
> 9188 total milliseconds to delete
> 19668 total milliseconds
> Writing files as one byte array
> 12304 total milliseconds to create, 664 kb/s
> 281 total milliseconds to read, 29079 kb/s
> 9689 total milliseconds to delete
> 22274 total milliseconds
>
> Regards,
> Aaron Donovan
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: indexreader refresh

2006-01-04 Thread Robert Engels

I proposed and posted a patch for this long ago. Only thing missing would be
some sort of reference courting for segments (rather than the 'stayopen'
flag).

  /**
   * reopens the IndexReader, possibly reusing the segments for greater
efficiency. The original IndexReader instance
   * is closed, and the reference is no longer valid
   *
   * @return the new IndexReader
   */
  public IndexReader reopen() throws IOException {
  if(!(this instanceof MultiReader))
  return IndexReader.open(directory);

  MultiReader mr = (MultiReader) this;

  final IndexReader[] oldreaders = mr.getReaders();
  final boolean[] stayopen = new boolean[oldreaders.length];

  synchronized (directory) {  // in- & 
inter-process sync
  return (IndexReader)new Lock.With(
  directory.makeLock(IndexWriter.COMMIT_LOCK_NAME),
  IndexWriter.COMMIT_LOCK_TIMEOUT) {
  public Object doBody() throws IOException {
SegmentInfos infos = new SegmentInfos();
infos.read(directory);
if (infos.size() == 1) {  // index is optimized
  return new SegmentReader(infos, infos.info(0),
closeDirectory);
} else {
  IndexReader[] readers = new IndexReader[infos.size()];
  for (int i = 0; i < infos.size(); i++) {
  for(int j=0;jmailto:[EMAIL PROTECTED]
Sent: Wednesday, January 04, 2006 12:30 PM
To: java-dev@lucene.apache.org
Subject: Re: indexreader refresh

Amol Bhutada wrote:
> If I have a reader and searcher on a indexdata folder and another
> indexwriter writing documents to the same indexdata folder, do I need to
> close existing reader and searcher and create new so that newly indexed
> data comes into search effect?

[ moved from user to dev list]

This is a frequent request.  While opening an all-new IndexReader is
effective, it is not always efficient.  It might be nice to support a
more efficient means of re-opening an index.

Perhaps we should add a few new IndexReader methods, as follows:

/** If reader's index has not been changed, return
   * reader, otherwise return a new [EMAIL PROTECTED] IndexReader}
   * reading the new latest of the index
   */
public static IndexReader open(IndexReader reader) {
   if (isCurrent()) {
 // unchanged: return existing
 return reader;
   }

   // try to incrementally create new reader
   IndexReader result = reader.reopen(reader);
   if (result != null) {
 return result;
   }

   // punt, opening an entirely new reader
   return IndexReader.open(reader.directory());
}

/** Return a new IndexReader reading the current state
   * of the index, re-using reader's resources, or null if this
   * is not possible.
   */
protected IndexReader reopen(IndexReader reader) {
   return null;
}

Then we can add implementations of reopen to SegmentReader and
MultiReader that attempt to re-use the existing, already opened
segments.  This should mostly be simple, but there are a few tricky
issues, like detecting whether an already-open segment has had
deletions, and deciding when to close obsolete segments.

Does this sound like it would make a good addition?  Does someone want
to volunteer to implement it?

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Save to database...

2006-01-05 Thread Robert Engels

There are impl in the contrib that do not need to retrieve the entire index
from the db in order to query (there store blocks of files in the db,
instead of blocks on disk).

I also developed an implementation that did not use blocks but rather a
custom index persistence mechanism.

There can be several advantages with this:

1. centralized/automated backup
2. db usually can perform more optimized caching of blocks
3. use multiple "nearly diskless" query servers with a single database
4. transactional updates to the index are possible (although index writes
are supposedly transactional in std lucene, I have encountered some index
corruption with hard failures - I think it is because the files are not
"synced" when flushed/closed).


-Original Message-
From: Steven Pannell [mailto:[EMAIL PROTECTED]
Sent: Thursday, January 05, 2006 2:22 AM
To: 'java-dev@lucene.apache.org'
Subject: RE: Save to database...


Look in the old archive mails and you will find a few people have tried this
out.  There is even some code around.

I have tried this, and to be honest it does not make much sense. The real
problem is performance it just takes too long to keep getting the index from
the database for performing the query.  In the end I realised this was not
the way to go and just use the standard filesystem and memory cache for
this. Much faster.

The only reason for doing this would be if you could not easily reproduce
the index and thus wanted to make sure you had some kind of permananet copy.


Out of interest why do you want to store in the DB?

Steve.



-Original Message-
From: Aditya Liviandi [mailto:[EMAIL PROTECTED]
Sent: 05 January 2006 07:15
To: java-dev@lucene.apache.org
Subject: Save to database...



How would I go about altering lucene so that the index is saved to a
database instead?

(or has it been done? Wouldn't want to reinvent the wheel there.)



--
This email is confidential and may be privileged.  If you are not the
intended recipient, please delete it and notify us immediately. Please do
not copy or use it for any purpose, or disclose its contents to any other
person. Thank you.
--

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Save to database...

2006-01-05 Thread Robert Engels

Yes, that is what I did in the "custom" persistence.

There are some not so trivial problems to solve though. Normally you cannot
seek with BLOBs, (a lot of JDBC/db impl will read the entire BLOB in all
cases) so efficiently reading the postings can be difficult, although you
can store the postings using a startdoc, enddoc, postings schema, which will
allow skipTo() to function.

The biggest problem is updating the postings efficiently - if you allow the
reuse of internal document numbers. If you don't allow a reuse, then you
need to periodically (RARELY!) compact the database while it is offline.
This will be a TIME CONSUMING process.

The Lucene standard index format is very efficient because it do not update,
but rather builds new indexes, and searches the indexes together.

-Original Message-
From: Aditya Liviandi [mailto:[EMAIL PROTECTED]
Sent: Thursday, January 05, 2006 2:43 AM
To: java-dev@lucene.apache.org; [EMAIL PROTECTED]
Subject: RE: Save to database...



What I meant is instead of saving the indexes into files, could I save
them as tables in a database?

I would think there would be a FieldNames table, a TermDictionary table,
a TermFrequency table and a TermPosition table (at the least)...

Has this been done before?


-Original Message-----
From: Robert Engels [mailto:[EMAIL PROTECTED]
Sent: Thursday, January 05, 2006 4:36 PM
To: java-dev@lucene.apache.org
Subject: RE: Save to database...

There are impl in the contrib that do not need to retrieve the entire
index
from the db in order to query (there store blocks of files in the db,
instead of blocks on disk).

I also developed an implementation that did not use blocks but rather a
custom index persistence mechanism.

There can be several advantages with this:

1. centralized/automated backup
2. db usually can perform more optimized caching of blocks
3. use multiple "nearly diskless" query servers with a single database
4. transactional updates to the index are possible (although index
writes
are supposedly transactional in std lucene, I have encountered some
index
corruption with hard failures - I think it is because the files are not
"synced" when flushed/closed).


-Original Message-
From: Steven Pannell [mailto:[EMAIL PROTECTED]
Sent: Thursday, January 05, 2006 2:22 AM
To: 'java-dev@lucene.apache.org'
Subject: RE: Save to database...


Look in the old archive mails and you will find a few people have tried
this
out.  There is even some code around.

I have tried this, and to be honest it does not make much sense. The
real
problem is performance it just takes too long to keep getting the index
from
the database for performing the query.  In the end I realised this was
not
the way to go and just use the standard filesystem and memory cache for
this. Much faster.

The only reason for doing this would be if you could not easily
reproduce
the index and thus wanted to make sure you had some kind of permananet
copy.


Out of interest why do you want to store in the DB?

Steve.



-Original Message-
From: Aditya Liviandi [mailto:[EMAIL PROTECTED]
Sent: 05 January 2006 07:15
To: java-dev@lucene.apache.org
Subject: Save to database...



How would I go about altering lucene so that the index is saved to a
database instead?

(or has it been done? Wouldn't want to reinvent the wheel there.)



--
This email is confidential and may be privileged.  If you are not the
intended recipient, please delete it and notify us immediately. Please
do
not copy or use it for any purpose, or disclose its contents to any
other
person. Thank you.
--

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


--
This email is confidential and may be privileged.  If you are not the
intended recipient, please delete it and notify us immediately. Please do
not copy or use it for any purpose, or disclose its contents to any other
person. Thank you.
--


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Save to database...

2006-01-05 Thread Robert Engels

Well, I think they are supposed to be - that is the reason the segments file
is written last. If the segments file is not updated the only problem
becomes orphaned files, but the index SHOULD still be consistent.

The Lucene index format is quite ingenious in the simplicity in how it does
this.

As I said I think the underlying reason why it doe snot work sometimes is
that if a hard crash (computer failure) occurs before the OS has written ALL
the in memory buffers to disk (so the segments file might be committed but
the earlier segments are not) - corruption. I writing the files in 'sync'
mode would prevent this (at the cost of performance).

-Original Message-
From: Andi Vajda [mailto:[EMAIL PROTECTED]
Sent: Thursday, January 05, 2006 10:32 AM
To: java-dev@lucene.apache.org; [EMAIL PROTECTED]
Subject: RE: Save to database...

On Thu, 5 Jan 2006, Robert Engels wrote:

> 4. transactional updates to the index are possible (although index writes
> are supposedly transactional in std lucene, I have encountered some index
> corruption with hard failures - I think it is because the files are not
> "synced" when flushed/closed).

They may be transactional but they're not ACID unless the underlying
Directory
implementation is. As far as I can tell, FSDirectory is not.

Andi..

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: [jira] Commented: (LUCENE-140) docs out of order

2006-01-10 Thread Robert Engels

Possibly "virus scanner" software interfering with the writing/renaming/copying 
of the index files???

-Original Message-
From: Doug Cutting (JIRA) [mailto:[EMAIL PROTECTED]
Sent: Tuesday, January 10, 2006 11:29 AM
To: java-dev@lucene.apache.org
Subject: [jira] Commented: (LUCENE-140) docs out of order


[ 
http://issues.apache.org/jira/browse/LUCENE-140?page=comments#action_12362354 ] 

Doug Cutting commented on LUCENE-140:
-

File corruption could cause this.  Please look in your system logs to see if 
there are any reports of problems accessing the drive that stores this index.

> docs out of order
> -
>
>  Key: LUCENE-140
>  URL: http://issues.apache.org/jira/browse/LUCENE-140
>  Project: Lucene - Java
> Type: Bug
>   Components: Index
> Versions: unspecified
>  Environment: Operating System: Linux
> Platform: PC
> Reporter: legez
> Assignee: Lucene Developers
>  Attachments: bug23650.txt
>
> Hello,
>   I can not find out, why (and what) it is happening all the time. I got an
> exception:
> java.lang.IllegalStateException: docs out of order
> at
> org.apache.lucene.index.SegmentMerger.appendPostings(SegmentMerger.java:219)
> at
> org.apache.lucene.index.SegmentMerger.mergeTermInfo(SegmentMerger.java:191)
> at
> org.apache.lucene.index.SegmentMerger.mergeTermInfos(SegmentMerger.java:172)
> at 
> org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:135)
> at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:88)
> at 
> org.apache.lucene.index.IndexWriter.mergeSegments(IndexWriter.java:341)
> at org.apache.lucene.index.IndexWriter.optimize(IndexWriter.java:250)
> at Optimize.main(Optimize.java:29)
> It happens either in 1.2 and 1.3rc1 (anyway what happened to it? I can not 
> find
> it neither in download nor in version list in this form). Everything seems 
> OK. I
> can search through index, but I can not optimize it. Even worse after this
> exception every time I add new documents and close IndexWriter new segments is
> created! I think it has all documents added before, because of its size.
> My index is quite big: 500.000 docs, about 5gb of index directory.
> It is _repeatable_. I drop index, reindex everything. Afterwards I add a few
> docs, try to optimize and receive above exception.
> My documents' structure is:
>   static Document indexIt(String id_strony, Reader reader, String 
> data_wydania,
> String id_wydania, String id_gazety, String data_wstawienia)
> {
> Document doc = new Document();
> doc.add(Field.Keyword("id", id_strony ));
> doc.add(Field.Keyword("data_wydania", data_wydania));
> doc.add(Field.Keyword("id_wydania", id_wydania));
> doc.add(Field.Text("id_gazety", id_gazety));
> doc.add(Field.Keyword("data_wstawienia", data_wstawienia));
> doc.add(Field.Text("tresc", reader));
> return doc;
> }
> Sincerely,
> legez

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: [jira] Created: (LUCENE-487) Database as a lucene index target

2006-01-11 Thread Robert Engels

Since no code has been posted, I'll just ask the question...

Does your implementation use the Blob "seek" functions when reading and 
writing, or does it read/write the blob in its entirety.

If it is the latter, your solution is only acceptable for the smallest of 
Lucene indexes. If it is the former, it would be interesting to see the results 
using various db & drivers, as many JDBC blob impls do not support this 
functionality, and read/write the blob completely behind the scenes.

-Original Message-
From: Amir Kibbar (JIRA) [mailto:[EMAIL PROTECTED]
Sent: Wednesday, January 11, 2006 12:35 PM
To: java-dev@lucene.apache.org
Subject: [jira] Created: (LUCENE-487) Database as a lucene index target


Database as a lucene index target
-

 Key: LUCENE-487
 URL: http://issues.apache.org/jira/browse/LUCENE-487
 Project: Lucene - Java
Type: New Feature
  Components: Store  
Versions: 1.9
 Environment: MySql (version 4.1 an up), Oracle (version 8.1.7 and up)
Reporter: Amir Kibbar
Priority: Minor


I've written an extension for the Directory object called DBDirectory, that 
allows you to read and write a Lucene index to a database instead of a file 
system.

This is done using blobs. Each blob represents a "file". Also, each blob has a 
name which is equivalent to the filename and a prefix, which is equivalent to a 
directory on a file system. This allows you to create multiple Lucene indexes 
in a single database schema.

The solution uses two tables:
LUCENE_INDEX - which holds the index files as blobs
LUCENE_LOCK - holds the different locks

Attached is my proposed solution. This solution is still very basic, but it 
does the job.
The solution supports Oracle and mysql

To use this solution:

1. Place the files:
- DBDirectory in src/java/org/apache/lucene/store
- TestDBIndex in src/test/org/apache/lucene/index
- objects-mysql.sql in src/db
- objects-oracle.sql in src/db

2. Edit the parameters for the database connection in TestDBIndex

3. Create the database tables using the objects-mysql.sql script (assuming 
you're using mysql)

4. Build Lucene

5. Run TestDBIndex with the database driver in the classpath

I've tested the solution on mysql, but it *should* work on Oracle, I will test 
that in a few days.

Amir

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: [jira] Created: (LUCENE-487) Database as a lucene index target

2006-01-11 Thread Robert Engels

You are better off using the one that has already been contributed then. It
uses JDBC and breaks the file into blocks. Much more efficient. Sorry to say
but your solution/code is inferior to what already exists.

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of
Amir Kibbar
Sent: Wednesday, January 11, 2006 1:37 PM
To: java-dev@lucene.apache.org; [EMAIL PROTECTED]
Subject: Re: [jira] Created: (LUCENE-487) Database as a lucene index
target


Robert,

My solution is the latter. If it is possible to do it using blob seek, I
will attempt to do it next.

Amir

On 1/11/06, Robert Engels <[EMAIL PROTECTED]> wrote:
>
> Since no code has been posted, I'll just ask the question...
>
> Does your implementation use the Blob "seek" functions when reading and
> writing, or does it read/write the blob in its entirety.
>
> If it is the latter, your solution is only acceptable for the smallest of
> Lucene indexes. If it is the former, it would be interesting to see the
> results using various db & drivers, as many JDBC blob impls do not support
> this functionality, and read/write the blob completely behind the scenes.
>
> -Original Message-
> From: Amir Kibbar (JIRA) [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, January 11, 2006 12:35 PM
> To: java-dev@lucene.apache.org
> Subject: [jira] Created: (LUCENE-487) Database as a lucene index target
>
>
> Database as a lucene index target
> -
>
>  Key: LUCENE-487
>  URL: http://issues.apache.org/jira/browse/LUCENE-487
>  Project: Lucene - Java
> Type: New Feature
>   Components: Store
> Versions: 1.9
> Environment: MySql (version 4.1 an up), Oracle (version 8.1.7 and up)
> Reporter: Amir Kibbar
> Priority: Minor
>
>
> I've written an extension for the Directory object called DBDirectory,
> that allows you to read and write a Lucene index to a database instead of
a
> file system.
>
> This is done using blobs. Each blob represents a "file". Also, each blob
> has a name which is equivalent to the filename and a prefix, which is
> equivalent to a directory on a file system. This allows you to create
> multiple Lucene indexes in a single database schema.
>
> The solution uses two tables:
> LUCENE_INDEX - which holds the index files as blobs
> LUCENE_LOCK - holds the different locks
>
> Attached is my proposed solution. This solution is still very basic, but
> it does the job.
> The solution supports Oracle and mysql
>
> To use this solution:
>
> 1. Place the files:
> - DBDirectory in src/java/org/apache/lucene/store
> - TestDBIndex in src/test/org/apache/lucene/index
> - objects-mysql.sql in src/db
> - objects-oracle.sql in src/db
>
> 2. Edit the parameters for the database connection in TestDBIndex
>
> 3. Create the database tables using the objects-mysql.sql script (assuming
> you're using mysql)
>
> 4. Build Lucene
>
> 5. Run TestDBIndex with the database driver in the classpath
>
> I've tested the solution on mysql, but it *should* work on Oracle, I will
> test that in a few days.
>
> Amir
>
> --
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of the administrators:
>http://issues.apache.org/jira/secure/Administrators.jspa
> -
> For more information on JIRA, see:
>http://www.atlassian.com/software/jira
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Filter

2006-01-26 Thread Robert Engels

I think the interface I proposed is simpler and handles more cases easily.

interface SearchFilter {
boolean include(int doc);
}

It seems your interface requires that the SearchFilter know all of the query
results before hand. I am not sure this works well with the partial result
sets that Lucene supports.

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] Behalf Of Chris Hostetter
Sent: Thursday, January 26, 2006 1:09 PM
To: java-dev@lucene.apache.org
Subject: Re: Filter



The subject of revamping the Filter API to support more compact filter
representations has come up in the past ... At least one patch comes to
mind that helps with the issue...

   https://issues.apache.org/jira/browse/LUCENE-328

...i'm not intimitely familiar with that code, but if i recall correctly
from the last time i read it, it doesn't propose any actual API changes
just some utilities to reduce memory usage.

Reading your post has me thinking about this whole issue again,
particularly the subject of Filters that are straight forward enough they
could be implimented as simple iterators with very little state and what
API changes could be made to support the interface you describe and still
be backwards compatible.

One thing that comes to mind (that i don't remember suggesting before, but
perhaps someone else has suggested it before) is that since Filter is an
bastract class which people arecurrently required to subclass, we could
follow a migration path something like this...

  1) add a SearchFilter interface like the one you describe to the core
 code base
  2) add the following method declaration to the Filter class...
public SearchFilter getSearchFilter(IndexReader) throws IOException
 ...impliment this method by calling bits, and returning an instance
 of a thin inner class that wraps the BitSet
  3) indicate that Filter.bits() is deprecated.
  4) change all existing calls to Filter.bits() in the core lucene code
 base to call Filter.getSearchFilter and do whatever iterating is
 neccessary.
  5) gradually reimpliment all of the concrete instances of Filter in
 the core lucene code base so they override the getSearchFilter method
 with something that returns a more "iterator" style SearchFilter,
 and impliment their bits() method to use the SearchFilter to build up
 the bit set if clients call it directly.
  6) wait a suitable amount of time.
  7) remove Filter.bits() and all of the concrete implimentations from the
 lucene core.

...i think that would be a fairly straight forward and practical way to
execute such a change.  The big question in my mind is what the
"SearchFilter" interface should look like.  what you propose is along the
usage lines of "iterate over your ScoreDocs, and foreach one test it
against hte filter" ... but i'm not convinced that it wouldnt' make more
sense to say "ask the filter what the next viable doc is, now score it",
ala...

  public interface SearchFilter {
  /** returns doc ids that pass the filter, in increasing order.
   * returns 0 once there are no more docs.
   */
  int doc getNextFilteredDoc();
  }


thoughts?


: Date: Thu, 26 Jan 2006 14:35:44 +0100
: From: Morus Walter <[EMAIL PROTECTED]>
: Reply-To: java-dev@lucene.apache.org
: To: java-dev@lucene.apache.org
: Subject: Filter
:
: Hi,
:
: I would like to suggest a more general filter interface which could be
: added as an alternative to the current bitset filters.
: (Replacing the bitset filters would only be possible if api changes were
: acceptable).
:
: While bitset based filters are useful in many use cases the restriction
: of filters to using bitsets prevents other solutions.
: Especially since the introduction of field caches for sorting it's easy
: to implement filters directly based on field values.
:
: So I suggest to add a general filter interface that requires a filter
: just to provide a filter-method that takes a ScoreDoc and returns
: true or false if the document passes the filter or is rejected.
: This would be basically
: public interface SearchFilter {
: boolean filter(ScoreDoc doc);
: }
:
: Thus a filter could be implemented using a bitset or it could get a
: field cache and check the documents value based on that or in any
: other way.
: Providing a ScoreDoc to the filter (instead of the document id alone)
: allows to write filters that modify the score instead of
: accepting/rejecting documents.
:
: Use cases include
: - Filtering based on document values
:   E.g. a date filter. This can be done by the current bitset based
: filters but if the date ranges vary from query to query and the index
: change rate is low, using a field cache on the dates seems better than
: creating a bitset for each range.
: - Modifying the score
:   E.g. a scoring that degrades the score based on a date field to prefer
: new documents over old ones. This is not the same as sorting by date
: since an old but goo

RE: How do I send search query to Multiple search Indexes ?

2006-02-03 Thread Robert Engels

Read the book "Lucene in Action".

-Original Message-
From: Vikas Khengare [mailto:[EMAIL PROTECTED]
Sent: Friday, February 03, 2006 12:14 AM
To: java-user@lucene.apache.org; java-dev@lucene.apache.org;
java-commits@lucene.apache.org
Subject: How do I send search query to Multiple search Indexes ?


 
> Hi Friends
>  
> How do I send one search query to multiple search Indexes which 
> are on remote machines ?
>  
> Which Technology will help me (AJAX / simple Servlet) ?
>  
> Thanks... in advance
>  
> I hope I will get result from experts like you
>  
> Best Regards
> [ [EMAIL PROTECTED] ]
>
>   

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: [jira] Created: (LUCENE-529) TermInfosReader and other + instance ThreadLocal => transient/odd memory leaks => OutOfMemoryException

2006-03-22 Thread Robert Engels

There is only a single TermInfoReader per index. In order to share this 
instance with multiple threads, and avoid the overhead of creating new 
enumerators for each request, the enumerator for the thread is stored in a 
thread local. Normally, in a server application, threads are pooled, so new 
threads are not constantly created and destroyed, so the memory leak is 
insiginificant.

The same reasoning holds true for the SegmentReader class.


-Original Message-
From: Andy Hind (JIRA) [mailto:[EMAIL PROTECTED]
Sent: Wednesday, March 22, 2006 11:07 AM
To: java-dev@lucene.apache.org
Subject: [jira] Created: (LUCENE-529) TermInfosReader and other +
instance ThreadLocal => transient/odd memory leaks =>
OutOfMemoryException


TermInfosReader and other + instance ThreadLocal => transient/odd memory leaks 
=>  OutOfMemoryException 


 Key: LUCENE-529
 URL: http://issues.apache.org/jira/browse/LUCENE-529
 Project: Lucene - Java
Type: Bug
  Components: Index  
Versions: 1.9
 Environment: Lucene 1.4.3 with 1.5.0_04 JVM or newer..will aplpy to 1.9 
code 
Reporter: Andy Hind


TermInfosReader uses an instance level ThreadLocal for enumerators.
This is a transient/odd memory leak in lucene 1.4.3-1.9 and applies to current 
JVMs, 
not just an old JVM issue as described in the finalizer of the 1.9 code.

There is also an instance level thread local in SegmentReaderwhich will 
have the same issue.
There may be other uses which also need to be fixed.

I don't understand the intended use for these variables.however

Each ThreadLocal has its own hashcode used for look up, see the ThreadLocal 
source code. Each instance of TermInfosReader will be creating an instance of 
the thread local. All this does is create an instance variable on each thread 
when it accesses the thread local. Setting it to null in the finaliser will set 
it to null on one thread, the finalizer thread, where it has never been 
created.  There is no point to this :-(

I assume there is a good concurrency reason why an instance variable can not be 
used...

I have not used multi-threaded searching, but I have used a lot of threads each 
making searchers and searching.
1.4.3 has a clear memory leak caused by this thread local. This use case above 
is definitely solved by setting the thread local to null in the close(). This 
at least has a chance of being on the correct thread :-) 
I know reusing Searchers would help but that is my choice and I will get to 
that later  

Now you wnat to know why

Thread locals are stored in a table of entries. Each entry is *weak reference* 
to the key (Here the TermInfosReader instance)  and a *simple reference* to the 
thread local value. When the instance is GCed its key becomes null. 
This is now a stale entry in the table.
Stale entries are cleared up in an ad hoc way and until they are cleared up the 
value will not be garbage collected.
Until the instance is GCed it is a valid key and its presence may cause the 
table to expand.
See the ThreadLocal code.

So if you have lots of threads, all creating thread locals rapidly, you can get 
each thread holding a large table of thread locals which all contain many stale 
entries and preventing some objects from being garbage collected. 
The limited GC of the thread local table is not enough to save you from running 
out of memory.  

Summary:

- remove finalizer()
- set the thread local to null in close() 
  - values will be available for gc 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: [jira] Created: (LUCENE-529) TermInfosReader and other + instance ThreadLocal => transient/odd memory leaks => OutOfMemoryException

2006-03-22 Thread Robert Engels

There was a small mistake - there is a single TermInfoReader per segment.

-Original Message-
From: Robert Engels [mailto:[EMAIL PROTECTED]
Sent: Wednesday, March 22, 2006 11:37 AM
To: java-dev@lucene.apache.org
Subject: RE: [jira] Created: (LUCENE-529) TermInfosReader and other +
instance ThreadLocal => transient/odd memory leaks =>
OutOfMemoryException

There is only a single TermInfoReader per index. In order to share this 
instance with multiple threads, and avoid the overhead of creating new 
enumerators for each request, the enumerator for the thread is stored in a 
thread local. Normally, in a server application, threads are pooled, so new 
threads are not constantly created and destroyed, so the memory leak is 
insiginificant.

The same reasoning holds true for the SegmentReader class.

-Original Message-
From: Andy Hind (JIRA) [mailto:[EMAIL PROTECTED]
Sent: Wednesday, March 22, 2006 11:07 AM
To: java-dev@lucene.apache.org
Subject: [jira] Created: (LUCENE-529) TermInfosReader and other +
instance ThreadLocal => transient/odd memory leaks =>
OutOfMemoryException

TermInfosReader and other + instance ThreadLocal => transient/odd memory leaks 
=>  OutOfMemoryException 

 Key: LUCENE-529
 URL: http://issues.apache.org/jira/browse/LUCENE-529
 Project: Lucene - Java
Type: Bug
  Components: Index  
Versions: 1.9
 Environment: Lucene 1.4.3 with 1.5.0_04 JVM or newer..will aplpy to 1.9 
code 
Reporter: Andy Hind

TermInfosReader uses an instance level ThreadLocal for enumerators.
This is a transient/odd memory leak in lucene 1.4.3-1.9 and applies to current 
JVMs, 
not just an old JVM issue as described in the finalizer of the 1.9 code.

There is also an instance level thread local in SegmentReaderwhich will 
have the same issue.
There may be other uses which also need to be fixed.

I don't understand the intended use for these variables.however

Each ThreadLocal has its own hashcode used for look up, see the ThreadLocal 
source code. Each instance of TermInfosReader will be creating an instance of 
the thread local. All this does is create an instance variable on each thread 
when it accesses the thread local. Setting it to null in the finaliser will set 
it to null on one thread, the finalizer thread, where it has never been 
created.  There is no point to this :-(

I assume there is a good concurrency reason why an instance variable can not be 
used...

I have not used multi-threaded searching, but I have used a lot of threads each 
making searchers and searching.
1.4.3 has a clear memory leak caused by this thread local. This use case above 
is definitely solved by setting the thread local to null in the close(). This 
at least has a chance of being on the correct thread :-) 
I know reusing Searchers would help but that is my choice and I will get to 
that later  

Now you wnat to know why

Thread locals are stored in a table of entries. Each entry is *weak reference* 
to the key (Here the TermInfosReader instance)  and a *simple reference* to the 
thread local value. When the instance is GCed its key becomes null. 
This is now a stale entry in the table.
Stale entries are cleared up in an ad hoc way and until they are cleared up the 
value will not be garbage collected.
Until the instance is GCed it is a valid key and its presence may cause the 
table to expand.
See the ThreadLocal code.

So if you have lots of threads, all creating thread locals rapidly, you can get 
each thread holding a large table of thread locals which all contain many stale 
entries and preventing some objects from being garbage collected. 
The limited GC of the thread local table is not enough to save you from running 
out of memory.  

Summary:

- remove finalizer()
- set the thread local to null in close() 
  - values will be available for gc 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

query parsing

2006-03-22 Thread Robert Engels

Using lucene 1.4.3, if I use the query

+cat AND -dog

it parses to

+cat -dog

and works correctly.

If I use

(+cat) AND (-dog)

it parses to

+(+cat) +(-dog)

and returns no results.

Is this a known issue?

RE: [jira] Created: (LUCENE-529) TermInfosReader and other + instance ThreadLocal => transient/odd memory leaks => OutOfMemoryException

2006-03-22 Thread Robert Engels

Creating and destroying threads is one of the worst performing operations,
and should be avoided at ALMOST all costs.

I do not see this problem in my server impl of Lucene, internally
multithreaded, and accessed via multiple threads from a Tomcat server. I
have to assume many (most?) users of Lucene are doing so in a multithreaded
server environment.

I reviewed the bugs in java.sun related to memory leaks with ThreadLocal's.
I don't think any of them apply in this case.

Maybe you could provide a simplified ThreadLocal testcase that demonstrates
the 'out of memory' condition?

Are you sure that you do not have a modified version of Lucene that is
somehow maintain a reference back to the ThreadLocal from the ThreadLocal's
value, as this is a known JDK issue
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6254531 I don't see this
bug as being applicable to the 1.9.1 or 1.4.3 code.

Did you try running your server using 1.4.3? (our server code is based off
the 1.4.3 codeset at this time).


-Original Message-
From: Andy Hind [mailto:[EMAIL PROTECTED]
Sent: Wednesday, March 22, 2006 12:48 PM
To: java-dev@lucene.apache.org; [EMAIL PROTECTED]
Subject: RE: [jira] Created: (LUCENE-529) TermInfosReader and other +
instance ThreadLocal => transient/odd memory leaks =>
OutOfMemoryException



For every IndexReader that is opened
- there is one SegmentReader for every segment in the index
   - with its thread local
   - for each of these there is a TermInfosReader + its thread local.

So I get 2 * (no of index segments) thread locals.

I am creating index readers for a main index and transactional updates
and layering the two. At the moment this is an issue, under stress
testing, using tomcat, with thread pooling, with a pretty big changing
index, left running for a few hours, it blows up.

Thread locals are also used in other areas of the app.

It would be better if threads were created and destroyed!

It is certainly not insignificant for me and gives a JVM that creeps up
in size pretty steadily over time.

I have fixed this issue locally in the code and it works.

Regards

Andy




-----Original Message-
From: Robert Engels [mailto:[EMAIL PROTECTED]
Sent: 22 March 2006 17:46
To: java-dev@lucene.apache.org
Subject: RE: [jira] Created: (LUCENE-529) TermInfosReader and other +
instance ThreadLocal => transient/odd memory leaks =>
OutOfMemoryException

There was a small mistake - there is a single TermInfoReader per
segment.

-Original Message-
From: Robert Engels [mailto:[EMAIL PROTECTED]
Sent: Wednesday, March 22, 2006 11:37 AM
To: java-dev@lucene.apache.org
Subject: RE: [jira] Created: (LUCENE-529) TermInfosReader and other +
instance ThreadLocal => transient/odd memory leaks =>
OutOfMemoryException


There is only a single TermInfoReader per index. In order to share this
instance with multiple threads, and avoid the overhead of creating new
enumerators for each request, the enumerator for the thread is stored in
a thread local. Normally, in a server application, threads are pooled,
so new threads are not constantly created and destroyed, so the memory
leak is insiginificant.

The same reasoning holds true for the SegmentReader class.


-Original Message-
From: Andy Hind (JIRA) [mailto:[EMAIL PROTECTED]
Sent: Wednesday, March 22, 2006 11:07 AM
To: java-dev@lucene.apache.org
Subject: [jira] Created: (LUCENE-529) TermInfosReader and other +
instance ThreadLocal => transient/odd memory leaks =>
OutOfMemoryException


TermInfosReader and other + instance ThreadLocal => transient/odd memory
leaks =>  OutOfMemoryException



 Key: LUCENE-529
 URL: http://issues.apache.org/jira/browse/LUCENE-529
 Project: Lucene - Java
Type: Bug
  Components: Index
Versions: 1.9
 Environment: Lucene 1.4.3 with 1.5.0_04 JVM or newer..will aplpy to
1.9 code
Reporter: Andy Hind


TermInfosReader uses an instance level ThreadLocal for enumerators.
This is a transient/odd memory leak in lucene 1.4.3-1.9 and applies to
current JVMs,
not just an old JVM issue as described in the finalizer of the 1.9 code.

There is also an instance level thread local in SegmentReaderwhich
will have the same issue.
There may be other uses which also need to be fixed.

I don't understand the intended use for these variables.however

Each ThreadLocal has its own hashcode used for look up, see the
ThreadLocal source code. Each instance of TermInfosReader will be
creating an instance of the thread local. All this does is create an
instance variable on each thread when it accesses the thread local.
Setting it to null in the finaliser will set it to null on one thread,
the finalizer thread, where it has never been created.  There is no
point to this :-(

I assume there is a good concurrency reason why an instance variable

RE: query parsing

2006-03-22 Thread Robert Engels

Any suggestions on what to do then, as the following query exhibits the same 
behavior

(+cat) (-dog)

Due to the implied AND. Removing the parenthesis allows it to work. It doesn't 
seem that adding parenthesis in this case should cause the query to fail???

Doesn't it suggest that there is a bug in the BooleanQuery scorer is not 
handling the case of a REQUIRED clause that is a BooleanQuery, that consists of 
a single prohibited boolean clause?

-Original Message-
From: Daniel Naber [mailto:[EMAIL PROTECTED]
Sent: Wednesday, March 22, 2006 1:03 PM
To: java-dev@lucene.apache.org
Subject: Re: query parsing

On Mittwoch 22 März 2006 18:49, Robert Engels wrote:

> If I use
>
> (+cat) AND (-dog)
>
> it parses to
>
> +(+cat) +(-dog)
>
> and returns no results.
>
> Is this a known issue?

Basically yes. QueryParser is known to exhibit strange behavior when 
combining +/- and AND/OR/NOT.

Regards
 Daniel

-- 
http://www.danielnaber.de

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: [jira] Updated: (LUCENE-529) TermInfosReader and other + instance ThreadLocal => transient/odd memory leaks => OutOfMemoryException

2006-03-23 Thread Robert Engels

Your testcase is invalid.

Reduce the size by 10, increase the repeat by 10, (SAME amount of memory use), 
and it works fine.

The reason it works in the one case is that you use new WeakReference(new 
Arrary()), - since the array cannot be referenced, it is immediately GC'd. You 
should have noticed since there were no finalization messages printed. You can 
demonstrate this clearly by adding an else to the if in the finalize() to print 
out that the object was indeed finalized.

ThreadLocal's work and are GC'd correctly.

There is something else wrong in your system.

I ran the test using 1.5.0._06.


-Original Message-
From: Andy Hind (JIRA) [mailto:[EMAIL PROTECTED]
Sent: Thursday, March 23, 2006 7:07 AM
To: java-dev@lucene.apache.org
Subject: [jira] Updated: (LUCENE-529) TermInfosReader and other +
instance ThreadLocal => transient/odd memory leaks =>
OutOfMemoryException


 [ http://issues.apache.org/jira/browse/LUCENE-529?page=all ]

Andy Hind updated LUCENE-529:
-

Attachment: ThreadLocalTest.java

Attached is a test which you can use to see how ThreadLocals are left around.
Getting an out of memory exception depends on a number thingsit is set up 
to fail for 64M

Now I understand what is going on, there are a few alternatives:

1) set null on close
- fine for single thread use
- probably leaves (n-1)*segments*2things hanging around for n threaded use

2) Use a weak reference and leave it up to GC to get rid of the referent when 
it is not being used

3) Manage the things youself by object id and thread id - and clean up on 
object close() 

I would go with option 1) and 2) although it may mean things get GCed before a 
call to close() when not used.

The fix I initially suggested is in production, and has been stress tested with 
a couple of hundred users continually pounding the app,
 but not for multithreaded use of IndexReaders. Each time does a couple of 
simple searches with no clever reuse of index readers (which is on the todo 
list)

I do not see how setting the thread local to null on close() has any negative 
impact. You are not going to use the cached information again??

Before the fix: 10-100 threads - 1G JVM - OOM in a few hours 
After: 10-100 threads 256M JVM -  days with a flat memory footprint

I am not sure why the thread local table is so big for us, but that is not 
really the issue.
It could just be building lots of IndexReaders (with thread locals hanging - 
probably making 10/instance ) and gc not kicking in so this table grows and can 
hold a lot of stale entries.  I may get time to investigate further

> TermInfosReader and other + instance ThreadLocal => transient/odd memory 
> leaks =>  OutOfMemoryException
> ---
>
>  Key: LUCENE-529
>  URL: http://issues.apache.org/jira/browse/LUCENE-529
>  Project: Lucene - Java
> Type: Bug
>   Components: Index
> Versions: 1.9
>  Environment: Lucene 1.4.3 with 1.5.0_04 JVM or newer..will aplpy to 1.9 
> code 
> Reporter: Andy Hind
>  Attachments: ThreadLocalTest.java
>
> TermInfosReader uses an instance level ThreadLocal for enumerators.
> This is a transient/odd memory leak in lucene 1.4.3-1.9 and applies to 
> current JVMs, 
> not just an old JVM issue as described in the finalizer of the 1.9 code.
> There is also an instance level thread local in SegmentReaderwhich will 
> have the same issue.
> There may be other uses which also need to be fixed.
> I don't understand the intended use for these variables.however
> Each ThreadLocal has its own hashcode used for look up, see the ThreadLocal 
> source code. Each instance of TermInfosReader will be creating an instance of 
> the thread local. All this does is create an instance variable on each thread 
> when it accesses the thread local. Setting it to null in the finaliser will 
> set it to null on one thread, the finalizer thread, where it has never been 
> created.  There is no point to this :-(
> I assume there is a good concurrency reason why an instance variable can not 
> be used...
> I have not used multi-threaded searching, but I have used a lot of threads 
> each making searchers and searching.
> 1.4.3 has a clear memory leak caused by this thread local. This use case 
> above is definitely solved by setting the thread local to null in the 
> close(). This at least has a chance of being on the correct thread :-) 
> I know reusing Searchers would help but that is my choice and I will get to 
> that later  
> Now you wnat to know why
> Thread locals are stored in a table of entries. Each entry is *weak 
> reference* to the key (Here the TermInfosReader instance)  and a *simple 
> reference* to the thread local value. When the instance is GCed its key 
> becomes null. 
> This is now a stale entry in the table.
> Stale entries are cleared up in an ad hoc way and until they a

RE: [jira] Updated: (LUCENE-529) TermInfosReader and other + instance ThreadLocal => transient/odd memory leaks => OutOfMemoryException

2006-03-23 Thread Robert Engels

The only other thing that may be causing your problem is the use of finalize(). 
This can interfere with the GC ability to GC objects.

I am not sure why the finalize() is used in the Lucene ThreadLocal handling. It 
doesn't seem necessary to me.

-Original Message-
From: Robert Engels [mailto:[EMAIL PROTECTED]
Sent: Thursday, March 23, 2006 10:00 AM
To: java-dev@lucene.apache.org
Subject: RE: [jira] Updated: (LUCENE-529) TermInfosReader and other +
instance ThreadLocal => transient/odd memory leaks =>
OutOfMemoryException
Importance: High

Your testcase is invalid.

Reduce the size by 10, increase the repeat by 10, (SAME amount of memory use), 
and it works fine.

The reason it works in the one case is that you use new WeakReference(new 
Arrary()), - since the array cannot be referenced, it is immediately GC'd. You 
should have noticed since there were no finalization messages printed. You can 
demonstrate this clearly by adding an else to the if in the finalize() to print 
out that the object was indeed finalized.

ThreadLocal's work and are GC'd correctly.

There is something else wrong in your system.

I ran the test using 1.5.0._06.

-Original Message-
From: Andy Hind (JIRA) [mailto:[EMAIL PROTECTED]
Sent: Thursday, March 23, 2006 7:07 AM
To: java-dev@lucene.apache.org
Subject: [jira] Updated: (LUCENE-529) TermInfosReader and other +
instance ThreadLocal => transient/odd memory leaks =>
OutOfMemoryException

 [ http://issues.apache.org/jira/browse/LUCENE-529?page=all ]

Andy Hind updated LUCENE-529:
-

Attachment: ThreadLocalTest.java

Attached is a test which you can use to see how ThreadLocals are left around.
Getting an out of memory exception depends on a number thingsit is set up 
to fail for 64M

Now I understand what is going on, there are a few alternatives:

1) set null on close
- fine for single thread use
- probably leaves (n-1)*segments*2things hanging around for n threaded use

2) Use a weak reference and leave it up to GC to get rid of the referent when 
it is not being used

3) Manage the things youself by object id and thread id - and clean up on 
object close() 

I would go with option 1) and 2) although it may mean things get GCed before a 
call to close() when not used.

The fix I initially suggested is in production, and has been stress tested with 
a couple of hundred users continually pounding the app,
 but not for multithreaded use of IndexReaders. Each time does a couple of 
simple searches with no clever reuse of index readers (which is on the todo 
list)

I do not see how setting the thread local to null on close() has any negative 
impact. You are not going to use the cached information again??

Before the fix: 10-100 threads - 1G JVM - OOM in a few hours 
After: 10-100 threads 256M JVM -  days with a flat memory footprint

I am not sure why the thread local table is so big for us, but that is not 
really the issue.
It could just be building lots of IndexReaders (with thread locals hanging - 
probably making 10/instance ) and gc not kicking in so this table grows and can 
hold a lot of stale entries.  I may get time to investigate further

> TermInfosReader and other + instance ThreadLocal => transient/odd memory 
> leaks =>  OutOfMemoryException
> ---
>
>  Key: LUCENE-529
>  URL: http://issues.apache.org/jira/browse/LUCENE-529
>  Project: Lucene - Java
> Type: Bug
>   Components: Index
> Versions: 1.9
>  Environment: Lucene 1.4.3 with 1.5.0_04 JVM or newer..will aplpy to 1.9 
> code 
> Reporter: Andy Hind
>  Attachments: ThreadLocalTest.java
>
> TermInfosReader uses an instance level ThreadLocal for enumerators.
> This is a transient/odd memory leak in lucene 1.4.3-1.9 and applies to 
> current JVMs, 
> not just an old JVM issue as described in the finalizer of the 1.9 code.
> There is also an instance level thread local in SegmentReaderwhich will 
> have the same issue.
> There may be other uses which also need to be fixed.
> I don't understand the intended use for these variables.however
> Each ThreadLocal has its own hashcode used for look up, see the ThreadLocal 
> source code. Each instance of TermInfosReader will be creating an instance of 
> the thread local. All this does is create an instance variable on each thread 
> when it accesses the thread local. Setting it to null in the finaliser will 
> set it to null on one thread, the finalizer thread, where it has never been 
> created.  There is no point to this :-(
> I assume there is a good concurrency reason why an instance variable can not 
> be used...
> I have not used multi-threaded searching, but I have used a lot of threads 
> each making searchers and searching

RE: [jira] Updated: (LUCENE-529) TermInfosReader and other + instance ThreadLocal => transient/odd memory leaks => OutOfMemoryException

2006-03-23 Thread Robert Engels

The testcase is still not correct - at least with regards to Lucene.

Review the ThreadLocal and ThreadLocalMap code again, you will see that
references to the ThreadLocal are kept using weak references, in a slot in
an array, and entries are reclaimed() (i.e. the slot cleared) PERIODICALLY
as new entries are ADDED or RETRIEVED. The array will never decrease in
size, but it is unlikely to grow to be very large.

See TheadLocalMap.cleanSomeSlots() and where/when it is called.

 The reason that your test fails is that each entry in the table
maintains a HARD reference to the value, and since the entries are only
reclaimed PERIODICALLY (for performance reasons - see cleanSomeSlots()), you
do have the possibility to run out of memory (since the ThreadLocal values
you are storing are EXTREMELY LARGE).

In Lucene, the values used in the ThreadLocal are 1/10 the size, and
thus should cause no problems.

-Original Message-
From: Andy Hind [mailto:[EMAIL PROTECTED]
Sent: Thursday, March 23, 2006 12:17 PM
To: java-dev@lucene.apache.org; [EMAIL PROTECTED]
Subject: RE: [jira] Updated: (LUCENE-529) TermInfosReader and other +
instance ThreadLocal => transient/odd memory leaks =>
OutOfMemoryException



Well, unfortunately, it is your test case that is not equivalent.

OK you make 10 times as many objects that are 1/10 the size.
But the thread local map is the same size and likely to end up holding
the same number of stale entries, so yes the memory footprint is 10
times smaller and it works.

If you increase the size of the thread local table by x10 it will fail
again. This will not be an exact memory size match due to the underlying
algorithm for cleaning stuff out looking at log2(current.no.of thread
locals) + ?*log2(table.size) and being dependent on GC, so I have no
idea what fraction of stale entries will remain for any size.

I suggest you look at the source code for ThreadLocal and how stale
entries are removed look for "GC" and you will see where the "value"
is set to null to help the GC.

I should have proposed using SoftReferences not WeakReferences.
Apologies. Soft References also work fine.

The finalize method is indeed pointless but makes no difference to the
result. The point I was making is that it is pointless in lucene.

Yes, creating threads is expensive - my point - somewhat tongue-in-cheek
is at least they would be clean :-)

You can add an assert to make sure the weak/soft reference exists after
the Object is created. It does. You *may* be unlucky and have GC take
place in between and it go for weak/soft references.

I used 1.5.0_04 and checked the source code for ThreadLocal is 1.33
which is the same version as in 1.5.0_06 and the issue was initially
found using 1.5.0_06.

The issue is for real.
You can blame ThreadLocal but it does what it says on the tin.

Regards

Andy


-Original Message-
From: Robert Engels [mailto:[EMAIL PROTECTED]
Sent: 23 March 2006 16:05
To: java-dev@lucene.apache.org; [EMAIL PROTECTED]
Subject: RE: [jira] Updated: (LUCENE-529) TermInfosReader and other +
instance ThreadLocal => transient/odd memory leaks =>
OutOfMemoryException

The only other thing that may be causing your problem is the use of
finalize(). This can interfere with the GC ability to GC objects.

I am now sure why the finalize() is used in the Lucene ThreadLocal
handling. It doesn't seem necessary to me.

-----Original Message-
From: Robert Engels [mailto:[EMAIL PROTECTED]
Sent: Thursday, March 23, 2006 10:00 AM
To: java-dev@lucene.apache.org
Subject: RE: [jira] Updated: (LUCENE-529) TermInfosReader and other +
instance ThreadLocal => transient/odd memory leaks =>
OutOfMemoryException
Importance: High


Your testcase is invalid.

Reduce the size by 10, increase the repeat by 10, (SAME amount of memory
use), and it works fine.

The reason it works in the one case is that you use new
WeakReference(new Arrary()), - since the array cannot be referenced, it
is immediately GC'd. You should have noticed since there were no
finalization messages printed. You can demonstrate this clearly by
adding an else to the if in the finalize() to print out that the object
was indeed finalized.

ThreadLocal's work and are GC'd correctly.

There is something else wrong in your system.

I ran the test using 1.5.0._06.


-Original Message-
From: Andy Hind (JIRA) [mailto:[EMAIL PROTECTED]
Sent: Thursday, March 23, 2006 7:07 AM
To: java-dev@lucene.apache.org
Subject: [jira] Updated: (LUCENE-529) TermInfosReader and other +
instance ThreadLocal => transient/odd memory leaks =>
OutOfMemoryException


 [ http://issues.apache.org/jira/browse/LUCENE-529?page=all ]

Andy Hind updated LUCENE-529:
-

Attachment: ThreadLocalTest.java

Attached is a test which you can use to see how ThreadLocals are left
around.
Getting an out of memory exception depends on a number thingsit is
set up to

RE: [jira] Updated: (LUCENE-529) TermInfosReader and other + instance ThreadLocal => transient/odd memory leaks => OutOfMemoryException

2006-03-24 Thread Robert Engels

Seems like something else is wrong in your environment, since you will only
get 2 ThreadLocals per segment - having 7000 entries in the ThreadLocal of a
thread seems like a lot. Even so, with the current finalize() method, the
buffer used by the ThreadLocal is reclaimed, since the ThreadLocal value is
set to null.

I would also be very suspect as too how you are determining this. If you are
using a profiler or debugger, the extra CPU processing required by the
environment along with the serious CPU nature of lucene will often cause the
garbage collector to not run as frequently. I would try reducing the memory
(maybe 128m) to cause the JVM to run out of memory and force the GC to run,
and see how many "stale" entries retained. If the GC runs more often, then
the call to cleanSomeSlots() will find more slots available to be reclaimed.

100 threads processing Lucene queries seems like WAY TOO many as well. Given
the CPU bound nature of MOST of Lucene, 100 threads is going to waste a lot
of cpu performing context switching. This is an IDEAL case to use a
ThreadPool of Lucene processing threads - I would no more than a few per
logical processor sounds reasonable.

Anyway, I suggest the JIRA issue be closed, there is no bug in Lucene, and
the common (and best) use of a ThreadPool in a high-use server environment
makes this a non-issue.


-Original Message-
From: Andy Hind [mailto:[EMAIL PROTECTED]
Sent: Friday, March 24, 2006 6:03 AM
To: java-dev@lucene.apache.org; [EMAIL PROTECTED]
Subject: RE: [jira] Updated: (LUCENE-529) TermInfosReader and other +
instance ThreadLocal => transient/odd memory leaks =>
OutOfMemoryException



Entries are only reclaimed after the holding object/class has been
garbage collected and you access thread locals on a given thread. Lets
say you give your app server 1G to play with. Then you create thread
locals like there is no tomorrow. GC has a bit of a sleep and before you
know it you do have a very large table to hold thread locals. It is the
interaction of garbage collection and thread local access that is used
to clear entries and determine a stable size.

Remember, there are lots of other things going on and it may not just be
lucene using thread locals. If something else pushes up the thread local
table size there is still going to be an issue.

I have had a quick play using a simple lucene example and got a thread
local table of 131072 by allowing the test 512m. This thread local table
size increases as you have more free memory and GC is less frequent. It
also tends to increase with the number of search iterations. I presume
you are more likely to hit a bad case. I will add the example to the bug
report.

Ok, there are ~7000 actual entries in this table at the end - pretty
sparse.
When I access it is going to look at the next 12 entries. Which is 64%
hit rate. So maybe this is a stable size. I have used it enough. Maybe
it is a freak of use and gc.

In the actual case the thread local tables seem to grow steadily over
time.
This is going to be down to the number of threads, typical amount of
free memory, rate of thread local creation, thread local table size, gc
and the rate at which stale entries are cleared out, ... It is not going
to be an issue for everyone.

So with 100 threads I can see how we could end up holding on to maybe
700,000 objects. These all have a clone of an input stream each with at
least a 1k byte buffer from a quick look at a couple? Has someone got a
better size?

I will see if I can turn my example into a test case.

Regards

Andy


-Original Message-----
From: Robert Engels [mailto:[EMAIL PROTECTED]
Sent: 23 March 2006 18:50
To: java-dev@lucene.apache.org
Subject: RE: [jira] Updated: (LUCENE-529) TermInfosReader and other +
instance ThreadLocal => transient/odd memory leaks =>
OutOfMemoryException

The testcase is still not correct - at least with regards to Lucene.

Review the ThreadLocal and ThreadLocalMap code again, you will see that
references to the ThreadLocal are kept using weak references, in a slot
in
an array, and entries are reclaimed() (i.e. the slot cleared)
PERIODICALLY
as new entries are ADDED or RETRIEVED. The array will never decrease in
size, but it is unlikely to grow to be very large.

See TheadLocalMap.cleanSomeSlots() and where/when it is called.

 The reason that your test fails is that each entry in the table
maintains a HARD reference to the value, and since the entries are only
reclaimed PERIODICALLY (for performance reasons - see cleanSomeSlots()),
you
do have the possibility to run out of memory (since the ThreadLocal
values
you are storing are EXTREMELY LARGE).

In Lucene, the values used in the ThreadLocal are 1/10 the size, and
thus should cause no problems.

-Original Message-
From: Andy Hind [mailto:[EMAIL PROTECTED]
Sent: Thursday, March 23, 2006 12:17 PM
To: java-dev@lucene.apache.org; [EMAIL PROTECTED]
Subject: RE: [jira] Updated: (LUCENE-529) TermInfosReade

RE: Question about RemoteSearchable, RMI and queries in parallel

2006-04-12 Thread Robert Engels

I think you may need a much more advanced design - with change detection,
parallel query execution, and index modification.

A lot of it depends on you semantics of a search - does it mean at the
results are 'almost right' at a moment in time, or are pending index changes
made first before any queries. How you define this will affect the
architecture greatly.


-Original Message-
From: wenjie zheng [mailto:[EMAIL PROTECTED]
Sent: Wednesday, April 12, 2006 8:45 PM
To: java-dev@lucene.apache.org
Subject: Question about RemoteSearchable, RMI and queries in parallel


My question is a little bit long. Thanks for your patience.

I am working on project which requires remote searching ability. So I use
RMI and RemoteSearchable class. Here is the system structure:
Server A has all the indices on it and the RemoteSearchable object running
on it. Server B accepts queries and use ParallelMultiSearcher  and RMI to
search the indices on Server A.
There are many indices on Server A, based on query I received, I choose
different index to search from .

For example, if I the query is from X, we will search indexX; if the query
is from Y, we will search indexY. In order to do that, I changed the
Searchable interface a little bit by adding function load(String folder) and
implement it in all the subclasses. What load() does is to close the
underlining reader of the first index and open the reader for the second
index.

So here is the problem I bumped into. When there are more than one queries
coming at the same time and I wanted to process them in parallel,  the
load() function called by the second query may accidently close the the
reader of the first index, while searching on the first index is not
finished yet.

So, do I have to queue the queries up and execute them one after the other?
Or do I have other options?

Thanks,
Wenjie


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Question about RemoteSearchable, RMI and queries in parallel

2006-04-12 Thread Robert Engels

If using RMI, you need to register multiple RemoteSearchable at different
addresses - one for each index you want to search.

Some simple client code will allow you to select the proper one.

This is the simplest solution (from what I understand of your problem -
although I admit I am still not completely understanding the dilemma).

The other solution would be to just create your own higher level RMI
interface where the methods took an 'index name', and the server could
multiplex from there.

-Original Message-
From: wenjie zheng [mailto:[EMAIL PROTECTED]
Sent: Wednesday, April 12, 2006 10:09 PM
To: java-dev@lucene.apache.org
Subject: Re: Question about RemoteSearchable, RMI and queries in
parallel


Thanks for your reply.
I think I didn't address the problem very clear. Let me rephrase it.

There is no such problem if everything is local, literaly we can new as many
IndexSearchers as we need.
However given the fact that there is only one RemoteSearchable instance
running on Server A, how can I run multiple queries on Server B  (different
indices) at the same time w/o affecting each other.

On 4/12/06, Robert Engels <[EMAIL PROTECTED]> wrote:
>
> I think you may need a much more advanced design - with change detection,
> parallel query execution, and index modification.
>
> A lot of it depends on you semantics of a search - does it mean at the
> results are 'almost right' at a moment in time, or are pending index
> changes
> made first before any queries. How you define this will affect the
> architecture greatly.
>
>
> -Original Message-
> From: wenjie zheng [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, April 12, 2006 8:45 PM
> To: java-dev@lucene.apache.org
> Subject: Question about RemoteSearchable, RMI and queries in parallel
>
>
> My question is a little bit long. Thanks for your patience.
>
> I am working on project which requires remote searching ability. So I use
> RMI and RemoteSearchable class. Here is the system structure:
> Server A has all the indices on it and the RemoteSearchable object running
> on it. Server B accepts queries and use ParallelMultiSearcher  and RMI to
> search the indices on Server A.
> There are many indices on Server A, based on query I received, I choose
> different index to search from .
>
> For example, if I the query is from X, we will search indexX; if the query
> is from Y, we will search indexY. In order to do that, I changed the
> Searchable interface a little bit by adding function load(String folder)
> and
> implement it in all the subclasses. What load() does is to close the
> underlining reader of the first index and open the reader for the second
> index.
>
> So here is the problem I bumped into. When there are more than one queries
> coming at the same time and I wanted to process them in parallel,  the
> load() function called by the second query may accidently close the the
> reader of the first index, while searching on the first index is not
> finished yet.
>
> So, do I have to queue the queries up and execute them one after the
> other?
> Or do I have other options?
>
> Thanks,
> Wenjie
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: lucene search sentence

2006-04-27 Thread Robert Engels

Ask the question on the lucene users list, not the dev-list.

And, Read a book. Read the javadoc. Read the samples.

-Original Message-
From: Anton Feldmann [mailto:[EMAIL PROTECTED]
Sent: Thursday, April 27, 2006 10:05 AM
To: java-dev@lucene.apache.org; java-user@lucene.apache.org
Subject: lucene search sentence

Hi

I wrote a Indexer which is indexing all the contents of a text and the
sentence are seperated in an other Document.

"Document document = new Document(new Field ("contents", reader ));

StringTokenizer token = new StringTokenizer(contents.replaceAll(".
", "\\.x\\") , "\\.x\\");
while(token.hasMoreTokens()){
Document doc = new Document();
doc.add(new Field ("sentence", token.nextToken(),Field.Store.YES,
Field.Index.TOKENIZED) );
}"

1) How do I write a Lucene Search and display all the hits in an
document?
2) How do I display the sentence the hit is in? and color the hit.
3) How do I display the sentence before and after the sentence the hit
is in?

Cherrs

anton

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: GData, updateable IndexSearcher

2006-04-27 Thread Robert Engels

Doug can you please elaborate on this.

I thought each segment maintained its own list of deleted documents (since
segments are WRITE ONCE, and when that segment is merged or optimized it
would "go away" anyway, as the deleted documents are removed.

In my reopen() implementation, I check to see if the existing segment name
is the same as an already open segment, and then just use the existing
SegmentInfo object (since it should still have reference to its deleted
documents).

For example,

Index has 3 (1-3) segments. A new document is written that causes a segment
to be created (4). A reopen would retain the SegmentInfo for 1-3, and create
a new one for 4.

It would be no different if segment 2 had deleted documents when the
creation of segment 4 occurs, segment 2 is not modified in this case.

If adding the new document, which creates a new segment, caused a merge,
segment 2 would be rewritten (and the deletions processed), so the segment
name for 2 would no longer be valid anyway, and the SegmentInfo would not
reused.

I've had this code in production for almost 2 years and have not seen any
problems - trying to get a handle on the possibility that our code may be
"unstable".

-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED]
Sent: Wednesday, April 26, 2006 3:44 PM
To: java-dev@lucene.apache.org
Subject: Re: GData, updateable IndexSearcher

jason rutherglen wrote:
> I was thinking you implied that you knew of someone who had customized
their own, but it was a closed source solution.  And if so then you would
know how that project faired.

I don't recall the details, but I know folks have discussed this
previously, and probably even posted patches, but I don't think any of
the patches was ready to commit.

> Wouldn't there also need to be a hack on the IndexWriter to keep track of
new segments?

I think the 'public static IndexReader.reopen(IndexReader old)' method I
proposed can easily compare the current list of segments for the
directory of old to those that old already has open, and determine which
can be reused and which new segments must be opened.  Deletions would be
a little tricky to track.  If a segment has had deletions, then a new
SegmentReader could be cloned from the old, sharing everything but the
deletions, which could be re-read from disk.  This would invalidate
cached filters for segments that had deletions.

You could even try to figure out what documents have been deleted, then
update filters incrementally.  That would be fastest, but more complicated.

Doug

> - Original Message 
> From: Doug Cutting <[EMAIL PROTECTED]>
> To: solr-dev@lucene.apache.org
> Sent: Wednesday, April 26, 2006 11:27:44 AM
> Subject: Re: GData, updateable IndexSearcher
>
> jason rutherglen wrote:
>
>>Interesting, does this mean there is a plan for incrementally updateable
IndexSearchers to become part of Lucene?
>
>
> In general, there is no plan for Lucene.  If someone implements a
> generally useful, efficient, feature in a back-compatible, easy to use,
> manner, and submits it as a patch, then it becomes a part of Lucene.
> That's the way Lucene changes.  Since we don't pay anyone, we can't make
> plans and assign tasks.  So if you're particularly interested in this
> feature, you might search the archives to find past efforts, or simply
> try to implement it yourself.
>
> I think a good approach would be to create a new IndexSearcher instance
> based on an existing one, that shares IndexReaders.  Similarly, one
> should be able to create a new IndexReader based on an existing one.
> This would be a MultiReader that shares many of the same SegmentReaders.
>
> Things get a little tricky after this.
>
> Lucene caches filters based on the IndexReader.  So filters would need
> to be re-created.  Ideally these could be incrementally re-created, but
> that might be difficult.  What might be simpler would be to use a
> MultiSearcher constructed with an IndexSearcher per SegmentReader,
> avoiding the use of MultiReader.  Then the caches would still work.
> This would require making a few things public that are not at present.
> Perhaps adding a 'MultiReader.getSubReaders()' method, combined with an
> 'static IndexReader.reopen(IndexReader)' method.  The latter would
> return a new MultiReader that shared SegmentReaders with the old
> version.  Then one could use getSubReaders() on the new multi reader to
> extract the current set to use when constructing a MultiSearcher.
>
> Another tricky bit is figuring out when to close readers.
>
> Does this make sense?  This discussion should probably move to the
> lucene-dev list.
>
>
>>Are there any negatives to updateable IndexSearchers?
>
>
> Not if implemented well!
>
> Doug
>
>
>
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [E

RE: 2.0 release

2006-04-27 Thread Robert Engels

What about making IndexReader & IndexWriter interfaces? Or creating
interfaces for these (IReader & IWriter?), and making all of the classes use
the interfaces?

-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED]
Sent: Thursday, April 27, 2006 5:20 PM
To: java-dev@lucene.apache.org
Subject: 2.0 release

Are there any changes folks think we need before we make the 2.0
release?  The major change from 1.9, removal of deprecated items, has
been made.  Anything else critical?

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

SegmentReader changes?

2006-04-29 Thread Robert Engels

I think one of two things need to happen to the SegmentReader class.

Either make the 'segment' variable protected, or make the the initialize()
method protected.

Without this, subclassing SegmentReader is impossible, since there is no way
for the derived class to know what segment it is working with.

In implementing the 'reopen()' method SegmentReader needs to be subclassed
in order to support 'refreshing' the deleted documents.

RE: GData, updateable IndexSearcher

2006-05-01 Thread Robert Engels

fyi, using my reopen(0 implementation (which rereads the deletions)

on a 135mb index, with 5000 iterations

open & close time using new reader = 585609
open & close time using reopen = 27422

Almost 20x faster. Important in a highly interactive/incremental updating
index.

-Original Message-
From: jason rutherglen [mailto:[EMAIL PROTECTED]
Sent: Monday, May 01, 2006 1:24 PM
To: java-dev@lucene.apache.org
Subject: Re: GData, updateable IndexSearcher


I wanted to post a quick hack to see if it is along the correct lines.  A
few of the questions regard whether to resuse existing MultiReaders or
simply strip out only the SegmentReaders.  I do a compare on the segment
name and made it public.  Thanks!


public static IndexReader reopen(IndexReader indexReader) throws IOException
{
if (indexReader instanceof MultiReader) {
  MultiReader multiReader = (MultiReader)indexReader;

  SegmentInfos segmentInfos = new SegmentInfos();
  segmentInfos.read(indexReader.directory());
  if (segmentInfos.size() == 1) {  // index is optimized
return SegmentReader.get(segmentInfos, segmentInfos.info(0), false);
  }

  IndexReader[] existingIndexReaders = multiReader.getSubReaders();
  // now go through and compare the segment readers
  Map existingSegmentMap = new
HashMap();
  getSegmentReaders(existingIndexReaders, existingSegmentMap);

  Map newSegmentInfosMap = new
HashMap();

  List newSegmentReaders = new
ArrayList();

  Iterator segmentInfosIterator = segmentInfos.iterator();
  while (segmentInfosIterator.hasNext()) {
SegmentInfo segmentInfo = (SegmentInfo)segmentInfosIterator.next();

if (!existingSegmentMap.containsKey(segmentInfo.name)) {
  // it's new
  SegmentReader newSegmentReader = SegmentReader.get(segmentInfo);
  newSegmentReaders.add(newSegmentReader);
}
  }
  List allSegmentReaders = new ArrayList();
  allSegmentReaders.add(multiReader);
  allSegmentReaders.addAll(newSegmentReaders);

  return new MultiReader(indexReader.directory(), segmentInfos, false,
(IndexReader[])allSegmentReaders.toArray(new IndexReader[0]));
}
throw new RuntimeException("indexReader not supported at this time");
  }

  public static void getSegmentReaders(IndexReader[] indexReaders,
Map map) {
for (int x=0; x < indexReaders.length; x++) {
  if (indexReaders[x] instanceof MultiReader) {
MultiReader multiReader = (MultiReader)indexReaders[x];
IndexReader[] subReaders = multiReader.getSubReaders();
getSegmentReaders(subReaders, map);
  } else if (indexReaders[x] instanceof SegmentReader) {
SegmentReader segmentReader = (SegmentReader)indexReaders[x];
map.put(segmentReader.segment, segmentReader);
  }
}
  }



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: GData, updateable IndexSearcher

2006-05-01 Thread Robert Engels

Attached.

It uses subclasses and instanceof which is sort of "hackish" - to do it
correctly requires changes to the base classes.



-Original Message-
From: jason rutherglen [mailto:[EMAIL PROTECTED]
Sent: Monday, May 01, 2006 1:43 PM
To: java-dev@lucene.apache.org
Subject: Re: GData, updateable IndexSearcher


Can you post your code?

- Original Message ----
From: Robert Engels <[EMAIL PROTECTED]>
To: java-dev@lucene.apache.org; jason rutherglen <[EMAIL PROTECTED]>
Sent: Monday, May 1, 2006 11:33:06 AM
Subject: RE: GData, updateable IndexSearcher

fyi, using my reopen(0 implementation (which rereads the deletions)

on a 135mb index, with 5000 iterations

open & close time using new reader = 585609
open & close time using reopen = 27422

Almost 20x faster. Important in a highly interactive/incremental updating
index.

-Original Message-
From: jason rutherglen [mailto:[EMAIL PROTECTED]
Sent: Monday, May 01, 2006 1:24 PM
To: java-dev@lucene.apache.org
Subject: Re: GData, updateable IndexSearcher


I wanted to post a quick hack to see if it is along the correct lines.  A
few of the questions regard whether to resuse existing MultiReaders or
simply strip out only the SegmentReaders.  I do a compare on the segment
name and made it public.  Thanks!


public static IndexReader reopen(IndexReader indexReader) throws IOException
{
if (indexReader instanceof MultiReader) {
  MultiReader multiReader = (MultiReader)indexReader;

  SegmentInfos segmentInfos = new SegmentInfos();
  segmentInfos.read(indexReader.directory());
  if (segmentInfos.size() == 1) {  // index is optimized
return SegmentReader.get(segmentInfos, segmentInfos.info(0), false);
  }

  IndexReader[] existingIndexReaders = multiReader.getSubReaders();
  // now go through and compare the segment readers
  Map existingSegmentMap = new
HashMap();
  getSegmentReaders(existingIndexReaders, existingSegmentMap);

  Map newSegmentInfosMap = new
HashMap();

  List newSegmentReaders = new
ArrayList();

  Iterator segmentInfosIterator = segmentInfos.iterator();
  while (segmentInfosIterator.hasNext()) {
SegmentInfo segmentInfo = (SegmentInfo)segmentInfosIterator.next();

if (!existingSegmentMap.containsKey(segmentInfo.name)) {
  // it's new
  SegmentReader newSegmentReader = SegmentReader.get(segmentInfo);
  newSegmentReaders.add(newSegmentReader);
}
  }
  List allSegmentReaders = new ArrayList();
  allSegmentReaders.add(multiReader);
  allSegmentReaders.addAll(newSegmentReaders);

  return new MultiReader(indexReader.directory(), segmentInfos, false,
(IndexReader[])allSegmentReaders.toArray(new IndexReader[0]));
}
throw new RuntimeException("indexReader not supported at this time");
  }

  public static void getSegmentReaders(IndexReader[] indexReaders,
Map map) {
for (int x=0; x < indexReaders.length; x++) {
  if (indexReaders[x] instanceof MultiReader) {
MultiReader multiReader = (MultiReader)indexReaders[x];
IndexReader[] subReaders = multiReader.getSubReaders();
getSegmentReaders(subReaders, map);
  } else if (indexReaders[x] instanceof SegmentReader) {
SegmentReader segmentReader = (SegmentReader)indexReaders[x];
map.put(segmentReader.segment, segmentReader);
  }
}
  }





package org.apache.lucene.index;

import java.io.IOException;

import org.apache.lucene.store.Directory;

/**
 * overridden to allow retrieval of contained IndexReader's to enable IndexReaderUtils.reopen()
 */
public class MyMultiReader extends MultiReader {

private IndexReader[] readers;

public MyMultiReader(Directory directory,SegmentInfos infos,IndexReader[] subReaders) throws IOException {
super(directory,infos,true,subReaders);
readers = subReaders;
}

public IndexReader[] getReaders() {
return readers;
}

public void doCommit() throws IOException {
super.doCommit();
}
}
package org.apache.lucene.index;

import java.io.IOException;
import java.util.*;

import org.apache.lucene.store.*;

public class IndexReaderUtils {
private static Map segments = new WeakHashMap();
static {
// must use String class name, otherwise instantiation order will not allow the override to work
System.setProperty("org.apache.lucene.SegmentReader.class","org.apache.lucene.index.MySegmentReader");
}

/**
 * reopens the IndexReader, possibly reusing the segments for greater efficiency. The original IndexReader instance
 * is closed, and the reference is no longer valid
 * 
 * @return the new IndexReader
 */
public static synchronized IndexReader reopen(IndexReader ir) throws IOException {
final Directory directory = ir.directory();

if(!(ir insta

refresh segments for deleted documents?

2006-05-01 Thread Robert Engels

I implemented the IndexReader.reopen(). My original implementation did not
"refresh" the deleted documents, and it seemed to work. The latest impl does
re-read the deletions.

BUT, on inspecting the IndexReader code, I am not sure this is necessary???

When a document is deleted, IndexReader marks the bit as deleted in the
SegmentReader, and if the SegmentReader instance is "reused", the document
is still deleted. If the Segment was merged, it would not be "reused"
anyway.

Doug, can you comment on exactly why the 'deletions' need to be re-read?
Doesn't seem necessary to me.

RE: GData, updateable IndexSearcher

2006-05-01 Thread Robert Engels

I just sent an email covering that. The code I provided takes that into
account, but in re-reading the code, I do not think it is necessary.


-Original Message-
From: jason rutherglen [mailto:[EMAIL PROTECTED]
Sent: Monday, May 01, 2006 5:17 PM
To: java-dev@lucene.apache.org
Subject: Re: GData, updateable IndexSearcher


Thanks for the code and performance metric Robert.  Have you had any issues
with the deleted segments as Doug has been describing?

- Original Message 
From: Robert Engels <[EMAIL PROTECTED]>
To: java-dev@lucene.apache.org; jason rutherglen <[EMAIL PROTECTED]>
Sent: Monday, May 1, 2006 11:49:41 AM
Subject: RE: GData, updateable IndexSearcher

Attached.

It uses subclasses and instanceof which is sort of "hackish" - to do it
correctly requires changes to the base classes.



-Original Message-
From: jason rutherglen [mailto:[EMAIL PROTECTED]
Sent: Monday, May 01, 2006 1:43 PM
To: java-dev@lucene.apache.org
Subject: Re: GData, updateable IndexSearcher


Can you post your code?

- Original Message ----
From: Robert Engels <[EMAIL PROTECTED]>
To: java-dev@lucene.apache.org; jason rutherglen <[EMAIL PROTECTED]>
Sent: Monday, May 1, 2006 11:33:06 AM
Subject: RE: GData, updateable IndexSearcher

fyi, using my reopen(0 implementation (which rereads the deletions)

on a 135mb index, with 5000 iterations

open & close time using new reader = 585609
open & close time using reopen = 27422

Almost 20x faster. Important in a highly interactive/incremental updating
index.

-Original Message-
From: jason rutherglen [mailto:[EMAIL PROTECTED]
Sent: Monday, May 01, 2006 1:24 PM
To: java-dev@lucene.apache.org
Subject: Re: GData, updateable IndexSearcher


I wanted to post a quick hack to see if it is along the correct lines.  A
few of the questions regard whether to resuse existing MultiReaders or
simply strip out only the SegmentReaders.  I do a compare on the segment
name and made it public.  Thanks!


public static IndexReader reopen(IndexReader indexReader) throws IOException
{
if (indexReader instanceof MultiReader) {
  MultiReader multiReader = (MultiReader)indexReader;

  SegmentInfos segmentInfos = new SegmentInfos();
  segmentInfos.read(indexReader.directory());
  if (segmentInfos.size() == 1) {  // index is optimized
return SegmentReader.get(segmentInfos, segmentInfos.info(0), false);
  }

  IndexReader[] existingIndexReaders = multiReader.getSubReaders();
  // now go through and compare the segment readers
  Map existingSegmentMap = new
HashMap();
  getSegmentReaders(existingIndexReaders, existingSegmentMap);

  Map newSegmentInfosMap = new
HashMap();

  List newSegmentReaders = new
ArrayList();

  Iterator segmentInfosIterator = segmentInfos.iterator();
  while (segmentInfosIterator.hasNext()) {
SegmentInfo segmentInfo = (SegmentInfo)segmentInfosIterator.next();

if (!existingSegmentMap.containsKey(segmentInfo.name)) {
  // it's new
  SegmentReader newSegmentReader = SegmentReader.get(segmentInfo);
  newSegmentReaders.add(newSegmentReader);
}
  }
  List allSegmentReaders = new ArrayList();
  allSegmentReaders.add(multiReader);
  allSegmentReaders.addAll(newSegmentReaders);

  return new MultiReader(indexReader.directory(), segmentInfos, false,
(IndexReader[])allSegmentReaders.toArray(new IndexReader[0]));
}
throw new RuntimeException("indexReader not supported at this time");
  }

  public static void getSegmentReaders(IndexReader[] indexReaders,
Map map) {
for (int x=0; x < indexReaders.length; x++) {
  if (indexReaders[x] instanceof MultiReader) {
MultiReader multiReader = (MultiReader)indexReaders[x];
IndexReader[] subReaders = multiReader.getSubReaders();
getSegmentReaders(subReaders, map);
  } else if (indexReaders[x] instanceof SegmentReader) {
SegmentReader segmentReader = (SegmentReader)indexReaders[x];
map.put(segmentReader.segment, segmentReader);
  }
}
  }








-Inline Attachment Follows-

package org.apache.lucene.index;

import java.io.IOException;

import org.apache.lucene.store.Directory;

/**
 * overridden to allow retrieval of contained IndexReader's to enable
IndexReaderUtils.reopen()
 */
public class MyMultiReader extends MultiReader {

private IndexReader[] readers;

public MyMultiReader(Directory directory,SegmentInfos
infos,IndexReader[] subReaders) throws IOException {
super(directory,infos,true,subReaders);
readers = subReaders;
}

public IndexReader[] getReaders() {
return readers;
}

public void doCommit() throws IOException {
super.doCommit();
}
}



-Inline Attachment Follows-

package org.apache.lucene.index;

import java.io.IOException;
import java.util.*;

import org.apache.lucene

RE: refresh segments for deleted documents?

2006-05-01 Thread Robert Engels

Thanks. I understand now. In my usage pattern deletions are never out of
sync - that is why it works.

-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED]
Sent: Monday, May 01, 2006 5:36 PM
To: java-dev@lucene.apache.org
Subject: Re: refresh segments for deleted documents?

Robert Engels wrote:
> Doug, can you comment on exactly why the 'deletions' need to be re-read?
> Doesn't seem necessary to me.

A common idiom is to use one IndexReader for searches, and a separate
for deletions.  For example, one might do something like:

1. Open IndexReader A.
2. Start serving queries against A.
3. Open IndexReader B.
4. Process queued deletions/updates against B.
5. Close B.
6. Open IndexWriter C.
7. Process queued additions/updates against C.
8. Close C.
9. Sleep until 1 minute has elapsed.
10. Go to step 1.

This would publish a new version of the index every minute, attempting
to batch insertions, updates and deletes, as is optimal.  In this case,
if you re-open A, its deletions could be out-of sync, but if you re-open
B its deletions would not be out of sync.

So perhaps in your usage pattern deletions are never out of sync at
re-open, but there are also reasonable usage patterns where deletions
can become out of sync on re-open.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: SegmentReader changes?

2006-05-01 Thread Robert Engels

Correct - changing SegmentReader would be best, but in the past, getting
proposed patches included has been slower than expected. So, by making the
SegmentReader more easily subclassed (which should hopefully get approved
quicker), I can still have a "build" of Lucene that does not require
patching any files. (just including classes in the appropriate package to
access package level vars/methods).

I can do everything needed (without subclassing) if there was a
package/public accessor to the segment "name".

-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED]
Sent: Monday, May 01, 2006 5:44 PM
To: java-dev@lucene.apache.org
Subject: Re: SegmentReader changes?

Robert Engels wrote:
> In implementing the 'reopen()' method SegmentReader needs to be subclassed
> in order to support 'refreshing' the deleted documents.

Why subclass?  Why not simply change SegmentReader?  It's not a public
class at present, and making it a public class would be a bigger change
than should be required to implement reopen.

But perhaps I just don't yet understand how you intend to implement
re-open.  I think I'd implement it as something that inquired whether
the deletions have changed, and if they have, clone the SegmentReader,
re-opening all files, but only re-reading the deletions.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: SegmentReader changes?

2006-05-01 Thread Robert Engels

I can submit a patch to add the IndexReader.reopen() method.

BUT, I think the requested change to SegmentReader is still valid, for the
reasons cited in the previous email.

There is already support for replacing the SegmentReader impl at runtime
with System properties, but without the SegmentReader changes I think it is
next to impossible to have any worthwhile subclass - except for "maybe"
method logging, so either the runtime replacement code should be removed, or
the changes made. Currently there just isn't a way for the subclass to know
ANYTHING, since all of the initialization methods called by the static
factory method are private.

-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED]
Sent: Monday, May 01, 2006 6:03 PM
To: java-dev@lucene.apache.org
Subject: Re: SegmentReader changes?

Robert Engels wrote:
> Correct - changing SegmentReader would be best, but in the past, getting
> proposed patches included has been slower than expected.

I'm sorry if the process has been frustrating to you in the past.  I
hope your experiences are better in the future.

> So, by making the
> SegmentReader more easily subclassed (which should hopefully get approved
> quicker), I can still have a "build" of Lucene that does not require
> patching any files. (just including classes in the appropriate package to
> access package level vars/methods).

Aren't we discussing a change to IndexReader, adding a new method?  This
is not a contrib module, but a change to the core.  So proposing it as a
patch file that changes existing classes is the normal course.  I don't
think we ought to be in the pracice of making changes in order to
support easier access to non-public APIs.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: SegmentReader changes?

2006-05-01 Thread Robert Engels

No, not at all. I will put something together.

BUT, back to the subclassing comments... Why have the runtime replaceable
support then in the SegmentReader factory - there is nothing useful a
subclass can do at this time, and without API changes, it will never be able
to.



-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED]
Sent: Monday, May 01, 2006 6:22 PM
To: java-dev@lucene.apache.org
Subject: Re: SegmentReader changes?


If the non-public core requires a subclassible SegmentReader then
SegmentReader should certainly be made subclassible.  But we shouldn't
make changes to improve the extensibility of the non-public API.  That's
a slipperly slope.  The fact that you can access package-protected
members by writing code in the same package is a loophole, not a
supported extension mechanism.  We need to retain the freedom to change
non-public APIs without warning.

I'd love to see a good patch that adds an IndexReader.reopen() method
and I hope you are not discouraged from writing one.

Doug

Robert Engels wrote:
> I can submit a patch to add the IndexReader.reopen() method.
>
> BUT, I think the requested change to SegmentReader is still valid, for the
> reasons cited in the previous email.
>
> There is already support for replacing the SegmentReader impl at runtime
> with System properties, but without the SegmentReader changes I think it
is
> next to impossible to have any worthwhile subclass - except for "maybe"
> method logging, so either the runtime replacement code should be removed,
or
> the changes made. Currently there just isn't a way for the subclass to
know
> ANYTHING, since all of the initialization methods called by the static
> factory method are private.
>
> -Original Message-
> From: Doug Cutting [mailto:[EMAIL PROTECTED]
> Sent: Monday, May 01, 2006 6:03 PM
> To: java-dev@lucene.apache.org
> Subject: Re: SegmentReader changes?
>
>
> Robert Engels wrote:
>
>>Correct - changing SegmentReader would be best, but in the past, getting
>>proposed patches included has been slower than expected.
>
>
> I'm sorry if the process has been frustrating to you in the past.  I
> hope your experiences are better in the future.
>
>
>>So, by making the
>>SegmentReader more easily subclassed (which should hopefully get approved
>>quicker), I can still have a "build" of Lucene that does not require
>>patching any files. (just including classes in the appropriate package to
>>access package level vars/methods).
>
>
> Aren't we discussing a change to IndexReader, adding a new method?  This
> is not a contrib module, but a change to the core.  So proposing it as a
> patch file that changes existing classes is the normal course.  I don't
> think we ought to be in the pracice of making changes in order to
> support easier access to non-public APIs.
>
> Doug
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

MemoryIndex

2006-05-02 Thread Robert Engels

Along the lines of Lucene-550, what about having a MemoryIndex that accepts
multiple documents, then wrote the index once at the end in the Lucene file
format (so it could be merged) during close.

When adding documents using an IndexWriter, a new segment is created for
each document, and then the segments are periodically merged in memory,
and/or with disk segments. It seems that when constructing an Index or
updating a "lot" of documents in an existing index, the write, read, merge
cycle is inefficient, and if the documents/field information were maintained
in order (TreeMaps) greater efficiency would be realized.

With a memory index, the memory needed during update will increase
dramatically, but this could still be bounded, and a "disk based" index
segment written when too many documents are in the memory index (max
buffered documents).

Does this "sound" like an improvement? Has anyone else tried something like
this?

Why ThreadLocal?

2006-05-04 Thread Robert Engels

In reviewing the code for bug 436
(http://issues.apache.org/jira/browse/LUCENE-436)

Why are we using a ThreadLocal for the enumeration at all?

Since terms(), and terms(Term t) return new instances anyway, why not just
have them clone the needed data structures?

Seems like the code could be much simpler and perform just as well.

What am I missing?

RE: Multiple threads searching in Lucene and the synchronized issue. -- solution attached.

2006-05-09 Thread Robert Engels

I am interested in the exact performance difference in ms per query removing
the synchronized block?

I can see that after a while when using your code, the JIT will probably
inline the 'non-reading' path.

Even then...

I would not think that 2 lines of synchronized code would contribute much
when even the simplest of queries needs to execute thousands of lines of
code (and probably at least a few calls to the OS, except when using a pure
memory index).

-Original Message-
From: yueyu lin [mailto:[EMAIL PROTECTED]
Sent: Tuesday, May 09, 2006 8:40 PM
To: java-dev@lucene.apache.org
Subject: Re: Multiple threads searching in Lucene and the synchronized
issue. -- solution attached.


My assumption is that every query is relatively quick. If the times lapsed
in other process when querying, the ensureIndexIsRead() function will not
cause a lot of problems. If not, the ensureIndexIsRead() function will be a
bottle neck.
I could understand that a lot of systems' queries are quiet complex, so the
problem may be gone. But for our system, more than 150 queries per seconds
on a dual CPUs linux box, that's a problem.

If the queries became more complicated, we would ignore it in most of time.

On 5/10/06, Doug Cutting <[EMAIL PROTECTED]> wrote:
>
> The best search performance is achieved using a single IndexSearcher
> shared by multiple threads.  Peter Keegan has demonstrated rates of up
> to 400 searches per second on eight-CPU machines using this approach:
>
> http://www.mail-archive.com/java-user@lucene.apache.org/msg05074.html
>
> So the synchronization is probably not hurting your performance.
>
> Doug
>
> yueyu lin wrote:
> > One IndexSearcher is one IndexSearcher instance. The instance has a lot
> of
> > functions. Unfortunately they will call another synchronized function in
> > other class's instance (TermInfosReader). That's the point why we need
> two
> > IndexSearchers. But two searchers will cost double cache memory. It's
> not
> > worthy. So if Lucene team can modify the codes slightly, the
> > synchronization
> > problem will be gone.
> >
> > On 5/9/06, Chris Hostetter <[EMAIL PROTECTED]> wrote:
> >
> >>
> >>
> >> :   We found if we were using 2 IndexSearcher, we would get 10%
> >> performance
> >> : benefit.
> >> :   But if we increased the number of IndexSearcher from 2, the
> >> performance
> >> : improvement became slight even worse.
> >>
> >> Why use more then 2 IndexSearchers?
> >>
> >> Typically 1 is all you need, except for when you want to open and "warm
> >> up" a new Searcher because you know your index has changed on disk and
> >> you're ready for those changes to be visible.
> >>
> >> (I'm not arguing against your change -- concurrancy isn't my forte so i
> >> have no opinion on wether your suggesting is good or not, i'm just
> >> questioning the goal)
> >>
> >> Acctually .. i don't know a lot about the internals of IndexSearcher
> and
> >> TermInfosReader, but according to your description of the problem...
> >>
> >> :   The class org.apache.lucene.index.TermInfosReader , as you know,
> >> every
> >> : IndexSearcher will have one TermInfosReader. Every query, one method
> in
> >> the
> >> : class must be called:
> >> : private synchronized void ensureIndexIsRead() throws IOException .
> >> Notice
> >>
> >> If the method isn't static, then how can two differnet instances of
> >> IndexSearcher, each with their own TermInfosReader, block one another?
> >>
> >>
> >>
> >>
> >> -Hoss
> >>
> >>
> >> -
> >> To unsubscribe, e-mail: [EMAIL PROTECTED]
> >> For additional commands, e-mail: [EMAIL PROTECTED]
> >>
> >>
> >
> >
> > --
> > --
> > Yueyu Lin
> >
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


--
--
Yueyu Lin


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Multiple threads searching in Lucene and the synchronized issue. -- solution attached.

2006-05-09 Thread Robert Engels

I think your basic problem is that you are using multiple IndexSearchers?
And creating new instances during runtime? If so, you will be reading the
index information far too often. This is not a good configuration.

-Original Message-
From: yueyu lin [mailto:[EMAIL PROTECTED]
Sent: Tuesday, May 09, 2006 8:46 PM
To: java-dev@lucene.apache.org; Otis Gospodnetic
Subject: Re: Multiple threads searching in Lucene and the synchronized
issue. -- solution attached.


Oh,please believe in me that I've forced the JVM to print the thread dump.
It waited here indeed.
I'll try to post the patch to JIRA.
I don't want to modify these codes by myself because that will break the
Lucene codes. So I wish you can do me the favor to check these codes and
make it availabe in the next release.
On 5/9/06, Otis Gospodnetic <[EMAIL PROTECTED]> wrote:
>
> Yueyu Lin,
>
> From what I can tell from a quick look at the method, that method need to
> remain synchronized, so multiple threads don't accidentally re-read that
> 'indexTerms' (Term[] type).  Even though the method is synchronized, it
> looks like only the first invocation would enter that try/catch/finally
> block where term reading happens.  Subsequent calls to this method should
> exist quickly, because indexTerms != null.
>
> Are you sure this is causing the bottleneck for you?
> I believe the proper way to figure that out is to kill the JVM with a
> SIGnal that causes the JVM to dump thread information.  That would tell
you
> where the code is blocking.
>
> Also, if you have concrete suggestions for code changes, please post them
> to JIRA as diffs/patches.
>
> Otis
>
>
> - Original Message 
> From: yueyu lin <[EMAIL PROTECTED]>
> To: java-dev@lucene.apache.org
> Sent: Tuesday, May 9, 2006 3:53:55 AM
> Subject: Re: Multiple threads searching in Lucene and the synchronized
> issue. -- solution attached.
>
> Please trace the codes into the Lucene when searching.
> Here is a table about how invokations are called.
> The trace log:   *Steps*
> *ClassName*
> *Functions*
> *Description*
>   1.  org.apache.lucene.search.Searcher  public final Hits search(Query
> query)  It will call another search function.   2.
> org.apache.lucene.search.Searcher  public Hits search(Query query, Filter
> filter)  Only one line code. It will new a Hits.
> return new Hits(this, query, filter);   3.
> org.apache.lucene.search.Hits Hits(Searcher s, Query q, Filter f)
> Next, we will trace into the constructor to see what stuffs will be
> done.  4.
> org.apache.lucene.search.Hits  Hits(Searcher s, Query q, Filter f)
> line 41 : weight = q.weight(s)  This call will rewrite the Query if
> necessary, let us to see what will happen then.
>
>
>   5.  org.apache.lucene.search.Query  public Weight weight(Searcher
> searcher)
> line 92: Query query = searcher.rewrite(this);  This call will begin to
> rewrite the Query.   6.  *org.apache.lucene.search.IndexSearcher*  public
> Query rewrite(Query original)  NOTE: we only have one IndexSearcher which
> has one IndexReader. If there is any functioins that are synchronized, the
> query process will be queued.   7.
> org.apache.lucene.search.BooleanQuery public Query rewrite(IndexReader
> reader)
> line 396: Query query = c.getQuery().rewrite(reader);  Here, BooleanQuery
> will get its subqueries and call their rewrite function. The function will
> require to pass a parameter: *IndexReader* that we only have one instance.
> From the codes we will notice *TermQuery* will not be rewrote and *
> PrefixQuery* will be rewrote to several *TermQuery*s. So we ignore the *
> TermQuery* and look into the *PrefixQuery*.   8.
> org.apache.lucene.search.PrefixQuery  public Query rewrite(IndexReader
> reader)
> line 41: TermEnum enumerator = reader.terms(prefix);  Let's see what will
> happen then.   9.  org.apache.lucene.index.SegmentReader  public TermEnum
> terms(Term t)
> line 277: return tis.terms(t);  SegmentReader is in fact an IndexReader's
> implementation.   10.  org.apache.lucene.index.TermInfosReader  public
> SegmentTermEnum terms(Term term)
> line 211:get(term);
>
>   11.  org.apache.lucene.index.TermInfosReader  TermInfo get(Term term)
> line 136:ensureIndexIsRead();  We finally find it!   12.
> org.apache.lucene.index.TermInfosReader  private synchronized void
> ensureIndexIsRead()  Let's analyze the function and to see why it's
> synchronized and how to improve it.
>
> On 5/9/06, Chris Hostetter <[EMAIL PROTECTED]> wrote:
> >
> >
> > :   We found if we were using 2 IndexSearcher, we would get 10%
> > performance
> > : benefit.
> > :   But if we increased the number of IndexSearcher from 2, the
> > performance
> > : improvement became slight even worse.
> >
> > Why use more then 2 IndexSearchers?
> >
> > Typically 1 is all you need, except for when you want to open and "warm
> > up" a new Searcher because you know your index has changed on disk and
> > you're ready for those changes to be visible.
> >
> > (I'm not arguing against your change -- co

RE: Multiple threads searching in Lucene and the synchronized issue. -- solution attached.

2006-05-09 Thread Robert Engels

I am fairly certain his code is ok, since it rechecks the initialized state
in the synchronized block before initializing.

Worst case, during the initial checks when the initialization is occurring
there may be some unneeded checking, but after that, the code should perform
better since it will never enter a synchronized block.

I just doubt that this change makes any real difference if the IndexSearcher
is long-lived.

-Original Message-
From: Yonik Seeley [mailto:[EMAIL PROTECTED]
Sent: Tuesday, May 09, 2006 9:04 PM
To: java-dev@lucene.apache.org
Cc: Otis Gospodnetic
Subject: Re: Multiple threads searching in Lucene and the synchronized
issue. -- solution attached.


Yueyu Lin,

Your patch below looks suspiciously like the double-checked locking
anti-pattern, and is not guaranteed to work.
There really isn't a way to safely lazily initialize without using
synchronized or volatile.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

On 5/9/06, yueyu lin <[EMAIL PROTECTED]> wrote:
> Yes, the modification is still synchronized and the first thread will be
> responsible for reading first. And then other threads will not read and
the
> synchronization is unnecessary.
> private void ensureIndexIsRead() throws IOException {
> if (indexTerms != null)   // index already read
>   return; // do nothing
> synchronized(this){
> System.out.println("Read [EMAIL PROTECTED]@");
> if(indexTerms != null){
> System.out.println ("Someone read it.return-_-");
> return ;
> }
> readIndex ();
> }
>   }
>
>   private synchronized void readIndex() throws IOException{
>   Term[] m_indexTerms = null;
>   try {
>   int indexSize = (int)indexEnum.size;// otherwise read
> index
>   m_indexTerms = new Term[indexSize];
>   indexInfos = new TermInfo[indexSize];
>   indexPointers = new long[indexSize];
>
>   for (int i = 0; indexEnum.next(); i++) {
> m_indexTerms[i] = indexEnum.term();
> indexInfos[i] = indexEnum.termInfo();
> indexPointers[i] = indexEnum.indexPointer;
>   }
> } finally {
> indexEnum.close();
> indexEnum = null;
> indexTerms = m_indexTerms;
> }
>   }
>
> That's a small trick I learned when I was developing a busy stock system.
>
> If the method ensureIndexIsRead() is synchronized, it will be blocked for
a
> while, although even 2 lines codes.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Multiple threads searching in Lucene and the synchronized issue. -- solution attached.

2006-05-09 Thread Robert Engels

I wrote a test case to test the performance (assuming that it worked, but
based on reading the double-checked articles I understand the dilemma).

Using 30,000,000 simple iterations and 2 threads: (note this is on a single
processor machine).

sync time = 39532
unsync time = 2250
diff time = 37282
diff per iteration = 6.2136667E-4

So it saves .0006 ms per invocation of the method.

I honestly doubt this is the cause of any performance bottleneck.

I attached the test code if you are interested.



-Original Message-
From: yueyu lin [mailto:[EMAIL PROTECTED]
Sent: Tuesday, May 09, 2006 11:32 PM
To: java-dev@lucene.apache.org
Subject: Re: Multiple threads searching in Lucene and the synchronized
issue. -- solution attached.


I met these problem before indeed.The compiler did something optimized for
me that was bad for me when I see the byte-codes.
 When I'm using a function local variable, m_indexTerms and in JDK1.5.06, it
seems ok.
Whether it will break in other environments, I still don't know about it.
On 5/10/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:
>
> On 5/9/06, Robert Engels <[EMAIL PROTECTED]> wrote:
> > I am fairly certain his code is ok, since it rechecks the initialized
> state
> > in the synchronized block before initializing.
>
> That "recheck" is why the pattern (or anti-pattern) is called
> double-checked locking :-)
>
> -Yonik
> http://incubator.apache.org/solr Solr, the open-source Lucene search
> server
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


--
--
Yueyu Lin

public class TestSyncCosts {
static Object test = null;

static final int NTHREADS = 2;
static final int NITERATIONS = 3000;
static int creates = 0;

public static void main(String[] args) throws Exception {

Thread[] threads = new Thread[NTHREADS];

for(int i=0;i-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Multiple threads searching in Lucene and the synchronized issue. -- solution attached.

2006-05-09 Thread Robert Engels

I think you could use a volatile primitive boolean to control whether or not
the index needs to be read, and also mark the index data volatile and it
SHOULD PROBABLY work.

But as stated, I don't think the performance difference is worth it.


-Original Message-
From: yueyu lin [mailto:[EMAIL PROTECTED]
Sent: Tuesday, May 09, 2006 11:32 PM
To: java-dev@lucene.apache.org
Subject: Re: Multiple threads searching in Lucene and the synchronized
issue. -- solution attached.


I met these problem before indeed.The compiler did something optimized for
me that was bad for me when I see the byte-codes.
 When I'm using a function local variable, m_indexTerms and in JDK1.5.06, it
seems ok.
Whether it will break in other environments, I still don't know about it.
On 5/10/06, Yonik Seeley <[EMAIL PROTECTED]> wrote:
>
> On 5/9/06, Robert Engels <[EMAIL PROTECTED]> wrote:
> > I am fairly certain his code is ok, since it rechecks the initialized
> state
> > in the synchronized block before initializing.
>
> That "recheck" is why the pattern (or anti-pattern) is called
> double-checked locking :-)
>
> -Yonik
> http://incubator.apache.org/solr Solr, the open-source Lucene search
> server
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>


--
--
Yueyu Lin


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Multiple threads searching in Lucene and the synchronized issue. -- solution attached.

2006-05-10 Thread Robert Engels

For what its worth...

That is not my understanding. My understanding is that volatile just ensures
the JIT always accesses the var in order - prevents some compiler
optimizations - where as synchronized needs to acquire the lock. (There were
discussions regarding having volatile create synchronized accessors behind
the scenes - but I don't think that semantic was ever agreed upon).

That coupled with using primitives (to avoid the early memory alloc - since
primitives are allocated on the stack), allows the double-locked
synchronization to work (at least that is my understanding).

-Original Message-
From: Chris Hostetter [mailto:[EMAIL PROTECTED]
Sent: Wednesday, May 10, 2006 12:51 AM
To: Lucene Dev
Subject: RE: Multiple threads searching in Lucene and the synchronized
issue. -- solution attached.



: I think you could use a volatile primitive boolean to control whether or
not
: the index needs to be read, and also mark the index data volatile and it
: SHOULD PROBABLY work.
:
: But as stated, I don't think the performance difference is worth it.

My understanding is:
  1) volatile will only help as of java 1.5 ... lucene targets 1.4
 compatibility.
  2) in 1.5, volatile is basically just as expensive as synchronized.

: I met these problem before indeed.The compiler did something optimized for
: me that was bad for me when I see the byte-codes.
:  When I'm using a function local variable, m_indexTerms and in JDK1.5.06,
it
: seems ok.
: Whether it will break in other environments, I still don't know about it.

The dangerous thing is that even if the byte code looks okay, and if it
works okay today, your app could run for a while and then all of the
sudden it could stop working because of the order the threads are run, or
becuase of an optimization the JVM applies on the fly.



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Taking a step back

2006-05-10 Thread Robert Engels

I agree with almost all of what you said.

The file format issue whoever is a non-issue. If you want interoperability
between systems do it via remote invocation and IIOP, or some HTTP
interface. This is far more easier to control, especially through version
change cycles - otherwise all platforms need to be updated together - which
is very hard to do (unless you are using Java with WORA !).

I also don't understand why Lucene doesn't focus on being THE JAVA search
engine. Anything I think that detracts that from moving forward should be
out of scope.


-Original Message-
From: Grant Ingersoll [mailto:[EMAIL PROTECTED]
Sent: Wednesday, May 10, 2006 6:06 AM
To: Lucene Developer's List
Subject: Taking a step back


Is it just me or do we have a whole bunch of people proposing a whole
bunch of fairly broad changes to Lucene? (I know, I know, they should
always be backward compatible)  Might this warrant some
coordination/planning?  I know things are mostly done in an ad-hoc way
(whoever submits a patch), but I think we may all be better served by
some coordination beyond what takes place on the mailing list.  I see
some pieces here and there that would benefit from common code, etc.

As I see it, we have several people proposing file format changes, Otis
and some others want scoring changes, I have discussed with a few people
the ability to make more pluggable how fields are indexed plus the
ability to add metadata at all levels of the index (field, document,
index, etc.), more to come on this soon.  Additionally, the lazy loading
field stuff is pending and would benefit from a few file format changes
as well

Additionally, we can't just think of the Java version anymore,
especially when it comes to file formats, I don't think.  Should we,
perhaps, setup a top-level wiki-style planning place?  Would this be
useful?  I don't think it replaces the good discussions on this list, I
just think it could give potential contributors a much easier way of
finding how to help, plus set a (albeit loose) plan for the future of
Lucene beyond what is captured in snippets of email here and there.

Just my two cents,
Grant

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Multiple threads searching in Lucene and the synchronized issue. -- solution attached.

2006-05-10 Thread Robert Engels

You are correct.

It would seem that modern processor architecture will provide a better
solution to this. Since they seem to be giving up on clock-speed and going
with multiple parallel cores, and multi-processors, making a solution to
this problem of even greater importance. The upside should be that JVMs will
be able to implement synchronized more efficiently.

-Original Message-
From: Yonik Seeley [mailto:[EMAIL PROTECTED]
Sent: Wednesday, May 10, 2006 10:07 AM
To: java-dev@lucene.apache.org; [EMAIL PROTECTED]
Subject: Re: Multiple threads searching in Lucene and the synchronized
issue. -- solution attached.

On 5/10/06, Robert Engels <[EMAIL PROTECTED]> wrote:
> I think you could use a volatile primitive boolean to control whether or
not
> the index needs to be read, and also mark the index data volatile and it
> SHOULD PROBABLY work.

No, that still doesn't work.
volatile doesn't quite mean what you think it means with the Java
memory model < v1.5.

Reads and writes of volatile variables may not be reordered, *but*
non-volatile reads & writes may still be reordered w/ respect to the
volatile ones (making volatile not-that-useful).

The *reference* to the object is really the only thing that becomes
consistent across threads... the fields of objects the reference
points to can be inconsistent unless all the fields are volatile.

The memory model changed in Java1.5, and the meaning of volatile along with
it.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Taking a step back

2006-05-10 Thread Robert Engels

What about the case where a "bug" is found that necessitates a file format
change.

Obviously this should be VERY rare given adequate testing, but it seems
difficult to make a hard and fast rule that X.0 should be able to ALWAYS
read X.N.

-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED]
Sent: Wednesday, May 10, 2006 1:14 PM
To: java-dev@lucene.apache.org
Subject: Re: Taking a step back

Lucene version numbers are about compatibility.

Minor versions should always have complete API back-compatiblity.
That's to say, any code developed against X.0 should continue to run
without alteration against all X.N releases.  A major release may
introduce incompatible API changes.  The transition strategy is to
introduce new APIs in release X.N, deprecating old APIs, then remove all
deprecated APIs in release X+1.0.

File formats are back-compatible between major versions.  Version X.N
should be able to read indexes generated by any version after and
including version X-1.0, but may-or-may-not be able to read indexes
generated by version X-2.N.

Note that older releases are never guaranteed to be able to read indexes
generated by newer releases.  When this is attempted, a predictable
error should be generated.

Does that sound reasonable?

Doug

karl wettin wrote:
> On Wed, 2006-05-10 at 13:29 -0400, Grant Ingersoll wrote:
>
>>Or even Lucene3Whiteboard (did I really write Lucene 3?!?)
>
>
> You know, I was just thinking that it would be nice if Lucene was
> developed like the Linux kernels. When 2.6 is stable, people are beta
> testing 2.7 and some hack 2.8.
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Taking a step back

2006-05-11 Thread Robert Engels

Exactly. If people don't get the REAL value of Java by now, they are
probably not going to ever get it. Weighing ALL of the pros/cons, developing
modern software in anything else is just silly. But, arguing this is akin to
discussing religion...

-Original Message-
From: Doug Cutting [mailto:[EMAIL PROTECTED]
Sent: Thursday, May 11, 2006 12:20 PM
To: java-dev@lucene.apache.org
Subject: Re: Taking a step back

Marvin Humphrey wrote:
> The only question is whether there are Java-specific optimizations
> which are so advantageous that they outweigh the benefits of
> interchange.

It's not just optimizations.  If we, e.g., wrote, for each field, the
name of the codec class that it uses, then we could provide arbitrary
extensibility.  Anything that implemented the field codec API could be
used, permitting alternate posting compression algorithms, etc.  But
that would not be friendly to other implementations, which may not be
able to easily instantiate classses from class names, nor dynamically
download codec implementations from a public repository, etc.  The fact
that java bytecode is portable makes this more attractive.

Doug

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Taking a step back

2006-05-11 Thread Robert Engels

I disagree with that a bit. I have found that certain languages lend
themselves far better to certain file formats (that is, if an operation is
very efficient to perform in a particular language, using a file format that
allows the usage of that operation directly will often lead to much better
performance). This is often true with byte ordering on particular hardware
platforms. That is the whole reason this is an issue. Others can read the
modified UTF, it is just not as efficient for them !

But more importantly, I don't think Lucene (or others) should be "held back"
attempting to adhere to a standardized file format.

Take databases for example. Many available. All use different file formats,
but all can be accessed with (pretty much) standardized SQL (using different
drivers).

I think Lucene could offer a similar approach at the API level, maybe an
embedded TCP/IP interface / command processor (similar to an HTTP server).

You are always going to have interoperability issues (sometimes even when
using Java, but rarely), so I say dump the burden on the others, and just
make Lucene the best Java search engine possible.

Without starting some sort of flame war, I can't think of any advantages to
not running a Java version of Lucene, but, that is just my opinion. It would
be fairly straight forward to convert all of Lucene to C, and provide a Java
binding, but why???

-Original Message-
From: Marvin Humphrey [mailto:[EMAIL PROTECTED]
Sent: Thursday, May 11, 2006 12:08 PM
To: java-dev@lucene.apache.org
Subject: Re: Taking a step back

On May 10, 2006, at 8:02 AM, Robert Engels wrote:

> The file format issue whoever is a non-issue. If you want
> interoperability
> between systems do it via remote invocation and IIOP, or some HTTP
> interface. This is far more easier to control, especially through
> version
> change cycles - otherwise all platforms need to be updated together
> - which
> is very hard to do (unless you are using Java with WORA !).
>
> I also don't understand why Lucene doesn't focus on being THE JAVA
> search
> engine. Anything I think that detracts that from moving forward
> should be
> out of scope.

I really don't relish the prospect that this might degenerate into a
language argument, but I think it falls to me to respond, since the
patch I submitted on Monday opens up a lot of possibilities for interop.

I don't necessarily disagree.

Abandoning all attempts at interop has its advantages.  One
unfortunate albeit unavoidable aspect of Lucene is that it is tightly
bound to its file format.  In a perfect world, the file reading/
writing apparatus would be modular: the index would be read into
memory using a plugin, manipulated, then saved using another plugin.
That doesn't work, obviously, because indexes are commonly too large
to be read into available RAM, and so the I/O stuff is scattered over
the entire library, which makes maintaining compatibility laborious.

However, Lucene has to make some effort to track its file format
definition, so that it may live up to the commitments for backwards-
compatibility codified earlier in this thread.  This is currently
done using the File Formats document (though that document is
incomplete and buggy).  There's not much difference between
supporting the files written by an earlier version of Lucene and
supporting the files written by another implementation of Lucene
which adhere to the same spec.

The only question is whether there are Java-specific optimizations
which are so advantageous that they outweigh the benefits of
interchange.  There is no inherent advantage in using Modified UTF-8
over standard UTF-8, and the UTF-8 code I supplied actually speeds up
Lucene by a couple percent because it simplifies some conditionals --
all of the performance hit comes from using a bytecount as the String
prefix.  I have good reasons to believe that this can go away, not
the least of which is I've actually written a working implementation
in Perl/C which uses bytecounts and I know where all the bottlenecks
are.

There are also advantages to keeping the file format public, both for
Java Lucene and for the larger Apache Lucene project.  Of course
there's the the raw usefulness of interchange.  For instance, it
might be nice to whip up a little script in Perl or Ruby which works
with your existing rig -- especially if there's a CPAN module that
offers functionality you need which isn't available yet in Java, or
you'd benefit from a near-instantaneous startup time.

But more important, I'd argue, is that having all implementations
share a common file format means that all the authors have an
amplified interest in coordinating, communicating, and contributing.
Just as learning new languages, programming or natural, broadens an
individual's horizons, so does working out an implementation based on
Lucene's data structur

RE: Taking a step back

2006-05-11 Thread Robert Engels

1) That is my point. In this case, they are not copying the impl, they are
requesting changes to the format.

I just think there are better ways of doing interoperability than file
formats. In almost all cases where I've encountered (or built !) systems
that did integration based on a known file format, it bit me in the ass in
the end... and/or severely limited the ability of myself or others to
change...

-Original Message-
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
Sent: Thursday, May 11, 2006 2:29 PM
To: java-dev@lucene.apache.org
Subject: Re: Taking a step back

I don't want to get into this (so I'm replying!?), but I just want to point
out 2 things:
1) So far we've never had a situation where Java Lucene was held back
because of interoperability.  Ports tend to copy the implementation and
adapt to Java Lucene.
2) Solr already does the HTTP server thing that you are describing, I
believe.

Otis

- Original Message 
From: Robert Engels <[EMAIL PROTECTED]>
To: java-dev@lucene.apache.org
Sent: Thursday, May 11, 2006 1:37:17 PM
Subject: RE: Taking a step back

I disagree with that a bit. I have found that certain languages lend
themselves far better to certain file formats (that is, if an operation is
very efficient to perform in a particular language, using a file format that
allows the usage of that operation directly will often lead to much better
performance). This is often true with byte ordering on particular hardware
platforms. That is the whole reason this is an issue. Others can read the
modified UTF, it is just not as efficient for them !

But more importantly, I don't think Lucene (or others) should be "held back"
attempting to adhere to a standardized file format.

Take databases for example. Many available. All use different file formats,
but all can be accessed with (pretty much) standardized SQL (using different
drivers).

I think Lucene could offer a similar approach at the API level, maybe an
embedded TCP/IP interface / command processor (similar to an HTTP server).

You are always going to have interoperability issues (sometimes even when
using Java, but rarely), so I say dump the burden on the others, and just
make Lucene the best Java search engine possible.

Without starting some sort of flame war, I can't think of any advantages to
not running a Java version of Lucene, but, that is just my opinion. It would
be fairly straight forward to convert all of Lucene to C, and provide a Java
binding, but why???

-Original Message-
From: Marvin Humphrey [mailto:[EMAIL PROTECTED]
Sent: Thursday, May 11, 2006 12:08 PM
To: java-dev@lucene.apache.org
Subject: Re: Taking a step back

On May 10, 2006, at 8:02 AM, Robert Engels wrote:

> The file format issue whoever is a non-issue. If you want
> interoperability
> between systems do it via remote invocation and IIOP, or some HTTP
> interface. This is far more easier to control, especially through
> version
> change cycles - otherwise all platforms need to be updated together
> - which
> is very hard to do (unless you are using Java with WORA !).
>
> I also don't understand why Lucene doesn't focus on being THE JAVA
> search
> engine. Anything I think that detracts that from moving forward
> should be
> out of scope.

I really don't relish the prospect that this might degenerate into a
language argument, but I think it falls to me to respond, since the
patch I submitted on Monday opens up a lot of possibilities for interop.

I don't necessarily disagree.

Abandoning all attempts at interop has its advantages.  One
unfortunate albeit unavoidable aspect of Lucene is that it is tightly
bound to its file format.  In a perfect world, the file reading/
writing apparatus would be modular: the index would be read into
memory using a plugin, manipulated, then saved using another plugin.
That doesn't work, obviously, because indexes are commonly too large
to be read into available RAM, and so the I/O stuff is scattered over
the entire library, which makes maintaining compatibility laborious.

However, Lucene has to make some effort to track its file format
definition, so that it may live up to the commitments for backwards-
compatibility codified earlier in this thread.  This is currently
done using the File Formats document (though that document is
incomplete and buggy).  There's not much difference between
supporting the files written by an earlier version of Lucene and
supporting the files written by another implementation of Lucene
which adhere to the same spec.

The only question is whether there are Java-specific optimizations
which are so advantageous that they outweigh the benefits of
interchange.  There is no inherent advantage in using Modified UTF-8
over standard UTF-8, and the UTF-8 code I supplied actually speeds up
Lucene by a couple percent because it simplifies some conditionals --
all of t

RE: LUCENE-436

2006-05-12 Thread Robert Engels

As stated many times, it is SIGNIFICANT if using RAMdirectories to hold
entire indexes. If not, then it is not such a big deal.

Rather than using FixedThreadLocal, a more involved solution using a runtime
property to determine which thread local impl to use is possible. In lieu of
that, RAMDirectories are either broken, or everyone takes a performance hit.

-Original Message-
From: Fernando Padilla [mailto:[EMAIL PROTECTED]
Sent: Thursday, May 11, 2006 9:53 PM
To: java-dev@lucene.apache.org
Subject: Re: LUCENE-436

so... what do you think?

We just took the patch through QA and there was a noticeable memory
increase through time, and once we applied the patch, the memory didn't
increase..

So if you don't like the solution.. what are some alternatives?

fernando

ps - www.protrade.com

Otis Gospodnetic wrote:
> I'm not at home with some of the things mentioned in LUCENE-436, so I'm
not applying any of the various patches provided there, but it looks like
something that deserves attention.  I think it has been brought up a while
back, too.
> http://issues.apache.org/jira/browse/LUCENE-436
>
> Otis
>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Lucene Index comparison..

2006-05-12 Thread Robert Engels

I think more detail is in needed.

-Original Message-
From: Krishnan, Ananda [mailto:[EMAIL PROTECTED]
Sent: Thursday, May 11, 2006 10:46 PM
To: java-dev@lucene.apache.org
Cc: [EMAIL PROTECTED]
Subject: RE: Lucene Index comparison..

Hi

i will explain about my problem a bit more in detail.
Every night at a particular time indexing will be happen. I have to find out
any new urls or any files have been added in the  new index so that i have
to send an alert to the customer.

I know about sending alert to the customer but i am not sure about the
comparison of the indexes.

Have you ever tried this.. If so please let me know about the approach so
that it would be much helpful for me.

Thanks and regards
Anandha krishnan.

-Original Message-
From: karl wettin [mailto:[EMAIL PROTECTED]
Sent: Thursday, May 11, 2006 5:48 PM
To: java-dev@lucene.apache.org
Subject: Re: Lucene Index comparison..

On Thu, 2006-05-11 at 10:32 +0530, Krishnan, Ananda wrote:
> Hi
>
> Can anyone please help me to know about how to compare two different
lucene indexes.

I think there has been four instances of this question lately, including
my own.

Mine is only compatible with the 1.9_20060505-karl1 branch, and really
more of a test case. I will add it to my next update of
. I can send you the
broken snapshot off list if you want.

It is supplied with two index readers and
 * iterates and compare all terms, documents and positions.
 * places a couple of searches and compare the results.
 * does not compare the term frequency vectors.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

This e-mail and any attachment is for authorised use by the intended
recipient(s) only. It may contain proprietary material, confidential
information and/or be subject to legal privilege. It should not be copied,
disclosed to, retained or used by, any other party. If you are not an
intended recipient then please promptly delete this e-mail and any
attachment and all copies and inform the sender. Thank you.

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: LUCENE-436

2006-05-12 Thread Robert Engels

There is no "memory leak" per se - just the propensity to use more memory
than would seem to be needed (based on index size).

Using the FixedThreadLocal along with the modified TermInfosReader (that
uses it), the memory problem is resolved. If you are not seeing that, then
you have some other memory leak in your application.

-Original Message-
From: Fernando Padilla [mailto:[EMAIL PROTECTED]
Sent: Friday, May 12, 2006 1:35 PM
To: java-dev@lucene.apache.org
Subject: Re: LUCENE-436


hmm. So what you're saying is that there is a "memory leak", but very
very noticeable with large RAMDirectories (like what we have)...

With a 5M directories, it seems to be leaking atleast 5M per hour,
depending on queries..  even on our 1500M VM they run our of memory over
24 hours.

So I guess we have no choice but to use FSDirectories?


The FixedThreadLocal patch doesn't seem to have helped afterall..



Robert Engels wrote:
> As stated many times, it is SIGNIFICANT if using RAMdirectories to hold
> entire indexes. If not, then it is not such a big deal.
>
> Rather than using FixedThreadLocal, a more involved solution using a
runtime
> property to determine which thread local impl to use is possible. In lieu
of
> that, RAMDirectories are either broken, or everyone takes a performance
hit.
>
>
> -Original Message-
> From: Fernando Padilla [mailto:[EMAIL PROTECTED]
> Sent: Thursday, May 11, 2006 9:53 PM
> To: java-dev@lucene.apache.org
> Subject: Re: LUCENE-436
>
>
> so... what do you think?
>
> We just took the patch through QA and there was a noticeable memory
> increase through time, and once we applied the patch, the memory didn't
> increase..
>
> So if you don't like the solution.. what are some alternatives?
>
> fernando
>
> ps - www.protrade.com
>
>
> Otis Gospodnetic wrote:
>
>>I'm not at home with some of the things mentioned in LUCENE-436, so I'm
>
> not applying any of the various patches provided there, but it looks like
> something that deserves attention.  I think it has been brought up a while
> back, too.
>
>>http://issues.apache.org/jira/browse/LUCENE-436
>>
>>Otis
>>
>>
>>
>>-
>>To unsubscribe, e-mail: [EMAIL PROTECTED]
>>For additional commands, e-mail: [EMAIL PROTECTED]
>>
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Nio File Caching & Performance Test

2006-05-12 Thread Robert Engels

I finally got around to making the NioFSDirectory with caching 1.9
compliant. I also produced a performance test case.

Below is the results on my machine:

read random = 586391
read same = 68578
nio read random = 72766
nio max mem = 203292672
nio memory = 102453248
nio hits = 14974713
nio misses = 25039
nio hit rate = 99
nio read same = 22344

The most important statistic is that the reading via the local cache, vs.
going to the OS (where the block is cached) is 3x faster (22344 vs. 68578).
With random reads, when the block may not be in the OS cache, it is 8x
faster (72766 vs. 586391).

Attached are all of the files needed to run the test (The NioFSDirectory is
not needed for the test).

The revised NioFile shares one cache for all Nio files. The revised
MemoryCache uses SoftReferences to allow the cache to grow unbounded, and
let the GC handle cache reductions (it seems to be that for most JVMs,
SoftReferences are reclaimed in a LRU order which helps).

This test only demonstrates improvements in the low-level IO layer, but one
could infer significant performance improvements for common searches and/or
document retrievals.

Is there a standard Lucene search performance I could run both with and
without the NioFSDirectory to demonstrate real world performance
improvements? I have some internal tests that I am collating, but I would
rather use a standard test if possible.
package org.apache.lucene.util;

import java.io.*;
import java.util.*;

import junit.framework.TestCase;

public class NioFilePerformanceTest extends TestCase {
static final int BLOCKSIZE = 1 * 2048 + 1; // try with 2k, 4k and 2k+1 (so not on nio boundry)
static final int NBLOCKS = 1000 * 100; // 400 mb file with 4k blocks
static final File file = new File("testfile");
static final int NREADS = (NBLOCKS)*100;
static final int PERCENTOFFILE = 50; // must be 1-100
static final byte[] block = new byte[BLOCKSIZE];

static {
System.setProperty("org.apache.lucene.CachePercent","90");
}

public void testCreateFile() throws Exception {
long stime = System.currentTimeMillis();
RandomAccessFile rf = new RandomAccessFile(file,"rw");
for(int i=0;ipackage org.apache.lucene.util;

import java.io.*;
import java.lang.ref.SoftReference;
import java.nio.ByteBuffer;
import java.nio.channels.FileChannel;
import java.util.*;

/**
 * wrapper for NIO FileChannel in order to circumvent problems with multiple threads reading the
 * same FileChannel, and to provide local cache. The current Windows implementation of FileChannel
 * has some synchronization even when performing positioned reads. See JDK bug #6265734.
 * 
 * The NioFile contains internal caching to reduce the number of physical disk reads.
 */
public final class NioFile {
static final private int BLOCKSIZE = Integer.getInteger("org.apache.lucene.BlockSize",4096).intValue();

static public int cachehits = 0; 
static public int cachemisses = 0; 

static public MemoryCache cache;
static {
cache = new MemoryCache();
}

private boolean open = true;
private int opencount = 0;
private FileChannel channel;

public NioFile(File path,String mode) throws IOException {
//System.out.println("new NioFile for "+path);
open(path,mode);
}

private synchronized void open(File path,String mode) throws IOException {
if(opencount++==0) {
RandomAccessFile raf = new RandomAccessFile(path,mode);
channel = raf.getChannel();
}
}

public synchronized void close() throws IOException {
if(--opencount==0)
channel.close();
}

public boolean isOpen() {
return opencount>0;
}

public void read(byte[] b, int offset, int len, long position) throws IOException {
do {
long blockno = (position/BLOCKSIZE);
BlockKey bk = new BlockKey(this,blockno);
byte[] block = cache.get(bk);

if(block==null) {
cachemisses++;
block = new byte[BLOCKSIZE];
channel.read(ByteBuffer.wrap(block),blockno*BLOCKSIZE);
cache.put(bk,block);
} else
cachehits++;

int blockoffset = (int) (position % BLOCKSIZE);
int i = Math.min(len,BLOCKSIZE-blockoffset);

System.arraycopy(block,blockoffset,b,offset,i);

offset += i;
len -= i;
position += i;

} while (len >0);
}

static final class BlockKey {
private NioFile file;
private long blockno;
private int hashCode;

public BlockKey(NioFile file, long blockno) {
this.file = file;
this.blockno = blockno;
hashCode = (int) (file.hashCode() ^ blockno);
}
public int hashCode() {
return hashCod

RE: Nio File Caching & Performance Test

2006-05-15 Thread Robert Engels

As stated in the email, it is 3x faster reading from a Java local cache,
then having Java go to the OS (where it may or may not be cached). It avoids
the overhead/context switch into the OS.

-Original Message-
From: peter royal [mailto:[EMAIL PROTECTED]
Sent: Monday, May 15, 2006 4:11 PM
To: java-dev@lucene.apache.org
Subject: Re: Nio File Caching & Performance Test


On May 12, 2006, at 3:38 PM, Robert Engels wrote:
> I finally got around to making the NioFSDirectory with caching 1.9
> compliant. I also produced a performance test case.

How does this implementation compare to the MMapDirectory?

I've found that the MMapDirectory is far faster than the FSDirectory
on 64-bit machines with very large indexes. Is the explicit caching
of the NioFSDirectory expected to be a considerable win over allowing
the OS to do caching of data from the filesystem?
-pete

--
[EMAIL PROTECTED] - http://fotap.org/~osi




-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Nio File Caching & Performance Test

2006-05-16 Thread Robert Engels

My tests still hold that the NioFile I submitted is significantly faster
than the standard FSDirectory.

BUT, the memory mapped implementation is significantly faster than NioFile.
I attribute this to the overhead of managing the soft references, and
possible GC interaction.

SO, I would like to use a memory mapped reader, but I encounter OOM errors
when mapping large files, due to running out of address space.

Has anyone found a solution for this? (A 2 gig index is not all that
large...).

-Original Message-
From: Murat Yakici [mailto:[EMAIL PROTECTED]
Sent: Tuesday, May 16, 2006 1:55 AM
To: java-dev@lucene.apache.org
Subject: Re: Nio File Caching & Performance Test


Hi,

According to my humble tests, there is no significant improvement
either. NIO has buffer creation time costs compared to other Buffered
IOs. However, a testbed would be ideal for benchmarks.

Murat

Doug Cutting wrote:

> Robert Engels wrote:
>
>> The most important statistic is that the reading via the local cache, vs.
>> going to the OS (where the block is cached) is 3x faster (22344 vs.
>> 68578).
>> With random reads, when the block may not be in the OS cache, it is 8x
>> faster (72766 vs. 586391).
>
> [ ... ]
>
>> This test only demonstrates improvements in the low-level IO layer,
>> but one
>> could infer significant performance improvements for common searches
>> and/or
>> document retrievals.
>
>
> That is not an inference I would make.  There should be some
> improvement, but whether it is significant is not clear to me.
>
>> Is there a standard Lucene search performance I could run both with and
>> without the NioFSDirectory to demonstrate real world performance
>> improvements? I have some internal tests that I am collating, but I would
>> rather use a standard test if possible.
>
>
> No, we don't have a standard benchmark suite.  Folks have talked about
> developing one, but I don't think one yet exists.
>
> Report what you have.  Describe the collection, how it is indexed, how
> you've selected queries, and the improvement in average response time.
>
> Doug
>
> -
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

FieldsReader synchronized access vs. ThreadLocal ?

2006-05-16 Thread Robert Engels

In SegmentReader, currently the access to FieldsReader.doc(n) is
synchronized (which is must be).

Does it not make sense to use a ThreadLocal implementation similar to the
TermInfosReader?

It seems that in a highly multi-threaded server this synchronized method
could lead to significant blocking when the documents are being retrieved?

RE: Nio File Caching & Performance Test

2006-05-16 Thread Robert Engels

The MMapDirectory works for really big indexes (larger than 2 gig), BUT if
the JVM does not have enough address space (32 bit JVM)it will not work.

-Original Message-
From: eks dev [mailto:[EMAIL PROTECTED]
Sent: Tuesday, May 16, 2006 2:20 PM
To: java-dev@lucene.apache.org
Subject: Re: Nio File Caching & Performance Test

Hi Robert,
I might be easily wrong, but I beleive I saw something on JIRA (or was it
bugzilla?) a long long time ago, where somebody made MMAP implementation for
really big indexes that works on 32 bit. I guess it is worth checking it.

- Original Message 
From: Yonik Seeley <[EMAIL PROTECTED]>
To: java-dev@lucene.apache.org; [EMAIL PROTECTED]
Sent: Tuesday, 16 May, 2006 6:10:07 PM
Subject: Re: Nio File Caching & Performance Test

On 5/16/06, Robert Engels <[EMAIL PROTECTED]> wrote:
> SO, I would like to use a memory mapped reader, but I encounter OOM errors
> when mapping large files, due to running out of address space.

Pretty much all x86 servers sold are 64 bit capable now.
Run a 64 bit OS if you can :-)

> Has anyone found a solution for this? (A 2 gig index is not all that
> large...).

I guess one could try a hybrid approach... only mmap certain index
files that are critical for performance.

-Yonik
http://incubator.apache.org/solr Solr, the open-source Lucene search server

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

non indexed field searching?

2006-05-16 Thread Robert Engels

I know I've (and others have brought this up before), but maybe now with the
lazy field loading (seemingly due to larger documents being stored) it is
time to revisit.

It seems that maybe a query could be separated into Filter and Query clauses
(similar to how the query optimizer works in Nutch). Clauses that were based
on non-indexed fields would be converted to a Filter.

The problem is if you have some thing like

(indexed:somevalue OR nonindexed:somevalue)

would require a complete visit to every document.

But something like

(indexed:somevalue AND nonindexed:somevalue)

would be very efficient.

I understand that this is moving Lucene closer to a database, but it is just
very difficult to perform some complex queries efficiently without it.

*** As an aside, I still don't understand why Filter is not an interface

interface Filter {
boolean include(IndexReader reader,int doc)
}

and then you would have

NonIndexedFilter(String fieldname,String expression) implements Filter
boolean include(IndexReader reader,int doc) {
Document d = reader.document(doc);
String val = d.getValue(fieldname);
return {evaluate expression against val};
}

Filter being an interface should incur very little overhead in the common
case where it was backed by a BitSet as the modern JVM will inline it.

RE: Hacking Luke for bytecount-based strings

2006-05-16 Thread Robert Engels

While you're at it, why not rewrite Luke in Perl as well...

Seems like a great use of your time.

-Original Message-
From: Marvin Humphrey [mailto:[EMAIL PROTECTED]
Sent: Tuesday, May 16, 2006 11:36 PM
To: java-dev@lucene.apache.org
Cc: Andrzej Bialecki
Subject: Hacking Luke for bytecount-based strings

Greets,

There does not seem to be a lot of demand for one implementation of  
Lucene to read indexes generated by another implementation of Lucene  
for the purposes of indexing or searching.  However, there is a  
demand for index browsing via Luke.

It occurred to me today that if Luke were powered by a version of  
Lucene with my bytecount-based-strings patch applied, it would be  
able to read indexes generated by Ferret.  Ironically, it wouldn't be  
able to read KinoSearch indexes unless I reverted the change which  
causes the term vectors to be stored in the .fdt file.  I'd probably  
do that.  Luke is great.

One possibility for distributing such a beast is to offer a patched  
jar for download from my website.  Before I start down that road,  
though, I thought I'd bring up the subject here.

Thoughts?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

1 2 3 4 5 6 >

1 - 100 of 592 matches

Mail list logo