mount of fields,
the
performance gain of using equals() on interned strings is no match
for the
performance loss of interning the field name of each field.
Wolfgang Hoschek-2 wrote:
I noticed that, too, but in my case the difference was often much
more extreme: it was one of the primary
I need to read the TokenStream at least twice
I used the horribly hackey but quick-for-me method of adding a
method to MemoryIndex that accepts a List of Tokens. Any ideas?
I'm not sure about modifying MemoryIndex. It should be easy enough
to create a subclass of TokenStream - ("CachedToke
[
https://issues.apache.org/jira/browse/LUCENE-129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12462579
]
wolfgang hoschek commented on LUCENE-129:
-
Just to clarify: The empty finalize() method body in MemoryIndex
[
http://issues.apache.org/jira/browse/LUCENE-550?page=comments#action_12451870 ]
wolfgang hoschek commented on LUCENE-550:
-
I've now checked in a version of MemoryIndexTest into contrib/memory that more
easily allows to switch be
[
http://issues.apache.org/jira/browse/LUCENE-550?page=comments#action_12451730 ]
wolfgang hoschek commented on LUCENE-550:
-
What's the benchmark configuration? For example, is throughput bounded by
indexing or querying? Measur
[
http://issues.apache.org/jira/browse/LUCENE-550?page=comments#action_12451731 ]
wolfgang hoschek commented on LUCENE-550:
-
Other question: when running the driver in test mode (checking for equality of
query results against
[
http://issues.apache.org/jira/browse/LUCENE-550?page=comments#action_12451768 ]
wolfgang hoschek commented on LUCENE-550:
-
Ok. That means a basic test passes. For some more exhaustive tests, run all the
queries in
src/test/org
[
http://issues.apache.org/jira/browse/LUCENE-550?page=comments#action_12451817 ]
wolfgang hoschek commented on LUCENE-550:
-
> All Lucene unit tests have been adapted to work with my alternate index.
> Everything but proximity q
A related prior discussion is at http://issues.apache.org/bugzilla/
show_bug.cgi?id=34930
Wolfgang.
On Oct 3, 2006, at 7:08 PM, Yonik Seeley wrote:
Yeah, I don't think it's easy to get rid of the exception because the
client of FastStreamChar is JavaCC generated code, which AFAIK uses
the exc
MemoryIndex was designed to maximize performance for a specific use
case: pure in-memory datastructure, at most one document per
MemoryIndex instance, any number of fields, high frequency reads,
high frequency index writes, no thread-safety required, optional
support for storing offsets.
I noticed that, too, but in my case the difference was often much
more extreme: it was one of the primary bottlenecks on indexing. This
is the primary reason why MemoryIndex.addField(...) navigates around
the problem by taking a parameter of type "String fieldName" instead
of type "Field":
Initially it might, but probably eventually not. I was
thinking Lucene formats might also be bit more compact
than vanilla hash maps, but I guess that depends on
many factors. But I will probably want to play with
actual queries later on, based on frequencies.
OK.
In the latter case, are yo
Hi Tatu,
I take it that simply maintaining the frequencies in a hashmap
similar to
org.apache.lucene.index.memory.AnalyzerUtil.getMostFrequentTerms()
isn't sufficient for your usecases?
In the latter case, are you using
org.apache.lucene.store.RAMDirectory or
org.apache.lucene.index.mem
If you'd consider using a MemoryIndex for this, I'd recommend also
having a look at nux.xom.pool.FullTextUtil and
nux.xom.pool.FullTextPool, adding smart caching for indexes, queries
and results on top of a MemoryIndex. With some luck this (or some
variant of it) could help speed up your us
On Dec 17, 2005, at 2:36 PM, Paul Elschot wrote:
Gentlemen,
While maintaining my bookmarks I ran into this:
"Case Study: Enabling Low-Cost XML-Aware Searching
Capable of Complex Querying":
http://www.idealliance.org/papers/xmle02/dx_xmle02/papers/
03-02-08/03-02-08.html
Some loose thoughts:
all the results for a single XML
document. This is not provided by default, but has been done with
extension to this code."
Regards,
Paul Elschot
On Friday 16 December 2005 03:45, Wolfgang Hoschek wrote:
I think implementing an XQuery Full-Text engine is far beyond the
scope of Lucene.
luded in Java 6, but that doesn't help too much given the
Java 1.4 req.
-Yonik
On 12/15/05, Wolfgang Hoschek <[EMAIL PROTECTED]> wrote:
STAX would probably make coding easier, but unfortunately complicates
the packaging side: one must ship at least two additional external
jars (sta
I think implementing an XQuery Full-Text engine is far beyond the
scope of Lucene.
Implementing a building block for the fulltext aspect of it would be
more manageable. Unfortunately The W3C fulltext drafts
indiscriminately mix and mingle two completely different languages
into a single l
STAX would probably make coding easier, but unfortunately complicates
the packaging side: one must ship at least two additional external
jars (stax interfaces and impl) for it to become usable. Plus, STAX
is quite underspecified (I wrote a STAX parser + serializer impl
lately), so there's r
That's basically what I'm implementing with Nux, except that the
syntax and calling conventions are a bit different, and that Lucene
analyzers can optionally be specified, which makes it a lot more
powerful (but also a bit more complicated).
Wolfgang.
On Dec 6, 2005, at 10:48 AM, Incze Laj
Hopefully that makes sense to someone besides just me. It's
certainly a
lot more complexity then a simple one to one mapping, but it seems
to me
like the flexability is worth spending the extra time to design/
build it.
Makes perfect sense to me, and it doesn't seem any more complex
Hopefully that makes sense to someone besides just me. It's
certainly a
lot more complexity then a simple one to one mapping, but it seems
to me
like the flexability is worth spending the extra time to design/
build it.
Makes perfect sense to me, and it doesn't seem any more complex tha
Thanks.
Just trying to get by with the weird Eclipse SVN client.
Wolfgang.
On Dec 3, 2005, at 7:54 AM, Erik Hatcher wrote:
Wolfgang,
First - I've authorized your address to send commit e-mails, so
they'll pass through right away from now on.
Related to your change - please take advantage o
Yonik, I haven't been terribly active lately, but I've been voted in
as committer as well... :-)
http://marc.theaimsgroup.com/?l=lucene-dev&w=2&r=1&s=hoschek
+committer&q=b
Cheers,
Wolfgang.
On Dec 2, 2005, at 2:53 PM, Yonik Seeley wrote:
~yonik/yourkit/
---
On Aug 30, 2005, at 12:47 PM, Doug Cutting wrote:
Yonik Seeley wrote:
I've been looking around... do you have a pointer to the source
where just the suffix is converted from UTF-8?
I understand the index format, but I'm not sure I understand the
problem that would be posed by the prefix len
The Nux-1.3 release has been uploaded to
http://dsd.lbl.gov/nux/
Nux is an open-source Java toolkit making efficient and powerful XML
processing easy.
Changelog:
•Upgraded to saxonb-8.5 (saxon-8.4 and 8.3 should continue
to work as well).
•Upgraded to xom-1.1-rc1 (w
On Jul 19, 2005, at 12:58 PM, Daniel Naber wrote:
Hi,
currently Analyzer is an abstract class. Shouldn't we make it an
Interface?
Currently that's not possible, but it will be as soon as the
deprecated
method is removed (i.e. after Lucene 1.9).
Regards
Daniel
Daniel, what's the use ca
> poor java startup time
For the one's really keen on reducing startup time the Jolt Java VM
daemon may perhaps be of some interest:
http://www.dystance.net/software/jolt/index.html
I played with it a year ago when I was curious to see what could be
done about startup time in the context of
As an aside, in my performance testing of Lucene using JProfiler,
it seems
to me that the only way to improve Lucene's performance greatly can
come
from 2 areas
1. optimizing the JVM array/looping/JIT constructs/capabilities to
avoid
bounds checking/improve performance
2. improve function
Yep, if one would set -Xmx32m memory consumption would of course be
different. So it's really a "discovery" of the (default) Sun JVM gc
policy rather than anything Lucene specific.
It seems that benchmark results sometimes reflect more a person's
familiarity (or lack thereof) with a tool ra
Folding the surround syntax into the standard query parser would be
great indeed!
I'd very much encourage the increased power and expressiveness lucene
would gain through that.
Wolfgang.
On May 29, 2005, at 9:33 AM, Otis Gospodnetic wrote:
--- Erik Hatcher <[EMAIL PROTECTED]> wrote:
O
Cool stuff. Once this has stabilized and settled down I might start
exposing the surround language from XQuery/XPath as an experimental
match facility.
Wolfgang.
On May 28, 2005, at 10:07 AM, Paul Elschot wrote:
On Saturday 28 May 2005 17:06, Erik Hatcher wrote:
On May 28, 2005, a
The nux-1.2 release has been uploaded to
http://dsd.lbl.gov/nux/
Nux is an open-source Java XML toolset geared towards embedded use in
high-throughput XML messaging middleware such as large-scale Peer-to-
Peer infrastructures, message queues, publish-subscribe and
matchmaking systems fo
For the MemoryIndex, I'm seeing large performance overheads due to
repetitive temporary string interning of o.a.l.index.Term.
For example, consider a FuzzyTermQuery or similar, scanning all terms
via TermEnum in the index: 40% of the time is spent in String.intern
() of new Term(). [Allocating
Right. One doesn't need to run those benchmarks to immediately see
that most time is spent in VM startup, class loading, hotspot
compilation rather than anything Lucene related. Even a simple
System.out.println("hello") typically takes some 0.3 secs on a fast
box and JVM.
Wolfgang.
On May
I've uploaded a main memory based SynonymMap and SynonymTokenFilter
contrib to
http://issues.apache.org/bugzilla/show_bug.cgi?id=34882
This can be used at index time or query time. So far I'm mostly using
it by handing an analyzer with a SynonymTokenFilter to a QueryParser
and that seems
On May 4, 2005, at 4:44 PM, Daniel Naber wrote:
On Wednesday 04 May 2005 22:59, Wolfgang Hoschek wrote:
I was considering an efficient impl of TermEnum.skipTo(Term target)
for
the MemoryIndex. But then I realized that nothing anywhere in Lucene
calls that method.
It's part of the API (p
I was considering an efficient impl of TermEnum.skipTo(Term target) for
the MemoryIndex. But then I realized that nothing anywhere in Lucene
calls that method. It's effectively dead code; a remainder of a
previous ice age - nothing would break if it would be removed. I'd
suggest doing so unless
On May 3, 2005, at 5:26 PM, Erik Hatcher wrote:
Wolfgang,
I've now added this.
Thanks :-)
I'm not seeing how this could be generally useful. I'm curious how
you are using it and why it is better suited for what you're doing
than any other analyzer.
"keyword tokenizer" is a bit overloaded termin
Here's a convenience add-on method to MemoryIndex. If it turns out that
this could be of wider use, it could be moved into the core analysis
package. For the moment the MemoryIndex might be a better home.
Opinions, anyone?
Wolfgang.
/**
* Convenience method; Creates and returns a token strea
Here's a performance patch for MemoryIndex.MemoryIndexReader that
caches the norms for a given field, avoiding repeated recomputation of
the norms. Recall that, depending on the query, norms() can be called
over and over again with mostly the same parameters. Thus, replace
public byte[] norms(S
Thanks!
Wolfgang.
I've committed this change after it successfully worked for me.
Thanks!
Erik
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
The version I sent returns in O(1), if performance was your concern.
Or
did you mean something else?
Since 0 is the only document number in the index, a
return target == 0;
might be nice for skipTo(). It doesn't really help performance, though,
and the next() works just as well.
Regards,
Paul Elsc
Yes, the svn trunk uses skipTo more often than 1.4.3.
However, your implementation of skipTo() needs some improvement.
See the javadoc of skipTo of class Scorer:
http://lucene.apache.org/java/docs/api/org/apache/lucene/search/
Scorer.html#skipTo(int)
What's wrong with the version I sent? Remeber t
May 2, 2005, at 9:05 AM, Wolfgang Hoschek wrote:
I'm looking at it right now. The tests pass fine when you put
lucene-1.4.3.jar instead of the current lucene onto the classpath
which is what I've been doing so far. Something seems to have changed
in the scoring calculation. No idea
rong with it for lucene current SVN? Should my
calculation now be done differently? If so, how?
Thanks for any clues into the right direction.
Wolfgang.
On May 2, 2005, at 9:05 AM, Wolfgang Hoschek wrote:
I'm looking at it right now. The tests pass fine when you put
lucene-1.4.3.jar instead
I'm looking at it right now. The tests pass fine when you put
lucene-1.4.3.jar instead of the current lucene onto the classpath which
is what I've been doing so far. Something seems to have changed in the
scoring calculation. No idea what that might be. I'll see if I can find
out.
Wolfgang
I've uploaded code that now runs against the current SVN, plus junit
test cases, plus some minor internal updates to the functionality
itself. For details see
http://issues.apache.org/bugzilla/show_bug.cgi?id=34585
Be prepared for the testcases to take some minutes to complete - don't
hit CTRL
Here is the first and most high-priority patch I've settled on to get
Lucene to work efficiently for the typical usage scenarios of
MemoryIndex. More patches are forthcoming if this one is received
favourably...
There's large overhead involved in forcing all IndexReader impls to
have a fin
OK. I'll send an update as soon as I get round to it...
Wolfgang.
On Apr 27, 2005, at 12:22 PM, Doug Cutting wrote:
Erik Hatcher wrote:
I'm not quite sure where to put MemoryIndex - maybe it deserves to
stand on its own in a new contrib area?
That sounds good to me.
Ok... once Wolfgang gives me
Whichever place you settle on is fine with me.
[In case it might make a difference: Just note that MemoryIndex has a
small auxiliary dependency on PatternAnalyzer in addField() because the
Analyzer superclass doesn't have a tokenStream(String fieldName, String
text) method. And PatternAnalyzer r
tream("content", "James is running round in the
woods"),
* "English"));
*
On Apr 22, 2005, at 1:53 PM, Wolfgang Hoschek wrote:
I've now got the contrib code cleaned up, tested and documented into a
decent state, ready for your review and comments.
Co
API, or any other issues.
Cheers,
Wolfgang.
On Apr 20, 2005, at 11:26 AM, Wolfgang Hoschek wrote:
On Apr 20, 2005, at 9:22 AM, Erik Hatcher wrote:
On Apr 20, 2005, at 12:11 PM, Wolfgang Hoschek wrote:
By the way, by now I have a version against 1.4.3 that is 10-100
times faster (i.e. 3 - 200
On Apr 20, 2005, at 9:22 AM, Erik Hatcher wrote:
On Apr 20, 2005, at 12:11 PM, Wolfgang Hoschek wrote:
By the way, by now I have a version against 1.4.3 that is 10-100
times faster (i.e. 3 - 20 index+query steps/sec) than the
simplistic RAMDirectory approach, depending on the nature of
eveloper to debug and find the reason in
the first place!)
Luc
-Original Message-
From: Erik Hatcher [mailto:[EMAIL PROTECTED]
Sent: Saturday, April 16, 2005 2:09 AM
To: java-dev@lucene.apache.org
Subject: Re: [Performance] Streaming main memory indexing of single
strings
On Apr 15, 2005,
On Apr 16, 2005, at 1:17 PM, Wolfgang Hoschek wrote:
Note that "fish*~" is not a valid query expression :)
Perhaps the Lucene QueryParser should throw an exception then.
Currently 1.4.3 accepts the expression as is without grumbling...
Several minor QueryParser weirdnesses like this h
On Apr 16, 2005, at 2:58 AM, Erik Hatcher wrote:
On Apr 15, 2005, at 9:50 PM, Wolfgang Hoschek wrote:
So, all the text analyzed is in a given field... that means that
anything in the Query not associated with that field has no bearing
on whether the text matches or not, correct?
Right, it has no
On Apr 15, 2005, at 5:55 PM, Erik Hatcher wrote:
On Apr 15, 2005, at 8:18 PM, Wolfgang Hoschek wrote:
The main issue is to enable handling arbitrary queries (anything
derived from o.a.l.search.Query). Yes, there'd be an additional
method Analyzer parameter (support any analyzer). The use
for the default field name, which is the one to implicitly be
queried...
Wolfgang.
On Apr 15, 2005, at 5:08 PM, Erik Hatcher wrote:
On Apr 15, 2005, at 6:15 PM, Wolfgang Hoschek wrote:
Cool! For my use case it would need to be able to handle arbitrary
queries (previously parsed from a general luce
nd the core would be moved into AbstractIndexReader so
projects like this would be much easier).
Robert
-Original Message-
From: Wolfgang Hoschek [mailto:[EMAIL PROTECTED]
Sent: Friday, April 15, 2005 5:58 PM
To: java-dev@lucene.apache.org
Subject: Re: [Performance] Streaming main memory indexi
On Apr 15, 2005, at 4:15 PM, Doug Cutting wrote:
Wolfgang Hoschek wrote:
The classic fuzzy fulltext search and similarity matching that Lucene
is good for :-)
So you need a score that can be compared to other matches? This will
be based on nothing but term frequency, which a regex can compute
On Apr 15, 2005, at 4:00 PM, Doug Cutting wrote:
Erik Hatcher wrote:
I think something like this would make a handy addition to our
contrib area at least.
Perhaps.
What use cases cannot be met by regular expression matching?
Doug
The classic fuzzy fulltext search and similarity matching that Lucen
cking and all sort of unnecessary stuff with its internal
RAMDirectory.
- Even more extreme: Don't extend Searcher but implement the
functionality directly using low-level APIs. This avoids unnecessary
baggage for collecting hits, etc.
Wolfgang.
On Apr 15, 2005, at 3:15 PM, Wolfgang Hos
Original Message-
From: Wolfgang Hoschek [mailto:[EMAIL PROTECTED]
Sent: Thursday, April 14, 2005 4:04 PM
To: java-dev@lucene.apache.org
Subject: Re: [Performance] Streaming main memory indexing of single
strings
This seems to be a promising avenue worth exploring. My gutfeeling is
that thi
earches.
start again with next document.
-----Original Message-
From: Wolfgang Hoschek [mailto:[EMAIL PROTECTED]
Sent: Thursday, April 14, 2005 2:56 PM
To: java-dev@lucene.apache.org
Subject: Re: [Performance] Streaming main memory indexing of single
strings
Otis, this might be a misunderstandi
ed.
Keep your IndexWriter open, don't close it, and optimize the index only
once you are done adding documents to it.
See the highlights and the snipets in the first hit:
http://www.lucenebook.com/search?query=when+to+optimize
Otis
--- Wolfgang Hoschek <[EMAIL PROTECTED]> wrote:
Hi,
I
Hi,
I'm wondering if anyone could let me know how to improve Lucene
performance for "streaming main memory indexing of single strings".
This would help to effectively integrate Lucene with the Nux XQuery
engine.
Below is a small microbenchmark simulating STREAMING XQuery fulltext
search as typ
((end-start) / 1000.0f));
System.out.println("queries/sec=" + (nodes /
((end-start) / 1000.0f)));
System.out.println();
}
}
}
---
Wolfgang Hosc
68 matches
Mail list logo