On Feb 23, 2007, at 10:28 AM, James Kennedy wrote:
True. However, in the case where you are processing Documents one
at a time
and discarding them (e.g. We use hitCollector to process all
documents from
a search), or memory is not an issue, it would be nice to have the
ability
to disable the interning for performance sake.
I don't know how much it would increase overall throughput in a
variety of use cases, but one approach could be to add a copy-like-
this factory method like Field.createField(Reader) to Field.java,
analog to the method Term.createTerm(String text) that was added to
Term.java sometime ago for a similar reason.
This would guarantee that the name continues to be interned yet
allows to avoid the interning overhead on use cases where a field
with the same parametrization (yet a different content String/Reader)
is constructed many times, which is probably the most common case
where intern() overhead might matter.
For example, something like
Field f1 = ...
Field f2 = f1.createSimilarField(Reader);
/**
* Optimized construction of new Terms by reusing same field as
this Term
* - avoids field.intern() overhead
* @param text The text of the new term (field is implicitly same
as this Term instance)
* @return A new Term
*/
public Term createTerm(String text)
{
return new Term(field,text,false);
}
Wolfgang.
Robert Engels wrote:
I don't think it is just the performance gain of equals() where
intern
() matters.
It also reduces memory consumption dramatically when working with
large collections of documents in memory - although this could also
be done with constants, there is nothing in Java to enforce it (thus
the use of intern()).
On Feb 23, 2007, at 12:02 PM, James Kennedy wrote:
In our case, we're trying to optimize document() retrieval and we
found that
disabling the String interning in the Field constructor improved
performance
dramatically. I agree that interning should be an option on the
constructor.
For document retrieval, at least for a small of amount of fields,
the
performance gain of using equals() on interned strings is no match
for the
performance loss of interning the field name of each field.
Wolfgang Hoschek-2 wrote:
I noticed that, too, but in my case the difference was often much
more extreme: it was one of the primary bottlenecks on indexing.
This
is the primary reason why MemoryIndex.addField(...) navigates
around
the problem by taking a parameter of type "String fieldName"
instead
of type "Field":
public void addField(String fieldName, TokenStream stream) {
/*
* Note that this method signature avoids having a user call new
* o.a.l.d.Field(...) which would be much too expensive due to
the
* String.intern() usage of that class.
*/
Wolfgang.
On Feb 14, 2006, at 1:42 PM, Tatu Saloranta wrote:
After profiling in-memory indexing, I noticed that
calls to String.intern() showed up surprisingly high;
especially the one from Field() constructor. This is
understandable due to overhead String.intern() has
(being native and synchronized method; overhead
incurred even if String is already interned), and the
fact this essentially gets called once per
document+field combination.
Now, it would be quite easy to improve things a bit
(in theory), such that most intern() calls could be
avoid, transparent to the calling app; for example,
for each IndexWriter() one could use a simple
HashMap() for caching interned Strings. This approach
is more than twice as fast as directly calling
intern(). One could also use per-thread cache, or
global one; all of which would probably be faster.
However, Field constructor hard-codes call to
intern(), so it would be necessary to add a new
constructor that indicates that field name is known to
be interned.
And there would also need to be a way to invoke the
new optional functionality.
Has anyone tried this approach to see if speedup is
worth the hassle (in my case it'd probably be
something like 2 - 3%, assuming profiler's 5% for
intern() is accurate)?
-+ Tatu +-
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com
------------------------------------------------------------------
--
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
-------------------------------------------------------------------
--
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
--
View this message in context: http://www.nabble.com/Field-
constructor%2C-avoiding-String.intern%28%29-tf1123597.html#a9123600
Sent from the Lucene - Java Developer mailing list archive at
Nabble.com.
--------------------------------------------------------------------
-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
--
View this message in context: http://www.nabble.com/Field-
constructor%2C-avoiding-String.intern%28%29-tf1123597.html#a9124055
Sent from the Lucene - Java Developer mailing list archive at
Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]