patch -- almost 10K
lines. I sure hope we didn't break anything :)
> Token implementation needs improvements
> ---
>
> Key: LUCENE-1333
> URL: https://issues.apache.org/jira/browse/LUCENE-1333
>
[
https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael McCandless reassigned LUCENE-1333:
--
Assignee: Michael McCandless
> Token implementation needs improveme
new patch with tiny changes:
* Fixed javadoc warnings
* Removed extra unnecessary comparison in ArrayUtil.getShrinkSize
I plan to commit in a day or two!
> Token implementation needs improvements
> ---
>
>
sableToken); nextToken != null; nextToken
= stream.next(reusableToken)) {
Do something with nextToken
}
{code}
* Improved Payload.clone() to avoid new if possible.
* Removed last remaining calls to termText() in a *.jj file.
> Token implementation needs
ontent as another. There are a couple of places that this is
appropriate.
I like this, but maybe name it reinit(Token)?
bq. Also, are the reinit methods used?
I think we could use it in further places (I didn't search exhaustively).
DM if you can pull together a patch w/ these fi
(termBuffer, 0, termLength);
return code;
}
Also, are the reinit methods used? If not, I'd like to work up a patch that
uses them. (And I'll include the above in it.)
I'll probably add copyFrom(Token) as a means to initialize one token to have
the same content as another
f places that this is
appropriate.
> Token implementation needs improvements
> ---
>
> Key: LUCENE-1333
> URL: https://issues.apache.org/jira/browse/LUCENE-1333
> Project: Lucene - Java
>
ssue with clone in the old Token was that it had
to do the initTermBuffer before, even though clone already knows what size the
buffer should be, etc. It looks like this is fixed now, so it may be a
non-issue.
> Token implementation needs impro
ed in the constructor, or a special setter can be added for
that, such as: disableRest().
There would not be an enableReset().
When disabled, next() would not clone, and reset() will throw an exception.
> Token implementation needs improvements
> ---
>
>
ved more
than once.
Two personal opinions:
* Caches that don't implement reset should return cache.remove(0) [or
equivalent] so it is clear that the cache can only be used once.
* Caches should not be used except when it gives a clear performance advantage.
> Token implementat
bad. Does look like we improved some of the
cloning costs, though.
> Token implementation needs improvements
> ---
>
> Key: LUCENE-1333
> URL: https://issues.apache.org/jira/browse/LUCENE-1333
>
, the consumer that invoked step 2
might have overridden the tokens it got from step 2. Now in step 4 sink
"thinks" it returns clones of original tokens, unaware that the consumer of
step 2 modified them.
> Token implem
the Analysis process, which I would argue does not violate
the reusability process.
Cloning Tokens is not cheap, as I recall. In fact, I seem to recall testing
that it was cheaper to do a new. Now, maybe that is fixed in this issue, but I
am not sure.
> Token implementation needs impro
change it
and thus the internal list changes. This is only a problem when reset is
implemented. This is because a single token can be returned more than once from
the TokenStream.
If SinkTokenizer did not implement reset, then next() would not need to create
a clone.
> Token implementation n
27;t seem right to me. The token is already cloned
in the add() method, no need to clone again in the next() call.
At least, that's my understanding based on trying to piece together the changes
being proposed here.
> Token implementation n
er clean, all contrib tests passed!
> Token implementation needs improvements
> ---
>
> Key: LUCENE-1333
> URL: https://issues.apache.org/jira/browse/LUCENE-1333
> Project: Lucene - Java
>
all tests passed.
Let me check recent patch.
> Token implementation needs improvements
> ---
>
> Key: LUCENE-1333
> URL: https://issues.apache.org/jira/browse/LUCENE-1333
> Project: Lucene - Java
>
[
https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael McCandless updated LUCENE-1333:
---
Attachment: LUCENE-1333.patch
Woops, you're right -- new patch attached.
&g
and all seems correct to me.
One tiny thing in Token - in two locations there's this code:
{code}
setTermBuffer(newTermBuffer, newTermOffset, newTermLength);
setTermLength(newTermLength);
{code}
Second call is redundant since first call already sets termLength.
> Token implemen
& cloning
* Added partial clone method, that clones all except replaces termBuffer &
start/end offset; fixed NGramTokenFilter and CompoundWordTokenFilterBase to use
this method
* Fixed bug in PrecedenceQueryParser.jj that became obvious once I ran javacc
I think we are close!
javacc to run cleanly on PrecedenceQueryParser.jj.
I figured this out: you have to use javacc 3.2 (not 4.0 or beyond) for
contrib/misc, then "ant javacc" runs fine (with the patch from LUCENE-1353).
> Token implementation needs improvements
> --
nges in their generate java files. While it
would be good to fix the generation problem, you can compare the jj and java
pairs to see that they match.
{quote}
{quote}
I ended up doing the same.
{quote}
There's a patch for this in LUCENE-1353 ...
> Token implementation needs
and is not
resilient to future fields.
> Token implementation needs improvements
> ---
>
> Key: LUCENE-1333
> URL: https://issues.apache.org/jira/browse/LUCENE-1333
> Project: Lucene - Java
[
https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael McCandless updated LUCENE-1333:
---
Attachment: LUCENE-1333.patch
Attached new patch w/ above changes.
> To
p; start/end
offsets you've passed in? Analagous to reinit, but first making a cloned token.
> Token implementation needs improvements
> ---
>
> Key: LUCENE-1333
> URL: https://issues.apache.org/jira/browse/LUCENE-13
o be good to finish the canonical implementation and
provide hashcode?
OK I'll add.
bq. Perhaps an assert in a producer is a good idea. I find it easier to debug a
failed assert than a null pointer exception.
Agreed, I'll add.
> Token i
ionIncrement());
assertEquals(tokFlags, newtok.getFlags());
assertEquals(tokPayload, newtok.getPayload());
}
{code}
> Token implementation needs improvements
> ---
>
> Key: LUCENE-1333
> URL: https://issues.apache.org/jir
ought it a bug but no, patch is
correct.
* I find the comment in TokenFilter about implementing next() vs. next(Token)
confusing.
Perhaps use here the same comment as in Tokenizer.
> Token implementation needs improvements
> ---
>
>
hey match.
I agree that Token next(Token) should assume a non-null. Perhaps an assert in a
producer is a good idea. I find it easier to debug a failed assert than a null
pointer exception.
> Token implementation needs improvements
> ---
>
>
applies cleanly and I'm running the test now, so far all looks good but
my PC is not that fast and it will take some more time to complete. I hope to
review later tonight or tomorrow morning.
> Token implementation needs improvements
> ---
&
Payloads.equals -- this fixed
TestSingleTokenTokenFilter.
* Switched to the "final reusableToken" pattern.
* Added to TokenStream.next javadoc stating that parameter should
never be null
* Fixed test failures (all tests should pass now)
> Token implementation needs
to see if there are any other
bugs! These equals() methods are tricky!
> Token implementation needs improvements
> ---
>
> Key: LUCENE-1333
> URL: https://issues.apache.org/jira/browse/LUCENE-1333
>
can't see that bug... :-)
For PrecedenceQueryParser, Is there a way to use ant to run javacc on
contrib/misc?
> Token implementation needs improvements
> ---
>
> Key: LUCENE-1333
> URL: https://issues.apach
for Token.equals() in case you didn't write that part yet
Thanks Doron; I'll merge into my version :) Both your version and my version
had bugs, so it's great you posted yours! And I hope the merged result has no
bugs :)
> Token implementat
src/java/org/apache/lucene/queryParser/precedence/PrecedenceQueryParser.jj .
. .
org.javacc.parser.ParseException: Encountered "<<" at line 658, column 3.
Was expecting one of:
...
"<" ...
Detected 1 errors and 0 warnings.
{code}
Does anyone else hit
p;&
subEqual(payload, other.payload);
}
private boolean subEqual(Object o1, Object o2) {
if (o1==null)
return o2==null;
return o1.equals(o2);
}
{code}
> Token implementation needs improvements
> ---
>
> Ke
k in backwards compatibility, but I think it's OK?
{quote}
I think so.
I feel good about this whole change, it will make reuse chain more clear,
especially once all deprecations can be removed.
> Token implementation needs improvements
> ---
>
>
ould implement equals().
I agree. This is technically a break in backwards compatibility, but I think
it's OK?
> Token implementation needs improvements
> ---
>
> Key: LUCENE-1333
> URL: https://issues.apach
I wonder how this test ever
passed before... Oh I see it now, - trunk's of SingleTokenTokenStream never
called Token.clone() while patched version calls it twice. So definitely, Token
should implement equals().
> Token implementation needs improvements
> ---
s in TokenStream.nextToken(Token)'s javadocs.
I agree; I'll do this in next patch iteration...
> Token implementation needs improvements
> ---
>
> Key: LUCENE-1333
> URL: https://issues.apach
not null. Not so nice.
So yes, I agree with you, just need to clarify this in
TokenStream.nextToken(Token)'s javadocs.
> Token implementation needs improvements
> ---
>
> Key: LUCENE-1333
> URL: https://is
Hmm, indeed I do see this too -- but this is because Token has never overridden
"equals" right?
> Token implementation needs improvements
> ---
>
> Key: LUCENE-1333
> URL: https://issues.apach
into the failure of TestSingleTokenTokenFilter I saw that if
these lines are executed:
{code}
Token t1 = new Token();
Token t2 = t1.clone();
boolean isEqual = t1.equals(t2)
{code}
Surprisingly, *isEqual = false*
Do you see this too?
> Token implementation needs impro
kens probably should check and create a token if it is
so. But I don't think that is what they do now.)
Actually I think one should never pass null to next(Token) API -- ie a source
token stream need not check for null.
> Token implementation needs
were lost in ChineseTokenizer.
OK, I'll fold this into my changes to the patch. Thanks Doron!
> Token implementation needs improvements
> ---
>
> Key: LUCENE-1333
> URL: https://issues.apache.org
the test failures.
> Token implementation needs improvements
> ---
>
> Key: LUCENE-1333
> URL: https://issues.apache.org/jira/browse/LUCENE-1333
> Project: Lucene - Java
> Issue Type: Imp
ode-smell standpoint I'd still like to use the "single re-use"
pattern when applicable. DM I'll make this change & post a new patch.
> Token implementation needs improvements
> ---
>
> Key: LU
t in ChineseTokenizer.
Adding this in flush(Token) will fix it.
{code}
token.setStartOffset(start);
token.setEndOffset(start+length);
{code}
Possibly true for other places where new Token() that takes start/end offsets
was replaced by clear() and setTermBuffer().
> Token implementation needs impro
are failing though - TestChineseTokenizer for one.
I'm looking into it, might be that a problem is in the test.
> Token implementation needs improvements
> ---
>
> Key: LUCENE-1333
> URL: https://issues.apach
d the token it just returned?
> Token implementation needs improvements
> ---
>
> Key: LUCENE-1333
> URL: https://issues.apache.org/jira/browse/LUCENE-1333
> Project: Lucene - Java
&g
ode}
final Token reusableToken = new Token();
for (Token token = stream.next(reusableToken); token != null; token =
stream.next(reusableToken)) {
{code}
> Token implementation needs improvements
> ---
>
> Key: LUCENE-1333
>
for (Token next = stream.next(result); next != null; next =
stream.next(result)) {
{code}
> Token implementation needs improvements
> ---
>
> Key: LUCENE-1333
> URL: https://issues.apache.org/jira/browse/LU
all together under this issue, I think we should commit
this one instead of LUCENE-1350. I'll review the [massive] patch -- thanks DM!
> Token implementation needs improvements
> ---
>
> Key: LUCENE-1333
> URL: h
functionality solving LUCENE-1350. If this patch is
applied before LUCENE-1350, then that issue is resolved. If it is done after
then the patch will need to be rebuilt.
I did not do the "reuse" API mentioned in LUCENE-1350.
> Token implementation needs
appears to be general purpose array manipulation.
> Token implementation needs improvements
> ---
>
> Key: LUCENE-1333
> URL: https://issues.apache.org/jira/browse/LUCENE-1333
> Project: L
l of core and contrib. This patch
replaces the prior two.
> Token implementation needs improvements
> ---
>
> Key: LUCENE-1333
> URL: https://issues.apache.org/jira/browse/LUCENE-1333
>
[
https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
DM Smith updated LUCENE-1333:
-
Attachment: LUCENE-1333-xml-query-parser.patch
for contrib xml-query-parser
> Token implementat
[
https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
DM Smith updated LUCENE-1333:
-
Attachment: LUCENE-1333-wordnet.patch
for contrib wordnet
> Token implementation needs improveme
[
https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
DM Smith updated LUCENE-1333:
-
Attachment: LUCENE-1333-memory.patch
for contrib memory
> Token implementation needs improveme
[
https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
DM Smith updated LUCENE-1333:
-
Attachment: LUCENE-1333-queries.patch
for contrib queries
> Token implementation needs improveme
[
https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
DM Smith updated LUCENE-1333:
-
Attachment: LUCENE-1333-miscellaneous.patch
for contrib misc
> Token implementation needs improveme
[
https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
DM Smith updated LUCENE-1333:
-
Attachment: LUCENE-1333-lucli.patch
for contrib lucli
> Token implementation needs improveme
[
https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
DM Smith updated LUCENE-1333:
-
Attachment: LUCENE-1333-wikipedia.patch
for contrib wikipedia
> Token implementation ne
[
https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
DM Smith updated LUCENE-1333:
-
Attachment: LUCENE-1333-instantiated.patch
For contrib instantiated
> Token implementation ne
[
https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
DM Smith updated LUCENE-1333:
-
Attachment: LUCENE-1333-highlighter.patch
for contrib highlighter
> Token implementation ne
[
https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
DM Smith updated LUCENE-1333:
-
Attachment: LUCENE-1333-snowball.patch
for contrib snowball
> Token implementation needs improveme
[
https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
DM Smith updated LUCENE-1333:
-
Attachment: LUCENE-1333-analyzers.patch
This is for contrib analyzers.
> Token implementation ne
.l.analysis but not the files contained in
LUCENE-1333.patch
> Token implementation needs improvements
> ---
>
> Key: LUCENE-1333
> URL: https://issues.apache.org/jira/browse/LUCENE-1333
> Project: Lu
[
https://issues.apache.org/jira/browse/LUCENE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
DM Smith updated LUCENE-1333:
-
Attachment: LUCENE-1333-core.patch
This patch covers the rest of core Lucene.
> Token implementat
replaces the prior two.
> Token implementation needs improvements
> ---
>
> Key: LUCENE-1333
> URL: https://issues.apache.org/jira/browse/LUCENE-1333
> Project: Lucene - Java
> Is
ents in Token.growTermBuffer to first handle the
[I think most frequent] case where termBuffer is already
allocated.
* Javadoc/whitespace
> Token implementation needs improvements
> ---
>
> Key: LUCENE-1333
> URL: htt
15 jul 2008 kl. 11.25 skrev Hiroaki Kawai:
DM Smith <[EMAIL PROTECTED]> wrote:
In core and contrib, there are times where more than one token is
used
at a time. In a few places they are put into collections.
So a singleton wouldn't work.
I'd suggest a factory method to get a thread local
DM Smith <[EMAIL PROTECTED]> wrote:
> Hiroaki Kawai wrote:
> > DM Smith <[EMAIL PROTECTED]> wrote:
> >
> >> On Jul 11, 2008, at 9:42 PM, Hiroaki Kawai wrote:
> >>
> >>
> >>> Another suggestion from me:
> >>> How about making token object as an singleton?
> >>>
> >> Would that work fo
x27;d like to get a nod of whether I need to make further changes to
the supplied patch before doing #4)
4. Changing of the remainder of core and contrib to remove calls to deprecated
Token and TokenStream methods, i.e. to use the reuse API.
> Token implementation needs impr
Hiroaki Kawai wrote:
DM Smith <[EMAIL PROTECTED]> wrote:
On Jul 11, 2008, at 9:42 PM, Hiroaki Kawai wrote:
Another suggestion from me:
How about making token object as an singleton?
Would that work for a multi-threaded application?
Of cource. We should make that thread l
DM Smith <[EMAIL PROTECTED]> wrote:
> On Jul 11, 2008, at 9:42 PM, Hiroaki Kawai wrote:
>
> > Another suggestion from me:
> > How about making token object as an singleton?
>
> Would that work for a multi-threaded application?
Of cource. We should make that thread local singleton.
> >
> >
> >
Michael McCandless wrote:
But, in TokenFilter, next() should be deprecated, IMHO.
I think this is a good idea. After all if people don't want to bother
using the passed in Token, they are still allowed to return a new
one.
I'm looking into the deprecation of TokenStream.next(). I ha
This would be better, especially when we get to Java 5, with covariant
returns. Then we could also rename termBuffer() to term() (or just leave
it :).
DM
Michael McCandless wrote:
Or we could leave termText() deprecated, add term() which does the
same thing sub-optimally (ie, always creates
Or we could leave termText() deprecated, add term() which does the
same thing sub-optimally (ie, always creates new String from the
byte[]), and in the javadocs for termText() state that you can migrate
either term() (if you really want a String and you understand the
performance cost of
Another suggestion from me:
How about making token object as an singleton?
> Maybe we should un-deprecate the termText() method but add javadocs
> explaining that for better performance you should use the char[] reuse
> methods instead?
>
> Mike
>
> DM Smith wrote:
>
> > Michael McCandless
On Jul 11, 2008, at 9:42 PM, Hiroaki Kawai wrote:
Another suggestion from me:
How about making token object as an singleton?
Would that work for a multi-threaded application?
Maybe we should un-deprecate the termText() method but add javadocs
explaining that for better performance you s
Another suggestion from me:
How about making token object as an singleton?
> Maybe we should un-deprecate the termText() method but add javadocs
> explaining that for better performance you should use the char[] reuse
> methods instead?
>
> Mike
>
> DM Smith wrote:
>
> > Michael McCandless
Michael McCandless wrote:
Maybe we should un-deprecate the termText() method but add javadocs
explaining that for better performance you should use the char[] reuse
methods instead?
I think so, too. Should we leave it as deprecated until 3.0? With the
performance note and the encouragement to
Maybe we should un-deprecate the termText() method but add javadocs
explaining that for better performance you should use the char[] reuse
methods instead?
Mike
DM Smith wrote:
Michael McCandless wrote:
DM Smith wrote:
Shouldn't Term have constructors that take a Token?
I think tha
Michael McCandless wrote:
DM Smith wrote:
Shouldn't Term have constructors that take a Token?
I think that makes sense, though normally Token appears during
analysis and Term during searching (I think?) -- how often would you
need to make a Term from a Token?
The problem I'm addressing
DM Smith wrote:
Shouldn't Term have constructors that take a Token?
I think that makes sense, though normally Token appears during
analysis and Term during searching (I think?) -- how often would you
need to make a Term from a Token?
Mike
--
Token implementation needs improvements
---
Key: LUCENE-1333
URL: https://issues.apache.org/jira/browse/LUCENE-1333
Project: Lucene - Java
Issue Type: Improvement
Components: Analysis
n't dig into it, but in the core, are reusable Tokens per
document or per DocumentWriter. If it is the former then the copy
becomes more significant. If it is the latter, then it is almost
moot, as you point out.
It's actually per DocumentsWriter per thread per field.
This is e
document or per DocumentWriter. If it is the former then the copy
becomes more significant. If it is the latter, then it is almost
moot, as you point out.
It's actually per DocumentsWriter per thread per field.
This is excellent.
More below
More responses below:
DM Smit
On May 20, 2008, at 12:50 AM, Hiroaki Kawai wrote:
"Michael McCandless" <[EMAIL PROTECTED]> wrote:
More responses below:
DM Smith <[EMAIL PROTECTED]> wrote:
But, in TokenFilter, next() should be deprecated, IMHO.
I think this is a good idea. After all if people don't want to
Hiroaki Kawai wrote:
"Michael McCandless" <[EMAIL PROTECTED]> wrote:
I agree the situation is not ideal, and it's confusing.
This comes back to LUCENE-969.
At the time, we decided to keep both String & char[] only to avoid
performance cost for those analyzer chains that use String tokens
ex
he copy
becomes more significant. If it is the latter, then it is almost
moot, as you point out.
It's actually per DocumentsWriter per thread per field.
More below
More responses below:
DM Smith <[EMAIL PROTECTED]> wrote:
I think the Token implementation as it stands ca
"Michael McCandless" <[EMAIL PROTECTED]> wrote:
> I agree the situation is not ideal, and it's confusing.
>
> This comes back to LUCENE-969.
>
> At the time, we decided to keep both String & char[] only to avoid
> performance cost for those analyzer chains that use String tokens
> exclusively.
"Michael McCandless" <[EMAIL PROTECTED]> wrote:
> I agree the situation is not ideal, and it's confusing.
>
> This comes back to LUCENE-969.
>
> At the time, we decided to keep both String & char[] only to avoid
> performance cost for those analyzer chains that use String tokens
> exclusively.
icant. If it is the latter, then it is almost moot,
as you point out.
More below....
More responses below:
DM Smith <[EMAIL PROTECTED]> wrote:
I think the Token implementation as it stands can use some
improvement and
I'd be willing to do it. I'd like some input, though.
(tbLength > 0 && newSize > tbLength) {
> int size = tbLength;
> while (size < newSize) {
> size *= 2;
> }
>newSize = size;
> }
>
> // Check to see if the buffer needs to be resized
> if (newSize > tbLength)
> {
> termBuffer = new cha
; tbLength) {
int size = tbLength;
while (size < newSize) {
size *= 2;
}
newSize = size;
}
// Check to see if the buffer needs to be resized
if (newSize > tbLength)
{
termBuffer = new char[newSize];
}
}
More below
More responses below:
DM Smith <[EMAIL PROTECT
y char[] today) is compelling, but I worry
about the performance hit to legacy analyzer chains...
More responses below:
DM Smith <[EMAIL PROTECTED]> wrote:
> I think the Token implementation as it stands can use some improvement and
> I'd be willing to do it. I'd like some input
I think the Token implementation as it stands can use some improvement
and I'd be willing to do it. I'd like some input, though. Especially
because it is core to Lucene.
I've been working on eliminating deprecations from my user code and I
ran across Token.getText() as b
99 matches
Mail list logo