[
https://issues.apache.org/jira/browse/LUCENE-2958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13005259#comment-13005259
]
Shai Erera commented on LUCENE-2958:
------------------------------------
bq. though we'd still have to set up the Fields properly for indexing (ie some
are stored, some are not tokenized, etc.)
I don't think that matters? I.e., LineDocSource returns DocData, it's the
DocMaker which creates the actual Lucene Field and Document instances. So all
LDS needs to know is the name of the field.
bq. If we really want to stick w/ regex then we should at least pre-compile to
a Pattern then use the split method on that?
I agree. String.split compiles a Pattern inside, so we'd better pre-compile
that pattern once, and then you pattern.split().
bq. Or, can we go back to just using a char sep? We also make a String[] when
we didn't before.
That's a good point Mike. But the alternative (splitting on char 'the old way')
is not any better either, because you don't know in advance how many fields you
expect, so you'd have to create a List or something to store them.
I guess what we should be considering is super-duper optimized code vs. a nice
readable one. String.split[] is easily understood, keeps the code compact and
clear. Searching for SEP (the old code) is more complicated, especially when
you want to handle a general case. We'll be searching for SEP both ways, so the
only difference is whether an array is allocated or not.
Maybe instead of doing the split ourselves, we can have a getDocData(String
line), which will be implemented by default to search for TITLE, DATE and BODY,
using the optimized code, and can be overridden by others to parse line
differently? That way we don't impose any specific splitting behavior on
everyone, but we lose the potential generality of LineDocSource.
Is that array alloc() really critical?
About writing the field names in the file -- that's a nice idea, but
complicates DocData. We'd need to change it to store a Map<String,String> (or
Properties) of name-value pairs. That will affect the performance of creating
DocData, as well as constructing a Document out of it.
If we're willing to sacrifice some optimization here, we can do nice things.
But if we want to insist on having the most optimized code, I don't think we
can do much ... probably the best option is to have WLDT and LDS optimized for
what they are today, and let users extend to write/read more fields. We can
pass them the 'line' and let them split it however they want.
> WriteLineDocTask improvements
> -----------------------------
>
> Key: LUCENE-2958
> URL: https://issues.apache.org/jira/browse/LUCENE-2958
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/benchmark
> Reporter: Doron Cohen
> Assignee: Doron Cohen
> Priority: Minor
> Fix For: 3.2, 4.0
>
> Attachments: LUCENE-2958.patch, LUCENE-2958.patch
>
>
> Make WriteLineDocTask and LineDocSource more flexible/extendable:
> * allow to emit lines also for empty docs (keep current behavior as default)
> * allow more/less/other fields
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]