[jira] Commented: (LUCENE-2958) WriteLineDocTask improvements

Shai Erera (JIRA) Thu, 10 Mar 2011 11:13:22 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13005259#comment-13005259
 ]


Shai Erera commented on LUCENE-2958:
------------------------------------

bq. though we'd still have to set up the Fields properly for indexing (ie some 
are stored, some are not tokenized, etc.)

I don't think that matters? I.e., LineDocSource returns DocData, it's the 
DocMaker which creates the actual Lucene Field and Document instances. So all 
LDS needs to know is the name of the field.

bq. If we really want to stick w/ regex then we should at least pre-compile to 
a Pattern then use the split method on that?

I agree. String.split compiles a Pattern inside, so we'd better pre-compile 
that pattern once, and then you pattern.split().

bq. Or, can we go back to just using a char sep? We also make a String[] when 
we didn't before.

That's a good point Mike. But the alternative (splitting on char 'the old way') 
is not any better either, because you don't know in advance how many fields you 
expect, so you'd have to create a List or something to store them.

I guess what we should be considering is super-duper optimized code vs. a nice 
readable one. String.split[] is easily understood, keeps the code compact and 
clear. Searching for SEP (the old code) is more complicated, especially when 
you want to handle a general case. We'll be searching for SEP both ways, so the 
only difference is whether an array is allocated or not.

Maybe instead of doing the split ourselves, we can have a getDocData(String 
line), which will be implemented by default to search for TITLE, DATE and BODY, 
using the optimized code, and can be overridden by others to parse line 
differently? That way we don't impose any specific splitting behavior on 
everyone, but we lose the potential generality of LineDocSource.

Is that array alloc() really critical?

About writing the field names in the file -- that's a nice idea, but 
complicates DocData. We'd need to change it to store a Map<String,String> (or 
Properties) of name-value pairs. That will affect the performance of creating 
DocData, as well as constructing a Document out of it.

If we're willing to sacrifice some optimization here, we can do nice things. 
But if we want to insist on having the most optimized code, I don't think we 
can do much ... probably the best option is to have WLDT and LDS optimized for 
what they are today, and let users extend to write/read more fields. We can 
pass them the 'line' and let them split it however they want.

> WriteLineDocTask improvements
> -----------------------------
>
>                 Key: LUCENE-2958
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2958
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/benchmark
>            Reporter: Doron Cohen
>            Assignee: Doron Cohen
>            Priority: Minor
>             Fix For: 3.2, 4.0
>
>         Attachments: LUCENE-2958.patch, LUCENE-2958.patch
>
>
> Make WriteLineDocTask and LineDocSource more flexible/extendable:
> * allow to emit lines also for empty docs (keep current behavior as default)
> * allow more/less/other fields

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-2958) WriteLineDocTask improvements

Reply via email to