[jira] [Updated] (LUCENE-2958) WriteLineDocTask improvements

2011-03-21 Thread Doron Cohen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen updated LUCENE-2958:


Attachment: LUCENE-2958.patch

Hmmm, while reviewing again before committing I noticed that the 
HeaderLineParser constructor never assigns FieldName.PROP in posToF. I intended 
to do this but forgot. Indeed, emma shows that Properties handling in 
LineDocSource is not tested. So I enhanced LineDocSourceTest to also test for 
nonstandard fields and for properties. The test failes as expected, and the fix 
was trivial.

Attaching updated patch, planning to commit this shortly.

 WriteLineDocTask improvements
 -

 Key: LUCENE-2958
 URL: https://issues.apache.org/jira/browse/LUCENE-2958
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Reporter: Doron Cohen
Assignee: Doron Cohen
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-2958.patch, LUCENE-2958.patch, LUCENE-2958.patch, 
 LUCENE-2958.patch, LUCENE-2958.patch, LUCENE-2958.patch


 Make WriteLineDocTask and LineDocSource more flexible/extendable:
 * allow to emit lines also for empty docs (keep current behavior as default)
 * allow more/less/other fields

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-2958) WriteLineDocTask improvements

2011-03-20 Thread Doron Cohen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen updated LUCENE-2958:


Attachment: LUCENE-2958.patch

Updated patch, tests added for better coverage, and added a Changes entry.

 WriteLineDocTask improvements
 -

 Key: LUCENE-2958
 URL: https://issues.apache.org/jira/browse/LUCENE-2958
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Reporter: Doron Cohen
Assignee: Doron Cohen
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-2958.patch, LUCENE-2958.patch, LUCENE-2958.patch, 
 LUCENE-2958.patch, LUCENE-2958.patch


 Make WriteLineDocTask and LineDocSource more flexible/extendable:
 * allow to emit lines also for empty docs (keep current behavior as default)
 * allow more/less/other fields

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2958) WriteLineDocTask improvements

2011-03-13 Thread Doron Cohen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen updated LUCENE-2958:


Attachment: LUCENE-2958.patch

bq. Will that be simpler?
It will be simpler, I admit, but it will harder to manage:
* when re-reading the input file (with repeat=true) special treatment of the 
header line is needed. And cannot assume that the header line exists, because 
there are 1-line files out there without this line, which, is possible, I would 
not like to force recreating, and it is possible.
* the simple LDS as today handles no header line. As such, if there is one, it 
will wrongly treat it as a regular line. But I would like it to be able to 
handle both old files (with no header) and new files, with the header. Mmmm,,, 
e could for that write the header only if it differs from the default header. 
Perhaps this will work.

I'll take a look at that again, meanwhile attaching updated patch with the two 
inner DocDataLineReader's.

 WriteLineDocTask improvements
 -

 Key: LUCENE-2958
 URL: https://issues.apache.org/jira/browse/LUCENE-2958
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Reporter: Doron Cohen
Assignee: Doron Cohen
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-2958.patch, LUCENE-2958.patch, LUCENE-2958.patch, 
 LUCENE-2958.patch


 Make WriteLineDocTask and LineDocSource more flexible/extendable:
 * allow to emit lines also for empty docs (keep current behavior as default)
 * allow more/less/other fields

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2958) WriteLineDocTask improvements

2011-03-11 Thread Doron Cohen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen updated LUCENE-2958:


Attachment: LUCENE-2958.patch

Hi, thanks Mike and Shai for the review and great comments.

Attaching an updated patch.

Now WriteLineDocTask writes the fields as a header line to the result file. 

It always does this - perhaps a property to disable the header will be useful 
for allowing previous behavior (no header).

There are quite a few involved changes to LineDocSource:

- replaced line.split(SEP) by original recurring search for SEP.
- Method fillDocData(doc,fields[]) was changed to take a line String instead of 
the array of fields.
- That method was wrapped in a new interface: DocDataFiller for which there are 
now two implementations: 
-- SimpleDocDataFiller is used when there is no header line in the input file. 
It is implementing the original logic before this change. This allows to 
continue using existing line-doc-files which have no header line.
-- HeaderDocDataFiller is used when there exists a header line in the input 
file. Its implementation populates both fixed fields and flexible properties of 
DocData:
--- At construction of the filler a mapping is created from the field position 
in the header line to a setter method of docData. That mapping is not by 
reflection, nor by a HashMap - simply an int[] posToM where if posToM[3] = 1, 
later, when handling the field no. 3 in the line, the method fillDate3() will 
be called, and it will, in turn, call docData.setDate(), through a switch 
statement. If there's no mapping to a DocData setter, its properties object 
will be populated. So, this is quite general, with some performance overhead, 
though less than reflection I think (I did not measure this).
- An extension point for overriding the filler creation is through two new 
methods:
-- createDocDataFiller() for the case of no header line
-- createDocDataFiller(String[] header) when a header line is found in the input
- Note that filler creation is done once, when reading the first line of the 
input file. 

Some tests were fixed to account for the existence (or absence) of a header 
line.

I think more tests are required, but you can get the idea how this code will 
work.

Bottom line, LineDocSource is more general now, but the code became more 
complex.

I have mixed feelings about this - preferring simple code, but the added 
functionality is appealing.

 WriteLineDocTask improvements
 -

 Key: LUCENE-2958
 URL: https://issues.apache.org/jira/browse/LUCENE-2958
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Reporter: Doron Cohen
Assignee: Doron Cohen
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-2958.patch, LUCENE-2958.patch, LUCENE-2958.patch


 Make WriteLineDocTask and LineDocSource more flexible/extendable:
 * allow to emit lines also for empty docs (keep current behavior as default)
 * allow more/less/other fields

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-2958) WriteLineDocTask improvements

2011-03-10 Thread Doron Cohen (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-2958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doron Cohen updated LUCENE-2958:


Attachment: LUCENE-2958.patch

Updated patch (for 3x):
- from 3x root (previous patch was from benchmark by mistake)
- fixed typos in javadoc
- simplified loop over the fields in WriteLineDocTask
- removed *volatile* but added *final* for *fields/ToWrite*. 

Without volatile one test was failing: 
TestPerfTasksLogic.testParallelDocMaker() but then I was unable to fail it 
again even after removing volatile. 

Once marking these fields *final* definitely volatile is not required.

But I don't understand why was it needed in the first place - ParallelTask in 
TaskSequence clones the tasks, and since WriteLineDocTask does not implement 
clone() all (parallel) tasks will have a reference to same array... which in 
fact can be copied into a local copy by the JVM for efficiency.. but since the 
clone must take place only after the constructor is done, the array is 
initialized already... If I could fail this again I would investigate it but 
now it always passes even without final/volatile. 

So keeping the final, as this is safe, but I don't like the voodooism of it and 
if anyone has a better explanation it would be appreciated.

 WriteLineDocTask improvements
 -

 Key: LUCENE-2958
 URL: https://issues.apache.org/jira/browse/LUCENE-2958
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/benchmark
Reporter: Doron Cohen
Assignee: Doron Cohen
Priority: Minor
 Fix For: 3.2, 4.0

 Attachments: LUCENE-2958.patch, LUCENE-2958.patch


 Make WriteLineDocTask and LineDocSource more flexible/extendable:
 * allow to emit lines also for empty docs (keep current behavior as default)
 * allow more/less/other fields

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org