[jira] [Updated] (LUCENE-2958) WriteLineDocTask improvements
[ https://issues.apache.org/jira/browse/LUCENE-2958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen updated LUCENE-2958: Attachment: LUCENE-2958.patch Hmmm, while reviewing again before committing I noticed that the HeaderLineParser constructor never assigns FieldName.PROP in posToF. I intended to do this but forgot. Indeed, emma shows that Properties handling in LineDocSource is not tested. So I enhanced LineDocSourceTest to also test for nonstandard fields and for properties. The test failes as expected, and the fix was trivial. Attaching updated patch, planning to commit this shortly. WriteLineDocTask improvements - Key: LUCENE-2958 URL: https://issues.apache.org/jira/browse/LUCENE-2958 Project: Lucene - Java Issue Type: Improvement Components: contrib/benchmark Reporter: Doron Cohen Assignee: Doron Cohen Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-2958.patch, LUCENE-2958.patch, LUCENE-2958.patch, LUCENE-2958.patch, LUCENE-2958.patch, LUCENE-2958.patch Make WriteLineDocTask and LineDocSource more flexible/extendable: * allow to emit lines also for empty docs (keep current behavior as default) * allow more/less/other fields -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-2958) WriteLineDocTask improvements
[ https://issues.apache.org/jira/browse/LUCENE-2958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen updated LUCENE-2958: Attachment: LUCENE-2958.patch Updated patch, tests added for better coverage, and added a Changes entry. WriteLineDocTask improvements - Key: LUCENE-2958 URL: https://issues.apache.org/jira/browse/LUCENE-2958 Project: Lucene - Java Issue Type: Improvement Components: contrib/benchmark Reporter: Doron Cohen Assignee: Doron Cohen Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-2958.patch, LUCENE-2958.patch, LUCENE-2958.patch, LUCENE-2958.patch, LUCENE-2958.patch Make WriteLineDocTask and LineDocSource more flexible/extendable: * allow to emit lines also for empty docs (keep current behavior as default) * allow more/less/other fields -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2958) WriteLineDocTask improvements
[ https://issues.apache.org/jira/browse/LUCENE-2958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen updated LUCENE-2958: Attachment: LUCENE-2958.patch bq. Will that be simpler? It will be simpler, I admit, but it will harder to manage: * when re-reading the input file (with repeat=true) special treatment of the header line is needed. And cannot assume that the header line exists, because there are 1-line files out there without this line, which, is possible, I would not like to force recreating, and it is possible. * the simple LDS as today handles no header line. As such, if there is one, it will wrongly treat it as a regular line. But I would like it to be able to handle both old files (with no header) and new files, with the header. Mmmm,,, e could for that write the header only if it differs from the default header. Perhaps this will work. I'll take a look at that again, meanwhile attaching updated patch with the two inner DocDataLineReader's. WriteLineDocTask improvements - Key: LUCENE-2958 URL: https://issues.apache.org/jira/browse/LUCENE-2958 Project: Lucene - Java Issue Type: Improvement Components: contrib/benchmark Reporter: Doron Cohen Assignee: Doron Cohen Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-2958.patch, LUCENE-2958.patch, LUCENE-2958.patch, LUCENE-2958.patch Make WriteLineDocTask and LineDocSource more flexible/extendable: * allow to emit lines also for empty docs (keep current behavior as default) * allow more/less/other fields -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2958) WriteLineDocTask improvements
[ https://issues.apache.org/jira/browse/LUCENE-2958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen updated LUCENE-2958: Attachment: LUCENE-2958.patch Hi, thanks Mike and Shai for the review and great comments. Attaching an updated patch. Now WriteLineDocTask writes the fields as a header line to the result file. It always does this - perhaps a property to disable the header will be useful for allowing previous behavior (no header). There are quite a few involved changes to LineDocSource: - replaced line.split(SEP) by original recurring search for SEP. - Method fillDocData(doc,fields[]) was changed to take a line String instead of the array of fields. - That method was wrapped in a new interface: DocDataFiller for which there are now two implementations: -- SimpleDocDataFiller is used when there is no header line in the input file. It is implementing the original logic before this change. This allows to continue using existing line-doc-files which have no header line. -- HeaderDocDataFiller is used when there exists a header line in the input file. Its implementation populates both fixed fields and flexible properties of DocData: --- At construction of the filler a mapping is created from the field position in the header line to a setter method of docData. That mapping is not by reflection, nor by a HashMap - simply an int[] posToM where if posToM[3] = 1, later, when handling the field no. 3 in the line, the method fillDate3() will be called, and it will, in turn, call docData.setDate(), through a switch statement. If there's no mapping to a DocData setter, its properties object will be populated. So, this is quite general, with some performance overhead, though less than reflection I think (I did not measure this). - An extension point for overriding the filler creation is through two new methods: -- createDocDataFiller() for the case of no header line -- createDocDataFiller(String[] header) when a header line is found in the input - Note that filler creation is done once, when reading the first line of the input file. Some tests were fixed to account for the existence (or absence) of a header line. I think more tests are required, but you can get the idea how this code will work. Bottom line, LineDocSource is more general now, but the code became more complex. I have mixed feelings about this - preferring simple code, but the added functionality is appealing. WriteLineDocTask improvements - Key: LUCENE-2958 URL: https://issues.apache.org/jira/browse/LUCENE-2958 Project: Lucene - Java Issue Type: Improvement Components: contrib/benchmark Reporter: Doron Cohen Assignee: Doron Cohen Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-2958.patch, LUCENE-2958.patch, LUCENE-2958.patch Make WriteLineDocTask and LineDocSource more flexible/extendable: * allow to emit lines also for empty docs (keep current behavior as default) * allow more/less/other fields -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-2958) WriteLineDocTask improvements
[ https://issues.apache.org/jira/browse/LUCENE-2958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doron Cohen updated LUCENE-2958: Attachment: LUCENE-2958.patch Updated patch (for 3x): - from 3x root (previous patch was from benchmark by mistake) - fixed typos in javadoc - simplified loop over the fields in WriteLineDocTask - removed *volatile* but added *final* for *fields/ToWrite*. Without volatile one test was failing: TestPerfTasksLogic.testParallelDocMaker() but then I was unable to fail it again even after removing volatile. Once marking these fields *final* definitely volatile is not required. But I don't understand why was it needed in the first place - ParallelTask in TaskSequence clones the tasks, and since WriteLineDocTask does not implement clone() all (parallel) tasks will have a reference to same array... which in fact can be copied into a local copy by the JVM for efficiency.. but since the clone must take place only after the constructor is done, the array is initialized already... If I could fail this again I would investigate it but now it always passes even without final/volatile. So keeping the final, as this is safe, but I don't like the voodooism of it and if anyone has a better explanation it would be appreciated. WriteLineDocTask improvements - Key: LUCENE-2958 URL: https://issues.apache.org/jira/browse/LUCENE-2958 Project: Lucene - Java Issue Type: Improvement Components: contrib/benchmark Reporter: Doron Cohen Assignee: Doron Cohen Priority: Minor Fix For: 3.2, 4.0 Attachments: LUCENE-2958.patch, LUCENE-2958.patch Make WriteLineDocTask and LineDocSource more flexible/extendable: * allow to emit lines also for empty docs (keep current behavior as default) * allow more/less/other fields -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org