[Bug 61841] Unnecessary long computation when evaluating VLOOKUP on all column reference
https://bz.apache.org/bugzilla/show_bug.cgi?id=61841 --- Comment #14 from Greg Woolsey --- This was fixed in trunk for bug #62993. Not marking this as a duplicate, as the above comment should have been filed as a new issue referencing this issue. -- You are receiving this mail because: You are the assignee for the bug. - To unsubscribe, e-mail: dev-unsubscr...@poi.apache.org For additional commands, e-mail: dev-h...@poi.apache.org
[Bug 61841] Unnecessary long computation when evaluating VLOOKUP on all column reference
https://bz.apache.org/bugzilla/show_bug.cgi?id=61841 --- Comment #13 from Dominik Stadler --- Thanks for the note, but commenting on a resolved issue has a high chance of being lost, better report a new issue with a mention of the previous issue and the information about how to rerproduce your problem. -- You are receiving this mail because: You are the assignee for the bug. - To unsubscribe, e-mail: dev-unsubscr...@poi.apache.org For additional commands, e-mail: dev-h...@poi.apache.org
[Bug 61841] Unnecessary long computation when evaluating VLOOKUP on all column reference
https://bz.apache.org/bugzilla/show_bug.cgi?id=61841 Travis Burtrum changed: What|Removed |Added CC||admin.apache@moparisthebest ||.com --- Comment #12 from Travis Burtrum --- Created attachment 36497 --> https://bz.apache.org/bugzilla/attachment.cgi?id=36497&action=edit PoiFormulaChecker class demonstrating problem Here is a minimal testcase showing the bug which seems to only affect XSSF. when ran with HSSFWorkbook this is printed: evaluatedCell: org.apache.poi.ss.usermodel.CellValue [39.57] impl: HSSFWorkbook, maxRow: 4, elapsed: 209ms, 0s, 0min when ran with XSSFWorkbook: evaluatedCell: org.apache.poi.ss.usermodel.CellValue [#N/A] impl: XSSFWorkbook, maxRow: 4, elapsed: 248ms, 0s, 0min or with my patch to throw the exception instead of return null: Exception in thread "main" java.lang.RuntimeException: XSSFBUG: would shortcircuit wrongly: rowIndex: 2, _lastDefinedRow: 1 at org.apache.poi.xssf.usermodel.XSSFEvaluationSheet.getCell(XSSFEvaluationSheet.java:71) at org.apache.poi.ss.formula.OperationEvaluatorFactory.evaluate(OperationEvaluatorFactory.java:139) at org.apache.poi.ss.formula.WorkbookEvaluator.evaluateFormula(WorkbookEvaluator.java:534) at org.apache.poi.ss.formula.WorkbookEvaluator.evaluateAny(WorkbookEvaluator.java:275) at org.apache.poi.ss.formula.WorkbookEvaluator.evaluate(WorkbookEvaluator.java:216) at org.apache.poi.xssf.usermodel.BaseXSSFFormulaEvaluator.evaluateFormulaCellValue(BaseXSSFFormulaEvaluator.java:56) at org.apache.poi.ss.formula.BaseFormulaEvaluator.evaluate(BaseFormulaEvaluator.java:110) at PoiFormulaChecker.main(PoiFormulaChecker.java:47) -- You are receiving this mail because: You are the assignee for the bug. - To unsubscribe, e-mail: dev-unsubscr...@poi.apache.org For additional commands, e-mail: dev-h...@poi.apache.org
[Bug 61841] Unnecessary long computation when evaluating VLOOKUP on all column reference
https://bz.apache.org/bugzilla/show_bug.cgi?id=61841 --- Comment #11 from Travis Burtrum --- Hello all, We are upgrading from poi 3.17 to 4.0.1 and formula evaluation was broken, I used git bisect on poi and tracked it down to this line specifically: https://github.com/apache/poi/commit/6390202491c3c77a77df3a1b4f2dc205ebc5b307#diff-e3bae629e723fe5bac4ba5d2e85451cbR69 It goes back to working if I comment that line out, or change _lastDefinedRow to _xs.getLastRowNum(), I changed it to instead in that if block do this to attempt to track the problem down: throw new RuntimeException(String.format("XSSFBUG: would shortcircuit wrongly: rowIndex: %d, _lastDefinedRow: %d%n", rowIndex, _lastDefinedRow)); And the stack trace was: Caused by: java.lang.RuntimeException: XSSFBUG: would shortcircuit wrongly: rowIndex: 40, _lastDefinedRow: 39 at org.apache.poi.xssf.usermodel.XSSFEvaluationSheet.getCell(XSSFEvaluationSheet.java:72) at org.apache.poi.ss.formula.WorkbookEvaluator.evaluateReference(WorkbookEvaluator.java:782) at org.apache.poi.ss.formula.SheetRefEvaluator.getEvalForCell(SheetRefEvaluator.java:48) at org.apache.poi.ss.formula.SheetRangeEvaluator.getEvalForCell(SheetRangeEvaluator.java:74) at org.apache.poi.ss.formula.LazyAreaEval.getRelativeValue(LazyAreaEval.java:51) at org.apache.poi.ss.formula.LazyAreaEval.getRelativeValue(LazyAreaEval.java:45) at org.apache.poi.ss.formula.eval.AreaEvalBase.getValue(AreaEvalBase.java:128) at org.apache.poi.ss.formula.functions.LookupUtils$ColumnVector.getItem(LookupUtils.java:100) at org.apache.poi.ss.formula.functions.LookupUtils.lookupIndexOfExactValue(LookupUtils.java:504) at org.apache.poi.ss.formula.functions.LookupUtils.lookupIndexOfValue(LookupUtils.java:483) at org.apache.poi.ss.formula.functions.Vlookup.evaluate(Vlookup.java:61) at org.apache.poi.ss.formula.functions.Var3or4ArgFunction.evaluate(Var3or4ArgFunction.java:36) at org.apache.poi.ss.formula.OperationEvaluatorFactory.evaluate(OperationEvaluatorFactory.java:144) at org.apache.poi.ss.formula.WorkbookEvaluator.evaluateFormula(WorkbookEvaluator.java:534) at org.apache.poi.ss.formula.WorkbookEvaluator.evaluateAny(WorkbookEvaluator.java:275) at org.apache.poi.ss.formula.WorkbookEvaluator.evaluate(WorkbookEvaluator.java:216) at org.apache.poi.xssf.usermodel.BaseXSSFFormulaEvaluator.evaluateFormulaCellValue(BaseXSSFFormulaEvaluator.java:56) at org.apache.poi.ss.formula.BaseFormulaEvaluator.evaluate(BaseFormulaEvaluator.java:110) Clearly the number of rows in the spreadsheet is increasing without _lastDefinedRow being updated (as it's only set once in the constructor, and only updated on a call to clearAllCachedResultValues() which never happens). I have tried and failed to reproduce this with a small standalone example though, I was hoping maybe someone else can spot the problem. Thanks much, Travis -- You are receiving this mail because: You are the assignee for the bug. - To unsubscribe, e-mail: dev-unsubscr...@poi.apache.org For additional commands, e-mail: dev-h...@poi.apache.org
[Bug 61841] Unnecessary long computation when evaluating VLOOKUP on all column reference
https://bz.apache.org/bugzilla/show_bug.cgi?id=61841 --- Comment #10 from Javen O'Neal --- Throw in some @Override's for good measure -- You are receiving this mail because: You are the assignee for the bug. - To unsubscribe, e-mail: dev-unsubscr...@poi.apache.org For additional commands, e-mail: dev-h...@poi.apache.org
[Bug 61841] Unnecessary long computation when evaluating VLOOKUP on all column reference
https://bz.apache.org/bugzilla/show_bug.cgi?id=61841 Greg Woolsey changed: What|Removed |Added Status|NEEDINFO|RESOLVED Resolution|--- |FIXED -- You are receiving this mail because: You are the assignee for the bug. - To unsubscribe, e-mail: dev-unsubscr...@poi.apache.org For additional commands, e-mail: dev-h...@poi.apache.org
[Bug 61841] Unnecessary long computation when evaluating VLOOKUP on all column reference
https://bz.apache.org/bugzilla/show_bug.cgi?id=61841 --- Comment #9 from Greg Woolsey --- (In reply to PJ Fanning from comment #7) > Can we call the new method getLastRowNum instead of getlastRowNum? > Also, a lot of public methods have had their signatures changed. Do we need > to keep the existing signatures too and deprecate the old versions of the > methods? Argh, hate it when I miss things like that. Yes, fixed it in r1817325. The "public" methods here are either in classes marked in docs as "POI internal" or called only from methods similarly documented, so I think changing their signatures is fine. I'm not aware of anyone trying to hook into formula evaluation down at the internal WorkbookEvaluator level. I'm not opposed to duplicating and deprecating, but in this case I don't think it's needed. -- You are receiving this mail because: You are the assignee for the bug. - To unsubscribe, e-mail: dev-unsubscr...@poi.apache.org For additional commands, e-mail: dev-h...@poi.apache.org
[Bug 61841] Unnecessary long computation when evaluating VLOOKUP on all column reference
https://bz.apache.org/bugzilla/show_bug.cgi?id=61841 --- Comment #8 from Luca Martini --- (In reply to Greg Woolsey from comment #5) > Changes in r1817252 > Thank you very much Greg. > Of this remaining time, about 2/3 is taken up in the formula evaluation > caching and tracking mechanism. Bypassing it for null cells causes test > failures, which shows it is necessary, but relatively expensive. It appears > to try to optimize and minimize the "empty cell" rectangular regions it > holds. but assumes processing by row then column. That may be a memory/time > optimization we want to consider allowing additional strategies for. > > Note that this shortcut logic doesn't change the result of any methods, only > avoids busywork that didn't apply to the "nonexistent cell" cases. > > This doesn't optimize VLOOKUP directly, but is about 70% improvement > sufficient? I think so. That's more or less the fix I had in mind. > > Changing the VLOOKUP code itself is actually significantly more complex, > because POI handles sheets by row internally, and columns are second-class > constructs. There is no easy way to determine the last row with data in a > column other than iterating over all defined rows. With these > optimizations, the extra iterations should fail fast. I know, and I think that with current state of the data structure, iterating over every defined row could be worse than your current solution. Here we are still on POI 3.x, but it should not be difficult to integrate your changes in our forked version. For me the bug is considered as resolved. I still do not change its status because others have still pending comments. Best regards, Luca -- You are receiving this mail because: You are the assignee for the bug. - To unsubscribe, e-mail: dev-unsubscr...@poi.apache.org For additional commands, e-mail: dev-h...@poi.apache.org
[Bug 61841] Unnecessary long computation when evaluating VLOOKUP on all column reference
https://bz.apache.org/bugzilla/show_bug.cgi?id=61841 --- Comment #7 from PJ Fanning --- Can we call the new method getLastRowNum instead of getlastRowNum? Also, a lot of public methods have had their signatures changed. Do we need to keep the existing signatures too and deprecate the old versions of the methods? -- You are receiving this mail because: You are the assignee for the bug. - To unsubscribe, e-mail: dev-unsubscr...@poi.apache.org For additional commands, e-mail: dev-h...@poi.apache.org
[Bug 61841] Unnecessary long computation when evaluating VLOOKUP on all column reference
https://bz.apache.org/bugzilla/show_bug.cgi?id=61841 --- Comment #6 from Javen O'Neal --- (In reply to Greg Woolsey from comment #5) > 14.4s > when XSSFEvaluationSheet caches the value of getLastRowNum(), since it comes > from a TreeMap.lastKey() which has to navigate the tree each time to find > the last key. Side-note: should XSSFSheet cache the last row rather than iterating over the TreeMap for every call? Hopefully we could drop in a better Sorted map implementation, one that keeps a pointer to the first and last keys. -- You are receiving this mail because: You are the assignee for the bug. - To unsubscribe, e-mail: dev-unsubscr...@poi.apache.org For additional commands, e-mail: dev-h...@poi.apache.org
[Bug 61841] Unnecessary long computation when evaluating VLOOKUP on all column reference
https://bz.apache.org/bugzilla/show_bug.cgi?id=61841 --- Comment #5 from Greg Woolsey --- Changes in r1817252 Interesting. In a local test with the attached sample file, I found these results: 45s (second run) with current codebase issuing FormulaEvaluator.evaluateAll() on the workbook. 19s By just changing XSSFEvaluationSheet.getCell(row, col) to immediately return null if the row index > sheet.getLastRowNum() 14.4s when XSSFEvaluationSheet caches the value of getLastRowNum(), since it comes from a TreeMap.lastKey() which has to navigate the tree each time to find the last key. 12.1s after optimizing the blank cell tracking a bit to know about the last row with data. That's all without changing anything int he VLOOKUP evaluation and still iterating over the max # of rows per column. Of this remaining time, about 2/3 is taken up in the formula evaluation caching and tracking mechanism. Bypassing it for null cells causes test failures, which shows it is necessary, but relatively expensive. It appears to try to optimize and minimize the "empty cell" rectangular regions it holds. but assumes processing by row then column. That may be a memory/time optimization we want to consider allowing additional strategies for. Note that this shortcut logic doesn't change the result of any methods, only avoids busywork that didn't apply to the "nonexistent cell" cases. This doesn't optimize VLOOKUP directly, but is about 70% improvement sufficient? Changing the VLOOKUP code itself is actually significantly more complex, because POI handles sheets by row internally, and columns are second-class constructs. There is no easy way to determine the last row with data in a column other than iterating over all defined rows. With these optimizations, the extra iterations should fail fast. -- You are receiving this mail because: You are the assignee for the bug. - To unsubscribe, e-mail: dev-unsubscr...@poi.apache.org For additional commands, e-mail: dev-h...@poi.apache.org
[Bug 61841] Unnecessary long computation when evaluating VLOOKUP on all column reference
https://bz.apache.org/bugzilla/show_bug.cgi?id=61841 --- Comment #4 from Greg Woolsey --- It looks to me like the sheet is available: Vlookup.evalutate() receives a ValueEval that is converted to a TwoDEval via LookupUtils.resolveTableArrayArg(). TwoDEval interface has only one base class implementing it, AreaEvalBase. AreaEvalBase abstract class has 2 non-test implementations, cached and lazy. In both cases actual cell ValueEval references are available, which means sheet/row/column eventually. Seems to me we could augment the TwoDEval interface to indicate the last physical values for rows and columns along with the existing getWidth() and getHeight() methods. These could be used when creating ValueVector objects for the lookup iterations to reduce the vector size to the maximum defined values, since no matches would exist outside those anyway. -- You are receiving this mail because: You are the assignee for the bug. - To unsubscribe, e-mail: dev-unsubscr...@poi.apache.org For additional commands, e-mail: dev-h...@poi.apache.org
[Bug 61841] Unnecessary long computation when evaluating VLOOKUP on all column reference
https://bz.apache.org/bugzilla/show_bug.cgi?id=61841 --- Comment #3 from Javen O'Neal --- Looking at the POI implementation of Vlookup.java, the class currently doesn't have a reference to the sheet that contains the lookup table. This will make a fix non-trivial. The function that creates the AreaEval that is passed to the evaluate function might have a reference to the lookup sheet. Caveats: lookup table may be contained in a different workbook or sheet than the cell that contains the formula that is being evaluated. -- You are receiving this mail because: You are the assignee for the bug. - To unsubscribe, e-mail: dev-unsubscr...@poi.apache.org For additional commands, e-mail: dev-h...@poi.apache.org
[Bug 61841] Unnecessary long computation when evaluating VLOOKUP on all column reference
https://bz.apache.org/bugzilla/show_bug.cgi?id=61841 --- Comment #2 from Luca Martini --- Nick, I have already contributed some patches in the past. However I am not sure to have the time to work to this fix at the moment. I opened the bug for future reference and to check if someone with more knowledge about evaluators could help. In fact, at a first glance, the current workbook is not easy recoverable from the WorkbookEvaluator without changing its interface and I am not sure that is safe and sound to do so. -- You are receiving this mail because: You are the assignee for the bug. - To unsubscribe, e-mail: dev-unsubscr...@poi.apache.org For additional commands, e-mail: dev-h...@poi.apache.org
[Bug 61841] Unnecessary long computation when evaluating VLOOKUP on all column reference
https://bz.apache.org/bugzilla/show_bug.cgi?id=61841 Nick Burch changed: What|Removed |Added Status|NEW |NEEDINFO --- Comment #1 from Nick Burch --- Would you be able to put together a patch with the improved logic in? See http://poi.apache.org/guidelines.html and http://poi.apache.org/howtobuild.html and http://poi.apache.org/subversion.html for more on building and contributing! -- You are receiving this mail because: You are the assignee for the bug. - To unsubscribe, e-mail: dev-unsubscr...@poi.apache.org For additional commands, e-mail: dev-h...@poi.apache.org
[Bug 61841] Unnecessary long computation when evaluating VLOOKUP on all column reference
https://bz.apache.org/bugzilla/show_bug.cgi?id=61841 Luca Martini changed: What|Removed |Added CC||chiaramarches...@tagetik.co ||m, guar...@hotmail.com, ||samsontes...@tagetik.com OS||All -- You are receiving this mail because: You are the assignee for the bug. - To unsubscribe, e-mail: dev-unsubscr...@poi.apache.org For additional commands, e-mail: dev-h...@poi.apache.org