https://bugs.documentfoundation.org/show_bug.cgi?id=171907

--- Comment #4 from [email protected] ---
(In reply to ady from comment #3)
[snip]
> 
> At any rate, this report could probably be set as a DUPE of many others,
> perhaps adding some info about the XML "dimension" details. Alternatively,
> someone with the relevant knowledge should rather take this specific report
> and actually improve the situation.
> 
> There is possibly some reason for the (XML) "dimension" to be "too big(?)"
> when saving in Calc.

While there may be other reports that complain about the size, this is about
function.  Size appears to be a useful clue.

Again, there may well be more to this than the size.  The size is annoying. 
The result in this case is catastrophic.  It may be just size.  It may be that
there's a structural problem hiding behind it.

Automation is trying to process the worksheet.  The result when fed a Calc
sheet is that it runs out of memory and is killed.  Either or both a structural
problem or simply including area that Excel does not.

Again, Calc is writing a file with (many) cells that Excel does not.  Calc is
clearly providing a 'dimension' tag that doesn't reflect the useful area.

Asking my users to go thru a 12 step procedure for each sheet in every workbook
that they submit - every time they edit - is not realistic.  Besides the time,
it's error-prone.

I can confirm that if I manually select & delete DB1-XFD16384 IN CALC, Calc
reduces the 37MB file to 1.3M.  Do the same for A3-XFD16384, and it drops to a
mere 14K.  But these are cells that Calc ADDED.  So, maybe the issue is in
Calc's READER (vs. writer).

Absent a fix, the answer is going to be "Buy an Office license and sue Excel -
just for this application".

It's not a good answer.  Excel defines the correct behavior.  (In this case,
it's even the right behavior.)

There's no good reason for Calc to write more than Excel does.  Nor for the
dimension tag to include unwritten rows (particularly when Excel does not).

I am NOT saying that Calc's output must be bit-for-bit identical to Excel's. 
But it should be functionally compatible and reasonable.  In this case, it is
neither.

I hope that someone with the necessary expertise in Calc's output can fix this.

I expect that person has the necessary tools to investigate. 

But below are one-line reproducers - each takes a filename argument.  You need
Perl, and Spreadsheet::ParseXLSX (cpan install Spreadsheet::ParseXLSX).

Read the Excel (trimmed) Sheet: 0.5 sec:

 time perl -MSpreadsheet::ParseXLSX
-e'Spreadsheet::ParseXLSX->new->parse($ARGV[0])'
Multi-Proposal_Sample_Workbook_trimmed\ duplicates.xlsx

real 0m0.549s
user 0m0.521s
sys  0m0.027s

Tried to read the Calc version on a bigger machine: 11 min interactive before
failure, but too big and too long for a web service:

 time perl -MSpreadsheet::ParseXLSX
-e'$wbs=Spreadsheet::ParseXLSX->new->parse($ARGV[0])->worksheet(1); printf( "R:
\%u - \%u\nC \%u - \%u\n", $ws->row_range, $ws->col_range)'
Multi-Proposal_Sample_Workbook_with\ duplicates.xlsx
Can't call method "row_range" on an undefined value at -e line 1. (did not find
sheet)

real 10m45.220s
user 10m39.042s
sys  0m4.926s

Print the dimensions reported by Perl from trimmed - the sheet number is coded
as worksheet(1):

time perl -MSpreadsheet::ParseXLSX
-e'$ws=Spreadsheet::ParseXLSX->new->parse($ARGV[0])->worksheet(1); printf( "R:
\%u - \%u\nC \%u - \%u\n", $ws->row_range, $ws->col_range)'
Multi-Proposal_Sample_Workbook_trimmed\ duplicates.xlsx
R: 0 - 1082
C 0 - 8

Read the dimension tag directly from the XML file:
 unzip the .xlsx.  Then:
 sed  -e's/^.*\(<dimension\)/\1/; s/>.*/>/' xl/worksheets/sheet1.xml


Results - note that sheet 1 is consistently A1:DA2, but sheet 2 correlates with
file size :
Excel's (original)
 xl/worksheets/sheet1.xml : <dimension ref="A1:DA2"/>
 xl/worksheets/sheet2.xml : <dimension ref="A1:I1083"/>

Calc's: Multi-Proposal_Sample_WorkbookLO.xlsx
 xl/worksheets/sheet1.xml : <dimension ref="A1:DA2"/>
 xl/worksheets/sheet2.xml : <dimension ref="A1:XFD1008"/>

Calc's: Multi-Proposal_Sample_Workbook_with_duplicates.xlsx
 xl/worksheets/sheet1.xml : <dimension ref="A1:DA2"/>
 xl/worksheets/sheet2.xml : <dimension ref="A1:XFD1010"/>

Calc's: Multi-Proposal_Sample_Workbook_with_duplicates LO delete DB and A3 -
xfd16384.xlsx
 xl/worksheets/sheet1.xml : <dimension ref="A1:DA2"/>
 xl/worksheets/sheet2.xml : <dimension ref="A1:I32"/>

Excel's: excel-resaved.xlsx
 xl/worksheets/sheet1.xml : <dimension ref="A1:DA2"/>
 xl/worksheets/sheet2.xml : <dimension ref="A1:I1065"/>

Excel's (trimmed):
 xl/worksheets/sheet1.xml : <dimension ref="A1:DA2"/>
 xl/worksheets/sheet2.xml : <dimension ref="A1:I1083"/>

The rows on sheet1 are constant at 2, with the columns constant at DA (104)
The rows on sheet2 extend to ~1,000, but the columns from I (9)  to XFD
(16,384)

In no case is the actual data using as many columns as declared.
On sheet2, there are only 2 rows used, not ~1,000.

Finding the used dimensions ought to be pretty simple: for each row starting
with the maximum, start at the maximum column, and scan left until you hit a
non-blank or initialized cell.  Keep track of the maximum column.  Stop when
you nit a non-blank row.
That's the max row and column.  The min is the same.  You can do this at input
and track, or just do it at input and output.  Printing (in the absence of
printareas) can use the same info...

-- 
You are receiving this mail because:
You are the assignee for the bug.

Reply via email to