Michael Zalewski, you are a holy cleric.
(I hope I'm not jumping the gun too much there :)
Michael Zalewski wrote:
>When I wrote this comment, I thought after I posted -- This is
ridiculous.
>No sane person is going to try to rewrite BinaryTree like I suggested.
Such
>a change must surely destroy every other dependant structure, which
includes
>the file structure itself. Only a raving lunatic would actually spend
time
>on this.
>
>So I tried :)
>
>Result was memory requirements decreased by about 1/3. (My file
consists of
>only Strings, and the strings are short. I suspect most use cases would
not
>see such an improvement in memory requirements).
>
>However, time to load my 65,000 unique string workbook decreased by a
factor
>of of almost 10 (from over 5 minutes to about 30 sec). The strange
>phenomenon with the CPU going idle happened briefly for less than 3
sec, and
>only one time.
>
>Here is part of the code (to make my idea more clear)
>
> private static final class Node
> implements Map.Entry
> {
> private Comparable _dataKey; // instead of Comparable[] _data
> private Comparable _dataData;
> private Node _leftKey; // instead of Node[] _left
> private Node _leftData;
> private Node _rightKey; // instead of Node[] _right
> private Node _rightData;
> private Node _parentKey; // instead of Node[] _parent
> private Node _parentData;
> private boolean _blackKey; // instead of Boolean[] _black
> private boolean _blackData;
> private int _hashcode;
> private boolean _calculated_hashcode;
>
> /**
> * Make a new cell with given key and value, and with null
> * links, and black (true) colors.
> *
> * @param key
> * @param value
> */
>
> Node(final Comparable key, final Comparable value)
> {
> _dataKey = key; // much shorter ctor
> _dataData = value; // does not create any arrays
> }
>
>I'll put this into Bugzilla soon to start the discussion. But I should
point
>out that I have run practically no tests. Gotta find the tests first.
>
>Are there any tests?
>
>-----Original Message-----
>From: Michael Zalewski [mailto:[EMAIL PROTECTED]
>Sent: Friday, March 25, 2005 12:45 PM
>To: 'POI Users List'
>Subject: RE: HSSF cannot open files that contain many strings
>
>My own thought is that there are just too gosh darn many objects. (Gosh
darn
>many objects => gosh darn long time to process).
>
>The SST table gets deserialized into a humongous double binary tree
>structure, (org.apache.poi.util.BinaryTree) which is actually indexed by
>both the index of the string and the value of the string. So this means
that
>there are at least 10 objects created per String
>
>1) The String structure (type org.apache.poi.hssf.record.UnicodeString)
>2) The String value itself (contained as a field in type UnicodeString)
>3) The Integer value (which indexes the String). It's an Integer object
>instead of a primitive, so it can implement Comparable and be one of the
>keys in the double indexed tree structure
>4) The Node object (of the tree, which has a reference to both the
String
>value and the Integer value)
>5) One or more LabelSST records which contain an index into the tree.
>
>If you look inside org.apache.poi.util.BinaryTree, you can see that each
>node of the binary tree (there is one node for each string) contains
five
>array objects in addition to the ones I listed above.
>
>This means that my file of 65,000 unique strings will end up creating
>650,000 objects to represent those strings when deserialized. I'm
probably
>missing some objects in this analysis, so my guess is that my 65,000
string
>spreadsheet required over a million java objects.
>
>You can get rid of 5 of these objects with a simple refactoring of
>BinaryTree -- replace each of the 5 arrays with 2 fields (replace the 5
>arrays with 10 primitive fields).
>
>
>
>-----Original Message-----
>From: Danny Mui [mailto:[EMAIL PROTECTED]
>Sent: Friday, March 25, 2005 11:50 AM
>To: POI Users List
>Subject: Re: HSSF cannot open files that contain many strings
>
>I'm curious about the CPU utilization issues and why it takes so gosh
>darn long! Wonder what a profiler will say about loading a file as
>you've described.
>
>It shouldn't be too difficult to adjust the way the SST's are
>written/loaded to validate/invalidate this problem/fix.
>
>Michael Zalewski wrote:
>> Ummm...
>>
>> Yes I think I might have identified an issue with POI and a large
number
>of
>> strings. And I was looking at it partly in response to Mike's problem.
>>
>> But I don't think the issue I found is the root problem. It might
explain
>> why large files generated in POI HSSF would not open correctly in
Excel.
>In
>> fact, I couldn't find any problem with the way POI handles things. At
this
>> point, I would say that what I have identified is just a difference in
the
>> way Excel writes a file with more than 1024 strings, and the way the
same
>> file is written from POI.
>>
>> I have tried reading a 3 MB Excel file which contains 65,000 unique
>strings,
>> 130,000 BIFF records. Everything worked fine (if slowly, but 5 minutes
>> instead of 5 hours). I have a 2 Ghz Pentium laptop, with 1 GB RAM. I
did
>not
>> increase the JVM heap size (so it was 128 MB).
>>
>> I did see one thing which I don't understand. I was debugging the
>> application in Eclipse, and many times during the load, the CPU
>utilization
>> went down to nearly zero for several seconds at a time. But after 15
to 30
>> seconds, it would pick up again and run for another 15 to 30 seconds at
>> 100%. Toward the end of the run (when HSSFSheet creation was nearly
>> complete), the idle periods got longer. I am certain that the idle
>intervals
>> I observed were when the JVM was garbage collecting. I don't understand
>why
>> Windows showed 0% CPU Utilization during this time.
>>
>> -----Original Message-----
>> From: Danny Mui [mailto:[EMAIL PROTECTED]
>> Sent: Thursday, March 24, 2005 2:27 PM
>> To: POI Users List
>> Subject: Re: HSSF cannot open files that contain many strings
>>
>> Mike Z has identitifed an issue with HSSF handling a bunch of unique
>> strings (dev list). Once that is taken care of, I have a suspicion
your
>> issue will be addressed as well.
>>
>> Can you go into bugzilla and provide your excel file as a validation
>> point as well? I can't find an existing bug with this issue so it
would
>> help facilitate testing once the coding is complete.
>>
>> As for timeframe, I'll dedicate sometime in April and May as I'll be
>> trekking around Europe and need something to do while sipping coffee ;D
>>
>>
>>
>> Mike Serra wrote:
>>
>>>Hello again to the POI world,
>>> I have been having an ongoing problem with HSSF's ability to load an
>>>.xls file containing
>>>only strings. A 500kb file filled only with strings will not load, but
>>>it doesn't throw an exception or run out ram either. The process sits
>>>there taking up CPU time and slowly nibbling at system ram, and the
file
>>>might take hours to load (I haven't bothered to wait that long).
>>>
>>>In the past, I thought that POI was simply not able to load large
files,
>>>but I have since discovered that it can load enormous files, as long as
>>>they contain only numeric data. The strings are the problem. I would
>>>be very grateful if anyone has an idea what causes this.
>>>
>>>Thank you,
>>>Mike S.
>>>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: [EMAIL PROTECTED]
>Mailing List: http://jakarta.apache.org/site/mail2.html#poi
>The Apache Jakarta Poi Project: http://jakarta.apache.org/poi/
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: [EMAIL PROTECTED]
>Mailing List: http://jakarta.apache.org/site/mail2.html#poi
>The Apache Jakarta Poi Project: http://jakarta.apache.org/poi/
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
Mailing List: http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project: http://jakarta.apache.org/poi/