Re: Holy COW! It worked (I think)

Mike E. Serra Fri, 25 Mar 2005 12:14:11 -0800

Michael Zalewski, you are a holy cleric.
(I hope I'm not jumping the gun too much there :)


Michael Zalewski wrote:


>When I wrote this comment, I thought after I posted -- This is 
ridiculous.
>No sane person is going to try to rewrite BinaryTree like I suggested. 
Such
>a change must surely destroy every other dependant structure, which 
includes
>the file structure itself. Only a raving lunatic would actually spend 
time
>on this.
>
>So I tried :)
>
>Result was memory requirements decreased by about 1/3. (My file 
consists of
>only Strings, and the strings are short. I suspect most use cases would 
not
>see such an improvement in memory requirements).
>
>However, time to load my 65,000 unique string workbook decreased by a 
factor
>of of almost 10 (from over 5 minutes to about 30 sec). The strange
>phenomenon with the CPU going idle happened briefly for less than 3 
sec, and
>only one time.
>
>Here is part of the code (to make my idea more clear)
>
>    private static final class Node
>        implements Map.Entry
>    {
>        private Comparable   _dataKey; // instead of Comparable[] _data
>        private Comparable   _dataData;
>        private Node         _leftKey; // instead of Node[] _left
>        private Node         _leftData;
>        private Node         _rightKey; // instead of Node[] _right
>        private Node         _rightData;
>        private Node         _parentKey; // instead of Node[] _parent
>        private Node         _parentData;
>        private boolean      _blackKey; // instead of Boolean[] _black
>        private boolean      _blackData;
>        private int          _hashcode;
>        private boolean      _calculated_hashcode;
>
>        /**
>         * Make a new cell with given key and value, and with null
>         * links, and black (true) colors.
>         *
>         * @param key
>         * @param value
>         */
>
>        Node(final Comparable key, final Comparable value)
>        {
>            _dataKey = key;     // much shorter ctor
>            _dataData = value;  // does not create any arrays
>        }
>
>I'll put this into Bugzilla soon to start the discussion. But I should 
point
>out that I have run practically no tests. Gotta find the tests first.
>
>Are there any tests?
>
>-----Original Message-----
>From: Michael Zalewski [mailto:[EMAIL PROTECTED] 
>Sent: Friday, March 25, 2005 12:45 PM
>To: 'POI Users List'
>Subject: RE: HSSF cannot open files that contain many strings
>
>My own thought is that there are just too gosh darn many objects. (Gosh 
darn
>many objects => gosh darn long time to process).
>
>The SST table gets deserialized into a humongous double binary tree
>structure, (org.apache.poi.util.BinaryTree) which is actually indexed by
>both the index of the string and the value of the string. So this means 
that
>there are at least 10 objects created per String
>
>1) The String structure (type org.apache.poi.hssf.record.UnicodeString)
>2) The String value itself (contained as a field in type UnicodeString)
>3) The Integer value (which indexes the String). It's an Integer object
>instead of a primitive, so it can implement Comparable and be one of the
>keys in the double indexed tree structure
>4) The Node object (of the tree, which has a reference to both the 
String
>value and the Integer value)
>5) One or more LabelSST records which contain an index into the tree.
>
>If you look inside org.apache.poi.util.BinaryTree, you can see that each
>node of the binary tree (there is one node for each string) contains 
five
>array objects in addition to the ones I listed above.
>
>This means that my file of 65,000 unique strings will end up creating
>650,000 objects to represent those strings when deserialized. I'm 
probably
>missing some objects in this analysis, so my guess is that my 65,000 
string
>spreadsheet required over a million java objects.
>
>You can get rid of 5 of these objects with a simple refactoring of
>BinaryTree -- replace each of the 5 arrays with 2 fields (replace the 5
>arrays with 10 primitive fields).
>
>
>
>-----Original Message-----
>From: Danny Mui [mailto:[EMAIL PROTECTED] 
>Sent: Friday, March 25, 2005 11:50 AM
>To: POI Users List
>Subject: Re: HSSF cannot open files that contain many strings
>
>I'm curious about the CPU utilization issues and why it takes so gosh 
>darn long!  Wonder what a profiler will say about loading a file as 
>you've described.
>
>It shouldn't be too difficult to adjust the way the SST's are 
>written/loaded to validate/invalidate this problem/fix.
>
>Michael Zalewski wrote:
>> Ummm...
>> 
>> Yes I think I might have identified an issue with POI and a large 
number
>of
>> strings. And I was looking at it partly in response to Mike's problem.
>> 
>> But I don't think the issue I found is the root problem. It might 
explain
>> why large files generated in POI HSSF would not open correctly in 
Excel.
>In
>> fact, I couldn't find any problem with the way POI handles things. At 
this
>> point, I would say that what I have identified is just a difference in 
the
>> way Excel writes a file with more than 1024 strings, and the way the 
same
>> file is written from POI.
>> 
>> I have tried reading a 3 MB Excel file which contains 65,000 unique
>strings,
>> 130,000 BIFF records. Everything worked fine (if slowly, but 5 minutes
>> instead of 5 hours). I have a 2 Ghz Pentium laptop, with 1 GB RAM. I 
did
>not
>> increase the JVM heap size (so it was 128 MB).
>> 
>> I did see one thing which I don't understand. I was debugging the
>> application in Eclipse, and many times during the load, the CPU
>utilization
>> went down to nearly zero for several seconds at a time. But after 15 
to 30
>> seconds, it would pick up again and run for another 15 to 30 seconds at
>> 100%. Toward the end of the run (when HSSFSheet creation was nearly
>> complete), the idle periods got longer. I am certain that the idle
>intervals
>> I observed were when the JVM was garbage collecting. I don't understand
>why
>> Windows showed 0% CPU Utilization during this time.
>> 
>> -----Original Message-----
>> From: Danny Mui [mailto:[EMAIL PROTECTED] 
>> Sent: Thursday, March 24, 2005 2:27 PM
>> To: POI Users List
>> Subject: Re: HSSF cannot open files that contain many strings
>> 
>> Mike Z has identitifed an issue with HSSF handling a bunch of unique 
>> strings (dev list).  Once that is taken care of, I have a suspicion 
your 
>> issue will be addressed as well.
>> 
>> Can you go into bugzilla and provide your excel file as a validation 
>> point as well? I can't find an existing bug with this issue so it 
would 
>> help facilitate testing once the coding is complete.
>> 
>> As for timeframe, I'll dedicate sometime in April and May as I'll be 
>> trekking around Europe and need something to do while sipping coffee ;D
>> 
>> 
>> 
>> Mike Serra wrote:
>> 
>>>Hello again to the POI world,
>>> I have been having an ongoing problem with HSSF's ability to load an 
>>>.xls file containing
>>>only strings.  A 500kb file filled only with strings will not load, but 
>>>it doesn't throw an exception or run out ram either.  The process sits 
>>>there taking up CPU time and slowly nibbling at system ram, and the 
file 
>>>might take hours to load (I haven't bothered to wait that long).
>>>
>>>In the past, I thought that POI was simply not able to load large 
files, 
>>>but I have since discovered that it can load enormous files, as long as 
>>>they contain only numeric data.  The strings are the problem.  I would 
>>>be very grateful if anyone has an idea what causes this.
>>>
>>>Thank you,
>>>Mike S.
>>>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: [EMAIL PROTECTED]
>Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
>The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: [EMAIL PROTECTED]
>Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
>The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
Mailing List:     http://jakarta.apache.org/site/mail2.html#poi
The Apache Jakarta Poi Project:  http://jakarta.apache.org/poi/

Re: Holy COW! It worked (I think)

Reply via email to