On Wed, 23 Jan 2008, Daniel Noll wrote:
I was wondering if anyone had experimented with doing lazy parsing via the eventusermodel interface. I've had an attempt at it myself but am running into various troubles.

I did a bit, the core of which is now in svn as MissingRecordAwareHSSFListener

The first one which is really problematic is that once I get a FormulaRecord, I can't find a way to convert that into the formula string. Thankfully getting the value result is relatively simple.

Formulas are surprisingly tricky. They're stored as a series of ptgs, and turning them back into strings is quite hard. Then you have the fun of shared formulas, so you'll have to track all the formulas to be able to resolve those. Comes a point that you're holding so many records that you might as well just give in and use usermodel :/


If you have a fairly simple formula, then you can probably turn them into strings without needing a hssf.model.Workbook, using hssf.model.FormulaParser. However, there are some ptgs that need the workbook to turn into strings, so you might have problems with those.

Is your formula related eventusermodel code in a format suitable for contributing back? It'd be handy to be able to put something in svn that would make dealing with the formula stuff much simpler. I'd be happy to spend a bit of time tidying it up / writing tests for it, if you could contribute it?


Have the HSSF developers considered making an API half way between usermodel and eventusermodel, which can return HSSFCell instances one at a time without instantiating the entire spreadsheet? It would be a really nice thing for saving memory.

I think there was some talk a few years back, but nothing really came of it. The problem is that it'd take a large amount of programmer time, and memory seems to be fairly cheap.

(From my perspective, I can buy a staggering amount of memory for all my production servers for a couple of days billable rate. I suspect that that holds for many of the other poi developers, so in the absense of external sponsorship, I can't see it being a great priority for anyone. Alas I think most of us have larger poi 'itches' than memory)


(Although an implementation of the records which doesn't create copies of everything in memory would probably solve the memory problems almost as well.)

I'm not sure how that'd work though. If we don't hold the contents of the records in memory, then how are we going to be able to do anything with them? (Maybe I'm missing something in your suggestion though)


My hunch is that we'll have a peak use of somewhere around 3-5 times the size of the excel file in memory, except for very small files. There'll be one copy of the file in poifs, another in hssf, then each record will take a copy as it parses itself.

Does anyone have a good memory profiling tool? While I can't see us re-architecting poi any time soon (unless someone wants to sponsor it...), if there are a few quick wins them I'm sure we can sort those. If someone could spot where most of the memory does go, or any points in processing when we use very large amounts of memory for a short spell, that'd be helpful to know

Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to