Re: HSSF: Middle-ground API for reading an Excel spreadsheet

Nick Burch Thu, 24 Jan 2008 07:41:55 -0800

On Wed, 23 Jan 2008, Daniel Noll wrote:

I was wondering if anyone had experimented with doing lazy parsing viathe eventusermodel interface. I've had an attempt at it myself but amrunning into various troubles.

I did a bit, the core of which is now in svn asMissingRecordAwareHSSFListener

The first one which is really problematic is that once I get aFormulaRecord, I can't find a way to convert that into the formulastring. Thankfully getting the value result is relatively simple.

Formulas are surprisingly tricky. They're stored as a series of ptgs, andturning them back into strings is quite hard. Then you have the fun ofshared formulas, so you'll have to track all the formulas to be able toresolve those. Comes a point that you're holding so many records that youmight as well just give in and use usermodel :/

If you have a fairly simple formula, then you can probably turn them intostrings without needing a hssf.model.Workbook, usinghssf.model.FormulaParser. However, there are some ptgs that need theworkbook to turn into strings, so you might have problems with those.

Is your formula related eventusermodel code in a format suitable forcontributing back? It'd be handy to be able to put something in svn thatwould make dealing with the formula stuff much simpler. I'd be happy tospend a bit of time tidying it up / writing tests for it, if you couldcontribute it?

Have the HSSF developers considered making an API half way betweenusermodel and eventusermodel, which can return HSSFCell instances one ata time without instantiating the entire spreadsheet? It would be areally nice thing for saving memory.

I think there was some talk a few years back, but nothing really came ofit. The problem is that it'd take a large amount of programmer time, andmemory seems to be fairly cheap.

(From my perspective, I can buy a staggering amount of memory for all myproduction servers for a couple of days billable rate. I suspect thatthat holds for many of the other poi developers, so in the absense ofexternal sponsorship, I can't see it being a great priority for anyone.Alas I think most of us have larger poi 'itches' than memory)

(Although an implementation of the records which doesn't create copiesof everything in memory would probably solve the memory problems almostas well.)

I'm not sure how that'd work though. If we don't hold the contents of therecords in memory, then how are we going to be able to do anything withthem? (Maybe I'm missing something in your suggestion though)

My hunch is that we'll have a peak use of somewhere around 3-5 times thesize of the excel file in memory, except for very small files. There'll beone copy of the file in poifs, another in hssf, then each record will takea copy as it parses itself.

Does anyone have a good memory profiling tool? While I can't see usre-architecting poi any time soon (unless someone wants to sponsor it...),if there are a few quick wins them I'm sure we can sort those. If someonecould spot where most of the memory does go, or any points in processingwhen we use very large amounts of memory for a short spell, that'd behelpful to know


Nick

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: HSSF: Middle-ground API for reading an Excel spreadsheet

Reply via email to