Thanks Nick! On Jul 26, 2013, at 11:46 AM, Nick Burch <apa...@gagravarr.org> wrote:
> On Fri, 26 Jul 2013, Mike Hugo wrote: >> I'm looking into basic support (text extraction) for MS OneNote. I found >> this bug https://issues.apache.org/bugzilla/show_bug.cgi?id=50750 that has >> some sample files attached. Does anyone have any pointers as to where I >> should get started? > > Use POIFSLister to work out if they have a single POIFS/OLE2 stream or > multiple. If loads, assume it's like Outlook (HSMF), use POIFSDump to look at > the parts. If one, use POIFSViewer and docs and try to work out if it's > streams of records (eg HSSF), nested records (HSLF, DDF), or streams (HWPF). > > Once you know that, try to do something to do a basic processing of the file > structure. Then add some .dev. tools to print the structure (look at visio, > outlook etc for an idea of how we've done that). Use your own dev tool to > play with the structure more. Finally, flesh out the implementation to cover > all the key bits, and write lots of unit tests! > > Nick > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@poi.apache.org > For additional commands, e-mail: dev-h...@poi.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@poi.apache.org For additional commands, e-mail: dev-h...@poi.apache.org