Re: Using StAX parser for .xlsx files
thanks a lot to all comments. I updated the hadoopoffice library to include also Stax parsing. In a later stage I will also add reading comments. On Fri, May 18, 2018 at 11:37 PM Jörn Franke wrote: > Hi, > > I wonder if someone has already used successfully the StAX parser with > .xlsx files (ie instead of an event driven push model a pull model, cf. > https://docs.oracle.com/javase/tutorial/jaxp/stax/why.html). > > Reason that I ask is that on Big Data platforms (for which we implemented > the HadoopOffice library powered by Apache POI, cf. > https://github.com/ZuInnoTe/hadoopoffice) the event driven model causes a > lot of memory overhead, because virtually all Big Data platforms implement > a pull model, which means if I use the push model provided by the event API > then I need to load the full content in memory to make it available as a > pull model, since those platforms are not event driven. > > I found the StaxHelper class, but I have little idea how it can be used > within Apache POI: > https://poi.apache.org/apidocs/org/apache/poi/util/StaxHelper.html > > > The main goal is to have a light weight approach as proposed by the > current POI event push model for reading .xlsx files ( > https://poi.apache.org/spreadsheet/how-to.html#xssf_sax_api) , but in > form of a pull model, e.g. as illustrated in the following pseudo code: > > XMLInputFactory xmlif; > > XSSFReader.SheetIterator iter; > InputStream currentInputStream; > > XMLStreamReader xmlr; > > > /* function called once at the start of processing */ > public void init() { > r = new XSSFReader( pkg ); > ReadOnlySharedStringsTable strings = new ReadOnlySharedStringsTable(pkg); > // inputstream sheet iterator > iter = (XSSFReader.SheetIterator)r.getSheetsData(); > // XML factory to create Stax Parser > > xmlif = XMLInputFactory.newInstance(); > > } > > > /** the following method is called by the Big Data platform (Flink, Spark, > Hadoop, Hive etc.) do get the next row without reading the full file in > memory as in the DOM or SAX (Push) model > returns null if no further record exist, otherwise sets the current cell > content as String > **/ > public String getNextRow() { > // check if we have data to read > if ((xmlr==null) || xmlr.hasNext()==false) { > if (!iter.hasNext()) { > return null; > } else { > // read sheet into stax parser > > xmlr=xmlif.createXMLStreamReader(iter.next()); > > } > // read the data from stax > // > return xmlr.getText(); // just an example, in fact i need to check for a c > tag, check its type, if it is string then check the string table, otherwise > convert the numeric to a data or indeed a number > } > > Any pitfalls with this approach? I assume I need to link the sharedstring > table somehow to the cell tag (I can derive probably from the source code > of the event API how to do this). > Formula evaluation is not of so much importance for this currently (the > HadoopOffice library offers for formula evaluation to load the full file in > memory). > Of course still the full sharedstringtable needs to be loaded in memory, > but I expect for the data to have a very small sharedstring table and/or > use of numerics. > > Has anyone used the StAX API together with .xlsx files with POI? > > > Thank you. > > best regards >
Re: Using StAX parser for .xlsx files
https://github.com/monitorjbl/excel-streaming-reader is probably a better bet -- Sent from: http://apache-poi.1045710.n5.nabble.com/POI-Dev-f2312866.html - To unsubscribe, e-mail: dev-unsubscr...@poi.apache.org For additional commands, e-mail: dev-h...@poi.apache.org
Re: Using StAX parser for .xlsx files
Maybe https://github.com/davidpelfree/sjxlsx is what you need if you don't want to use CSV as an intermediate format. -- Sent from: http://apache-poi.1045710.n5.nabble.com/POI-Dev-f2312866.html - To unsubscribe, e-mail: dev-unsubscr...@poi.apache.org For additional commands, e-mail: dev-h...@poi.apache.org
Re: Using StAX parser for .xlsx files
I would suggest converting the xlsx sheets to CSV and importing into Spark using its built-in CSV data source. -- Sent from: http://apache-poi.1045710.n5.nabble.com/POI-Dev-f2312866.html - To unsubscribe, e-mail: dev-unsubscr...@poi.apache.org For additional commands, e-mail: dev-h...@poi.apache.org
Using StAX parser for .xlsx files
Hi, I wonder if someone has already used successfully the StAX parser with .xlsx files (ie instead of an event driven push model a pull model, cf. https://docs.oracle.com/javase/tutorial/jaxp/stax/why.html). Reason that I ask is that on Big Data platforms (for which we implemented the HadoopOffice library powered by Apache POI, cf. https://github.com/ZuInnoTe/hadoopoffice) the event driven model causes a lot of memory overhead, because virtually all Big Data platforms implement a pull model, which means if I use the push model provided by the event API then I need to load the full content in memory to make it available as a pull model, since those platforms are not event driven. I found the StaxHelper class, but I have little idea how it can be used within Apache POI: https://poi.apache.org/apidocs/org/apache/poi/util/StaxHelper.html The main goal is to have a light weight approach as proposed by the current POI event push model for reading .xlsx files ( https://poi.apache.org/spreadsheet/how-to.html#xssf_sax_api) , but in form of a pull model, e.g. as illustrated in the following pseudo code: XMLInputFactory xmlif; XSSFReader.SheetIterator iter; InputStream currentInputStream; XMLStreamReader xmlr; /* function called once at the start of processing */ public void init() { r = new XSSFReader( pkg ); ReadOnlySharedStringsTable strings = new ReadOnlySharedStringsTable(pkg); // inputstream sheet iterator iter = (XSSFReader.SheetIterator)r.getSheetsData(); // XML factory to create Stax Parser xmlif = XMLInputFactory.newInstance(); } /** the following method is called by the Big Data platform (Flink, Spark, Hadoop, Hive etc.) do get the next row without reading the full file in memory as in the DOM or SAX (Push) model returns null if no further record exist, otherwise sets the current cell content as String **/ public String getNextRow() { // check if we have data to read if ((xmlr==null) || xmlr.hasNext()==false) { if (!iter.hasNext()) { return null; } else { // read sheet into stax parser xmlr=xmlif.createXMLStreamReader(iter.next()); } // read the data from stax // return xmlr.getText(); // just an example, in fact i need to check for a c tag, check its type, if it is string then check the string table, otherwise convert the numeric to a data or indeed a number } Any pitfalls with this approach? I assume I need to link the sharedstring table somehow to the cell tag (I can derive probably from the source code of the event API how to do this). Formula evaluation is not of so much importance for this currently (the HadoopOffice library offers for formula evaluation to load the full file in memory). Of course still the full sharedstringtable needs to be loaded in memory, but I expect for the data to have a very small sharedstring table and/or use of numerics. Has anyone used the StAX API together with .xlsx files with POI? Thank you. best regards