Re: Using StAX parser for .xlsx files

2018-10-19 Thread Jörn Franke
thanks a lot to all comments. I updated the hadoopoffice library to include
also Stax parsing. In a later stage I will also add reading comments.

On Fri, May 18, 2018 at 11:37 PM Jörn Franke  wrote:

> Hi,
>
> I wonder if someone has already used successfully the StAX parser with
> .xlsx files (ie instead of an event driven push model a pull model, cf.
> https://docs.oracle.com/javase/tutorial/jaxp/stax/why.html).
>
> Reason that I ask is that on Big Data platforms (for which we implemented
> the HadoopOffice library powered by Apache POI, cf.
> https://github.com/ZuInnoTe/hadoopoffice) the event driven model causes a
> lot of memory overhead, because virtually all Big Data platforms implement
> a pull model, which means if I use the push model provided by the event API
> then I need to load the full content in memory to make it available as a
> pull model, since those platforms are not event driven.
>
> I found the StaxHelper class, but I have little idea how it can be used
> within Apache POI:
> https://poi.apache.org/apidocs/org/apache/poi/util/StaxHelper.html
>
>
> The main goal is to have a light weight approach as proposed by the
> current POI event push model for reading .xlsx files (
> https://poi.apache.org/spreadsheet/how-to.html#xssf_sax_api) , but in
> form of a pull model, e.g. as illustrated in the following pseudo code:
>
> XMLInputFactory xmlif;
>
> XSSFReader.SheetIterator iter;
> InputStream currentInputStream;
>
> XMLStreamReader xmlr;
>
>
> /* function called once at the start of processing */
> public void init() {
> r = new XSSFReader( pkg );
> ReadOnlySharedStringsTable strings = new ReadOnlySharedStringsTable(pkg);
> // inputstream sheet iterator
>  iter = (XSSFReader.SheetIterator)r.getSheetsData();
> // XML factory to create Stax Parser
>
> xmlif = XMLInputFactory.newInstance();
>
> }
>
>
> /** the following method is called by the Big Data platform (Flink, Spark,
> Hadoop, Hive etc.) do get the next row without reading the full file in
> memory as in the DOM or SAX (Push) model
> returns null if no further record exist, otherwise sets the current cell
> content as String
> **/
> public String getNextRow() {
> // check if we have data to read
> if ((xmlr==null) || xmlr.hasNext()==false) {
>   if (!iter.hasNext()) {
>  return null;
> } else {
> // read sheet into stax parser
>
> xmlr=xmlif.createXMLStreamReader(iter.next());
>
> }
> // read the data from stax
>  //
> return xmlr.getText(); // just an example, in fact i need to check for a c
> tag, check its type, if it is string then check the string table, otherwise
> convert the numeric to a data or indeed a number
> }
>
> Any pitfalls with this approach? I assume I need to link the sharedstring
> table somehow to the cell tag (I can derive probably from the source code
> of the event API how to do this).
> Formula evaluation is not of so much importance for this currently (the
> HadoopOffice library offers for formula evaluation to load the full file in
> memory).
> Of course still the full sharedstringtable needs to be loaded in memory,
> but I expect for the data to have a very small sharedstring table and/or
> use of numerics.
>
> Has anyone used the StAX API together with .xlsx files with POI?
>
>
> Thank you.
>
> best regards
>


Re: Using StAX parser for .xlsx files

2018-05-18 Thread pj.fanning
https://github.com/monitorjbl/excel-streaming-reader is probably a better bet



--
Sent from: http://apache-poi.1045710.n5.nabble.com/POI-Dev-f2312866.html

-
To unsubscribe, e-mail: dev-unsubscr...@poi.apache.org
For additional commands, e-mail: dev-h...@poi.apache.org



Re: Using StAX parser for .xlsx files

2018-05-18 Thread pj.fanning
Maybe https://github.com/davidpelfree/sjxlsx is what you need if you don't
want to use CSV as an intermediate format.



--
Sent from: http://apache-poi.1045710.n5.nabble.com/POI-Dev-f2312866.html

-
To unsubscribe, e-mail: dev-unsubscr...@poi.apache.org
For additional commands, e-mail: dev-h...@poi.apache.org



Re: Using StAX parser for .xlsx files

2018-05-18 Thread pj.fanning
I would suggest converting the xlsx sheets to CSV and importing into Spark
using its built-in CSV data source.



--
Sent from: http://apache-poi.1045710.n5.nabble.com/POI-Dev-f2312866.html

-
To unsubscribe, e-mail: dev-unsubscr...@poi.apache.org
For additional commands, e-mail: dev-h...@poi.apache.org



Using StAX parser for .xlsx files

2018-05-18 Thread Jörn Franke
Hi,

I wonder if someone has already used successfully the StAX parser with
.xlsx files (ie instead of an event driven push model a pull model, cf.
https://docs.oracle.com/javase/tutorial/jaxp/stax/why.html).

Reason that I ask is that on Big Data platforms (for which we implemented
the HadoopOffice library powered by Apache POI, cf.
https://github.com/ZuInnoTe/hadoopoffice) the event driven model causes a
lot of memory overhead, because virtually all Big Data platforms implement
a pull model, which means if I use the push model provided by the event API
then I need to load the full content in memory to make it available as a
pull model, since those platforms are not event driven.

I found the StaxHelper class, but I have little idea how it can be used
within Apache POI:
https://poi.apache.org/apidocs/org/apache/poi/util/StaxHelper.html


The main goal is to have a light weight approach as proposed by the current
POI event push model for reading .xlsx files (
https://poi.apache.org/spreadsheet/how-to.html#xssf_sax_api) , but in form
of a pull model, e.g. as illustrated in the following pseudo code:

XMLInputFactory xmlif;

XSSFReader.SheetIterator iter;
InputStream currentInputStream;

XMLStreamReader xmlr;


/* function called once at the start of processing */
public void init() {
r = new XSSFReader( pkg );
ReadOnlySharedStringsTable strings = new ReadOnlySharedStringsTable(pkg);
// inputstream sheet iterator
 iter = (XSSFReader.SheetIterator)r.getSheetsData();
// XML factory to create Stax Parser

xmlif = XMLInputFactory.newInstance();

}


/** the following method is called by the Big Data platform (Flink, Spark,
Hadoop, Hive etc.) do get the next row without reading the full file in
memory as in the DOM or SAX (Push) model
returns null if no further record exist, otherwise sets the current cell
content as String
**/
public String getNextRow() {
// check if we have data to read
if ((xmlr==null) || xmlr.hasNext()==false) {
  if (!iter.hasNext()) {
 return null;
} else {
// read sheet into stax parser

xmlr=xmlif.createXMLStreamReader(iter.next());

}
// read the data from stax
 //
return xmlr.getText(); // just an example, in fact i need to check for a c
tag, check its type, if it is string then check the string table, otherwise
convert the numeric to a data or indeed a number
}

Any pitfalls with this approach? I assume I need to link the sharedstring
table somehow to the cell tag (I can derive probably from the source code
of the event API how to do this).
Formula evaluation is not of so much importance for this currently (the
HadoopOffice library offers for formula evaluation to load the full file in
memory).
Of course still the full sharedstringtable needs to be loaded in memory,
but I expect for the data to have a very small sharedstring table and/or
use of numerics.

Has anyone used the StAX API together with .xlsx files with POI?


Thank you.

best regards