I have an incoming xlsx file, with many sheets. I am trying to use
ConvertExcelToCSVProcessor processor to extract the sheets. It is currently
erroring when it extracts the header.

Here is what the header is in one of the sheets (commas added by me for
clarity):

Order,FAO,ISO3,Country/territory,Region,Sub-Region,Inc I,Inc II,Notes
I,Notes
II,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020,,1961

This is what it should be:
Order,FAO,ISO3,Country/territory,Region,Sub-Region,Inc I,Inc II,Notes
I,Notes
II,1961,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,
*2020,,1961*
,1962,1963,1964,1965,1966,1967,1968,1969,1970,1971,1972,1973,1974,1975,1976,1977,1978,1979,1980,1981,1982,1983,1984,1985,1986,1987,1988,1989,1990,1991,1992,1993,1994,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020

Note that there is a gap column between the two sections - one section
provides data in one unit of measure, the second in hectares of arable land.

This is how I have my Convert processor configured:
Sheets to Extract                 No value
setNumber of Rows to Skip  0
Columns To Skip                   No value
setFormat Cell Values           true
CSV Format                          Microsoft Excel
Include Header Line              true

How must I tune this configuration to successfully extract all the header
fields? And why would it make it past that gap column, only to choke after
the first column of the second section? It seems that if it was going to
give up the ghost, it would do it when it encounters that gap column.

How does this processor identify the row that is the header? I'd like to
know what it does in case I need to write my own Groovy script to pull this
data. What characteristic does it use? What heuristic does it apply to skip
all the comment rows that sometimes precede the header row.

Thank you.

Reply via email to