[ https://issues.apache.org/jira/browse/NIFI-12491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17859726#comment-17859726 ]
Brendan Buhr commented on NIFI-12491: ------------------------------------- [~dstiegli1] I expect that when using the reader like we did pre the excelReader Controller service anyone who wants to apply a schema would first use a split (excelToCSV) which is now replaced with he splitExcel Processor after which it will feed into a record reader with the excelReader defined and a schema per sheet. # the header set first row as the header was an option on the CSV Reader and was the default we are merely looking to get that same functionality on this reader, as mentioned by [~iiojj2] there was an option on the excelToCSV to skip columns and rows as well. since the Excel split would just split the file I think it would be best to bring those 2 features over to the reader and possibly make them dynamic (Attributes) which can be dynamically set as flow attributes and then passed to the reader so that a single reader can be used on multiple sheets to where the sheet and rows/columns to skip are dynamic attributes and option with the default being to maintain existing result when not defined !image-2024-06-24-18-01-49-886.png! !image-2024-06-24-18-02-36-592.png! # I have never encountered a case where multiple sheets had the same structure and the same header gets applied but I can picture a scenario where data is split into chunks for various reasons and would be treated as one dataset, we would usually split this and then do a join of some sort before querying it but I can see how the benefit of the join of the data in the reader helps, I would make this optional and not the default behavior so that you can get an error when no sheet name is specified and trigger alerts on that. on a side note, 1 thing I have experienced with data where there are 2 header rows and the first one is a merged header across rows, behavior was that the value would get written to the first cell and the rest were blank. sometimes it's nice to have that value in all the remerged cells so that you can merge that value with the second-row headers. (This is nice to have and not relevant to this ticket) > ExcelReader - new Schema Access strategy: Use String Fields From Header > ----------------------------------------------------------------------- > > Key: NIFI-12491 > URL: https://issues.apache.org/jira/browse/NIFI-12491 > Project: Apache NiFi > Issue Type: Improvement > Components: Core Framework > Affects Versions: 1.23.2 > Reporter: Philipp Korniets > Assignee: Daniel Stieglitz > Priority: Major > Attachments: image-2024-06-24-18-01-49-886.png, > image-2024-06-24-18-02-36-592.png > > > ExcelReader needs an ability similar to CSVReader to "Use String Fields From > Header" as a Schema Access Strategy. > Current implementation has: > 1. Use Schema Name/Schema Text - this option relies on the order of the > columns. Possible issues - order of the columns change, but types dont. This > cause further calculations to be erroneous. > 2. Infer Schema - replaces real column names with column_1,column_2 etc - > this again loses the "context" of the column and forces us to rely on how > columns are ordered. > Any workarounds make workflow more complicated. > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)