[GitHub] [nifi] exceptionfactory commented on a diff in pull request #7194: NIFI-11167 - Add Excel Record Reader
exceptionfactory commented on code in PR #7194: URL: https://github.com/apache/nifi/pull/7194#discussion_r1231194122 ## nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/pom.xml: ## @@ -75,6 +75,11 @@ com.fasterxml.jackson.dataformat jackson-dataformat-csv + +com.github.pjfanning +excel-streaming-reader Review Comment: This dependency can be removed now that the ExcelReader is in a different module. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [nifi] exceptionfactory commented on a diff in pull request #7194: NIFI-11167 - Add Excel Record Reader
exceptionfactory commented on code in PR #7194: URL: https://github.com/apache/nifi/pull/7194#discussion_r1218567911 ## nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/excel/RowIterator.java: ## @@ -0,0 +1,147 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.nifi.excel; + +import com.github.pjfanning.xlsx.StreamingReader; +import org.apache.nifi.logging.ComponentLog; +import org.apache.nifi.processor.exception.ProcessException; +import org.apache.poi.ss.usermodel.Row; +import org.apache.poi.ss.usermodel.Sheet; +import org.apache.poi.ss.usermodel.Workbook; + +import java.io.Closeable; +import java.io.IOException; +import java.io.InputStream; +import java.util.HashMap; +import java.util.Iterator; +import java.util.List; +import java.util.Map; +import java.util.stream.Collectors; + +public class RowIterator implements Iterator, Closeable { +private final Workbook workbook; +private final Iterator sheets; +private final Map requiredSheets; +private final int firstRow; +private final ComponentLog logger; +private Sheet currentSheet; +private Iterator currentRows; +private Row currentRow; + +public RowIterator(InputStream in, List requiredSheets, int firstRow, ComponentLog logger) { +this.workbook = StreamingReader.builder() +.rowCacheSize(100) +.bufferSize(4096) +.open(in); +this.sheets = this.workbook.iterator(); +this.requiredSheets = requiredSheets != null ? requiredSheets.stream() +.collect(Collectors.toMap(key -> key, value -> Boolean.FALSE)) : new HashMap<>(); +this.firstRow = firstRow; +this.logger = logger; +} + +@Override +public boolean hasNext() { +setCurrent(); Review Comment: Although it may not be the most intuitive approach, having `hasNext()` throw an `UnsupportedOperationException` and changing callers to just use `next()` could work, although it raises a question of logic handling for the calling class. Another option could be folding the functionality into the calling class. I think it would be cleaner in the end to rework the RowIterator, in addition to removing the `public` visibility from the class. I could take a look at reworking RowIterator and pushing the changes to this branch if you think the changes are too involved. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [nifi] exceptionfactory commented on a diff in pull request #7194: NIFI-11167 - Add Excel Record Reader
exceptionfactory commented on code in PR #7194: URL: https://github.com/apache/nifi/pull/7194#discussion_r1218506840 ## nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/excel/RowIterator.java: ## @@ -0,0 +1,147 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.nifi.excel; + +import com.github.pjfanning.xlsx.StreamingReader; +import org.apache.nifi.logging.ComponentLog; +import org.apache.nifi.processor.exception.ProcessException; +import org.apache.poi.ss.usermodel.Row; +import org.apache.poi.ss.usermodel.Sheet; +import org.apache.poi.ss.usermodel.Workbook; + +import java.io.Closeable; +import java.io.IOException; +import java.io.InputStream; +import java.util.HashMap; +import java.util.Iterator; +import java.util.List; +import java.util.Map; +import java.util.stream.Collectors; + +public class RowIterator implements Iterator, Closeable { +private final Workbook workbook; +private final Iterator sheets; +private final Map requiredSheets; +private final int firstRow; +private final ComponentLog logger; +private Sheet currentSheet; +private Iterator currentRows; +private Row currentRow; + +public RowIterator(InputStream in, List requiredSheets, int firstRow, ComponentLog logger) { +this.workbook = StreamingReader.builder() +.rowCacheSize(100) +.bufferSize(4096) +.open(in); +this.sheets = this.workbook.iterator(); +this.requiredSheets = requiredSheets != null ? requiredSheets.stream() +.collect(Collectors.toMap(key -> key, value -> Boolean.FALSE)) : new HashMap<>(); +this.firstRow = firstRow; +this.logger = logger; +} + +@Override +public boolean hasNext() { +setCurrent(); Review Comment: It should be possible to call `hasNext()` in any order. Taking a closer look, it seems like the current implementation of `setCurrent` needs further refinement because it sets `currentSheet` to `null` on every invocation. One general approach could be to set the initial sheet and row in the constructor. In general, it should be possible to call `hasNext()` and unlimited number of times without advancing the iterator, and calling `next()` should advance the iterator forward. When all sheets and rows have been read, `next()` should throw an `NoSuchElementException` if called again. ## nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/excel/RowIterator.java: ## @@ -0,0 +1,147 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.nifi.excel; + +import com.github.pjfanning.xlsx.StreamingReader; +import org.apache.nifi.logging.ComponentLog; +import org.apache.nifi.processor.exception.ProcessException; +import org.apache.poi.ss.usermodel.Row; +import org.apache.poi.ss.usermodel.Sheet; +import org.apache.poi.ss.usermodel.Workbook; + +import java.io.Closeable; +import java.io.IOException; +import java.io.InputStream; +import java.util.HashMap; +import java.util.Iterator; +import java.util.List; +import java.util.Map; +import java.util.stream.Collectors; + +public class RowIterator implements Iterator, Closeable { +private final Workbook workbook; +private final Iterator
[GitHub] [nifi] exceptionfactory commented on a diff in pull request #7194: NIFI-11167 - Add Excel Record Reader
exceptionfactory commented on code in PR #7194: URL: https://github.com/apache/nifi/pull/7194#discussion_r1218506840 ## nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/excel/RowIterator.java: ## @@ -0,0 +1,147 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.nifi.excel; + +import com.github.pjfanning.xlsx.StreamingReader; +import org.apache.nifi.logging.ComponentLog; +import org.apache.nifi.processor.exception.ProcessException; +import org.apache.poi.ss.usermodel.Row; +import org.apache.poi.ss.usermodel.Sheet; +import org.apache.poi.ss.usermodel.Workbook; + +import java.io.Closeable; +import java.io.IOException; +import java.io.InputStream; +import java.util.HashMap; +import java.util.Iterator; +import java.util.List; +import java.util.Map; +import java.util.stream.Collectors; + +public class RowIterator implements Iterator, Closeable { +private final Workbook workbook; +private final Iterator sheets; +private final Map requiredSheets; +private final int firstRow; +private final ComponentLog logger; +private Sheet currentSheet; +private Iterator currentRows; +private Row currentRow; + +public RowIterator(InputStream in, List requiredSheets, int firstRow, ComponentLog logger) { +this.workbook = StreamingReader.builder() +.rowCacheSize(100) +.bufferSize(4096) +.open(in); +this.sheets = this.workbook.iterator(); +this.requiredSheets = requiredSheets != null ? requiredSheets.stream() +.collect(Collectors.toMap(key -> key, value -> Boolean.FALSE)) : new HashMap<>(); +this.firstRow = firstRow; +this.logger = logger; +} + +@Override +public boolean hasNext() { +setCurrent(); Review Comment: It should be possible to call `hasNext()` in any order. Taking a closer look, it seems like the current implementation of `setCurrent` needs further refinement because it sets `currentSheet` to `null` on every invocation. One general approach could be to set the initial sheet and row in the constructor. In general, it should be possible to call `hasNext()` and unlimited number of times without advancing the iterator, and it calling `next()` should advance the iterator forward. When all sheets and rows have been read, `next()` should throw an `NoSuchElementException` if called again. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [nifi] exceptionfactory commented on a diff in pull request #7194: NIFI-11167 - Add Excel Record Reader
exceptionfactory commented on code in PR #7194: URL: https://github.com/apache/nifi/pull/7194#discussion_r1218506840 ## nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/excel/RowIterator.java: ## @@ -0,0 +1,147 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.nifi.excel; + +import com.github.pjfanning.xlsx.StreamingReader; +import org.apache.nifi.logging.ComponentLog; +import org.apache.nifi.processor.exception.ProcessException; +import org.apache.poi.ss.usermodel.Row; +import org.apache.poi.ss.usermodel.Sheet; +import org.apache.poi.ss.usermodel.Workbook; + +import java.io.Closeable; +import java.io.IOException; +import java.io.InputStream; +import java.util.HashMap; +import java.util.Iterator; +import java.util.List; +import java.util.Map; +import java.util.stream.Collectors; + +public class RowIterator implements Iterator, Closeable { +private final Workbook workbook; +private final Iterator sheets; +private final Map requiredSheets; +private final int firstRow; +private final ComponentLog logger; +private Sheet currentSheet; +private Iterator currentRows; +private Row currentRow; + +public RowIterator(InputStream in, List requiredSheets, int firstRow, ComponentLog logger) { +this.workbook = StreamingReader.builder() +.rowCacheSize(100) +.bufferSize(4096) +.open(in); +this.sheets = this.workbook.iterator(); +this.requiredSheets = requiredSheets != null ? requiredSheets.stream() +.collect(Collectors.toMap(key -> key, value -> Boolean.FALSE)) : new HashMap<>(); +this.firstRow = firstRow; +this.logger = logger; +} + +@Override +public boolean hasNext() { +setCurrent(); Review Comment: It should be possible to call `hasNext()` in any order. Taking a closer look, it seems like the current implementation of `setCurrent` needs further refinement because it sets `currentSheet` to `null` on every invocation. One general approach could be to set the initial sheet and row in the constructor. In general, it should be possible to call `hasNext()` and unlimited number of times without advancing the iterator, and it calling `next()` should advance the iterator forward. When all sheets and rows have been read, `next()` should throw an `IllegalStateException` if called again. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [nifi] exceptionfactory commented on a diff in pull request #7194: NIFI-11167 - Add Excel Record Reader
exceptionfactory commented on code in PR #7194: URL: https://github.com/apache/nifi/pull/7194#discussion_r1218310754 ## nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/excel/ExcelSchemaInference.java: ## @@ -0,0 +1,83 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.nifi.excel; + +import org.apache.nifi.schema.inference.FieldTypeInference; +import org.apache.nifi.schema.inference.RecordSource; +import org.apache.nifi.schema.inference.SchemaInferenceEngine; +import org.apache.nifi.schema.inference.TimeValueInference; +import org.apache.nifi.serialization.SimpleRecordSchema; +import org.apache.nifi.serialization.record.DataType; +import org.apache.nifi.serialization.record.RecordField; +import org.apache.nifi.serialization.record.RecordSchema; +import org.apache.nifi.util.SchemaInferenceUtil; +import org.apache.poi.ss.usermodel.Cell; +import org.apache.poi.ss.usermodel.DataFormatter; +import org.apache.poi.ss.usermodel.Row; + +import java.io.IOException; +import java.util.LinkedHashMap; +import java.util.List; +import java.util.Locale; +import java.util.Map; +import java.util.stream.Collectors; +import java.util.stream.IntStream; + +public class ExcelSchemaInference implements SchemaInferenceEngine { +private final TimeValueInference timeValueInference; +private final DataFormatter dataFormatter; + +public ExcelSchemaInference(TimeValueInference timeValueInference) { +this(timeValueInference, null); +} + +public ExcelSchemaInference(TimeValueInference timeValueInference, Locale locale) { +this.timeValueInference = timeValueInference; +this.dataFormatter = locale == null ? new DataFormatter() : new DataFormatter(locale); +} + +@Override +public RecordSchema inferSchema(RecordSource recordSource) throws IOException { +final Map typeMap = new LinkedHashMap<>(); +Row row; +while((row = recordSource.next()) != null) { +inferSchema(row, typeMap); +} +return createSchema(typeMap); +} + +private void inferSchema(final Row row, final Map typeMap) { +if (ExcelUtils.hasCells(row)) { +IntStream.range(0, row.getLastCellNum()) +.forEach(index -> { +final Cell cell = row.getCell(index); +final String fieldName = Integer.toString(index); Review Comment: What do you think of prefixing this field name with `column_`? So the first field name would be `column_0`? That would be easier to work with in absence of a defined schema. ## nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/excel/RowIterator.java: ## @@ -0,0 +1,147 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.nifi.excel; + +import com.github.pjfanning.xlsx.StreamingReader; +import org.apache.nifi.logging.ComponentLog; +import org.apache.nifi.processor.exception.ProcessException; +import org.apache.poi.ss.usermodel.Row; +import org.apache.poi.ss.usermodel.Sheet; +import org.apache.poi.ss.usermodel.Workbook; + +import java.io.Closeable; +import java.io.IOException; +import java.io.InputStream; +import java.util.HashMap; +import java.util.Iterator; +import java.util.List;
[GitHub] [nifi] exceptionfactory commented on a diff in pull request #7194: NIFI-11167 - Add Excel Record Reader
exceptionfactory commented on code in PR #7194: URL: https://github.com/apache/nifi/pull/7194#discussion_r1212254476 ## nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/excel/ExcelHeaderSchemaStrategy.java: ## @@ -0,0 +1,116 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.nifi.excel; + +import org.apache.nifi.context.PropertyContext; +import org.apache.nifi.logging.ComponentLog; +import org.apache.nifi.schema.access.SchemaAccessStrategy; +import org.apache.nifi.schema.access.SchemaField; +import org.apache.nifi.schema.access.SchemaNotFoundException; +import org.apache.nifi.schema.inference.FieldTypeInference; +import org.apache.nifi.schema.inference.RecordSource; +import org.apache.nifi.schema.inference.TimeValueInference; +import org.apache.nifi.serialization.DateTimeUtils; +import org.apache.nifi.serialization.SimpleRecordSchema; +import org.apache.nifi.serialization.record.DataType; +import org.apache.nifi.serialization.record.RecordField; +import org.apache.nifi.serialization.record.RecordSchema; +import org.apache.nifi.util.SchemaInferenceUtil; +import org.apache.poi.ss.usermodel.Cell; +import org.apache.poi.ss.usermodel.DataFormatter; +import org.apache.poi.ss.usermodel.Row; + +import java.io.IOException; +import java.io.InputStream; +import java.util.EnumSet; +import java.util.LinkedHashMap; +import java.util.List; +import java.util.Locale; +import java.util.Map; +import java.util.Set; +import java.util.stream.Collectors; +import java.util.stream.IntStream; + +public class ExcelHeaderSchemaStrategy implements SchemaAccessStrategy { +private static final Set schemaFields = EnumSet.noneOf(SchemaField.class); + +private final PropertyContext context; + +private final DataFormatter dataFormatter; + +private final ComponentLog logger; + +public ExcelHeaderSchemaStrategy(PropertyContext context, ComponentLog logger) { +this(context, logger, null); +} + +public ExcelHeaderSchemaStrategy(PropertyContext context, ComponentLog logger, Locale locale) { +this.context = context; +this.logger = logger; +this.dataFormatter = locale == null ? new DataFormatter() : new DataFormatter(locale); +} + +@Override +public RecordSchema getSchema(Map variables, InputStream contentStream, RecordSchema readSchema) throws SchemaNotFoundException, IOException { +if (this.context == null) { +throw new SchemaNotFoundException("Schema Access Strategy intended only for validation purposes and cannot obtain schema"); +} + +String errMsg = "Failed to read Header line from Excel worksheet"; +RecordSource recordSource; +try { +recordSource = new ExcelRecordSource(contentStream, context, variables, logger); +} catch (Exception e) { +throw new SchemaNotFoundException(errMsg, e); +} + +Row headerRow = recordSource.next(); +if (!ExcelUtils.hasCells(headerRow)) { +throw new SchemaNotFoundException("The chosen header line in the Excel worksheet had no cells"); +} + +try { +String dateFormat = context.getProperty(DateTimeUtils.DATE_FORMAT).getValue(); +String timeFormat = context.getProperty(DateTimeUtils.TIME_FORMAT).getValue(); +String timestampFormat = context.getProperty(DateTimeUtils.TIMESTAMP_FORMAT).getValue(); +final TimeValueInference timeValueInference = new TimeValueInference(dateFormat, timeFormat, timestampFormat); +final Map typeMap = new LinkedHashMap<>(); +IntStream.range(0, headerRow.getLastCellNum()) +.forEach(index -> { +final Cell cell = headerRow.getCell(index); +final String fieldName = Integer.toString(index); Review Comment: The string typing is also a problem as you noted. Theoretically that could also be changed to follow an approach similar to the Infer Schema strategy, reading all rows to determine potential values. In the interest of completing a first
[GitHub] [nifi] exceptionfactory commented on a diff in pull request #7194: NIFI-11167 - Add Excel Record Reader
exceptionfactory commented on code in PR #7194: URL: https://github.com/apache/nifi/pull/7194#discussion_r1212240524 ## nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/excel/ExcelHeaderSchemaStrategy.java: ## @@ -0,0 +1,116 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.nifi.excel; + +import org.apache.nifi.context.PropertyContext; +import org.apache.nifi.logging.ComponentLog; +import org.apache.nifi.schema.access.SchemaAccessStrategy; +import org.apache.nifi.schema.access.SchemaField; +import org.apache.nifi.schema.access.SchemaNotFoundException; +import org.apache.nifi.schema.inference.FieldTypeInference; +import org.apache.nifi.schema.inference.RecordSource; +import org.apache.nifi.schema.inference.TimeValueInference; +import org.apache.nifi.serialization.DateTimeUtils; +import org.apache.nifi.serialization.SimpleRecordSchema; +import org.apache.nifi.serialization.record.DataType; +import org.apache.nifi.serialization.record.RecordField; +import org.apache.nifi.serialization.record.RecordSchema; +import org.apache.nifi.util.SchemaInferenceUtil; +import org.apache.poi.ss.usermodel.Cell; +import org.apache.poi.ss.usermodel.DataFormatter; +import org.apache.poi.ss.usermodel.Row; + +import java.io.IOException; +import java.io.InputStream; +import java.util.EnumSet; +import java.util.LinkedHashMap; +import java.util.List; +import java.util.Locale; +import java.util.Map; +import java.util.Set; +import java.util.stream.Collectors; +import java.util.stream.IntStream; + +public class ExcelHeaderSchemaStrategy implements SchemaAccessStrategy { +private static final Set schemaFields = EnumSet.noneOf(SchemaField.class); + +private final PropertyContext context; + +private final DataFormatter dataFormatter; + +private final ComponentLog logger; + +public ExcelHeaderSchemaStrategy(PropertyContext context, ComponentLog logger) { +this(context, logger, null); +} + +public ExcelHeaderSchemaStrategy(PropertyContext context, ComponentLog logger, Locale locale) { +this.context = context; +this.logger = logger; +this.dataFormatter = locale == null ? new DataFormatter() : new DataFormatter(locale); +} + +@Override +public RecordSchema getSchema(Map variables, InputStream contentStream, RecordSchema readSchema) throws SchemaNotFoundException, IOException { +if (this.context == null) { +throw new SchemaNotFoundException("Schema Access Strategy intended only for validation purposes and cannot obtain schema"); +} + +String errMsg = "Failed to read Header line from Excel worksheet"; +RecordSource recordSource; +try { +recordSource = new ExcelRecordSource(contentStream, context, variables, logger); +} catch (Exception e) { +throw new SchemaNotFoundException(errMsg, e); +} + +Row headerRow = recordSource.next(); +if (!ExcelUtils.hasCells(headerRow)) { +throw new SchemaNotFoundException("The chosen header line in the Excel worksheet had no cells"); +} + +try { +String dateFormat = context.getProperty(DateTimeUtils.DATE_FORMAT).getValue(); +String timeFormat = context.getProperty(DateTimeUtils.TIME_FORMAT).getValue(); +String timestampFormat = context.getProperty(DateTimeUtils.TIMESTAMP_FORMAT).getValue(); +final TimeValueInference timeValueInference = new TimeValueInference(dateFormat, timeFormat, timestampFormat); +final Map typeMap = new LinkedHashMap<>(); +IntStream.range(0, headerRow.getLastCellNum()) +.forEach(index -> { +final Cell cell = headerRow.getCell(index); +final String fieldName = Integer.toString(index); Review Comment: @dan-s1, although the header column does not have to be a string, the expectation for this strategy is that the first row cell values should be used as the field names, analogous to the CSV implementation. In its current state, I agree the
[GitHub] [nifi] exceptionfactory commented on a diff in pull request #7194: NIFI-11167 - Add Excel Record Reader
exceptionfactory commented on code in PR #7194: URL: https://github.com/apache/nifi/pull/7194#discussion_r1212152365 ## nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/excel/ExcelHeaderSchemaStrategy.java: ## @@ -0,0 +1,116 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.nifi.excel; + +import org.apache.nifi.context.PropertyContext; +import org.apache.nifi.logging.ComponentLog; +import org.apache.nifi.schema.access.SchemaAccessStrategy; +import org.apache.nifi.schema.access.SchemaField; +import org.apache.nifi.schema.access.SchemaNotFoundException; +import org.apache.nifi.schema.inference.FieldTypeInference; +import org.apache.nifi.schema.inference.RecordSource; +import org.apache.nifi.schema.inference.TimeValueInference; +import org.apache.nifi.serialization.DateTimeUtils; +import org.apache.nifi.serialization.SimpleRecordSchema; +import org.apache.nifi.serialization.record.DataType; +import org.apache.nifi.serialization.record.RecordField; +import org.apache.nifi.serialization.record.RecordSchema; +import org.apache.nifi.util.SchemaInferenceUtil; +import org.apache.poi.ss.usermodel.Cell; +import org.apache.poi.ss.usermodel.DataFormatter; +import org.apache.poi.ss.usermodel.Row; + +import java.io.IOException; +import java.io.InputStream; +import java.util.EnumSet; +import java.util.LinkedHashMap; +import java.util.List; +import java.util.Locale; +import java.util.Map; +import java.util.Set; +import java.util.stream.Collectors; +import java.util.stream.IntStream; + +public class ExcelHeaderSchemaStrategy implements SchemaAccessStrategy { +private static final Set schemaFields = EnumSet.noneOf(SchemaField.class); + +private final PropertyContext context; + +private final DataFormatter dataFormatter; + +private final ComponentLog logger; + +public ExcelHeaderSchemaStrategy(PropertyContext context, ComponentLog logger) { +this(context, logger, null); +} + +public ExcelHeaderSchemaStrategy(PropertyContext context, ComponentLog logger, Locale locale) { +this.context = context; +this.logger = logger; +this.dataFormatter = locale == null ? new DataFormatter() : new DataFormatter(locale); +} + +@Override +public RecordSchema getSchema(Map variables, InputStream contentStream, RecordSchema readSchema) throws SchemaNotFoundException, IOException { +if (this.context == null) { +throw new SchemaNotFoundException("Schema Access Strategy intended only for validation purposes and cannot obtain schema"); +} + +String errMsg = "Failed to read Header line from Excel worksheet"; +RecordSource recordSource; +try { +recordSource = new ExcelRecordSource(contentStream, context, variables, logger); +} catch (Exception e) { +throw new SchemaNotFoundException(errMsg, e); +} + +Row headerRow = recordSource.next(); +if (!ExcelUtils.hasCells(headerRow)) { +throw new SchemaNotFoundException("The chosen header line in the Excel worksheet had no cells"); +} + +try { +String dateFormat = context.getProperty(DateTimeUtils.DATE_FORMAT).getValue(); +String timeFormat = context.getProperty(DateTimeUtils.TIME_FORMAT).getValue(); +String timestampFormat = context.getProperty(DateTimeUtils.TIMESTAMP_FORMAT).getValue(); +final TimeValueInference timeValueInference = new TimeValueInference(dateFormat, timeFormat, timestampFormat); +final Map typeMap = new LinkedHashMap<>(); +IntStream.range(0, headerRow.getLastCellNum()) +.forEach(index -> { +final Cell cell = headerRow.getCell(index); +final String fieldName = Integer.toString(index); Review Comment: This approach uses the field index as the field name, instead of using the cell value as the field name, which would be the expected behavior. ##
[GitHub] [nifi] exceptionfactory commented on a diff in pull request #7194: NIFI-11167 - Add Excel Record Reader
exceptionfactory commented on code in PR #7194: URL: https://github.com/apache/nifi/pull/7194#discussion_r1199111371 ## nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/excel/RowIterator.java: ## @@ -0,0 +1,155 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */package org.apache.nifi.excel; + +import com.github.pjfanning.xlsx.StreamingReader; +import org.apache.nifi.logging.ComponentLog; +import org.apache.poi.ss.usermodel.Row; +import org.apache.poi.ss.usermodel.Sheet; +import org.apache.poi.ss.usermodel.Workbook; + +import java.io.Closeable; +import java.io.IOException; +import java.io.InputStream; +import java.util.HashMap; +import java.util.Iterator; +import java.util.List; +import java.util.Map; +import java.util.stream.Collectors; + +public class RowIterator implements Iterator, Closeable { +private final Workbook workbook; +private final Iterator sheets; +private Sheet currentSheet; +private Iterator currentRows; +private final Map desiredSheets; +private final int firstRow; +private ComponentLog logger; +private boolean log; +private Row currentRow; + +public RowIterator(InputStream in, List desiredSheets, int firstRow) { +this(in, desiredSheets, firstRow, null); +} + +public RowIterator(InputStream in, List desiredSheets, int firstRow, ComponentLog logger) { +this.workbook = StreamingReader.builder() +.rowCacheSize(100) +.bufferSize(4096) +.open(in); +this.sheets = this.workbook.iterator(); +this.desiredSheets = desiredSheets != null ? desiredSheets.stream() +.collect(Collectors.toMap(key -> key, value -> Boolean.FALSE)) : new HashMap<>(); +this.firstRow = firstRow; +this.logger = logger; +this.log = logger != null; +} + +@Override +public boolean hasNext() { +setCurrent(); +boolean next = currentRow != null; +if(!next) { +String sheetsNotFound = getSheetsNotFound(desiredSheets); +if (!sheetsNotFound.isEmpty() && log) { +logger.warn("Excel sheet(s) not found: {}", sheetsNotFound); +} Review Comment: It could be something else, probably `ProcessException` would be a good option as a more specific subclass of `RuntimeException`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [nifi] exceptionfactory commented on a diff in pull request #7194: NIFI-11167 - Add Excel Record Reader
exceptionfactory commented on code in PR #7194: URL: https://github.com/apache/nifi/pull/7194#discussion_r1199060548 ## nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/excel/RowIterator.java: ## @@ -0,0 +1,155 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */package org.apache.nifi.excel; + +import com.github.pjfanning.xlsx.StreamingReader; +import org.apache.nifi.logging.ComponentLog; +import org.apache.poi.ss.usermodel.Row; +import org.apache.poi.ss.usermodel.Sheet; +import org.apache.poi.ss.usermodel.Workbook; + +import java.io.Closeable; +import java.io.IOException; +import java.io.InputStream; +import java.util.HashMap; +import java.util.Iterator; +import java.util.List; +import java.util.Map; +import java.util.stream.Collectors; + +public class RowIterator implements Iterator, Closeable { +private final Workbook workbook; +private final Iterator sheets; +private Sheet currentSheet; +private Iterator currentRows; +private final Map desiredSheets; +private final int firstRow; +private ComponentLog logger; +private boolean log; +private Row currentRow; + +public RowIterator(InputStream in, List desiredSheets, int firstRow) { +this(in, desiredSheets, firstRow, null); +} + +public RowIterator(InputStream in, List desiredSheets, int firstRow, ComponentLog logger) { +this.workbook = StreamingReader.builder() +.rowCacheSize(100) +.bufferSize(4096) +.open(in); +this.sheets = this.workbook.iterator(); +this.desiredSheets = desiredSheets != null ? desiredSheets.stream() +.collect(Collectors.toMap(key -> key, value -> Boolean.FALSE)) : new HashMap<>(); +this.firstRow = firstRow; +this.logger = logger; +this.log = logger != null; +} + +@Override +public boolean hasNext() { +setCurrent(); +boolean next = currentRow != null; +if(!next) { +String sheetsNotFound = getSheetsNotFound(desiredSheets); +if (!sheetsNotFound.isEmpty() && log) { +logger.warn("Excel sheet(s) not found: {}", sheetsNotFound); +} Review Comment: You're right, I was thinking more of Record Writers and Processors that generate record-oriented files. With that in mind, it looks like there isn't a good way to signal a problem of this kind, other than throwing an exception. With that in mind, I think the property should be changed to `Required Sheets`, and the description should indicate that the absence of these Sheets will throw an exception. We could theoretically introduce a `Sheet Processing Strategy` of some kind, but I think having the single `Required Sheets` property is the best way to go forward for now. How does that sound? In that case, instead of logging a warning, this condition would throw an `IOException`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [nifi] exceptionfactory commented on a diff in pull request #7194: NIFI-11167 - Add Excel Record Reader
exceptionfactory commented on code in PR #7194: URL: https://github.com/apache/nifi/pull/7194#discussion_r1198284268 ## nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/excel/RowIterator.java: ## @@ -0,0 +1,155 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */package org.apache.nifi.excel; + +import com.github.pjfanning.xlsx.StreamingReader; +import org.apache.nifi.logging.ComponentLog; +import org.apache.poi.ss.usermodel.Row; +import org.apache.poi.ss.usermodel.Sheet; +import org.apache.poi.ss.usermodel.Workbook; + +import java.io.Closeable; +import java.io.IOException; +import java.io.InputStream; +import java.util.HashMap; +import java.util.Iterator; +import java.util.List; +import java.util.Map; +import java.util.stream.Collectors; + +public class RowIterator implements Iterator, Closeable { +private final Workbook workbook; +private final Iterator sheets; +private Sheet currentSheet; +private Iterator currentRows; +private final Map desiredSheets; +private final int firstRow; +private ComponentLog logger; +private boolean log; +private Row currentRow; + +public RowIterator(InputStream in, List desiredSheets, int firstRow) { +this(in, desiredSheets, firstRow, null); +} + +public RowIterator(InputStream in, List desiredSheets, int firstRow, ComponentLog logger) { +this.workbook = StreamingReader.builder() +.rowCacheSize(100) +.bufferSize(4096) +.open(in); +this.sheets = this.workbook.iterator(); +this.desiredSheets = desiredSheets != null ? desiredSheets.stream() +.collect(Collectors.toMap(key -> key, value -> Boolean.FALSE)) : new HashMap<>(); +this.firstRow = firstRow; +this.logger = logger; +this.log = logger != null; +} + +@Override +public boolean hasNext() { +setCurrent(); +boolean next = currentRow != null; +if(!next) { +String sheetsNotFound = getSheetsNotFound(desiredSheets); +if (!sheetsNotFound.isEmpty() && log) { +logger.warn("Excel sheet(s) not found: {}", sheetsNotFound); +} +} +return next; +} + +private void setCurrent() { +currentRow = getNextRow(); +if (currentRow != null) { +return; +} + +currentSheet = null; +currentRows = null; +while (sheets.hasNext()) { +currentSheet = sheets.next(); +if (isIterateOverAllSheets() || hasSheet(currentSheet.getSheetName())) { +currentRows = currentSheet.iterator(); +currentRow = getNextRow(); +if (currentRow != null) { +return; +} +} +} +} + +private Row getNextRow() { +while (currentRows != null && !hasExhaustedRows()) { +Row tempCurrentRow = currentRows.next(); +if (!isSkip(tempCurrentRow)) { +return tempCurrentRow; +} +} +return null; +} + +private boolean hasExhaustedRows() { +boolean exhausted = !currentRows.hasNext(); +if (log && exhausted) { +logger.info("Exhausted all rows from sheet {}", currentSheet.getSheetName()); +} +return exhausted; +} + +private boolean isSkip(Row row) { +return row.getRowNum() < firstRow; +} + +private boolean isIterateOverAllSheets() { +boolean iterateAllSheets = desiredSheets.isEmpty(); +if (iterateAllSheets && log) { +logger.info("Advanced to sheet {}", currentSheet.getSheetName()); +} +return iterateAllSheets; +} + +private boolean hasSheet(String name) { +boolean sheetByName = !desiredSheets.isEmpty() +&& desiredSheets.keySet().stream() +.anyMatch(desiredSheet -> desiredSheet.equalsIgnoreCase(name)); +if (sheetByName) { +desiredSheets.put(name, Boolean.TRUE); +} +return sheetByName; +} + +
[GitHub] [nifi] exceptionfactory commented on a diff in pull request #7194: NIFI-11167 - Add Excel Record Reader
exceptionfactory commented on code in PR #7194: URL: https://github.com/apache/nifi/pull/7194#discussion_r1198272978 ## nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/excel/RowIterator.java: ## @@ -0,0 +1,155 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */package org.apache.nifi.excel; + +import com.github.pjfanning.xlsx.StreamingReader; +import org.apache.nifi.logging.ComponentLog; +import org.apache.poi.ss.usermodel.Row; +import org.apache.poi.ss.usermodel.Sheet; +import org.apache.poi.ss.usermodel.Workbook; + +import java.io.Closeable; +import java.io.IOException; +import java.io.InputStream; +import java.util.HashMap; +import java.util.Iterator; +import java.util.List; +import java.util.Map; +import java.util.stream.Collectors; + +public class RowIterator implements Iterator, Closeable { +private final Workbook workbook; +private final Iterator sheets; +private Sheet currentSheet; +private Iterator currentRows; +private final Map desiredSheets; +private final int firstRow; +private ComponentLog logger; +private boolean log; +private Row currentRow; + +public RowIterator(InputStream in, List desiredSheets, int firstRow) { +this(in, desiredSheets, firstRow, null); +} + +public RowIterator(InputStream in, List desiredSheets, int firstRow, ComponentLog logger) { +this.workbook = StreamingReader.builder() +.rowCacheSize(100) +.bufferSize(4096) +.open(in); +this.sheets = this.workbook.iterator(); +this.desiredSheets = desiredSheets != null ? desiredSheets.stream() +.collect(Collectors.toMap(key -> key, value -> Boolean.FALSE)) : new HashMap<>(); +this.firstRow = firstRow; +this.logger = logger; +this.log = logger != null; +} + +@Override +public boolean hasNext() { +setCurrent(); +boolean next = currentRow != null; +if(!next) { +String sheetsNotFound = getSheetsNotFound(desiredSheets); +if (!sheetsNotFound.isEmpty() && log) { +logger.warn("Excel sheet(s) not found: {}", sheetsNotFound); +} +} +return next; +} + +private void setCurrent() { +currentRow = getNextRow(); +if (currentRow != null) { +return; +} + +currentSheet = null; +currentRows = null; +while (sheets.hasNext()) { +currentSheet = sheets.next(); +if (isIterateOverAllSheets() || hasSheet(currentSheet.getSheetName())) { +currentRows = currentSheet.iterator(); +currentRow = getNextRow(); +if (currentRow != null) { +return; +} +} +} +} + +private Row getNextRow() { +while (currentRows != null && !hasExhaustedRows()) { +Row tempCurrentRow = currentRows.next(); +if (!isSkip(tempCurrentRow)) { +return tempCurrentRow; +} +} +return null; +} + +private boolean hasExhaustedRows() { +boolean exhausted = !currentRows.hasNext(); +if (log && exhausted) { +logger.info("Exhausted all rows from sheet {}", currentSheet.getSheetName()); +} +return exhausted; +} + +private boolean isSkip(Row row) { +return row.getRowNum() < firstRow; +} + +private boolean isIterateOverAllSheets() { +boolean iterateAllSheets = desiredSheets.isEmpty(); +if (iterateAllSheets && log) { +logger.info("Advanced to sheet {}", currentSheet.getSheetName()); +} +return iterateAllSheets; +} + +private boolean hasSheet(String name) { +boolean sheetByName = !desiredSheets.isEmpty() +&& desiredSheets.keySet().stream() +.anyMatch(desiredSheet -> desiredSheet.equalsIgnoreCase(name)); +if (sheetByName) { +desiredSheets.put(name, Boolean.TRUE); +} +return sheetByName; +} + +
[GitHub] [nifi] exceptionfactory commented on a diff in pull request #7194: NIFI-11167 - Add Excel Record Reader
exceptionfactory commented on code in PR #7194: URL: https://github.com/apache/nifi/pull/7194#discussion_r1198269634 ## nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/excel/ExcelReader.java: ## @@ -0,0 +1,193 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.nifi.excel; + +import org.apache.commons.lang3.StringUtils; +import org.apache.nifi.annotation.documentation.CapabilityDescription; +import org.apache.nifi.annotation.documentation.Tags; +import org.apache.nifi.annotation.lifecycle.OnEnabled; +import org.apache.nifi.components.AllowableValue; +import org.apache.nifi.components.PropertyDescriptor; +import org.apache.nifi.context.PropertyContext; +import org.apache.nifi.controller.ConfigurationContext; +import org.apache.nifi.expression.ExpressionLanguageScope; +import org.apache.nifi.logging.ComponentLog; +import org.apache.nifi.processor.util.StandardValidators; +import org.apache.nifi.schema.access.SchemaAccessStrategy; +import org.apache.nifi.schema.access.SchemaNotFoundException; +import org.apache.nifi.schema.inference.InferSchemaAccessStrategy; +import org.apache.nifi.schema.inference.RecordSourceFactory; +import org.apache.nifi.schema.inference.SchemaInferenceEngine; +import org.apache.nifi.schema.inference.SchemaInferenceUtil; +import org.apache.nifi.schema.inference.TimeValueInference; +import org.apache.nifi.schemaregistry.services.SchemaRegistry; +import org.apache.nifi.serialization.DateTimeUtils; +import org.apache.nifi.serialization.MalformedRecordException; +import org.apache.nifi.serialization.RecordReader; +import org.apache.nifi.serialization.RecordReaderFactory; +import org.apache.nifi.serialization.SchemaRegistryService; +import org.apache.nifi.serialization.record.RecordSchema; +import org.apache.nifi.stream.io.NonCloseableInputStream; +import org.apache.poi.ss.usermodel.Row; + +import java.io.IOException; +import java.io.InputStream; +import java.util.ArrayList; +import java.util.List; +import java.util.Map; +import java.util.concurrent.atomic.AtomicReferenceArray; +import java.util.stream.IntStream; + +@Tags({"excel", "spreadsheet", "xlsx", "parse", "record", "row", "reader", "values", "cell"}) +@CapabilityDescription("Parses a Microsoft Excel document returning each row in each sheet as a separate record. " ++ "This reader allows for inferring a schema either based on the first line of an Excel sheet if a 'header line' is " ++ "present or from all the desired sheets, or providing an explicit schema " ++ "for interpreting the values. See Controller Service's Usage for further documentation. " ++ "This reader is currently only capable of processing .xlsx " ++ "(XSSF 2007 OOXML file format) Excel documents and not older .xls (HSSF '97(-2007) file format) documents.)") +public class ExcelReader extends SchemaRegistryService implements RecordReaderFactory { + +private static final AllowableValue HEADER_DERIVED = new AllowableValue("excel-header-derived", "Use fields From Header", +"The first chosen row of the Excel sheet is a header row that contains the columns representative of all the rows " + +"in the desired sheets. The schema will be derived by using those columns in the header."); +public static final PropertyDescriptor DESIRED_SHEETS = new PropertyDescriptor +.Builder().name("extract-sheets") +.displayName("Sheets to Extract") +.description("Comma separated list of Excel document sheet names whose rows should be extracted from the excel document. If this property" + +" is left blank then all the rows from all the sheets will be extracted from the Excel document. The list of names is case in-sensitive. Any sheets not" + +" specified in this value will be ignored. A bulletin will be generated if a specified sheet(s) are not found.") +.required(false) + .expressionLanguageSupported(ExpressionLanguageScope.FLOWFILE_ATTRIBUTES) +.addValidator(StandardValidators.NON_EMPTY_VALIDATOR) +
[GitHub] [nifi] exceptionfactory commented on a diff in pull request #7194: NIFI-11167 - Add Excel Record Reader
exceptionfactory commented on code in PR #7194: URL: https://github.com/apache/nifi/pull/7194#discussion_r1198210056 ## nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/excel/ExcelUtils.java: ## @@ -0,0 +1,28 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.nifi.excel; + +import org.apache.poi.ss.usermodel.Row; + +public class ExcelUtils { Review Comment: Thanks, it is fine for now, leaving it as is sounds good. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [nifi] exceptionfactory commented on a diff in pull request #7194: NIFI-11167 - Add Excel Record Reader
exceptionfactory commented on code in PR #7194: URL: https://github.com/apache/nifi/pull/7194#discussion_r1198205592 ## nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/excel/RowIterator.java: ## @@ -0,0 +1,155 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */package org.apache.nifi.excel; + +import com.github.pjfanning.xlsx.StreamingReader; +import org.apache.nifi.logging.ComponentLog; +import org.apache.poi.ss.usermodel.Row; +import org.apache.poi.ss.usermodel.Sheet; +import org.apache.poi.ss.usermodel.Workbook; + +import java.io.Closeable; +import java.io.IOException; +import java.io.InputStream; +import java.util.HashMap; +import java.util.Iterator; +import java.util.List; +import java.util.Map; +import java.util.stream.Collectors; + +public class RowIterator implements Iterator, Closeable { +private final Workbook workbook; +private final Iterator sheets; +private Sheet currentSheet; +private Iterator currentRows; +private final Map desiredSheets; +private final int firstRow; +private ComponentLog logger; +private boolean log; +private Row currentRow; + +public RowIterator(InputStream in, List desiredSheets, int firstRow) { +this(in, desiredSheets, firstRow, null); +} + +public RowIterator(InputStream in, List desiredSheets, int firstRow, ComponentLog logger) { +this.workbook = StreamingReader.builder() +.rowCacheSize(100) +.bufferSize(4096) +.open(in); +this.sheets = this.workbook.iterator(); +this.desiredSheets = desiredSheets != null ? desiredSheets.stream() +.collect(Collectors.toMap(key -> key, value -> Boolean.FALSE)) : new HashMap<>(); +this.firstRow = firstRow; +this.logger = logger; +this.log = logger != null; +} + +@Override +public boolean hasNext() { +setCurrent(); +boolean next = currentRow != null; +if(!next) { +String sheetsNotFound = getSheetsNotFound(desiredSheets); +if (!sheetsNotFound.isEmpty() && log) { +logger.warn("Excel sheet(s) not found: {}", sheetsNotFound); +} +} +return next; +} + +private void setCurrent() { +currentRow = getNextRow(); +if (currentRow != null) { +return; +} + +currentSheet = null; +currentRows = null; +while (sheets.hasNext()) { +currentSheet = sheets.next(); +if (isIterateOverAllSheets() || hasSheet(currentSheet.getSheetName())) { +currentRows = currentSheet.iterator(); +currentRow = getNextRow(); +if (currentRow != null) { +return; +} +} +} +} + +private Row getNextRow() { +while (currentRows != null && !hasExhaustedRows()) { +Row tempCurrentRow = currentRows.next(); +if (!isSkip(tempCurrentRow)) { +return tempCurrentRow; +} +} +return null; +} + +private boolean hasExhaustedRows() { +boolean exhausted = !currentRows.hasNext(); +if (log && exhausted) { +logger.info("Exhausted all rows from sheet {}", currentSheet.getSheetName()); +} +return exhausted; +} + +private boolean isSkip(Row row) { +return row.getRowNum() < firstRow; +} + +private boolean isIterateOverAllSheets() { +boolean iterateAllSheets = desiredSheets.isEmpty(); +if (iterateAllSheets && log) { +logger.info("Advanced to sheet {}", currentSheet.getSheetName()); +} +return iterateAllSheets; +} + +private boolean hasSheet(String name) { +boolean sheetByName = !desiredSheets.isEmpty() +&& desiredSheets.keySet().stream() +.anyMatch(desiredSheet -> desiredSheet.equalsIgnoreCase(name)); +if (sheetByName) { +desiredSheets.put(name, Boolean.TRUE); +} +return sheetByName; +} + +
[GitHub] [nifi] exceptionfactory commented on a diff in pull request #7194: NIFI-11167 - Add Excel Record Reader
exceptionfactory commented on code in PR #7194: URL: https://github.com/apache/nifi/pull/7194#discussion_r1198205061 ## nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/excel/RowIterator.java: ## @@ -0,0 +1,155 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */package org.apache.nifi.excel; + +import com.github.pjfanning.xlsx.StreamingReader; +import org.apache.nifi.logging.ComponentLog; +import org.apache.poi.ss.usermodel.Row; +import org.apache.poi.ss.usermodel.Sheet; +import org.apache.poi.ss.usermodel.Workbook; + +import java.io.Closeable; +import java.io.IOException; +import java.io.InputStream; +import java.util.HashMap; +import java.util.Iterator; +import java.util.List; +import java.util.Map; +import java.util.stream.Collectors; + +public class RowIterator implements Iterator, Closeable { +private final Workbook workbook; +private final Iterator sheets; +private Sheet currentSheet; +private Iterator currentRows; +private final Map desiredSheets; +private final int firstRow; +private ComponentLog logger; +private boolean log; +private Row currentRow; + +public RowIterator(InputStream in, List desiredSheets, int firstRow) { +this(in, desiredSheets, firstRow, null); +} + +public RowIterator(InputStream in, List desiredSheets, int firstRow, ComponentLog logger) { +this.workbook = StreamingReader.builder() +.rowCacheSize(100) +.bufferSize(4096) +.open(in); +this.sheets = this.workbook.iterator(); +this.desiredSheets = desiredSheets != null ? desiredSheets.stream() +.collect(Collectors.toMap(key -> key, value -> Boolean.FALSE)) : new HashMap<>(); +this.firstRow = firstRow; +this.logger = logger; +this.log = logger != null; +} + +@Override +public boolean hasNext() { +setCurrent(); +boolean next = currentRow != null; +if(!next) { +String sheetsNotFound = getSheetsNotFound(desiredSheets); +if (!sheetsNotFound.isEmpty() && log) { +logger.warn("Excel sheet(s) not found: {}", sheetsNotFound); +} Review Comment: The problem with the warning is that it is not actionable in terms of flow handling. Using the `record.count` attribute provides the opportunity to warn if a flow configuration expects to see records in all cases. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [nifi] exceptionfactory commented on a diff in pull request #7194: NIFI-11167 - Add Excel Record Reader
exceptionfactory commented on code in PR #7194: URL: https://github.com/apache/nifi/pull/7194#discussion_r1195809425 ## nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/excel/ExcelRecordReader.java: ## @@ -0,0 +1,221 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.nifi.excel; + +import org.apache.nifi.logging.ComponentLog; +import org.apache.nifi.serialization.MalformedRecordException; +import org.apache.nifi.serialization.RecordReader; +import org.apache.nifi.serialization.record.DataType; +import org.apache.nifi.serialization.record.MapRecord; +import org.apache.nifi.serialization.record.Record; +import org.apache.nifi.serialization.record.RecordField; +import org.apache.nifi.serialization.record.RecordSchema; +import org.apache.nifi.serialization.record.util.DataTypeUtils; +import org.apache.poi.ss.usermodel.Cell; +import org.apache.poi.ss.usermodel.DateUtil; +import org.apache.poi.ss.usermodel.Row; + +import java.io.IOException; +import java.text.DateFormat; +import java.util.ArrayList; +import java.util.LinkedHashMap; +import java.util.List; +import java.util.Map; +import java.util.function.Supplier; +import java.util.stream.IntStream; + +import static org.apache.commons.lang3.StringUtils.isEmpty; + +public class ExcelRecordReader implements RecordReader { +private final RowIterator rowIterator; +private ComponentLog logger; +private final RecordSchema schema; +private final List desiredSheets; +private final Supplier LAZY_DATE_FORMAT; +private final Supplier LAZY_TIME_FORMAT; +private final Supplier LAZY_TIMESTAMP_FORMAT; +private final String dateFormat; +private final String timeFormat; +private final String timestampFormat; + +public ExcelRecordReader(ExcelRecordReaderArgs args) throws MalformedRecordException { +this.logger = args.getLogger(); +this.schema = args.getSchema(); +this.desiredSheets = new ArrayList<>(); +if (args.getDesiredSheets() != null && args.getDesiredSheets().length() > 0) { +IntStream.range(0, args.getDesiredSheets().length()) +.forEach(index -> this.desiredSheets.add(args.getDesiredSheets().get(index))); +} + +if (isEmpty(args.getDateFormat())) { +this.dateFormat = null; +LAZY_DATE_FORMAT = null; +} else { +this.dateFormat = args.getDateFormat(); +LAZY_DATE_FORMAT = () -> DataTypeUtils.getDateFormat(dateFormat); +} + +if (isEmpty(args.getTimeFormat())) { +this.timeFormat = null; +LAZY_TIME_FORMAT = null; +} else { +this.timeFormat = args.getTimeFormat(); +LAZY_TIME_FORMAT = () -> DataTypeUtils.getDateFormat(timeFormat); +} + +if (isEmpty(args.getTimestampFormat())) { +this.timestampFormat = null; +LAZY_TIMESTAMP_FORMAT = null; +} else { +this.timestampFormat = args.getTimestampFormat(); +LAZY_TIMESTAMP_FORMAT = () -> DataTypeUtils.getDateFormat(timestampFormat); +} + +try { +this.rowIterator = new RowIterator(args.getInputStream(), desiredSheets, args.getFirstRow(), logger); +} catch (RuntimeException e) { +String msg = "Error occurred while processing record file"; +logger.error(msg, e); +throw new MalformedRecordException(msg, e); +} +} + +@Override +public Record nextRecord(boolean coerceTypes, boolean dropUnknownFields) throws IOException, MalformedRecordException { +try { +if (rowIterator.hasNext()) { +Row currentRow = rowIterator.next(); +Map currentRowValues = getCurrentRowValues(currentRow, coerceTypes, dropUnknownFields); +return new MapRecord(schema, currentRowValues); +} +} catch (Exception e) { +throw new MalformedRecordException("Error while getting next record", e); +} +return null; +} + +private Map getCurrentRowValues(Row
[GitHub] [nifi] exceptionfactory commented on a diff in pull request #7194: NIFI-11167 - Add Excel Record Reader
exceptionfactory commented on code in PR #7194: URL: https://github.com/apache/nifi/pull/7194#discussion_r1195712821 ## nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/excel/RowIterator.java: ## @@ -0,0 +1,155 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */package org.apache.nifi.excel; + +import com.github.pjfanning.xlsx.StreamingReader; +import org.apache.nifi.logging.ComponentLog; +import org.apache.poi.ss.usermodel.Row; +import org.apache.poi.ss.usermodel.Sheet; +import org.apache.poi.ss.usermodel.Workbook; + +import java.io.Closeable; +import java.io.IOException; +import java.io.InputStream; +import java.util.HashMap; +import java.util.Iterator; +import java.util.List; +import java.util.Map; +import java.util.stream.Collectors; + +public class RowIterator implements Iterator, Closeable { +private final Workbook workbook; +private final Iterator sheets; +private Sheet currentSheet; +private Iterator currentRows; +private final Map desiredSheets; +private final int firstRow; +private ComponentLog logger; +private boolean log; +private Row currentRow; + +public RowIterator(InputStream in, List desiredSheets, int firstRow) { +this(in, desiredSheets, firstRow, null); +} + +public RowIterator(InputStream in, List desiredSheets, int firstRow, ComponentLog logger) { +this.workbook = StreamingReader.builder() +.rowCacheSize(100) +.bufferSize(4096) +.open(in); +this.sheets = this.workbook.iterator(); +this.desiredSheets = desiredSheets != null ? desiredSheets.stream() +.collect(Collectors.toMap(key -> key, value -> Boolean.FALSE)) : new HashMap<>(); +this.firstRow = firstRow; +this.logger = logger; +this.log = logger != null; +} + +@Override +public boolean hasNext() { +setCurrent(); +boolean next = currentRow != null; +if(!next) { +String sheetsNotFound = getSheetsNotFound(desiredSheets); +if (!sheetsNotFound.isEmpty() && log) { +logger.warn("Excel sheet(s) not found: {}", sheetsNotFound); +} Review Comment: Some of the other Record Readers add a record.count attribute, which is useful for tracking. That's probably a better option than logging a warning. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@nifi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [nifi] exceptionfactory commented on a diff in pull request #7194: NIFI-11167 - Add Excel Record Reader
exceptionfactory commented on code in PR #7194: URL: https://github.com/apache/nifi/pull/7194#discussion_r1195705310 ## nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/excel/ExcelRecordReader.java: ## @@ -0,0 +1,221 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.nifi.excel; + +import org.apache.nifi.logging.ComponentLog; +import org.apache.nifi.serialization.MalformedRecordException; +import org.apache.nifi.serialization.RecordReader; +import org.apache.nifi.serialization.record.DataType; +import org.apache.nifi.serialization.record.MapRecord; +import org.apache.nifi.serialization.record.Record; +import org.apache.nifi.serialization.record.RecordField; +import org.apache.nifi.serialization.record.RecordSchema; +import org.apache.nifi.serialization.record.util.DataTypeUtils; +import org.apache.poi.ss.usermodel.Cell; +import org.apache.poi.ss.usermodel.DateUtil; +import org.apache.poi.ss.usermodel.Row; + +import java.io.IOException; +import java.text.DateFormat; +import java.util.ArrayList; +import java.util.LinkedHashMap; +import java.util.List; +import java.util.Map; +import java.util.function.Supplier; +import java.util.stream.IntStream; + +import static org.apache.commons.lang3.StringUtils.isEmpty; + +public class ExcelRecordReader implements RecordReader { +private final RowIterator rowIterator; +private ComponentLog logger; +private final RecordSchema schema; +private final List desiredSheets; +private final Supplier LAZY_DATE_FORMAT; +private final Supplier LAZY_TIME_FORMAT; +private final Supplier LAZY_TIMESTAMP_FORMAT; +private final String dateFormat; +private final String timeFormat; +private final String timestampFormat; + +public ExcelRecordReader(ExcelRecordReaderArgs args) throws MalformedRecordException { +this.logger = args.getLogger(); +this.schema = args.getSchema(); +this.desiredSheets = new ArrayList<>(); +if (args.getDesiredSheets() != null && args.getDesiredSheets().length() > 0) { +IntStream.range(0, args.getDesiredSheets().length()) +.forEach(index -> this.desiredSheets.add(args.getDesiredSheets().get(index))); +} + +if (isEmpty(args.getDateFormat())) { +this.dateFormat = null; +LAZY_DATE_FORMAT = null; +} else { +this.dateFormat = args.getDateFormat(); +LAZY_DATE_FORMAT = () -> DataTypeUtils.getDateFormat(dateFormat); +} + +if (isEmpty(args.getTimeFormat())) { +this.timeFormat = null; +LAZY_TIME_FORMAT = null; +} else { +this.timeFormat = args.getTimeFormat(); +LAZY_TIME_FORMAT = () -> DataTypeUtils.getDateFormat(timeFormat); +} + +if (isEmpty(args.getTimestampFormat())) { +this.timestampFormat = null; +LAZY_TIMESTAMP_FORMAT = null; +} else { +this.timestampFormat = args.getTimestampFormat(); +LAZY_TIMESTAMP_FORMAT = () -> DataTypeUtils.getDateFormat(timestampFormat); +} + +try { +this.rowIterator = new RowIterator(args.getInputStream(), desiredSheets, args.getFirstRow(), logger); +} catch (RuntimeException e) { +String msg = "Error occurred while processing record file"; +logger.error(msg, e); +throw new MalformedRecordException(msg, e); +} +} + +@Override +public Record nextRecord(boolean coerceTypes, boolean dropUnknownFields) throws IOException, MalformedRecordException { +try { +if (rowIterator.hasNext()) { +Row currentRow = rowIterator.next(); +Map currentRowValues = getCurrentRowValues(currentRow, coerceTypes, dropUnknownFields); +return new MapRecord(schema, currentRowValues); +} +} catch (Exception e) { +throw new MalformedRecordException("Error while getting next record", e); +} +return null; +} + +private Map getCurrentRowValues(Row
[GitHub] [nifi] exceptionfactory commented on a diff in pull request #7194: NIFI-11167 - Add Excel Record Reader
exceptionfactory commented on code in PR #7194: URL: https://github.com/apache/nifi/pull/7194#discussion_r1195649993 ## nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/excel/ExcelRecordReader.java: ## @@ -0,0 +1,221 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.nifi.excel; + +import org.apache.nifi.logging.ComponentLog; +import org.apache.nifi.serialization.MalformedRecordException; +import org.apache.nifi.serialization.RecordReader; +import org.apache.nifi.serialization.record.DataType; +import org.apache.nifi.serialization.record.MapRecord; +import org.apache.nifi.serialization.record.Record; +import org.apache.nifi.serialization.record.RecordField; +import org.apache.nifi.serialization.record.RecordSchema; +import org.apache.nifi.serialization.record.util.DataTypeUtils; +import org.apache.poi.ss.usermodel.Cell; +import org.apache.poi.ss.usermodel.DateUtil; +import org.apache.poi.ss.usermodel.Row; + +import java.io.IOException; +import java.text.DateFormat; +import java.util.ArrayList; +import java.util.LinkedHashMap; +import java.util.List; +import java.util.Map; +import java.util.function.Supplier; +import java.util.stream.IntStream; + +import static org.apache.commons.lang3.StringUtils.isEmpty; + +public class ExcelRecordReader implements RecordReader { +private final RowIterator rowIterator; +private ComponentLog logger; +private final RecordSchema schema; +private final List desiredSheets; +private final Supplier LAZY_DATE_FORMAT; +private final Supplier LAZY_TIME_FORMAT; +private final Supplier LAZY_TIMESTAMP_FORMAT; +private final String dateFormat; +private final String timeFormat; +private final String timestampFormat; + +public ExcelRecordReader(ExcelRecordReaderArgs args) throws MalformedRecordException { +this.logger = args.getLogger(); +this.schema = args.getSchema(); +this.desiredSheets = new ArrayList<>(); +if (args.getDesiredSheets() != null && args.getDesiredSheets().length() > 0) { +IntStream.range(0, args.getDesiredSheets().length()) +.forEach(index -> this.desiredSheets.add(args.getDesiredSheets().get(index))); +} + +if (isEmpty(args.getDateFormat())) { +this.dateFormat = null; +LAZY_DATE_FORMAT = null; +} else { +this.dateFormat = args.getDateFormat(); +LAZY_DATE_FORMAT = () -> DataTypeUtils.getDateFormat(dateFormat); +} + +if (isEmpty(args.getTimeFormat())) { +this.timeFormat = null; +LAZY_TIME_FORMAT = null; +} else { +this.timeFormat = args.getTimeFormat(); +LAZY_TIME_FORMAT = () -> DataTypeUtils.getDateFormat(timeFormat); +} + +if (isEmpty(args.getTimestampFormat())) { +this.timestampFormat = null; +LAZY_TIMESTAMP_FORMAT = null; +} else { +this.timestampFormat = args.getTimestampFormat(); +LAZY_TIMESTAMP_FORMAT = () -> DataTypeUtils.getDateFormat(timestampFormat); +} + +try { +this.rowIterator = new RowIterator(args.getInputStream(), desiredSheets, args.getFirstRow(), logger); +} catch (RuntimeException e) { +String msg = "Error occurred while processing record file"; +logger.error(msg, e); +throw new MalformedRecordException(msg, e); +} +} + +@Override +public Record nextRecord(boolean coerceTypes, boolean dropUnknownFields) throws IOException, MalformedRecordException { +try { +if (rowIterator.hasNext()) { +Row currentRow = rowIterator.next(); +Map currentRowValues = getCurrentRowValues(currentRow, coerceTypes, dropUnknownFields); +return new MapRecord(schema, currentRowValues); +} +} catch (Exception e) { +throw new MalformedRecordException("Error while getting next record", e); +} +return null; +} + +private Map getCurrentRowValues(Row
[GitHub] [nifi] exceptionfactory commented on a diff in pull request #7194: NIFI-11167 - Add Excel Record Reader
exceptionfactory commented on code in PR #7194: URL: https://github.com/apache/nifi/pull/7194#discussion_r1194517265 ## nifi-nar-bundles/nifi-standard-services/nifi-record-serialization-services-bundle/nifi-record-serialization-services/src/main/java/org/apache/nifi/excel/ExcelReader.java: ## @@ -0,0 +1,193 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.nifi.excel; + +import org.apache.commons.lang3.StringUtils; +import org.apache.nifi.annotation.documentation.CapabilityDescription; +import org.apache.nifi.annotation.documentation.Tags; +import org.apache.nifi.annotation.lifecycle.OnEnabled; +import org.apache.nifi.components.AllowableValue; +import org.apache.nifi.components.PropertyDescriptor; +import org.apache.nifi.context.PropertyContext; +import org.apache.nifi.controller.ConfigurationContext; +import org.apache.nifi.expression.ExpressionLanguageScope; +import org.apache.nifi.logging.ComponentLog; +import org.apache.nifi.processor.util.StandardValidators; +import org.apache.nifi.schema.access.SchemaAccessStrategy; +import org.apache.nifi.schema.access.SchemaNotFoundException; +import org.apache.nifi.schema.inference.InferSchemaAccessStrategy; +import org.apache.nifi.schema.inference.RecordSourceFactory; +import org.apache.nifi.schema.inference.SchemaInferenceEngine; +import org.apache.nifi.schema.inference.SchemaInferenceUtil; +import org.apache.nifi.schema.inference.TimeValueInference; +import org.apache.nifi.schemaregistry.services.SchemaRegistry; +import org.apache.nifi.serialization.DateTimeUtils; +import org.apache.nifi.serialization.MalformedRecordException; +import org.apache.nifi.serialization.RecordReader; +import org.apache.nifi.serialization.RecordReaderFactory; +import org.apache.nifi.serialization.SchemaRegistryService; +import org.apache.nifi.serialization.record.RecordSchema; +import org.apache.nifi.stream.io.NonCloseableInputStream; +import org.apache.poi.ss.usermodel.Row; + +import java.io.IOException; +import java.io.InputStream; +import java.util.ArrayList; +import java.util.List; +import java.util.Map; +import java.util.concurrent.atomic.AtomicReferenceArray; +import java.util.stream.IntStream; + +@Tags({"excel", "spreadsheet", "xlsx", "parse", "record", "row", "reader", "values", "cell"}) +@CapabilityDescription("Parses a Microsoft Excel document returning each row in each sheet as a separate record. " ++ "This reader allows for inferring a schema either based on the first line of an Excel sheet if a 'header line' is " ++ "present or from all the desired sheets, or providing an explicit schema " ++ "for interpreting the values. See Controller Service's Usage for further documentation. " ++ "This reader is currently only capable of processing .xlsx " ++ "(XSSF 2007 OOXML file format) Excel documents and not older .xls (HSSF '97(-2007) file format) documents.)") +public class ExcelReader extends SchemaRegistryService implements RecordReaderFactory { + +private static final AllowableValue HEADER_DERIVED = new AllowableValue("excel-header-derived", "Use fields From Header", +"The first chosen row of the Excel sheet is a header row that contains the columns representative of all the rows " + +"in the desired sheets. The schema will be derived by using those columns in the header."); +public static final PropertyDescriptor DESIRED_SHEETS = new PropertyDescriptor +.Builder().name("extract-sheets") +.displayName("Sheets to Extract") +.description("Comma separated list of Excel document sheet names whose rows should be extracted from the excel document. If this property" + +" is left blank then all the rows from all the sheets will be extracted from the Excel document. The list of names is case in-sensitive. Any sheets not" + +" specified in this value will be ignored. A bulletin will be generated if a specified sheet(s) are not found.") +.required(false) + .expressionLanguageSupported(ExpressionLanguageScope.FLOWFILE_ATTRIBUTES) +.addValidator(StandardValidators.NON_EMPTY_VALIDATOR) +