[GitHub] nifi issue #929: NIFI-2613 - Apache POI processor to convert Excel documents...
Github user jvwing commented on the issue: https://github.com/apache/nifi/pull/929 Thanks again for putting this together, @jdye64. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] nifi issue #929: NIFI-2613 - Apache POI processor to convert Excel documents...
Github user jvwing commented on the issue: https://github.com/apache/nifi/pull/929 @jdye64, it looks good! I was just about to merge, giving the files a last once-over before I push the button... and I noticed we have a json.org dependency in the nifi-poi-processors POM: org.json json 20160810 Are we using that? [json.org is Category X these days](https://issues.apache.org/jira/browse/NIFI-2991), so we can't go with it. But it doesn't appear to be used, and when I deleted it I was able to build, test, and run the processor without obvious issues. Do you know of any reason I can't just pull that out? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] nifi issue #929: NIFI-2613 - Apache POI processor to convert Excel documents...
Github user jvwing commented on the issue: https://github.com/apache/nifi/pull/929 Thanks for those improvements, @jdye64, I especially like the updated usage doc. Two things on the latest code: 1. Did you try an .xls file? There is a problem when the flowfile attribute is added in the catch block on line ~195. The NiFi framework throws an exception of it's own, because we can't do `session.putAttribute` inside an InputStreamCallback for the same flowfile: > java.lang.IllegalStateException: StandardFlowFileRecord[uuid=be192381-9475-4c6d-a6ca-43735e5df271,claim=StandardContentClaim [resourceClaim=StandardResourceClaim[id=1489713904793-1, container=default, section=1], offset=0, length=26112],offset=0,name=./conf/test-xls.xls,size=26112] already in use for an active callback or InputStream created by ProcessSession.read(FlowFile) has not been closed Something similar happens with the session.putAttribute on ~209. As a result of these exceptions, the session is rolled back and the flowfile is returned to the input queue. I think we can throw an exception, though. So if we caught and rethrew with a different error message, it should work out. 2. In the failure case, we're routing the flowfile to both 'failure' and 'original'. I didn't realize it earlier, but I now believe this to be unusual in NiFi. Most processors treat failure as an exclusive route, and 'original' as part of the successful happy path. SplitAvro, SplitJson, SplitText, and UnpackContent were some examples I looked at. I doubt that's written in stone. What do you think? I made a [sample code fork](https://github.com/jvwing/nifi/commit/2ccf5dec2dcd707c5963716dfb3fbf7813c460ea) with a unit test for .xls and a suggested approach to solving the IllegalStateExceptions, and the failure routing. I did not get the logging to cooperate the way I think it should, but we're not too far off. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] nifi issue #929: NIFI-2613 - Apache POI processor to convert Excel documents...
Github user jvwing commented on the issue: https://github.com/apache/nifi/pull/929 @jdye64 Thanks, I looked into some of these error cases: * Unsupported .xls - Throws exception, routes to 'original'. The error bulletin seems a bit terse: > ConvertExcelToCSVProcessor[id=43426b41-015a-1000-b06e-3a9be79162d1] Package should contain a content type part [M1.13] * Blank Sheet - Succeeds, one empty flowfile to 'success' and the original to 'original'. No Errors. Seems correct to me. * Empty Workbook - I wasn't able to create an empty workbook manually in Excel. * Unmatched Sheet Names - Covered by testNonExistantSpecifiedSheetName, seems correct to me. * Diverse Content - I tried a number of bizarre things Excel lets you put in spreadsheets -- images, tables, pivot tables, formulas, hidden sheets, etc. Where appropriate, the processor returned content (cells in tables and pivot tables, last computed formula value), but did not run into errors. Images were ignored. * CSV-Breaking Content - The processor does not escape text to enforce a CSV structure. Sheets containing multiline text in a cell, cells with commas, etc. resulted in improperly formed CSV files. I do not object, I understand that's a significant scope increase. I recommend that you do the following: 1. Clearly document that the processor only supports .xlsx and NOT .xls files. I know you've already answered questions about this on the dev list, so somebody's going to try it. 2. For the .xls case, would it be possible to catch the `org.apache.poi.openxml4j.exceptions.InvalidFormatException`, repackage that with a helpful error message suggesting that only well-formed XLSX files are accepted, and route to failure? 3. Document that the processor does not escape invalid CSV content. 4. Logging changes from the code review post above. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] nifi issue #929: NIFI-2613 - Apache POI processor to convert Excel documents...
Github user jdye64 commented on the issue: https://github.com/apache/nifi/pull/929 @jvwing In regards to error cases here are a list of things which could make this fail. - Input data is incorrect. AKA not .xlsx file - Empty document. I guess this isn't an error but I think you get what I'm saying there just wouldn't be anything to do. - User could specify a list of "desired-sheets" that are not present in the document. Once again not really failure but of course no success relationships would be triggered on the original relationship. - The processor will currently only process strings (this includes numerical values, just the POI naming convention) and will ignore things like graphs, images, binary data, etc. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] nifi issue #929: NIFI-2613 - Apache POI processor to convert Excel documents...
Github user jvwing commented on the issue: https://github.com/apache/nifi/pull/929 Thanks, @jdye64, that got it. I was able to build just fine with contrib-check and everything. On to new topics: * **License/Notice** - We need a separate LICENSE and NOTICE files for nifi-poi-nar. I was reminded by the recent LICENSE/NOTICE discussion on the mailing list about what to include in NAR bundles, and it applies here. Good news, you've already put together most of the text for the nifi-assembly LICENSE/NOTICE, and there are many examples of the file layout in other NARs. I think a reasonably comparable NAR referencing another Apache project is [nifi-ignite-nar](https://github.com/apache/nifi/tree/master/nifi-nar-bundles/nifi-ignite-bundle/nifi-ignite-nar/src/main/resources/META-INF). * **Sheet in Filename** - Output filenames all have the UNKNOWN sheet name, like `35200238025098_UNKNOWN.csv`. How about adding a sheet name check to one of the unit tests? * **Error Cases** - What will make this fail? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] nifi issue #929: NIFI-2613 - Apache POI processor to convert Excel documents...
Github user jvwing commented on the issue: https://github.com/apache/nifi/pull/929 @jdye64, I am still having trouble with the root `pom.xml` file. The version number is still `1.0.0-SNAPSHOT` [on line 1006](https://github.com/apache/nifi/pull/929/files/474a08eda8497d47468e511e3642766995dad62a#diff-600376dffeb79835ede4a0b285078036). The commit you referenced above did correct version numbers in the nifi-poi-bundle project, but not the root NiFi project POM. How are you testing the build and the build output? I have experienced two scenarios: 1.) **Clean System** - Build fails at nifi-assembly, because the root pom.xml specifies nifi-poi-nar 1.0.0-SNAPSHOT, which is not in the local Maven repository: > [ERROR] Failed to execute goal on project nifi-assembly: Could not resolve dependencies for project org.apache.nifi:nifi-assembly:pom:1.2.0-SNAPSHOT: Could not find artifact org.apache.nifi:nifi-poi-nar:nar:1.0.0-SNAPSHOT in apache.snapshots (http://repository.apache.org/snapshots) 2.) **Reused System** - Build succeeds because the older nifi-poi-nar-1.0.0-SNAPSHOT.nar still exists in the local Maven repository, but the output assembly files contain nifi-poi-nar-1.0.0-SNAPSHOT.nar. In neither case do I get the lastest NAR output. Would you please try deleting the contents of `~/.m2/repository/org/apache/nifi/nifi-poi*` and running a full `mvn clean install`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] nifi issue #929: NIFI-2613 - Apache POI processor to convert Excel documents...
Github user jvwing commented on the issue: https://github.com/apache/nifi/pull/929 @jdye64 Thanks for the update. I'll look into why I'm seeing the POM issue. Don't worry about squashing, I'll take care of that when we're ready to merge. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] nifi issue #929: NIFI-2613 - Apache POI processor to convert Excel documents...
Github user jdye64 commented on the issue: https://github.com/apache/nifi/pull/929 @jvwing the ```ConvertAvroToParquet``` had slipped my mind thanks for the reminder there. The POM references had been fixed in a previous fix at https://github.com/apache/nifi/pull/929/commits/474a08eda8497d47468e511e3642766995dad62a#diff-e7171c7f07ed58f51e4643dc340432f8R22 I'm unsure what your preference for reviewing is but let me know if this is getting to the point where you would like for me to squash the commits or not. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] nifi issue #929: NIFI-2613 - Apache POI processor to convert Excel documents...
Github user jvwing commented on the issue: https://github.com/apache/nifi/pull/929 Thanks, @jdye64, this is clearly a big improvement. I will continue reviewing the functionality. Some of the issues identified above are still in the latest commit, and should still be addressed: * Root POM references nifi-poi-bundle version `1.0-SNAPSHOT` * Entry for `ConvertAvroToParquet` in nifi-kite-bundle * Incomplete comment at ConvertExcelToCSVProcessorTest.java:91 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] nifi issue #929: NIFI-2613 - Apache POI processor to convert Excel documents...
Github user jdye64 commented on the issue: https://github.com/apache/nifi/pull/929 @jvwing I indeed had used a much earlier version of the processor. I couldn't find the other version so just ended up doing a rewrite for the most part. Thanks for hanging with me on this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] nifi issue #929: NIFI-2613 - Apache POI processor to convert Excel documents...
Github user jdye64 commented on the issue: https://github.com/apache/nifi/pull/929 @jvwing thank you for pointing this out. Let me look it over right now and I'll make another commit shortly. I had this in another project and I think during the copy/paste I have made an error. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] nifi issue #929: NIFI-2613 - Apache POI processor to convert Excel documents...
Github user jvwing commented on the issue: https://github.com/apache/nifi/pull/929 @jdye64 I'm still confused by the disparity between the annotation docs and the behavior of the processor, especially with respect to multiple sheets in the incoming spreadsheet. From the annotations: > Each sheet from the incoming Excel document will generate a new Flowfile that will be output from this processor This was not my experience, and I believe the single call to `session.clone` on line 143, not performed in a loop, guarantees only a single flowfile to the success relationship. > @WritesAttribute(attribute="sheetname", description="The name of the Excel sheet that this particular row of data came from in the Excel document") But it has always been `UNKNOWN` for my tests. --- I created a unit test to help clarify how this is supposed to work. From the docs, I believe the test below should pass, although it currently fails. Am I misunderstanding? Also see [branch](https://github.com/jvwing/nifi/tree/NIFI-2613-poi-excel-3-tests) [TwoSheets.xlsx](https://github.com/apache/nifi/files/778759/TwoSheets.xlsx) @Test public void testMultipleSheetsGeneratesMultipleFlowFiles() throws Exception { testRunner.enqueue(new File("src/test/resources/TwoSheets.xlsx").toPath()); testRunner.run(); testRunner.assertTransferCount(ConvertExcelToCSVProcessor.SUCCESS, 2); testRunner.assertTransferCount(ConvertExcelToCSVProcessor.ORIGINAL, 1); testRunner.assertTransferCount(ConvertExcelToCSVProcessor.FAILURE, 0); MockFlowFile ffSheetA = testRunner.getFlowFilesForRelationship(ConvertExcelToCSVProcessor.SUCCESS).get(0); Long rowsSheetA = new Long(ffSheetA.getAttribute(ConvertExcelToCSVProcessor.ROW_NUM)); assertTrue(rowsSheetA == 4l); assertTrue(ffSheetA.getAttribute(ConvertExcelToCSVProcessor.SHEET_NAME).equalsIgnoreCase("TestSheetA")); MockFlowFile ffSheetB = testRunner.getFlowFilesForRelationship(ConvertExcelToCSVProcessor.SUCCESS).get(1); Long rowsSheetB = new Long(ffSheetB.getAttribute(ConvertExcelToCSVProcessor.ROW_NUM)); assertTrue(rowsSheetB == 3l); assertTrue(ffSheetB.getAttribute(ConvertExcelToCSVProcessor.SHEET_NAME).equalsIgnoreCase("TestSheetB")); } --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] nifi issue #929: NIFI-2613 - Apache POI processor to convert Excel documents...
Github user jvwing commented on the issue: https://github.com/apache/nifi/pull/929 @joewitt Yes, thanks for getting us moving, I'll continue reviewing. @jdye64 Thanks for the updates, I'll take another look shortly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] nifi issue #929: NIFI-2613 - Apache POI processor to convert Excel documents...
Github user joewitt commented on the issue: https://github.com/apache/nifi/pull/929 @jvwing you ok continuing to run with this one for the review? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] nifi issue #929: NIFI-2613 - Apache POI processor to convert Excel documents...
Github user jdye64 commented on the issue: https://github.com/apache/nifi/pull/929 @jvwing wow sorry its been a long time for my response. I've tried to go through and make all of the cleanup suggestions you have made. Yes, the intention is a single output per sheet in the excel file. Keep in mind that if the sheet is not .xlsx and rather .xls format you would see the behavior you are experiencing. Can you attach the excel doc you were testing with? I made the change to case of the attribute Also adjusted the error handling to prevent the end user from being pummeled with a full stack trace. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] nifi issue #929: NIFI-2613 - Apache POI processor to convert Excel documents...
Github user joewitt commented on the issue: https://github.com/apache/nifi/pull/929 @jdye64 will you have time to address @jvwing findings? Would like to get this across the line. We're trying to get the list of stale PRs burned down. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] nifi issue #929: NIFI-2613 - Apache POI processor to convert Excel documents...
Github user jvwing commented on the issue: https://github.com/apache/nifi/pull/929 I have a few more comments from running the processor and reviewing the code: **Processor Annotations** * The CapabilityDescription says this processor transfers one flowfile for each sheet in the Excel document, but that was not my experience. It aggregated all content into one output flowfile. Is that the intended behavior? * Most NiFi attribute names are lowercase, "sheetname" or "sheet.name" instead of "SheetName". I don't believe there is a hard and fast rule there, but I recommend going lowercase if you do not have a strong preference. **Properties** * PropertyDescriptors should have `name` as a machine readable key like "extract-sheets" and `displayName` for the user "Sheets to Extract". **Exception Handling and Logging** Most NiFi components use `getLogger().error(...` as you do later in the code, rather than `printStackTrace()`. I haven't tested these error cases, but they appear to be informational rather than stopping or failing the flow. Would `getLogger().info(...` or `getLogger().debug(...` be a better fit here? Do users need the full stack trace, or would just the message be helpful enough? ``` } catch (InvalidFormatException e) { e.printStackTrace(); } catch (OpenXML4JException e) { e.printStackTrace(); } catch (SAXException e) { e.printStackTrace(); } ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] nifi issue #929: NIFI-2613 - Apache POI processor to convert Excel documents...
Github user jvwing commented on the issue: https://github.com/apache/nifi/pull/929 @jdye64 I still had trouble building and running NiFi with this PR. * root POM still references nifi-poi-bundle version "1.0-SNAPSHOT" * nifi-poi-nar still has a dependency on nifi-poi-processors version "1.0-SNAPSHOT" * NiFi fails to start processing the entry for org.apache.nifi.processors.kite.ConvertAvroToParquet in nifi-nar-bundles/nifi-kite-bundle/nifi-kite-processors/src/main/resources/META-INF/services/org.apache.nifi.processor.Processor (unintentionally included in commit?) Would you please try running `mvn clean install -Pcontrib-check`, verifying that the latest nifi-poi-nar is included in the lib output, and that NiFi starts OK? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] nifi issue #929: NIFI-2613 - Apache POI processor to convert Excel documents...
Github user jvwing commented on the issue: https://github.com/apache/nifi/pull/929 @jdye64 I have a couple of requests from a quick scan. Would it be possible to... * Rebase on the latest master? * Update the POM versions, using a consistent "1.1.0-SNAPSHOT" format (looks like there is a mix of "1.0-SNAPSHOT" and "1.0.0-SNAPSHOT")? * Reference the nifi-poi-nar in the nifi-assembly dependency artifacts? * Include Poi in the nifi-assembly NOTICE file? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] nifi issue #929: NIFI-2613 - Apache POI processor to convert Excel documents...
Github user jvwing commented on the issue: https://github.com/apache/nifi/pull/929 Reviewing --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---