[GitHub] nifi issue #929: NIFI-2613 - Apache POI processor to convert Excel documents...

2017-03-17 Thread jvwing
Github user jvwing commented on the issue:

https://github.com/apache/nifi/pull/929
  
Thanks again for putting this together, @jdye64.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nifi issue #929: NIFI-2613 - Apache POI processor to convert Excel documents...

2017-03-17 Thread jvwing
Github user jvwing commented on the issue:

https://github.com/apache/nifi/pull/929
  
@jdye64, it looks good!  I was just about to merge, giving the files a last 
once-over before I push the button... and I noticed we have a json.org 
dependency in the nifi-poi-processors POM:


org.json
json
20160810


Are we using that?  [json.org is Category X these 
days](https://issues.apache.org/jira/browse/NIFI-2991), so we can't go with it. 
 But it doesn't appear to be used, and when I deleted it I was able to build, 
test, and run the processor without obvious issues.

Do you know of any reason I can't just pull that out?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nifi issue #929: NIFI-2613 - Apache POI processor to convert Excel documents...

2017-03-16 Thread jvwing
Github user jvwing commented on the issue:

https://github.com/apache/nifi/pull/929
  
Thanks for those improvements, @jdye64, I especially like the updated usage 
doc.  Two things on the latest code:

1. Did you try an .xls file?  There is a problem when the flowfile 
attribute is added in the catch block on line ~195.  The NiFi framework throws 
an exception of it's own, because we can't do `session.putAttribute` inside an 
InputStreamCallback for the same flowfile:

> java.lang.IllegalStateException: 
StandardFlowFileRecord[uuid=be192381-9475-4c6d-a6ca-43735e5df271,claim=StandardContentClaim
 [resourceClaim=StandardResourceClaim[id=1489713904793-1, container=default, 
section=1], offset=0, 
length=26112],offset=0,name=./conf/test-xls.xls,size=26112] already in use for 
an active callback or InputStream created by ProcessSession.read(FlowFile) has 
not been closed

Something similar happens with the session.putAttribute on ~209.  As a 
result of these exceptions, the session is rolled back and the flowfile is 
returned to the input queue.  I think we can throw an exception, though.  So if 
we caught and rethrew with a different error message, it should work out.

2. In the failure case, we're routing the flowfile to both 'failure' and 
'original'.  I didn't realize it earlier, but I now believe this to be unusual 
in NiFi.  Most processors treat failure as an exclusive route, and 'original' 
as part of the successful happy path.  SplitAvro, SplitJson, SplitText, and 
UnpackContent were some examples I looked at.  I doubt that's written in stone. 
 What do you think?

I made a [sample code 
fork](https://github.com/jvwing/nifi/commit/2ccf5dec2dcd707c5963716dfb3fbf7813c460ea)
 with a unit test for .xls and a suggested approach to solving the 
IllegalStateExceptions, and the failure routing.  I did not get the logging to 
cooperate the way I think it should, but we're not too far off.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nifi issue #929: NIFI-2613 - Apache POI processor to convert Excel documents...

2017-03-16 Thread jvwing
Github user jvwing commented on the issue:

https://github.com/apache/nifi/pull/929
  
@jdye64 Thanks, I looked into some of these error cases:

* Unsupported .xls - Throws exception, routes to 'original'.  The error 
bulletin seems a bit terse:
> ConvertExcelToCSVProcessor[id=43426b41-015a-1000-b06e-3a9be79162d1] 
Package should contain a content type part [M1.13]
* Blank Sheet - Succeeds, one empty flowfile to 'success' and the original 
to 'original'.  No Errors.  Seems correct to me.
* Empty Workbook - I wasn't able to create an empty workbook manually in 
Excel.  
* Unmatched Sheet Names - Covered by testNonExistantSpecifiedSheetName, 
seems correct to me.
* Diverse Content - I tried a number of bizarre things Excel lets you put 
in spreadsheets -- images, tables, pivot tables, formulas, hidden sheets, etc.  
Where appropriate, the processor returned content (cells in tables and pivot 
tables, last computed formula value), but did not run into errors.  Images were 
ignored.
* CSV-Breaking Content - The processor does not escape text to enforce a 
CSV structure.  Sheets containing multiline text in a cell, cells with commas, 
etc. resulted in improperly formed CSV files.  I do not object, I understand 
that's a significant scope increase.

I recommend that you do the following:

1. Clearly document that the processor only supports .xlsx and NOT .xls 
files.  I know you've already answered questions about this on the dev list, so 
somebody's going to try it.
2. For the .xls case, would it be possible to catch the 
`org.apache.poi.openxml4j.exceptions.InvalidFormatException`, repackage that 
with a helpful error message suggesting that only well-formed XLSX files are 
accepted, and route to failure?
3. Document that the processor does not escape invalid CSV content.
4. Logging changes from the code review post above.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nifi issue #929: NIFI-2613 - Apache POI processor to convert Excel documents...

2017-03-15 Thread jdye64
Github user jdye64 commented on the issue:

https://github.com/apache/nifi/pull/929
  
@jvwing In regards to error cases here are a list of things which could 
make this fail.

- Input data is incorrect. AKA not .xlsx file
- Empty document. I guess this isn't an error but I think you get what I'm 
saying there just wouldn't be anything to do.
- User could specify a list of "desired-sheets" that are not present in the 
document. Once again not really failure but of course no success relationships 
would be triggered on the original relationship.
- The processor will currently only process strings (this includes 
numerical values, just the POI naming convention) and will ignore things like 
graphs, images, binary data, etc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nifi issue #929: NIFI-2613 - Apache POI processor to convert Excel documents...

2017-02-27 Thread jvwing
Github user jvwing commented on the issue:

https://github.com/apache/nifi/pull/929
  
Thanks, @jdye64,  that got it.  I was able to build just fine with 
contrib-check and everything.  On to new topics:

* **License/Notice** - We need a separate LICENSE and NOTICE files for 
nifi-poi-nar.  I was reminded by the recent LICENSE/NOTICE discussion on the 
mailing list about what to include in NAR bundles, and it applies here.  Good 
news, you've already put together most of the text for the nifi-assembly 
LICENSE/NOTICE, and there are many examples of the file layout in other NARs.  
I think a reasonably comparable NAR referencing another Apache project is 
[nifi-ignite-nar](https://github.com/apache/nifi/tree/master/nifi-nar-bundles/nifi-ignite-bundle/nifi-ignite-nar/src/main/resources/META-INF).
* **Sheet in Filename** - Output filenames all have the UNKNOWN sheet name, 
like `35200238025098_UNKNOWN.csv`.  How about adding a sheet name check to one 
of the unit tests?
* **Error Cases** - What will make this fail?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nifi issue #929: NIFI-2613 - Apache POI processor to convert Excel documents...

2017-02-27 Thread jvwing
Github user jvwing commented on the issue:

https://github.com/apache/nifi/pull/929
  
@jdye64, I am still having trouble with the root `pom.xml` file.  The 
version number is still `1.0.0-SNAPSHOT` [on line 
1006](https://github.com/apache/nifi/pull/929/files/474a08eda8497d47468e511e3642766995dad62a#diff-600376dffeb79835ede4a0b285078036).
  The commit you referenced above did correct version numbers in the 
nifi-poi-bundle project, but not the root NiFi project POM.

How are you testing the build and the build output?  I have experienced two 
scenarios:

1.) **Clean System** - Build fails at nifi-assembly, because the root 
pom.xml specifies nifi-poi-nar 1.0.0-SNAPSHOT, which is not in the local Maven 
repository:

> [ERROR] Failed to execute goal on project nifi-assembly: Could not 
resolve dependencies for project 
org.apache.nifi:nifi-assembly:pom:1.2.0-SNAPSHOT: Could not find artifact 
org.apache.nifi:nifi-poi-nar:nar:1.0.0-SNAPSHOT in apache.snapshots 
(http://repository.apache.org/snapshots)

2.) **Reused System** - Build succeeds because the older 
nifi-poi-nar-1.0.0-SNAPSHOT.nar still exists in the local Maven repository, but 
the output assembly files contain nifi-poi-nar-1.0.0-SNAPSHOT.nar.

In neither case do I get the lastest NAR output.  Would you please try 
deleting the contents of `~/.m2/repository/org/apache/nifi/nifi-poi*` and 
running a full `mvn clean install`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nifi issue #929: NIFI-2613 - Apache POI processor to convert Excel documents...

2017-02-27 Thread jvwing
Github user jvwing commented on the issue:

https://github.com/apache/nifi/pull/929
  
@jdye64 Thanks for the update.  I'll look into why I'm seeing the POM 
issue.  Don't worry about squashing, I'll take care of that when we're ready to 
merge.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nifi issue #929: NIFI-2613 - Apache POI processor to convert Excel documents...

2017-02-27 Thread jdye64
Github user jdye64 commented on the issue:

https://github.com/apache/nifi/pull/929
  
@jvwing the ```ConvertAvroToParquet``` had slipped my mind thanks for the 
reminder there. The POM references had been fixed in a previous fix at 
https://github.com/apache/nifi/pull/929/commits/474a08eda8497d47468e511e3642766995dad62a#diff-e7171c7f07ed58f51e4643dc340432f8R22

I'm unsure what your preference for reviewing is but let me know if this is 
getting to the point where you would like for me to squash the commits or not.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nifi issue #929: NIFI-2613 - Apache POI processor to convert Excel documents...

2017-02-25 Thread jvwing
Github user jvwing commented on the issue:

https://github.com/apache/nifi/pull/929
  
Thanks, @jdye64, this is clearly a big improvement.  I will continue 
reviewing the functionality.  Some of the issues identified above are still in 
the latest commit, and should still be addressed:

* Root POM references nifi-poi-bundle version `1.0-SNAPSHOT`
* Entry for `ConvertAvroToParquet` in nifi-kite-bundle
* Incomplete comment at ConvertExcelToCSVProcessorTest.java:91


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nifi issue #929: NIFI-2613 - Apache POI processor to convert Excel documents...

2017-02-24 Thread jdye64
Github user jdye64 commented on the issue:

https://github.com/apache/nifi/pull/929
  
@jvwing I indeed had used a much earlier version of the processor. I 
couldn't find the other version so just ended up doing a rewrite for the most 
part. Thanks for hanging with me on this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nifi issue #929: NIFI-2613 - Apache POI processor to convert Excel documents...

2017-02-16 Thread jdye64
Github user jdye64 commented on the issue:

https://github.com/apache/nifi/pull/929
  
@jvwing thank you for pointing this out. Let me look it over right now and 
I'll make another commit shortly. I had this in another project and I think 
during the copy/paste I have made an error.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nifi issue #929: NIFI-2613 - Apache POI processor to convert Excel documents...

2017-02-15 Thread jvwing
Github user jvwing commented on the issue:

https://github.com/apache/nifi/pull/929
  
@jdye64 I'm still confused by the disparity between the annotation docs and 
the behavior of the processor, especially with respect to multiple sheets in 
the incoming spreadsheet.  From the annotations:

> Each sheet from the incoming Excel document will generate a new Flowfile 
that will be output from this processor

This was not my experience, and I believe the single call to 
`session.clone` on line 143, not performed in a loop, guarantees only a single 
flowfile to the success relationship.

> @WritesAttribute(attribute="sheetname", description="The name of the 
Excel sheet that this particular row of data came from in the Excel document")

But it has always been `UNKNOWN` for my tests.

---
I created a unit test to help clarify how this is supposed to work.  From 
the docs, I believe the test below should pass, although it currently fails.  
Am I misunderstanding?

Also see 
[branch](https://github.com/jvwing/nifi/tree/NIFI-2613-poi-excel-3-tests)
[TwoSheets.xlsx](https://github.com/apache/nifi/files/778759/TwoSheets.xlsx)

@Test
public void testMultipleSheetsGeneratesMultipleFlowFiles() throws 
Exception {

testRunner.enqueue(new 
File("src/test/resources/TwoSheets.xlsx").toPath());
testRunner.run();

testRunner.assertTransferCount(ConvertExcelToCSVProcessor.SUCCESS, 
2);
testRunner.assertTransferCount(ConvertExcelToCSVProcessor.ORIGINAL, 
1);
testRunner.assertTransferCount(ConvertExcelToCSVProcessor.FAILURE, 
0);

MockFlowFile ffSheetA = 
testRunner.getFlowFilesForRelationship(ConvertExcelToCSVProcessor.SUCCESS).get(0);
Long rowsSheetA = new 
Long(ffSheetA.getAttribute(ConvertExcelToCSVProcessor.ROW_NUM));
assertTrue(rowsSheetA == 4l);

assertTrue(ffSheetA.getAttribute(ConvertExcelToCSVProcessor.SHEET_NAME).equalsIgnoreCase("TestSheetA"));

MockFlowFile ffSheetB = 
testRunner.getFlowFilesForRelationship(ConvertExcelToCSVProcessor.SUCCESS).get(1);
Long rowsSheetB = new 
Long(ffSheetB.getAttribute(ConvertExcelToCSVProcessor.ROW_NUM));
assertTrue(rowsSheetB == 3l);

assertTrue(ffSheetB.getAttribute(ConvertExcelToCSVProcessor.SHEET_NAME).equalsIgnoreCase("TestSheetB"));
}



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nifi issue #929: NIFI-2613 - Apache POI processor to convert Excel documents...

2017-02-14 Thread jvwing
Github user jvwing commented on the issue:

https://github.com/apache/nifi/pull/929
  
@joewitt Yes, thanks for getting us moving, I'll continue reviewing.

@jdye64 Thanks for the updates, I'll take another look shortly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nifi issue #929: NIFI-2613 - Apache POI processor to convert Excel documents...

2017-02-14 Thread joewitt
Github user joewitt commented on the issue:

https://github.com/apache/nifi/pull/929
  
@jvwing you ok continuing to run with this one for the review?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nifi issue #929: NIFI-2613 - Apache POI processor to convert Excel documents...

2017-02-14 Thread jdye64
Github user jdye64 commented on the issue:

https://github.com/apache/nifi/pull/929
  
@jvwing wow sorry its been a long time for my response. I've tried to go 
through and make all of the cleanup suggestions you have made.

Yes, the intention is a single output per sheet in the excel file. Keep in 
mind that if the sheet is not .xlsx and rather .xls format you would see the 
behavior you are experiencing. Can you attach the excel doc you were testing 
with?

I made the change to case of the attribute

Also adjusted the error handling to prevent the end user from being 
pummeled with a full stack trace.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nifi issue #929: NIFI-2613 - Apache POI processor to convert Excel documents...

2017-02-14 Thread joewitt
Github user joewitt commented on the issue:

https://github.com/apache/nifi/pull/929
  
@jdye64 will you have time to address @jvwing findings?  Would like to get 
this across the line.  We're trying to get the list of stale PRs burned down.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nifi issue #929: NIFI-2613 - Apache POI processor to convert Excel documents...

2016-10-18 Thread jvwing
Github user jvwing commented on the issue:

https://github.com/apache/nifi/pull/929
  
I have a few more comments from running the processor and reviewing the 
code:

**Processor Annotations**
* The CapabilityDescription says this processor transfers one flowfile for 
each sheet in the Excel document, but that was not my experience.  It 
aggregated all content into one output flowfile.  Is that the intended behavior?
* Most NiFi attribute names are lowercase, "sheetname" or "sheet.name" 
instead of "SheetName".  I don't believe there is a hard and fast rule there, 
but I recommend going lowercase if you do not have a strong preference.

**Properties**
* PropertyDescriptors should have `name` as a machine readable key like 
"extract-sheets" and `displayName` for the user "Sheets to Extract".

**Exception Handling and Logging**
Most NiFi components use `getLogger().error(...` as you do later in the 
code, rather than `printStackTrace()`.  I haven't tested these error cases, but 
they appear to be informational rather than stopping or failing the flow.  
Would `getLogger().info(...` or `getLogger().debug(...` be a better fit here?  
Do users need the full stack trace, or would just the message be helpful enough?
```
} catch (InvalidFormatException e) {
e.printStackTrace();
} catch (OpenXML4JException e) {
e.printStackTrace();
} catch (SAXException e) {
e.printStackTrace();
}
```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nifi issue #929: NIFI-2613 - Apache POI processor to convert Excel documents...

2016-10-18 Thread jvwing
Github user jvwing commented on the issue:

https://github.com/apache/nifi/pull/929
  
@jdye64 I still had trouble building and running NiFi with this PR.

* root POM still references nifi-poi-bundle version "1.0-SNAPSHOT"
* nifi-poi-nar still has a dependency on nifi-poi-processors version 
"1.0-SNAPSHOT"
* NiFi fails to start processing the entry for 
org.apache.nifi.processors.kite.ConvertAvroToParquet in 
nifi-nar-bundles/nifi-kite-bundle/nifi-kite-processors/src/main/resources/META-INF/services/org.apache.nifi.processor.Processor
 (unintentionally included in commit?)

Would you please try running `mvn clean install -Pcontrib-check`, verifying 
that the latest nifi-poi-nar is included in the lib output, and that NiFi 
starts OK?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nifi issue #929: NIFI-2613 - Apache POI processor to convert Excel documents...

2016-10-13 Thread jvwing
Github user jvwing commented on the issue:

https://github.com/apache/nifi/pull/929
  
@jdye64 I have a couple of requests from a quick scan.  Would it be 
possible to...

* Rebase on the latest master?
* Update the POM versions, using a consistent "1.1.0-SNAPSHOT" format 
(looks like there is a mix of "1.0-SNAPSHOT" and "1.0.0-SNAPSHOT")?
* Reference the nifi-poi-nar in the nifi-assembly dependency artifacts?
* Include Poi in the nifi-assembly NOTICE file?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nifi issue #929: NIFI-2613 - Apache POI processor to convert Excel documents...

2016-10-13 Thread jvwing
Github user jvwing commented on the issue:

https://github.com/apache/nifi/pull/929
  
Reviewing


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---