[GitHub] nifi issue #3057: NIFI-5667: Add nested record support for PutORC
Github user bbende commented on the issue: https://github.com/apache/nifi/pull/3057 Looks good to me, was able to verify the functionality, going to merge ---
[GitHub] nifi issue #3057: NIFI-5667: Add nested record support for PutORC
Github user mattyb149 commented on the issue: https://github.com/apache/nifi/pull/3057 Updated the rest of the backticks, thanks @VikingK and @bbende for your reviews! ---
[GitHub] nifi issue #3057: NIFI-5667: Add nested record support for PutORC
Github user VikingK commented on the issue: https://github.com/apache/nifi/pull/3057 @mattyb149 I checked out the new backtick ddl patch you applied. Its seems it only backticks the complex data structures, the "top" level ones are for example not ticked. For example ProductId,Items,PropertyMap,Serial and Metadata should be backticked as well to protect from bad names. ``` hive.ddl CREATE EXTERNAL TABLE IF NOT EXISTS Sales (ProductId INT, Items ARRAY>, PropertyMap MAP, Serial STRUCT<`Serial`:BIGINT, `Date`:TIMESTAMP, `SystemId`:BIGINT, `T`:BIGINT, `IncludeStartTx`:BOOLEAN>, Metadata STRUCT<`AvroTs`:TIMESTAMP>) STORED AS ORC ``` ---
[GitHub] nifi issue #3057: NIFI-5667: Add nested record support for PutORC
Github user VikingK commented on the issue: https://github.com/apache/nifi/pull/3057 @mattyb149 asome work, I was gonna suggest the backtick feature since its a pain when downstream systems sends all kinds of weird names, like Timestamp etc. My solution right now is an ugly groovy script that eliminates them. I'll run some testing tomorrow on the fixes. ---
[GitHub] nifi issue #3057: NIFI-5667: Add nested record support for PutORC
Github user mattyb149 commented on the issue: https://github.com/apache/nifi/pull/3057 I found the issue and pushed a new commit with the fix. Also your test data exercised another part of the code I hadn't thought of, your "as" field is a keyword in Hive so when I used the generated hive.ddl attribute to create the table on top of the ORC files, it didn't work. The other change in this commit is to backtick-quote the field names to protect against field names that are reserved words. ---
[GitHub] nifi issue #3057: NIFI-5667: Add nested record support for PutORC
Github user bbende commented on the issue: https://github.com/apache/nifi/pull/3057 Ok using your test.avro going into PutORC with an AvroReader that uses embedded schema I can get the error you showed earlier. I also tested ConvertRecord using the AvroReader and JsonWrtier and that works and produces the following JSON: ``` [ { "OItems": [{ "Od" : 9, "HS" : [47,119,61,61], "AS" : [65,65,61,61], "NS" : "0" }] } ] ``` ---
[GitHub] nifi issue #3057: NIFI-5667: Add nested record support for PutORC
Github user VikingK commented on the issue: https://github.com/apache/nifi/pull/3057 Maybe it's because I am missing a name: on the array level that's causing the Jason to pick 'array'. I'll test later tonight. ---
[GitHub] nifi issue #3057: NIFI-5667: Add nested record support for PutORC
Github user VikingK commented on the issue: https://github.com/apache/nifi/pull/3057 @bbende wierd, I tried it again and I got the same error, here is the output from avro tools and I also attached my test.avro message ``` java -jar avro-tools-1.8.2.jar getschema test.avro log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. { "type" : "record", "name" : "IOlist", "namespace" : "analytics.models.its", "fields" : [ { "name" : "OItems", "type" : [ "null", { "type" : "array", "items" : { "type" : "record", "name" : "ISC", "namespace" : "analytics.models.its.iolist.oitems", "fields" : [ { "name" : "Od", "type" : [ "null", "long" ] }, { "name" : "HS", "type" : [ "null", "bytes" ] }, { "name" : "AS", "type" : [ "null", "bytes" ] }, { "name" : "NS", "type" : [ "null", "string" ] } ] } } ] } ] } java -jar avro-tools-1.8.2.jar tojson test.avro log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. {"OItems":{"array":[{"Od":{"long":9},"HS":{"bytes":"/w=="},"AS":{"bytes":"AA=="},"NS":{"string":"0"}}]}} ``` [test.zip](https://github.com/apache/nifi/files/2469204/test.zip) ---
[GitHub] nifi issue #3057: NIFI-5667: Add nested record support for PutORC
Github user bbende commented on the issue: https://github.com/apache/nifi/pull/3057 @VikingK in your schema it has OItems defined as an array, but then in the JSON OItems is not an array, its an object with a field called array. So running with that schema and example JSON I get: ``` Caused by: java.lang.ClassCastException: org.codehaus.jackson.node.ObjectNode cannot be cast to org.codehaus.jackson.node.ArrayNode at org.apache.nifi.json.JsonTreeRowRecordReader.convertField(JsonTreeRowRecordReader.java:188) at org.apache.nifi.json.JsonTreeRowRecordReader.convertJsonNodeToRecord(JsonTreeRowRecordReader.java:118) at org.apache.nifi.json.JsonTreeRowRecordReader.convertJsonNodeToRecord(JsonTreeRowRecordReader.java:83) at org.apache.nifi.json.JsonTreeRowRecordReader.convertJsonNodeToRecord(JsonTreeRowRecordReader.java:74) at org.apache.nifi.json.AbstractJsonRowRecordReader.nextRecord(AbstractJsonRowRecordReader.java:92) ``` Which makes sense because the OItems field is not an array, but the schema says it is. I'm trying to figure out how to reproduce the other error you showed. ---
[GitHub] nifi issue #3057: NIFI-5667: Add nested record support for PutORC
Github user VikingK commented on the issue: https://github.com/apache/nifi/pull/3057 Sorry for the late answer got bogged down. Avro: ``` { "name": "IOlist", "namespace": "analytics.models.its", "type": "record", "fields": [ { "name": "OItems", "type": [ "null", { "type": "array", "items": { "name": "ISC", "namespace": "analytics.models.its.iolist.oitems", "type": "record", "fields": [ { "name": "Od", "type": [ "null", "long" ] }, { "name": "HS", "type": [ "null", "bytes" ] }, { "name": "AS", "type": [ "null", "bytes" ] }, { "name": "NS", "type": [ "null", "string" ] } ] } } ] } ] } ``` JSON ``` { "OItems" : { "array" : [ { "Od" : { "long" : 9 }, "HS" : { "bytes" : "/w==" }, "AS" : { "bytes" : "AA==" }, "NS" : { "string" : "0" } } ] } } ``` ---
[GitHub] nifi issue #3057: NIFI-5667: Add nested record support for PutORC
Github user mattyb149 commented on the issue: https://github.com/apache/nifi/pull/3057 Can you share the schema of the data, and possibly a JSON export of your Avro file? I couldn't reproduce this with an array of ints, and @bbende ran successfully with an array of records. ---
[GitHub] nifi issue #3057: NIFI-5667: Add nested record support for PutORC
Github user VikingK commented on the issue: https://github.com/apache/nifi/pull/3057 Hi, I have been verifying the pull request and I came across this error for one of my Avro messages, I am currently working to figure out which part of the message caused this. ``` 2018-10-10 14:14:42,489 ERROR [Timer-Driven Process Thread-6] org.apache.nifi.processors.orc.PutORC PutORC[id=0430e7ab-99f1-3e25-c491-935245567fa3] Failed to write due to java.lang.ClassCa stException: org.apache.hadoop.hive.serde2.typeinfo.PrimitiveTypeInfo cannot be cast to org.apache.hadoop.hive.serde2.typeinfo.ListTypeInfo: java.lang.ClassCastException: org.apache.hadoop .hive.serde2.typeinfo.PrimitiveTypeInfo cannot be cast to org.apache.hadoop.hive.serde2.typeinfo.ListTypeInfo java.lang.ClassCastException: org.apache.hadoop.hive.serde2.typeinfo.PrimitiveTypeInfo cannot be cast to org.apache.hadoop.hive.serde2.typeinfo.ListTypeInfo at org.apache.hadoop.hive.ql.io.orc.NiFiOrcUtils.convertToORCObject(NiFiOrcUtils.java:127) at org.apache.hadoop.hive.ql.io.orc.NiFiOrcUtils.convertToORCObject(NiFiOrcUtils.java:177) at org.apache.hadoop.hive.ql.io.orc.NiFiOrcUtils.lambda$convertToORCObject$0(NiFiOrcUtils.java:129) at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499) at org.apache.hadoop.hive.ql.io.orc.NiFiOrcUtils.convertToORCObject(NiFiOrcUtils.java:130) at org.apache.nifi.processors.orc.record.ORCHDFSRecordWriter.write(ORCHDFSRecordWriter.java:73) at org.apache.nifi.processors.orc.record.ORCHDFSRecordWriter.write(ORCHDFSRecordWriter.java:94) at org.apache.nifi.processors.hadoop.AbstractPutHDFSRecord.lambda$null$0(AbstractPutHDFSRecord.java:324) at org.apache.nifi.controller.repository.StandardProcessSession.read(StandardProcessSession.java:2235) at org.apache.nifi.controller.repository.StandardProcessSession.read(StandardProcessSession.java:2203) at org.apache.nifi.processors.hadoop.AbstractPutHDFSRecord.lambda$onTrigger$1(AbstractPutHDFSRecord.java:305) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:360) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1662) at org.apache.nifi.processors.hadoop.AbstractPutHDFSRecord.onTrigger(AbstractPutHDFSRecord.java:272) at org.apache.nifi.processor.AbstractProcessor.onTrigger(AbstractProcessor.java:27) at org.apache.nifi.controller.StandardProcessorNode.onTrigger(StandardProcessorNode.java:1165) at org.apache.nifi.controller.tasks.ConnectableTask.invoke(ConnectableTask.java:203) at org.apache.nifi.controller.scheduling.TimerDrivenSchedulingAgent$1.run(TimerDrivenSchedulingAgent.java:117) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) ``` ---