[ https://issues.apache.org/jira/browse/FLINK-13292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16915620#comment-16915620 ]
Alejandro Sellero commented on FLINK-13292: ------------------------------------------- Hello [~nithish], first, thank you very much for taking a look on this. I also have seen the test so it could be that the problem is my data but I have a hard time finding the issue. I attached an example simple file which also fails for you. This is exactly the code I used with this file: {code:java} val env = StreamExecutionEnvironment.getExecutionEnvironment val orcSchema = "struct<" + "operation:int," + "originalTransaction:bigInt," + "bucket:int," + "rowId:bigInt," + "currentTransaction:bigInt," + "row:struct<" + "id:int," + "headline:string," + "user_id:int," + "company_id:int," + "created_at:timestamp," + "updated_at:timestamp," + "link:string," + "is_html:tinyint," + "source:string," + "company_feed_id:int," + "editable:tinyint," + "body_clean:string," + "activitystream_activity_id:bigint," + "uniqueness_checksum:string," + "rating:string," + "kununu_review_id:int," + "soft_deleted:tinyint," + "type:string," + "metadata:string," + "url:string," + "imagecache_uuid:string," + "video_id:int" + ">>" val hconf = new HadoopConfiguration() hconf.setBoolean("orc.skip.corrupt.data", true) val companyArticlesFormat = new OrcRowInputFormat( "PATH_TO_FOLDER", orcSchema, hconf ) env .readFile( companyArticlesFormat, "PATH_TO_FOLDER" ) .writeAsText("file:///tmp/test/orcRead") env.execute() {code} Again thanks a lot for taking a look on this. > NullPointerException when reading a string field in a nested struct from an > Orc file. > ------------------------------------------------------------------------------------- > > Key: FLINK-13292 > URL: https://issues.apache.org/jira/browse/FLINK-13292 > Project: Flink > Issue Type: Bug > Components: Connectors / ORC > Affects Versions: 1.8.0 > Reporter: Alejandro Sellero > Priority: Major > Attachments: one_row.json, output.orc > > > When I try to read an Orc file using flink-orc an NullPointerException > exception is thrown. > I think this issue could be related with this closed issue > https://issues.apache.org/jira/browse/FLINK-8230 > This happens when trying to read the string fields in a nested struct. This > is my schema: > {code:java} > "struct<" + > "operation:int," + > "originalTransaction:bigInt," + > "bucket:int," + > "rowId:bigInt," + > "currentTransaction:bigInt," + > "row:struct<" + > "id:int," + > "headline:string," + > "user_id:int," + > "company_id:int," + > "created_at:timestamp," + > "updated_at:timestamp," + > "link:string," + > "is_html:tinyint," + > "source:string," + > "company_feed_id:int," + > "editable:tinyint," + > "body_clean:string," + > "activitystream_activity_id:bigint," + > "uniqueness_checksum:string," + > "rating:string," + > "review_id:int," + > "soft_deleted:tinyint," + > "type:string," + > "metadata:string," + > "url:string," + > "imagecache_uuid:string," + > "video_id:int" + > ">>",{code} > {code:java} > [error] Caused by: java.lang.NullPointerException > [error] at java.lang.String.checkBounds(String.java:384) > [error] at java.lang.String.<init>(String.java:462) > [error] at > org.apache.flink.orc.OrcBatchReader.readString(OrcBatchReader.java:1216) > [error] at > org.apache.flink.orc.OrcBatchReader.readNonNullBytesColumnAsString(OrcBatchReader.java:328) > [error] at > org.apache.flink.orc.OrcBatchReader.readField(OrcBatchReader.java:215) > [error] at > org.apache.flink.orc.OrcBatchReader.readNonNullStructColumn(OrcBatchReader.java:453) > [error] at > org.apache.flink.orc.OrcBatchReader.readField(OrcBatchReader.java:250) > [error] at > org.apache.flink.orc.OrcBatchReader.fillRows(OrcBatchReader.java:143) > [error] at > org.apache.flink.orc.OrcRowInputFormat.ensureBatch(OrcRowInputFormat.java:333) > [error] at > org.apache.flink.orc.OrcRowInputFormat.reachedEnd(OrcRowInputFormat.java:313) > [error] at > org.apache.flink.runtime.operators.DataSourceTask.invoke(DataSourceTask.java:190) > [error] at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711) > [error] at java.lang.Thread.run(Thread.java:748){code} > Instead to use the TableApi I am trying to read the orc files in the Batch > mode as following: > {code:java} > env > .readFile( > new OrcRowInputFormat( > "", > "SCHEMA_GIVEN_BEFORE", > new HadoopConfiguration() > ), > "PATH_TO_FOLDER" > ) > .writeAsText("file:///tmp/test/fromOrc") > {code} > Thanks for your support -- This message was sent by Atlassian Jira (v8.3.2#803003)