[GitHub] HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format
HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format URL: https://github.com/apache/flink/pull/6483#issuecomment-462017703 @fhueske Thanks for the review. I removed all of unused fields, main function and test cases. To have better code coverage, I also added test cases for projected selection for each subclass of ParquetInputFormat. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format
HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format URL: https://github.com/apache/flink/pull/6483#issuecomment-464962219 @fhueske I finished rebase lastest upstream master just now. The compile error probably comes from the generated avro classes are committed in last several diffs. I removed them also. Thanks for the your effort of reviewing this PR. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format
HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format URL: https://github.com/apache/flink/pull/6483#issuecomment-445726623 @fhueske Thanks for pointing out so many missing parts. As you pointed out, I added SqlTypeInfo conversion and enforce List and Map schema convention in Diff. Please review it when you have time. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format
HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format URL: https://github.com/apache/flink/pull/6483#issuecomment-451798255 @fhueske I refined test cases. Would you please take a last round of review? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format
HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format URL: https://github.com/apache/flink/pull/6483#issuecomment-441301540 @fhueske Resolved all of the comments except the one for timestamp rewrite. It is needed for time field of window functions. Do you prefer to use timestamp udf SQL directly in this case? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format
HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format URL: https://github.com/apache/flink/pull/6483#issuecomment-441920259 @fhueske I removed timestamp override, and also update the failure recovery test case to test recovery reading file with 10 row group. Please review it as your most convenient time. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format
HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format URL: https://github.com/apache/flink/pull/6483#issuecomment-431737413 @fhueske Thanks for your patient review. It is pretty helpful to make the PR more readable and flawless. Resolved your comments. Please read one more round at your most convenient time. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format
HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format URL: https://github.com/apache/flink/pull/6483#issuecomment-431737461 @fhueske Thanks for your patient review. It is pretty helpful to make the PR more readable and flawless. Resolved your comments. Please read one more round at your most convenient time. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format
HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format URL: https://github.com/apache/flink/pull/6483#issuecomment-431737435 @fhueske Thanks for your patient review. It is pretty helpful to make the PR more readable and flawless. Resolved your comments. Please read one more round at your most convenient time. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format
HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format URL: https://github.com/apache/flink/pull/6483#issuecomment-431737481 @fhueske Thanks for your patient review. It is pretty helpful to make the PR more readable and flawless. Resolved your comments. Please read one more round at your most convenient time. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format
HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format URL: https://github.com/apache/flink/pull/6483#issuecomment-431737523 @fhueske Thanks for your patient review. It is pretty helpful to make the PR more readable and flawless. Resolved your comments. Please read one more round at your most convenient time. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format
HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format URL: https://github.com/apache/flink/pull/6483#issuecomment-431737558 @fhueske Thanks for your patient review. It is pretty helpful to make the PR more readable and flawless. Resolved your comments. Please read one more round at your most convenient time. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format
HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format URL: https://github.com/apache/flink/pull/6483#issuecomment-431737590 @fhueske Thanks for your patient review. It is pretty helpful to make the PR more readable and flawless. Resolved your comments. Please read one more round at your most convenient time. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format
HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format URL: https://github.com/apache/flink/pull/6483#issuecomment-431737642 @fhueske Thanks for your patient review. It is pretty helpful to make the PR more readable and flawless. Resolved your comments. Please read one more round at your most convenient time. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format
HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format URL: https://github.com/apache/flink/pull/6483#issuecomment-431737615 @fhueske Thanks for your patient review. It is pretty helpful to make the PR more readable and flawless. Resolved your comments. Please read one more round at your most convenient time. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format
HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format URL: https://github.com/apache/flink/pull/6483#issuecomment-431737854 @fhueske Thanks for your patient review. It is pretty helpful to make the PR more readable and flawless. Resolved your comments. Please read one more round at your most convenient time. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format
HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format URL: https://github.com/apache/flink/pull/6483#issuecomment-428274826 @fhueske Thanks for reviewing this PR. I can't agree more on offering a similar experience for both input formats (Parquet and ORC). I will resolve your comments in code tonight. Best Regards Peter Huang This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format
HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format URL: https://github.com/apache/flink/pull/6483#issuecomment-428462374 @fhueske Resolved most of your comments. The major blocker is probably the splittable processing of Parquet files. It will probably have a big change on the PR and more test case to cover. Given this is already a big PR, how about let me work on the improvement on another PR? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format
HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format URL: https://github.com/apache/flink/pull/6483#issuecomment-429234463 @fhueske Thanks for the review. Resolved all of the comments except unit tests for the checkpointing logic 1) For the question of "instead of always reading as Row and from there converting to the other types?" In Parquet's interface, a converter is needed for each type of result. Record can be convert to row by recursively put children in particular index, but Map has to do it with Key. To reduce code duplication, I use the row as intermediate representation. So type conversion can be put in sub class of ParquetInputFormat. 2) I will add unit test for checkpoint logic tomorrow night. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format
HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format URL: https://github.com/apache/flink/pull/6483#issuecomment-429517418 @fhueske Add unit test for failure recovery logic. Please review it again after the travis check turns to green. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] HuangZhenQiu commented on issue #6483: [Flink-7243][flink-formats] Add parquet input format
HuangZhenQiu commented on issue #6483: [Flink-7243][flink-formats] Add parquet input format URL: https://github.com/apache/flink/pull/6483#issuecomment-410152524 @suez1224 Would you please have a look this PR? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] HuangZhenQiu commented on issue #6483: [Flink-7243][flink-formats] Add parquet input format
HuangZhenQiu commented on issue #6483: [Flink-7243][flink-formats] Add parquet input format URL: https://github.com/apache/flink/pull/6483#issuecomment-414184561 @walterddr @suez1224 Fixed Rong's comments. Please continue with the review process at your most convenient time. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format
HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format URL: https://github.com/apache/flink/pull/6483#issuecomment-416702085 @docete @fhueske To make a clean interface for filterable parquet input format, I needs add lots of code in this PR. After considering the size of the PR, I would like to put all of the filter pushdown in the PR of ParquetTableSource. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format
HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format URL: https://github.com/apache/flink/pull/6483#issuecomment-418951266 @lvhuyen Thanks for using parquet input format and give me feedback. So the type timstamp is logic type in parquet. It is internally stored as primitive type int64. So it should be read out as long. From the error, it looks like the timestamp is read as String and try to set to a field of BigInteger. Would you please paste me the parquet schema for the file? Thanks This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format
HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format URL: https://github.com/apache/flink/pull/6483#issuecomment-419613742 @lvhuyen Thanks for digging out the root cause. I guess I should pass the logic type into RowPrimitiveConverter. So that different type of data stored as Binary can be handled differently. I am working on fix for it with more test case. Thanks. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format
HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format URL: https://github.com/apache/flink/pull/6483#issuecomment-420505649 @lvhuyen Thanks for the quick patch. I think the data conversion should be handled in RowConverter. I will ship a fix tonight. For the issue of the array type in ParquetMapInputFormat, I will look a look later. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format
HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format URL: https://github.com/apache/flink/pull/6483#issuecomment-420872269 @lvhuyen Last fix should resolve the problem you met in PoJoInputFormat. Would you please try it out. For primitive array handling, why the problem only happens in ParquetMapInputFormat? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format
HuangZhenQiu commented on issue #6483: [FLINK-7243][flink-formats] Add parquet input format URL: https://github.com/apache/flink/pull/6483#issuecomment-421894204 @lvhuyen For the Array handling issue, I figured it out. it is a List back compatibility issue. When I do internal testing at my company, there is only one type of list schema needs to be handled. Thanks for digging it out. https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists I created a fix. Please have a look. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services