liangrui1988 opened a new issue, #1939: URL: https://github.com/apache/orc/issues/1939
Hi, I have a few questions for you now. Cause: In the file of 2023-2-11 partition, there is a file read exception, log as follows. It is currently read with saprk3.2.1, at that time the write was written by spark2.4, but at that time there were records of successful execution of the read. The orc version was changed several times in the process. This issue is currently finding files with 3 partitions (each partition has only one problematic file) with this similar exception. I would like to ask: How to determine what orc version of the orc file to read correctly? Because I used everything Orc - tools - 1.3. The last - uber. Jar Orc - tools - 1.4. The last - uber. Jar ... Orc - tools - 1.9. The last - uber. Jar Coming and going to verify that the orc file in question is read is the same exception. I also changed the orc-tools source code and put orc.compress.size=262144*20 But there are other exceptions, so it should be that reading orc content is indeed problematic. But the orc file is a record of successful reading before. In this case, what are some good suggestions to debug the problem? How can I read and take out this orc file normally? thank you ``` java -jar orc-tools-1.5.6-uber.jar meta 000221_0 log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. Processing data file 000221_0 [length: 311578064] Structure for 000221_0 File Version: 0.12 with ORC_517 Rows: 19072149 Compression: SNAPPY Compression size: 262144 Type: struct<uid:bigint,object:bigint,attr:int,updatetime:string,appdata:string> Stripe Statistics: Stripe 1: Column 0: count: 7562 hasNull: false Column 1: count: 7562 hasNull: false min: 677081 max: 2933721627 sum: 20406373220886 Column 2: count: 7562 hasNull: false min: 588 max: 2933714767 sum: 20128755480030 Column 3: count: 7562 hasNull: false min: 1 max: 1 sum: 7562 Column 4: count: 7562 hasNull: false min: 2023-02-11 00:00:38 max: 2023-02-11 23:59:47 sum: 143678 Column 5: count: 7562 hasNull: false min: attentionInterface max: yypadapt sum: 97419 Stripe 2: Column 0: count: 3031906 hasNull: false Column 1: count: 3031906 hasNull: false min: 0 max: 254227973 sum: 335994680295336 Column 2: count: 3031906 hasNull: false min: 0 max: 4293656576 sum: 2715250187650210 Column 3: count: 3031906 hasNull: false min: 0 max: 1 sum: 2848516 Column 4: count: 3031906 hasNull: false min: 1970-01-01 08:00:00 max: 2023-02-10 23:33:18 sum: 57606214 Column 5: count: 3031906 hasNull: false min: LittleArt max: zone sum: 26158889 Stripe 3: Column 0: count: 3252488 hasNull: false Column 1: count: 3252488 hasNull: false min: 254227973 max: 804847112 sum: 1874185328076757 Column 2: count: 3252488 hasNull: false min: 0 max: 3871325159 sum: 3002506141474388 Column 3: count: 3252488 hasNull: false min: 0 max: 1 sum: 3085396 Column 4: count: 3252488 hasNull: false min: 1970-01-01 08:00:00 max: 2023-02-10 23:54:27 sum: 61797272 Column 5: count: 3252488 hasNull: false min: LittleArt max: zone sum: 28344969 Stripe 4: Column 0: count: 3326815 hasNull: false Column 1: count: 3326815 hasNull: false min: 804847112 max: 1169323168 sum: 3247348759465597 Column 2: count: 3326815 hasNull: false min: 0 max: 4294967295 sum: 3346008539078461 Column 3: count: 3326815 hasNull: false min: 0 max: 1 sum: 3136008 Column 4: count: 3326815 hasNull: false min: 1970-01-01 08:00:00 max: 2023-02-10 23:49:22 sum: 63209485 Column 5: count: 3326815 hasNull: false min: LittleArt max: zone sum: 31812118 Stripe 5: Column 0: count: 3449497 hasNull: false Column 1: count: 3449497 hasNull: false min: 1169323334 max: 1364554387 sum: 4350425743363666 Column 2: count: 3449497 hasNull: false min: 0 max: 140960826654730 sum: 3590689504624478 Column 3: count: 3449497 hasNull: false min: 0 max: 1 sum: 3332407 Column 4: count: 3449497 hasNull: false min: 2015-02-14 20:58:43 max: 2023-02-10 22:16:37 sum: 65540443 Column 5: count: 3449497 hasNull: false min: LittleArt max: zone sum: 33889705 Stripe 6: Column 0: count: 3285861 hasNull: false Column 1: count: 3285861 hasNull: false min: 1364554387 max: 1859624746 sum: 5176693436169696 Column 2: count: 3285861 hasNull: false min: 1 max: 2933312186 sum: 3719658121533510 Column 3: count: 3285861 hasNull: false min: 0 max: 1 sum: 3105437 Column 4: count: 3285861 hasNull: false min: 2014-12-03 12:22:33 max: 2023-02-10 23:54:39 sum: 62431359 Column 5: count: 3285861 hasNull: false min: LittleArt max: zone sum: 46929366 Stripe 7: Column 0: count: 2718020 hasNull: false Column 1: count: 2718020 hasNull: false min: 1859624760 max: 2843928969 sum: 6525427773935358 Column 2: count: 2718020 hasNull: false min: 1 max: 4294967295 sum: 4562575441612432 Column 3: count: 2718020 hasNull: false min: 0 max: 1 sum: 2717848 Column 4: count: 2718020 hasNull: false min: 2013-08-17 15:56:00 max: 2023-02-10 23:59:34 sum: 51642380 Column 5: count: 2718020 hasNull: false min: LittleArt max: zone sum: 41734941 File Statistics: Column 0: count: 19072149 hasNull: false Column 1: count: 19072149 hasNull: false min: 0 max: 2933721627 sum: 21530482094527296 Column 2: count: 19072149 hasNull: false min: 0 max: 140960826654730 sum: 20956816691453509 Column 3: count: 19072149 hasNull: false min: 0 max: 1 sum: 18233174 Column 4: count: 19072149 hasNull: false min: 1970-01-01 08:00:00 max: 2023-02-11 23:59:47 sum: 362370831 Column 5: count: 19072149 hasNull: false min: LittleArt max: zone sum: 208967407 Stripes: Stripe: offset: 3 data: 108812 rows: 7562 tail: 143 index: 224 Stream: column 0 section ROW_INDEX start: 3 length 12 Stream: column 1 section ROW_INDEX start: 15 length 37 Stream: column 2 section ROW_INDEX start: 52 length 36 Stream: column 3 section ROW_INDEX start: 88 length 26 Stream: column 4 section ROW_INDEX start: 114 length 60 Stream: column 5 section ROW_INDEX start: 174 length 53 Stream: column 1 section DATA start: 227 length 23543 Stream: column 2 section DATA start: 23770 length 33929 Stream: column 3 section DATA start: 57699 length 101 Stream: column 4 section DATA start: 57800 length 49194 Stream: column 4 section LENGTH start: 106994 length 355 Stream: column 5 section DATA start: 107349 length 1576 Stream: column 5 section LENGTH start: 108925 length 15 Stream: column 5 section DICTIONARY_DATA start: 108940 length 99 Encoding column 0: DIRECT Encoding column 1: DIRECT_V2 Encoding column 2: DIRECT_V2 Encoding column 3: DIRECT_V2 Encoding column 4: DIRECT_V2 Encoding column 5: DICTIONARY_V2[10] Stripe: offset: 109182 data: 53900850 rows: 3031906 tail: 157 index: 33222 Stream: column 0 section ROW_INDEX start: 109182 length 149 Stream: column 1 section ROW_INDEX start: 109331 length 7942 Stream: column 2 section ROW_INDEX start: 117273 length 7305 Stream: column 3 section ROW_INDEX start: 124578 length 3516 Stream: column 4 section ROW_INDEX start: 128094 length 10205 Stream: column 5 section ROW_INDEX start: 138299 length 4105 Stream: column 1 section DATA start: 142404 length 5793729 Stream: column 2 section DATA start: 5936133 length 13558058 Stream: column 3 section DATA start: 19494191 length 347433 Stream: column 4 section DATA start: 19841624 length 31301800 Stream: column 4 section LENGTH start: 51143424 length 143687 Stream: column 5 section DATA start: 51287111 length 2755665 Stream: column 5 section LENGTH start: 54042776 length 60 Stream: column 5 section DICTIONARY_DATA start: 54042836 length 418 Encoding column 0: DIRECT Encoding column 1: DIRECT_V2 Encoding column 2: DIRECT_V2 Encoding column 3: DIRECT_V2 Encoding column 4: DIRECT_V2 Encoding column 5: DICTIONARY_V2[55] Stripe: offset: 54043411 data: 52730936 rows: 3252488 tail: 157 index: 37205 Stream: column 0 section ROW_INDEX start: 54043411 length 158 Stream: column 1 section ROW_INDEX start: 54043569 length 9097 Stream: column 2 section ROW_INDEX start: 54052666 length 8156 Stream: column 3 section ROW_INDEX start: 54060822 length 3971 Stream: column 4 section ROW_INDEX start: 54064793 length 11437 Stream: column 5 section ROW_INDEX start: 54076230 length 4386 Stream: column 1 section DATA start: 54080616 length 6630706 Stream: column 2 section DATA start: 60711322 length 11987943 Stream: column 3 section DATA start: 72699265 length 297188 Stream: column 4 section DATA start: 72996453 length 31415283 Stream: column 4 section LENGTH start: 104411736 length 154156 Stream: column 5 section DATA start: 104565892 length 2245188 Stream: column 5 section LENGTH start: 106811080 length 60 Stream: column 5 section DICTIONARY_DATA start: 106811140 length 412 Encoding column 0: DIRECT Encoding column 1: DIRECT_V2 Encoding column 2: DIRECT_V2 Encoding column 3: DIRECT_V2 Encoding column 4: DIRECT_V2 Encoding column 5: DICTIONARY_V2[55] Exception in thread "main" java.lang.IllegalArgumentException: Buffer size too small. size = 262144 needed = 7133168 at org.apache.orc.impl.InStream$CompressedStream.readHeader(InStream.java:212) at org.apache.orc.impl.InStream$CompressedStream.ensureUncompressed(InStream.java:263) at org.apache.orc.impl.InStream$CompressedStream.read(InStream.java:250) at java.io.InputStream.read(InputStream.java:101) at com.google.protobuf.CodedInputStream.refillBuffer(CodedInputStream.java:737) at com.google.protobuf.CodedInputStream.isAtEnd(CodedInputStream.java:701) at com.google.protobuf.CodedInputStream.readTag(CodedInputStream.java:99) at org.apache.orc.OrcProto$StripeFooter.<init>(OrcProto.java:11144) at org.apache.orc.OrcProto$StripeFooter.<init>(OrcProto.java:11108) at org.apache.orc.OrcProto$StripeFooter$1.parsePartialFrom(OrcProto.java:11213) at org.apache.orc.OrcProto$StripeFooter$1.parsePartialFrom(OrcProto.java:11208) at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:89) at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:95) at com.google.protobuf.AbstractParser.parseFrom(AbstractParser.java:49) at org.apache.orc.OrcProto$StripeFooter.parseFrom(OrcProto.java:11441) at org.apache.orc.impl.RecordReaderUtils$DefaultDataReader.readStripeFooter(RecordReaderUtils.java:275) at org.apache.orc.impl.RecordReaderImpl.readStripeFooter(RecordReaderImpl.java:311) at org.apache.orc.tools.FileDump.printMetaDataImpl(FileDump.java:343) at org.apache.orc.tools.FileDump.printMetaData(FileDump.java:274) at org.apache.orc.tools.FileDump.main(FileDump.java:135) at org.apache.orc.tools.Driver.main(Driver.java:108) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
