[GitHub] [hudi] kazdy commented on issue #8259: [SUPPORT] Clustering created files with modified schema resulting in corrupted table
kazdy commented on issue #8259: URL: https://github.com/apache/hudi/issues/8259#issuecomment-1542821519 It was 0.12.2 not 0.13, sorry for the confusion. The issue does not occur with 0.13 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] kazdy commented on issue #8259: [SUPPORT] Clustering created files with modified schema resulting in corrupted table
kazdy commented on issue #8259: URL: https://github.com/apache/hudi/issues/8259#issuecomment-1530087223 @mansipp I was able to replicate this case using Hudi 0.13 as well. Files created by clustering have changed schema. Interestingly there's no issue if eg the order of types (not columns) does not change eg if I have only columns of type String. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] kazdy commented on issue #8259: [SUPPORT] Clustering created files with modified schema resulting in corrupted table
kazdy commented on issue #8259: URL: https://github.com/apache/hudi/issues/8259#issuecomment-1489217716 Hi @nsivabalan I see that Danny assigned you to this ticket. I was able to replicate this exact case. Previous one was not exactly exposing the issue. I'll update the repo soon. Here's what I found out and some additional info. First I'm wondering why in the stacktrace I'm getting: ``` java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary at org.apache.parquet.column.Dictionary.decodeToBinary(Dictionary.java:41) ``` I don't have any Long/Bigint in the table schema nor in the incoming schema, all numeric types are explicitly cast to INT or DECIMAL(10,0) (default for spark). The schema of the incoming batch should look like this: ``` col1 String col2 String col3 Int col4 Int col5 Timestamp partitionCol1 int partitionCol2 int col6 String col7 timestamp col8 int col9 string col10 string col11 string col12 string col13 decimal col14 string col15 string col16 string col17 string col17 string col18 string col18 string col19 int col20 string col21 string col22 string col23 int col24 string col25 string col26 string col27 string col28 string col29 string col30 string col31 int col31 int col32 string col33 string col34 string col35 string col36 string col37 string ``` Schema of clustered parquet files: ``` col1 String col2 String col3 Int col4 Int col5 Timestamp col6 String // (earlier at this place was partitionCol1 int) Tries to read Int but instead needs to read String? idk partitionCol2 int col7 timestamp col8 int col9 string col10 string col11 string col12 string col13 decimal col14 string col15 string col16 string col17 string col17 string col18 string col18 string col19 int col20 string col21 string col22 string col23 int col24 string col25 string col26 string col27 string col28 string col29 string col30 string col31 int col31 int col32 string col33 string col34 string col35 string col36 string col37 string partitionCol1 int ``` Schema in replacecommit conforms to the incoming batch schema/table schema (is correct). I don't know if Hudi resolves columns by position or by name and if it matters when reading parquet file for merging. If it was by position then col6 String (earlier at this place was partitionCol1 int) Hudi will try to read this column as of type Int but instead needs to read String? Therefore it can't since for PlainLongDictionary there's no decodeToBinary implementation available? Idk if it makes any sense at all, but this is my intuition. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] kazdy commented on issue #8259: [SUPPORT] Clustering created files with modified schema resulting in corrupted table
kazdy commented on issue #8259: URL: https://github.com/apache/hudi/issues/8259#issuecomment-1482386262 https://hudi.apache.org/docs/metadata This blog advises to use in process lock when metadata table and async services are enabled for single writer scenario. Is the blog incorrect? Or what you meant is that I should not use async metadata indexing with other async table services in single writer mode and in-process lock provider? Nevertheless I was able to reproduce the issue without async table services in streaming job. Looks like the issue is in clusterig itself. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] kazdy commented on issue #8259: [SUPPORT] Clustering created files with modified schema resulting in corrupted table
kazdy commented on issue #8259: URL: https://github.com/apache/hudi/issues/8259#issuecomment-1480945493 @nsivabalan I would really appreciate any support or guidance here, I have two broken tables in prod now and need to come up with a fix or workaround. Thanks! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org