[GitHub] [hudi] kazdy commented on issue #8259: [SUPPORT] Clustering created files with modified schema resulting in corrupted table

2023-05-10 Thread via GitHub


kazdy commented on issue #8259:
URL: https://github.com/apache/hudi/issues/8259#issuecomment-1542821519

   It was 0.12.2 not 0.13, sorry for the confusion. The issue does not occur 
with 0.13


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] kazdy commented on issue #8259: [SUPPORT] Clustering created files with modified schema resulting in corrupted table

2023-05-01 Thread via GitHub


kazdy commented on issue #8259:
URL: https://github.com/apache/hudi/issues/8259#issuecomment-1530087223

   @mansipp I was able to replicate this case using Hudi 0.13 as well. Files 
created by clustering have changed schema. Interestingly there's no issue if eg 
the order of types (not columns) does not change eg if I have only columns of 
type String.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] kazdy commented on issue #8259: [SUPPORT] Clustering created files with modified schema resulting in corrupted table

2023-03-29 Thread via GitHub


kazdy commented on issue #8259:
URL: https://github.com/apache/hudi/issues/8259#issuecomment-1489217716

   Hi @nsivabalan I see that Danny assigned you to this ticket.
   I was able to replicate this exact case. Previous one was not exactly 
exposing the issue. I'll update the repo soon.
   
   Here's what I found out and some additional info.
   
   First I'm wondering why in the stacktrace I'm getting:
   ```
   java.lang.UnsupportedOperationException: 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
at 
org.apache.parquet.column.Dictionary.decodeToBinary(Dictionary.java:41)
   ```

   I don't have any Long/Bigint in the table schema nor in the incoming schema, 
all numeric types are explicitly cast to INT or DECIMAL(10,0) (default for 
spark).
   
   The schema of the incoming batch should look like this:
   ```
   col1 String
   col2 String
   col3 Int
   col4 Int
   col5 Timestamp
   partitionCol1 int
   partitionCol2 int
   col6 String
   col7 timestamp
   col8 int
   col9 string
   col10 string
   col11 string
   col12 string
   col13 decimal
   col14 string
   col15 string
   col16 string
   col17 string
   col17 string
   col18 string
   col18 string
   col19 int
   col20 string
   col21 string
   col22 string
   col23 int
   col24 string
   col25 string
   col26 string
   col27 string
   col28 string
   col29 string
   col30 string
   col31 int
   col31 int
   col32 string
   col33 string
   col34 string
   col35 string
   col36 string
   col37 string
   ```
   
   Schema of clustered parquet files:
   
   ```
   col1 String
   col2 String
   col3 Int
   col4 Int
   col5 Timestamp
   col6 String // (earlier at this place was partitionCol1 int) Tries to read 
Int but instead needs to read String? idk
   partitionCol2 int
   col7 timestamp 
   col8 int
   col9 string
   col10 string
   col11 string
   col12 string
   col13 decimal
   col14 string
   col15 string
   col16 string
   col17 string
   col17 string
   col18 string
   col18 string
   col19 int
   col20 string
   col21 string
   col22 string
   col23 int
   col24 string
   col25 string
   col26 string
   col27 string
   col28 string
   col29 string
   col30 string
   col31 int
   col31 int
   col32 string
   col33 string
   col34 string
   col35 string
   col36 string
   col37 string
   partitionCol1 int
   ```
   
   Schema in replacecommit conforms to the incoming batch schema/table schema 
(is correct).
   I don't know if Hudi resolves columns by position or by name and if it 
matters when reading parquet file for merging.
   If it was by position then col6 String (earlier at this place was 
partitionCol1 int) Hudi will try to read this column as of type Int but instead 
needs to read String? 
   Therefore it can't since for PlainLongDictionary there's no decodeToBinary 
implementation available?
   Idk if it makes any sense at all, but this is my intuition.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] kazdy commented on issue #8259: [SUPPORT] Clustering created files with modified schema resulting in corrupted table

2023-03-24 Thread via GitHub


kazdy commented on issue #8259:
URL: https://github.com/apache/hudi/issues/8259#issuecomment-1482386262

   https://hudi.apache.org/docs/metadata
   
   This blog advises to use in process lock when metadata table and async 
services are enabled for single writer scenario. Is the blog incorrect?
   
   Or what you meant is that I should not use async metadata indexing with 
other async table services in single writer mode and in-process lock provider?
   
   Nevertheless I was able to reproduce the issue without async table services 
in streaming job. Looks like the issue is in clusterig itself.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [hudi] kazdy commented on issue #8259: [SUPPORT] Clustering created files with modified schema resulting in corrupted table

2023-03-23 Thread via GitHub


kazdy commented on issue #8259:
URL: https://github.com/apache/hudi/issues/8259#issuecomment-1480945493

   @nsivabalan I would really appreciate any support or guidance here, I have 
two broken tables in prod now and need to come up with a fix or workaround. 
Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org