[GitHub] [hudi] pengzhiwei2018 commented on pull request #2334: [HUDI-1453] Throw Exception when input data schema is not equal to th…
pengzhiwei2018 commented on pull request #2334: URL: https://github.com/apache/hudi/pull/2334#issuecomment-812496550 > https://gist.github.com/nsivabalan/91f12109e0fe1ca9749ff5290c946778 Hi @nsivabalan , I have take a review for your test code. First you write a "int" to the table, so the table schema type is "int". Then you write a "double" to the table, so the table schema become to "double". The table schema changed from "int" to "double". I think this is more reasonable. In my original idea, I think the first write schema(e.g. "int") is the table schema forever. The incoming records after that should be compatible with the origin table schema(e.g. "int"). This is this PR wants to solve. I can understand more clearly now. The table schema should be change to a more generic type (e.g. from "int" to "double"), but not always be the first write schema. So I can close this PR now. Thanks @nsivabalan for correct me. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] pengzhiwei2018 commented on pull request #2334: [HUDI-1453] Throw Exception when input data schema is not equal to th…
pengzhiwei2018 commented on pull request #2334: URL: https://github.com/apache/hudi/pull/2334#issuecomment-799071055 > @pengzhiwei2018 first of all, thanks for these great contributions. > > Wrt inputSchema vs writeSchema, I actually feel writeSchema already stands for inputSchema, input is what is being written, right? We can probably just leave it as is. and introduce new `tableSchema` variables as you have in the `HoodieWriteHandle` class.? > > Like someone else pointed out as well, so far, we are using read and write schemas consistently. Love to not introduce a new input schema, unless its absolutely necessary . Hi @vinothchandar ,thanks for your reply on this issue. Yes, in most case ,the `writeSchema` is the same with the `inputSchema` which can stands for the `inputSchema` . But in the case in this PR, we write the table twice: First, we write a "id: long" to the table. The input schema is "a:long", the table schema is "a:long". Second, we write a "id:int" to the table. The input schema is "a:int", but the table schema is "a:long" as the previous write. The write schema should be the same with the table schema, or else an Exception would throw out which is the problem we want to solve in this PR. So in this case, we need to distinguish the difference between the `inputSchema` and `writeSchema`. The `inputSchema` is the incoming records's schema, but the `writeSchema` is always the `tableSchema`. **Here's the summary** - The `inputSchema` is used to parser the record from the incoming data. - The `tableSchema` is used to write and read record from the table. When we want to write or read record to the table, we use the `tableSchema`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] pengzhiwei2018 commented on pull request #2334: [HUDI-1453] Throw Exception when input data schema is not equal to th…
pengzhiwei2018 commented on pull request #2334: URL: https://github.com/apache/hudi/pull/2334#issuecomment-754400800 > https://github.com/apache/hudi/blob/master/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/RealtimeCompactedRecordReader.java#L70 Hi @n3nash ,Thanks for your suggestion. This PR only covers the part where HUDi writes the data, not the part where hudi reads the data. The introduced `inputSchema/tableSchema` is only used for parser the incoming record and read data in table.which is much different from the `writer/reader schema` in RealtimeCompactedRecordReader. And I will describe the changes in more detail in the **Brief change log** part of this PR. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] pengzhiwei2018 commented on pull request #2334: [HUDI-1453] Throw Exception when input data schema is not equal to th…
pengzhiwei2018 commented on pull request #2334: URL: https://github.com/apache/hudi/pull/2334#issuecomment-751355407 > @pengzhiwei2018 : sorry I am not following the problem here. Can you throw some light. > Schema evolution is throwing an exception w/ Hudi, i.e a field's data type evolved to a compatible data type and hudi fails to accommodate it? is that the problem. > If yes, can you check #2350. there is a bug in schema compatibility check and the patch should fix it. If not, can you elaborate what exactly is the problem. > For eg: use-case given at #2063 succeeds with #2350 Hi @nsivabalan, When I write a `long` value to the `double` type field in the table, It can pass the `TableSchemaResolver.isSchemaCompatible`,However it failed in the write stage. I found the problem is that the schema (e.g `double`) used in write & read table are the same with the input schema(e.g. `long`). I fixed this problem by distinguishing the two schemas. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org