[GitHub] [hudi] pengzhiwei2018 commented on pull request #2334: [HUDI-1453] Throw Exception when input data schema is not equal to th…

2021-04-02 Thread GitBox


pengzhiwei2018 commented on pull request #2334:
URL: https://github.com/apache/hudi/pull/2334#issuecomment-812496550


   > https://gist.github.com/nsivabalan/91f12109e0fe1ca9749ff5290c946778
   
   Hi @nsivabalan , I have take a review for your test code. First you write a 
"int" to the table, so the table schema type is "int". Then you write a 
"double" to the table, so the table schema become to "double". The table schema 
changed from "int" to "double". I think this is more reasonable.
   
   In my original idea, I think the first write schema(e.g. "int") is the table 
schema forever. The incoming records after that should be compatible with the 
origin table schema(e.g. "int"). This is this PR wants to solve. I can 
understand more clearly now. The table schema should be change to a more 
generic type (e.g. from "int" to "double"), but not always be the first write 
schema.
   So I can close this PR now. Thanks @nsivabalan for correct me.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] pengzhiwei2018 commented on pull request #2334: [HUDI-1453] Throw Exception when input data schema is not equal to th…

2021-03-14 Thread GitBox


pengzhiwei2018 commented on pull request #2334:
URL: https://github.com/apache/hudi/pull/2334#issuecomment-799071055


   > @pengzhiwei2018 first of all, thanks for these great contributions.
   > 
   > Wrt inputSchema vs writeSchema, I actually feel writeSchema already stands 
for inputSchema, input is what is being written, right? We can probably just 
leave it as is. and introduce new `tableSchema` variables as you have in the 
`HoodieWriteHandle` class.?
   > 
   > Like someone else pointed out as well, so far, we are using read and write 
schemas consistently. Love to not introduce a new input schema, unless its 
absolutely necessary .
   
   Hi @vinothchandar ,thanks for your reply on this issue.
   Yes, in most case ,the `writeSchema` is the same with the `inputSchema`  
which can stands for the `inputSchema` . But in the case in this PR, we write 
the table twice:
   First, we write a "id: long" to the table. The input schema is "a:long", the 
table schema is "a:long". 
   Second, we write a "id:int" to the table. The input schema is "a:int", but 
the table schema is "a:long" as the previous write. The write schema should be 
the same with the table schema, or else an Exception would throw out which is 
the problem we want to solve in this PR.
   So in this case, we need to distinguish the difference between the 
`inputSchema` and `writeSchema`. The `inputSchema` is the incoming records's 
schema, but the `writeSchema` is always the `tableSchema`. 
   **Here's the summary**
   - The `inputSchema` is used to parser the record from the incoming data.
   - The `tableSchema` is used to write and read record from the table. When we 
want to write or read record to the table, we use the `tableSchema`.
   
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] pengzhiwei2018 commented on pull request #2334: [HUDI-1453] Throw Exception when input data schema is not equal to th…

2021-01-04 Thread GitBox


pengzhiwei2018 commented on pull request #2334:
URL: https://github.com/apache/hudi/pull/2334#issuecomment-754400800


   > 
https://github.com/apache/hudi/blob/master/hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/realtime/RealtimeCompactedRecordReader.java#L70
   
   Hi @n3nash ,Thanks for your suggestion.
This PR only covers the part where HUDi writes the data, not the part where 
hudi reads the data. The introduced `inputSchema/tableSchema` is only used for 
parser the incoming record and read data in table.which is much different 
from the `writer/reader schema` in RealtimeCompactedRecordReader.
   
   And I will describe the changes in more detail in the  **Brief change log** 
part of this PR.
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [hudi] pengzhiwei2018 commented on pull request #2334: [HUDI-1453] Throw Exception when input data schema is not equal to th…

2020-12-26 Thread GitBox


pengzhiwei2018 commented on pull request #2334:
URL: https://github.com/apache/hudi/pull/2334#issuecomment-751355407


   > @pengzhiwei2018 : sorry I am not following the problem here. Can you throw 
some light.
   > Schema evolution is throwing an exception w/ Hudi, i.e a field's data type 
evolved to a compatible data type and hudi fails to accommodate it? is that the 
problem.
   > If yes, can you check #2350. there is a bug in schema compatibility check 
and the patch should fix it. If not, can you elaborate what exactly is the 
problem.
   > For eg: use-case given at #2063 succeeds with #2350
   
   Hi @nsivabalan, When I write a `long` value  to the `double` type field in 
the table, It can pass the `TableSchemaResolver.isSchemaCompatible`,However it 
failed in the write stage. 
   I found the problem is that the schema (e.g `double`) used in write & read 
table are the same with the input schema(e.g. `long`). I fixed this problem by 
distinguishing the two schemas.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org