[GitHub] [hudi] yyh2954360585 commented on issue #9471: [SUPPORT] When using Deltasteamer JdbcSource to extract data, there are issues with data loss and slow query of source side data

via GitHub Mon, 21 Aug 2023 19:39:01 -0700


yyh2954360585 commented on issue #9471:
URL: https://github.com/apache/hudi/issues/9471#issuecomment-1687328500


   > > @yyh2954360585 JDBC is slow and put lot of load on source system. So 
full query a full query on large table can cause high load or even downtime to 
the database server. You can set the value of source-limit according to your 
dataset and requirement. You can even set it to a very high value.
   > 
   > If I set source limit=1000, then I can only extract 1000 pieces of data 
from the source table, which is not reasonable. Because it has no offset.
   > 
   > 
https://github.com/apache/hudi/blob/ba5ab8ca46863a67023e7172fb16a9a36d3b5acb/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JdbcSource.java#L239-L252
   
   So there will be another issue. If I have a table with a data volume of 10 
million and do not set the source limit, it will perform a full query on the 
source table. PpdQuery is a subquery that, according to the SQL execution plan, 
executes the subquery first and then the outer layer. If using jdbc. fetchsize, 
the condition for fetchsize will only be at the outermost layer


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [hudi] yyh2954360585 commented on issue #9471: [SUPPORT] When using Deltasteamer JdbcSource to extract data, there are issues with data loss and slow query of source side data

Reply via email to