[GitHub] [arrow-datafusion] gobraves commented on issue #6983: [DataFrame] Parallel Load into dataframe

via GitHub Tue, 01 Aug 2023 10:40:54 -0700


gobraves commented on issue #6983:
URL: 
https://github.com/apache/arrow-datafusion/issues/6983#issuecomment-1660801140


   hi @alamb, I apologize for the delayed response. Based on your tips, I 
executed the following commands in the CLI and also ran the code you provided 
to reproduce the issue. I noticed that executing the commands in the CLI was 
almost 8 times faster than running the code mentioned above, which is 
consistent with my CPU core count.
   
   Here are the commands I executed in the CLI:
   
       create external table test stored as parquet location 'part-0.parquet';
       create table t as select * from test;
       explain create table t as select * from test;
   
   In the logical_plan of the explain output, I observed `CreateMemoryTable` 
and `TableScan`. Consequently, I reviewed the code for `CreateMemoryTable` in 
the datafusion-cli and the `.cache() ` function, hoping to identify the 
differences. I noticed that the target_partitions are indeed passed in both 
cases, but I'm unsure why they are not utilized in `.cache()`. However, from 
the commit mentioned in issue #6984 , it seems that the problem is resolved by 
using repartitioning. Therefore, it appears that the difference lies in one 
implementation using `Partitioning`, while the other does not. However, when 
browsing through the code myself, I couldn't find any relevant settings. If 
this is the case, could you please provide some hints as to which part of the 
code this operation occurs?
   
   I have one more question: Do we need to create a new DmlStatement to address 
this issue?
   > Perhaps this could be done by creating a LogicalPlan::DmlStatement for 
write and then letting the existing insert machinery work rather than doing a 
custom "collect".  
   
   I'm not entirely clear about this statement, and I believe it might be 
because I haven't fully grasped the problem described above.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow-datafusion] gobraves commented on issue #6983: [DataFrame] Parallel Load into dataframe

Reply via email to