jonkeane commented on issue #37816:
URL: https://github.com/apache/arrow/issues/37816#issuecomment-1729637921

   Could you explain more about what kind of cloud storage + ephemeral local / 
syncing system that you're using? Using a dataset backed by something like 
Google Drive, One drive, Dropbox, iCloud drive etc. is not something we 
recommend since the performance can be so variable depending on if a file is 
truly local or needs to be fetched first. It would still be good to know which 
of those (or some other one) you're using in case we run into it elsewhere.  
   
   When opening a dataset, there is a process the recursively scans the 
directories + files to find the partitions + get a list of parquet files that 
make up the dataset. (There's a pretty good explanation of this for a different 
dataset in https://github.com/apache/arrow/issues/34145#issuecomment-1432181304 
). So it's possible that that listing might trigger the cloud -> local syncing 
process to start, hence downloading everything (even without trying to unify 
schemas). There _might_ be a way around this in the listing code inside of 
arrow, but there are a lot of complexities with these kinds of filesystems.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to