jonkeane commented on issue #37816: URL: https://github.com/apache/arrow/issues/37816#issuecomment-1729637921
Could you explain more about what kind of cloud storage + ephemeral local / syncing system that you're using? Using a dataset backed by something like Google Drive, One drive, Dropbox, iCloud drive etc. is not something we recommend since the performance can be so variable depending on if a file is truly local or needs to be fetched first. It would still be good to know which of those (or some other one) you're using in case we run into it elsewhere. When opening a dataset, there is a process the recursively scans the directories + files to find the partitions + get a list of parquet files that make up the dataset. (There's a pretty good explanation of this for a different dataset in https://github.com/apache/arrow/issues/34145#issuecomment-1432181304 ). So it's possible that that listing might trigger the cloud -> local syncing process to start, hence downloading everything (even without trying to unify schemas). There _might_ be a way around this in the listing code inside of arrow, but there are a lot of complexities with these kinds of filesystems. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
