I have a dataset up on S3 in partitioned folders. I'm trying to create an external hive table pointing to the location of that data. The table schema is set up to have the column partitions matching how the folders are set up on S3.
I've done this quite a few times successfully, but when the data is large the table creation query is either extremely slow or it hangs (We can't tell). I've followed some of the tips in https://hortonworks.github.io/hdp-aws/s3-hive/index.html#general-performance-tips by configuring some of the parameters involving file permission and file size checks to adjust for S3 but still no luck. We're using EMR 5.12.1 which contains Hive 2.3.2. The table creation query does not show up in the Tez UI, but it does show up in the HiveServer UI as running, but we're not sure if it actually is or just hung (most likely the latter). Our (very roundabout) solution so far is to copy all the files in that master folder to another directory, delete the files, create the external table when the directory is empty, and to transfer the files back. We need to keep the original directory name as other processes depend on it and can't simply just start in a fresh directory, so this whole method is obviously not ideal. Any tips / solutions to this problem we've been tackling would be greatly appreciated. Dickson
