Disk spill is not used for CTAS operations directly, unless the CTAS statement includes aggregations, sorts and some join operations.
I'm not familiar with the JDBC plugin, but when creating tables with parquet Drill direct memory should be the primary memory not the heap memory. The GC exception seems to be related to the heap memory, not direct memory. Perhaps try a test with a DFS/Hive source to create parquet data of similar size and see if the GC exception still exists. Also be careful with the assumption that the data set will fit in memory and should work. In my experience you should look at the following factors when doing CTAS with large data sets. When Drill creates the parquet file(s) it needs to fit the whole file in memory and also the source data to produce the file, since parquet can produce 3-4x storage efficiency (pending data and source format) it means you need to account for the parquet file size + 3-4x as much for the source data. Also each fragment in the CTAS will consume this much memory. The basic formula I use is Drillbit direct memory > #minor fragments * 5 * parquet file size. This can imply a large memory requirement in some cases. To reduce the memory requirement you can do the following. Reduce the number of minor fragments for CTAS on the drillbits, by setting planner.width.max_per_node to a lower number for the session before running the CTAS query. This will have fewer threads executing, which will make CTAS slower, but also reduce the memory requirement substantially. You can make the default parquet file size smaller, default is 512MB. Set store.parquet.block-size to a smaller value, 256MB or 128MB. The tradeoff is that it will produce more files, but it reduces the memory requirement by 2 or 4x. I prefer to lower the fragments and take the slower creation once to not have the impact of smaller parquet files, though 256MB is not necessarily a bad file size. I have not tracked CTAS with partitioning enough to see how that impacts memory requirements on large data sets. Though keep in mind with partitioning it pretty much creates a file for every partition key for every minor fragment on the cluster, and reducing the number of fragments may be needed pending the # keys, data size, etc. Others may have different experiences and additional input. --Andries > On Dec 4, 2015, at 5:51 AM, John Omernik <j...@omernik.com> wrote: > > Try looking into the the "sort.external.spill.directories" and > "sort.external.spill.fs" settings under drill.exec. These "may" help with > that memory pressure, but I am not an expert. > > John > > On Fri, Dec 4, 2015 at 7:22 AM, Daniel Garcia < > daniel.gar...@eurotaxglass.com <mailto:daniel.gar...@eurotaxglass.com>> wrote: > >> Hi, >> >> >> >> I was trying to dump into a parquet file data contained in a very big >> MySQL table. >> >> After setting the store.format to parquet and the compression to snappy, I >> used a query like: >> >> >> >> Create table dfs.tmp.`file.parquet` as (select … from mysq.schema.table); >> >> >> >> The problem that I found is that the data is big enough to fit into memory >> and I get a GC overhead exception and the drillbit process crashes. >> >> >> >> I’v been trying to find out some configuration setting to use disk memory >> when this happens, but I had no luck. >> >> >> >> Can this be done? >> >> >> >> Regards, >> >> >> >> *Daniel García* >> >> *Continuous Improvement Team* >> >> >> >> [image: image002] >> >> >> >> Albasanz 12 */* 4th Floor */* 28037 Madrid >> >> >> >> T +34 917 542 966 >> >> >> >> E *daniel.gar...@eurotaxglass.com <mailto:daniel.gar...@eurotaxglass.com> >> <daniel.gar...@eurotaxglass.com <mailto:daniel.gar...@eurotaxglass.com>>* >> >> W *www.eurotaxglass.com <http://www.eurotaxglass.com/> >> <http://www.eurotaxglass.com/ <http://www.eurotaxglass.com/>>* >> >> >> >> >> *Disclaimer* >> >> The information contained in this communication from the sender is >> confidential. It is intended solely for use by the recipient and others >> authorized to receive it. If you are not the recipient, you are hereby >> notified that any disclosure, copying, distribution or taking action in >> relation of the contents of this information is strictly prohibited and may >> be unlawful. >> >> This email has been scanned for viruses and malware, and may have been >> automatically archived by *Mimecast Ltd*, an innovator in Software as a >> Service (SaaS) for business. Providing a *safer* and *more useful* place >> for your human generated data. Specializing in; Security, archiving and >> compliance. To find out more Click Here >> <http://www.mimecast.com/products/ <http://www.mimecast.com/products/>>.