You can also try to add an order by at the end of the CTAS. I.e. create table dfs.tmp.`mytable_export` as select ID, NAME, STATUS, GROUP, from_unixtime(etl_date/1000) as etl_date from dfs.root.`/location/of/table/to/convert/mytable` ORDER BY ID;
Only use this as needed, as the sort is expensive and mostly used for tables that will be read frequently. First try Kunal’s suggestion as it should be much less expensive. --Andries On 11/16/17, 10:28 PM, "Kunal Khatua" <kkha...@mapr.com> wrote: It might be that your parallelization is causing it to generate 4 files, where only <= 3 files are sufficient. Try experimenting with the planner. width .max_per_query to a value of 3 ... that might help. https://drill.apache.org/docs/configuration-options-introduction/ -----Original Message----- From: Reed Villanueva [mailto:rvillanu...@ucera.org] Sent: Thursday, November 16, 2017 5:29 PM To: user@drill.apache.org Subject: sqlline parquet to tsv filesize imabalance causing slow sqoop export to MS sql server I am new to using drill and am trying to convert a table stored on hadoop dfs as .parquet to .tsv format using sqlline that came with the drill package. The problem is that when doing this, the tsv files are poorly 'balanced'. When checking the sizes of the converted files, I see: +-----------+----------------------------+ | Fragment | Number of records written | +-----------+----------------------------+ | 1_3 | 1 | | 1_1 | 306955 | | 1_2 | 493009 | | 1_0 | 698228 | +-----------+----------------------------+ [mapr@mnode02 mytable_export]$ ls -l total 486470 -rwxr-xr-x 1 mapr mapr 105581719 Oct 19 10:25 1_0_0.tsv -rwxr-xr-x 1 mapr mapr 155385226 Oct 19 10:25 1_1_0.tsv -rwxr-xr-x 1 mapr mapr 237176680 Oct 19 10:25 1_2_0.tsv -rwxr-xr-x 1 mapr mapr 279 Oct 19 10:25 1_3_0.tsv So when trying to export the files from hadoop using sqoop export, (I think) the imbalance is causing the export progress to go very slowly for some of the job mappers (when checking the hadoop applications web interface). My question, then, is whether there is a way to control the size and amount of the tsv files being created? Ultimately I am trying to export the table in tsv format to a Microsoft SQL Server DB(and can't alleviate the slowdown with --batch or --direct because sqoop apparently does not support those options for MS SQL server, nor is there a way to sqoop export parquet to sql server). The sql I am using to use to convert the table is mytable.sql: alter session set `store.format`='tsv'; create table dfs.tmp.`mytable_export` as select ID, NAME, STATUS, GROUP, from_unixtime(etl_date/1000) as etl_date from dfs.root.`/location/of/table/to/convert/mytable`; with this script bash myscript.sh mytable: #!/bin/bash ... tablename=$1 sqldir="/path/to/sql/to/run/"$tablename".sql" echo $sqldir ... #write table to tsv /opt/mapr/drill/drill-1.8.0/bin/sqlline \ -u jdbc:drill:zk=mnode01:5181,mnode02:5181,mnode03:5181 \ -n username\ -p password \ --run=$sqldir ... Any suggestions or advice would be appreciated, thanks. -- This electronic message is intended only for the named recipient, and may contain information that is confidential or privileged. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or use of the contents of this message is strictly prohibited. If you have received this message in error or are not the named recipient, please notify us immediately by contacting the sender at the electronic mail address noted above, and delete and destroy all copies of this message. Thank you.