You can also try to add an order by at the end of the CTAS.
I.e. 
create table dfs.tmp.`mytable_export`
    as select
        ID, NAME, STATUS, GROUP, from_unixtime(etl_date/1000) as etl_date from 
dfs.root.`/location/of/table/to/convert/mytable` ORDER BY ID;

Only use this as needed, as the sort is expensive and mostly used for tables 
that will be read frequently. First try Kunal’s suggestion as it should be much 
less expensive.

--Andries


On 11/16/17, 10:28 PM, "Kunal Khatua" <kkha...@mapr.com> wrote:

    It might be that your parallelization is causing it to generate 4 files, 
where only <= 3  files are sufficient. 
    
    Try experimenting with the planner. width .max_per_query  to a value of 3 
... that might help.
    
    https://drill.apache.org/docs/configuration-options-introduction/
    
    
    -----Original Message-----
    From: Reed Villanueva [mailto:rvillanu...@ucera.org] 
    Sent: Thursday, November 16, 2017 5:29 PM
    To: user@drill.apache.org
    Subject: sqlline parquet to tsv filesize imabalance causing slow sqoop 
export to MS sql server
    
    I am new to using drill and am trying to convert a table stored on hadoop 
dfs as .parquet to .tsv format using sqlline that came with the drill package. 
The problem is that when doing this, the tsv files are poorly 'balanced'. When 
checking the sizes of the converted files, I see:
    
    +-----------+----------------------------+
    
    | Fragment  | Number of records written  |
    
    +-----------+----------------------------+
    
    | 1_3       | 1                          |
    
    | 1_1       | 306955                     |
    
    | 1_2       | 493009                     |
    
    | 1_0       | 698228                     |
    
    +-----------+----------------------------+
    
    [mapr@mnode02 mytable_export]$ ls -l
    total 486470
    -rwxr-xr-x 1 mapr mapr 105581719 Oct 19 10:25 1_0_0.tsv -rwxr-xr-x 1 mapr 
mapr 155385226 Oct 19 10:25 1_1_0.tsv -rwxr-xr-x 1 mapr mapr 237176680 Oct 19 
10:25 1_2_0.tsv
    -rwxr-xr-x 1 mapr mapr       279 Oct 19 10:25 1_3_0.tsv
    
    So when trying to export the files from hadoop using sqoop export, (I
    think) the imbalance is causing the export progress to go very slowly for 
some of the job mappers (when checking the hadoop applications web interface).
    
    My question, then, is whether there is a way to control the size and amount 
of the tsv files being created? Ultimately I am trying to export the table in 
tsv format to a Microsoft SQL Server DB(and can't alleviate the slowdown with 
--batch or --direct because sqoop apparently does not support those options for 
MS SQL server, nor is there a way to sqoop export parquet to sql server).
    
    The sql I am using to use to convert the table is mytable.sql:
    
    alter session set `store.format`='tsv';
    create table dfs.tmp.`mytable_export`
    as select
        ID, NAME, STATUS, GROUP, from_unixtime(etl_date/1000) as etl_date from 
dfs.root.`/location/of/table/to/convert/mytable`;
    
    with this script bash myscript.sh mytable:
    
    #!/bin/bash
    ...
    tablename=$1
    sqldir="/path/to/sql/to/run/"$tablename".sql"
    echo $sqldir
    ...
    #write table to tsv
    /opt/mapr/drill/drill-1.8.0/bin/sqlline \
        -u jdbc:drill:zk=mnode01:5181,mnode02:5181,mnode03:5181 \
        -n username\
        -p password \
        --run=$sqldir
    ...
    
    Any suggestions or advice would be appreciated, thanks.
    
    --
    This electronic message is intended only for the named recipient, and may 
contain information that is confidential or privileged. If you are not the 
intended recipient, you are hereby notified that any disclosure, copying, 
distribution or use of the contents of this message is strictly prohibited. If 
you have received this message in error or are not the named recipient, please 
notify us immediately by contacting the sender at the electronic mail address 
noted above, and delete and destroy all copies of this message. Thank you.
    

Reply via email to