Re: Retrieve Column Types from Parquet File?

Boaz Ben-Zvi Mon, 25 Sep 2017 17:13:33 -0700

 Hi Max,

       Not via Drill, but you can find the data types of Parquet file(s) by 
using the parquet-tools (see 
https://github.com/apache/parquet-mr/tree/master/parquet-tools )
See example below,


       --   Boaz

$  cd ~/parquet-mr/parquet-tools
$  java -jar ./target/parquet-tools-1.9.1-SNAPSHOT.jar schema -d /data/tmp/ 
message test {
  required int64 row_count;
  required int32 gby_int32;
  required int32 gby_int32_rand;
  required binary gby_string (UTF8);
  required float gby_float;
  required int32 gby_date (DATE);
  required int64 gby_timestamp (TIMESTAMP_MILLIS);
  required binary gby_same (UTF8);
  optional binary gby_rand (UTF8);
  required int32 int32_field;
  required int64 int64_field;
  required int64 int64_rand;
  required boolean boolean_field;
  required float float_field;
  required float float_rand;
  required double double_field;
  required double double_rand;
}

creator: parquet-mr version 1.8.2-SNAPSHOT (build 
0cfa025d6ffeee07cb0fa2125c977185b849e5c9)
extra: writer.model.name = example

file schema: test
--------------------------------------------------------------------------------
row_count: REQUIRED INT64 R:0 D:0
gby_int32: REQUIRED INT32 R:0 D:0
gby_int32_rand: REQUIRED INT32 R:0 D:0
gby_string: REQUIRED BINARY O:UTF8 R:0 D:0
gby_float: REQUIRED FLOAT R:0 D:0
gby_date: REQUIRED INT32 O:DATE R:0 D:0
gby_timestamp: REQUIRED INT64 O:TIMESTAMP_MILLIS R:0 D:0
gby_same: REQUIRED BINARY O:UTF8 R:0 D:0
gby_rand: OPTIONAL BINARY O:UTF8 R:0 D:1
int32_field: REQUIRED INT32 R:0 D:0
int64_field: REQUIRED INT64 R:0 D:0
int64_rand: REQUIRED INT64 R:0 D:0
boolean_field: REQUIRED BOOLEAN R:0 D:0
float_field: REQUIRED FLOAT R:0 D:0
float_rand: REQUIRED FLOAT R:0 D:0
double_field: REQUIRED DOUBLE R:0 D:0
double_rand: REQUIRED DOUBLE R:0 D:0

row group 1: RC:702512 TS:134255368 OFFSET:4
--------------------------------------------------------------------------------
row_count:  INT64 UNCOMPRESSED DO:0 FPO:4 SZ:5620388/5620388/1.00 VC:702512 
ENC:PLAIN,BIT_PACKED
gby_int32:  INT32 UNCOMPRESSED DO:0 FPO:5620392 SZ:2810169/2810169/1.00 
VC:702512 ENC:PLAIN,BIT_PACKED
gby_int32_rand:  INT32 UNCOMPRESSED DO:0 FPO:8430561 SZ:2810169/2810169/1.00 
VC:702512 ENC:PLAIN,BIT_PACKED
gby_string:  BINARY UNCOMPRESSED DO:0 FPO:11240730 SZ:35481138/35481138/1.00 
VC:702512 ENC:PLAIN,BIT_PACKED
gby_float:  FLOAT UNCOMPRESSED DO:0 FPO:46721868 SZ:2810169/2810169/1.00 
VC:702512 ENC:PLAIN,BIT_PACKED
. . . . . . . . . 
 . . . . . . . . .
  . . . . . . . . . .

On 9/23/17, 10:10 AM, "Max Orelus" <[email protected]> wrote:

    Hi,
    
    I recently started learning about using Apache Drill and I've been
    trying to figure out something for a while now, but I can't seem to find
    any resources that document what I'm trying to do. Essentially, I'm
    trying to query the data types for each column that I have within a
    Parquet file using drill. I've scanned over the documentation on the
    Drill website, but nothing I have tried has worked.
    
    I will admit that I'm not in any means a Database administrator, so this
    is somewhat out of my knowledge realm. I'm a web developer that is
    integrating drill querying within my applications.
    
    So for instance, I have a parquet file that has the following columns:
    
    name | address | age | occupation | timestamp
    
    I would like to be able to query that parquet file and find out what
    each column type is in the following manner:
    
    | field      | type   |
    |------------|--------|
    | name       | string |
    | address    | string |
    | age        | int    |
    | occupation | string |
    | timestamp  | date   |
    
    If you can point me in the right direction or potentially give me an
    example of how I would write a query to output the above, I would really
    appreciate it.
    
    Thank you for your time.
    
    Best regards,
    
    --  
    Max Orelus
    +1 (202) 361-9946
    [email protected]

Re: Retrieve Column Types from Parquet File?

Reply via email to