Hi Max,
Not via Drill, but you can find the data types of Parquet file(s) by
using the parquet-tools (see
https://github.com/apache/parquet-mr/tree/master/parquet-tools )
See example below,
-- Boaz
$ cd ~/parquet-mr/parquet-tools
$ java -jar ./target/parquet-tools-1.9.1-SNAPSHOT.jar schema -d /data/tmp/
message test {
required int64 row_count;
required int32 gby_int32;
required int32 gby_int32_rand;
required binary gby_string (UTF8);
required float gby_float;
required int32 gby_date (DATE);
required int64 gby_timestamp (TIMESTAMP_MILLIS);
required binary gby_same (UTF8);
optional binary gby_rand (UTF8);
required int32 int32_field;
required int64 int64_field;
required int64 int64_rand;
required boolean boolean_field;
required float float_field;
required float float_rand;
required double double_field;
required double double_rand;
}
creator: parquet-mr version 1.8.2-SNAPSHOT (build
0cfa025d6ffeee07cb0fa2125c977185b849e5c9)
extra: writer.model.name = example
file schema: test
--------------------------------------------------------------------------------
row_count: REQUIRED INT64 R:0 D:0
gby_int32: REQUIRED INT32 R:0 D:0
gby_int32_rand: REQUIRED INT32 R:0 D:0
gby_string: REQUIRED BINARY O:UTF8 R:0 D:0
gby_float: REQUIRED FLOAT R:0 D:0
gby_date: REQUIRED INT32 O:DATE R:0 D:0
gby_timestamp: REQUIRED INT64 O:TIMESTAMP_MILLIS R:0 D:0
gby_same: REQUIRED BINARY O:UTF8 R:0 D:0
gby_rand: OPTIONAL BINARY O:UTF8 R:0 D:1
int32_field: REQUIRED INT32 R:0 D:0
int64_field: REQUIRED INT64 R:0 D:0
int64_rand: REQUIRED INT64 R:0 D:0
boolean_field: REQUIRED BOOLEAN R:0 D:0
float_field: REQUIRED FLOAT R:0 D:0
float_rand: REQUIRED FLOAT R:0 D:0
double_field: REQUIRED DOUBLE R:0 D:0
double_rand: REQUIRED DOUBLE R:0 D:0
row group 1: RC:702512 TS:134255368 OFFSET:4
--------------------------------------------------------------------------------
row_count: INT64 UNCOMPRESSED DO:0 FPO:4 SZ:5620388/5620388/1.00 VC:702512
ENC:PLAIN,BIT_PACKED
gby_int32: INT32 UNCOMPRESSED DO:0 FPO:5620392 SZ:2810169/2810169/1.00
VC:702512 ENC:PLAIN,BIT_PACKED
gby_int32_rand: INT32 UNCOMPRESSED DO:0 FPO:8430561 SZ:2810169/2810169/1.00
VC:702512 ENC:PLAIN,BIT_PACKED
gby_string: BINARY UNCOMPRESSED DO:0 FPO:11240730 SZ:35481138/35481138/1.00
VC:702512 ENC:PLAIN,BIT_PACKED
gby_float: FLOAT UNCOMPRESSED DO:0 FPO:46721868 SZ:2810169/2810169/1.00
VC:702512 ENC:PLAIN,BIT_PACKED
. . . . . . . . .
. . . . . . . . .
. . . . . . . . . .
On 9/23/17, 10:10 AM, "Max Orelus" <[email protected]> wrote:
Hi,
I recently started learning about using Apache Drill and I've been
trying to figure out something for a while now, but I can't seem to find
any resources that document what I'm trying to do. Essentially, I'm
trying to query the data types for each column that I have within a
Parquet file using drill. I've scanned over the documentation on the
Drill website, but nothing I have tried has worked.
I will admit that I'm not in any means a Database administrator, so this
is somewhat out of my knowledge realm. I'm a web developer that is
integrating drill querying within my applications.
So for instance, I have a parquet file that has the following columns:
name | address | age | occupation | timestamp
I would like to be able to query that parquet file and find out what
each column type is in the following manner:
| field | type |
|------------|--------|
| name | string |
| address | string |
| age | int |
| occupation | string |
| timestamp | date |
If you can point me in the right direction or potentially give me an
example of how I would write a query to output the above, I would really
appreciate it.
Thank you for your time.
Best regards,
--
Max Orelus
+1 (202) 361-9946
[email protected]