Hi!
I just got back from travelling. I ran a test and Drill did do what I
believe you're after.
Here's the test JSON file
➜ ~ cat /tmp/foo/bar.json
{"id":"T06125309","cust_id":"A20","num":"VAB6169028"}
And here's my Drill session.
apache drill> use dfs.tmp;
ok true
summary Default schema changed to [dfs.tmp]
apache drill (dfs.tmp)> select * from foo;
id T06125309
cust_id A20
num VAB6169028
apache drill (dfs.tmp)> create schema (id varchar, cust_id varchar, num
varchar, tranamt double not null default '1.11'
) for table foo;
ok true
summary Created schema for [foo]
1 row selected (0.257 seconds)
apache drill (dfs.tmp)> select * from foo;
id T06125309
cust_id A20
num VAB6169028
tranamt 1.11
On 2023/07/27 07:38, Prabhakar Bhosale wrote:
Hi James,
Any advice on the problem on schema for json file as mentioned in
my previous mail?
Regards
Prabhakar
On Mon, Jul 24, 2023 at 3:41 PM Prabhakar Bhosale <[email protected]>
<[email protected]>
wrote:
Hi James,
I tried the same on the JSON file and it is still not working.
Below is JSON file content
*{"id":"T06125309","cust_id":"A20","num":"VAB6169028"}*
The contents of the ".query.schema" file are as follows
*{ "table" : "mystore.`TEST_MOD`", "schema" : { "type" :
"tuple_schema", "columns" : [ { "name" : "ID", "type"
: "VARCHAR", "mode" : "OPTIONAL" }, { "name" :
"CUST_ID", "type" : "VARCHAR", "mode" : "OPTIONAL" },
{ "name" : "NUM", "type" : "VARCHAR", "mode" :
"OPTIONAL" }, { "name" : "TRANAMT", "type" :
"DOUBLE", "mode" : "REQUIRED", "properties" : {
"drill.default" : "1.11" } } ] }, "version" : 1}*
When I fire the query below, I expect the value of TRANAMT column to be
*1.11* but it gives out value as *NULL*
*select A.ID <http://A.ID> <http://A.ID>, A.CUST_ID, A.NUM, A.TRANAMT from
table(mystore.`TEST_MOD`(schema =>
'path=`/archived_files_nw/TEST_MOD/.drill.schema`')) A*
So essentially it is not considering the schema given at the query
execution time. Please let me know if I am doing anything incorrectly.
Thanks and Regards
Prabhakar
On Wed, Jul 12, 2023 at 4:33 PM James Turton <[email protected]>
<[email protected]> wrote:
Hi Prabhakar
From what I recall, Drill won't consider a provided schema when
querying Parquet because Parquet files bundle their own schema. You
might need to use a SQL function like COALESCE(TRAN_AMOUNT, 1.11) and
possibly put that in a SQL view for reuse.
Regards
James
On 2023/07/11 18:40, Prabhakar Bhosale wrote:
Hi Team,
I am using drill 1.20.1 with parquet files.
I have two parquet files in a directory with one column missing in one
file. When I query the directory it gives me NULL values for all those
rows
which are from the file where that column is missing.
But I want a specific value for that column instead of NULL. So I
have created the schema as given below. But even after creating it is
still
returning the NULL value. Please let me know what is going wrong.
I have also ensured that storage.table.user_schema_file=true at system
level.
The files are stored on linux mount point.
The name of the missing column is "TRAN_AMOUNT".
The schema is as below
{
"table" : "archive.default.`executions`",
"schema" : {
"type" : "tuple_schema",
"columns" : [
{
"name" : "EXEC_ID",
"type" : "VARCHAR",
"mode" : "OPTIONAL"
},
{
"name" : "CUST_ID",
"type" : "VARCHAR",
"mode" : "OPTIONAL"
},
{
"name" : "CELL_ID",
"type" : "VARCHAR",
"mode" : "OPTIONAL"
},
{
"name" : "TRAN_AMOUNT",
"type" : "FLOAT",
"mode" : "REQUIRED",
"properties" : {
"drill.default" : "1.11"
}
}
]
},
"version" : 1
}