Re: table schema for parquet file is not working

Prabhakar Bhosale Wed, 09 Aug 2023 20:41:23 -0700

THanks James for confirming. This helps a lot.  I can move to 1.21.1. But
before that I need to try the same with the parquet also. Will check that
too and make a decision.


Regards
Prabhakar

On Tue, Aug 8, 2023 at 12:51 PM James Turton <[email protected]> wrote:

> Okay I get the same result as you when I try with Drill 1.20.3 so I
> guess that there's a JSON reader bug that got fixed somewhere in between
> 1.20.3 and 1.21.1. Do you need to stay on Drill 1.20 or can you upgrade
> to 1.21?
>
> On 2023/08/07 08:32, Prabhakar Bhosale wrote:
> > hi James,
> > I re-tried the steps once again after looking at the output you shared
> with
> > me. But still I am not able to get the expected output on drill 1.20.1.
> > Then I downloaded drill 1.21.1 and tried the same steps. This time I got
> > the expected output.
> > So can you please try the same on drill 1.20.1 once if possible? thx
> >
> > REgards
> > Prabhakar
> >
> > On Wed, Aug 2, 2023 at 3:11 PM James Turton <[email protected]> wrote:
> >
> >> Hi!
> >>
> >> I just got back from travelling. I ran a test and Drill did do what I
> >> believe you're after.
> >>
> >> Here's the test JSON file
> >>
> >> ➜  ~ cat /tmp/foo/bar.json
> >> {"id":"T06125309","cust_id":"A20","num":"VAB6169028"}
> >>
> >>
> >> And here's my Drill session.
> >>
> >> apache drill> use dfs.tmp;
> >> ok       true
> >> summary  Default schema changed to [dfs.tmp]
> >>
> >> apache drill (dfs.tmp)> select * from foo;
> >> id       T06125309
> >> cust_id  A20
> >> num      VAB6169028
> >>
> >> apache drill (dfs.tmp)> create schema (id varchar, cust_id varchar, num
> >> varchar, tranamt double not null default '1.11'
> >> ) for table foo;
> >> ok       true
> >> summary  Created schema for [foo]
> >>
> >> 1 row selected (0.257 seconds)
> >> apache drill (dfs.tmp)> select * from foo;
> >> id       T06125309
> >> cust_id  A20
> >> num      VAB6169028
> >> tranamt  1.11
> >>
> >>
> >> On 2023/07/27 07:38, Prabhakar Bhosale wrote:
> >>
> >> Hi James,
> >> Any advice on the problem on schema for json file as mentioned in
> >> my previous mail?
> >>
> >> Regards
> >> Prabhakar
> >>
> >> On Mon, Jul 24, 2023 at 3:41 PM Prabhakar Bhosale <
> [email protected]> <[email protected]>
> >> wrote:
> >>
> >>
> >> Hi James,
> >>
> >> I tried the same on the JSON file and it is still not working.
> >>
> >> Below is JSON file content
> >>
> >> *{"id":"T06125309","cust_id":"A20","num":"VAB6169028"}*
> >>
> >> The contents of the ".query.schema" file are as follows
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> *{  "table" : "mystore.`TEST_MOD`",  "schema" : {    "type" :
> >> "tuple_schema",    "columns" : [      {        "name" : "ID",
> "type"
> >> : "VARCHAR",        "mode" : "OPTIONAL"      },      {        "name" :
> >> "CUST_ID",        "type" : "VARCHAR",        "mode" : "OPTIONAL"      },
> >>    {        "name" : "NUM",        "type" : "VARCHAR",        "mode" :
> >> "OPTIONAL"      },      {        "name" : "TRANAMT",        "type" :
> >> "DOUBLE",        "mode" : "REQUIRED",        "properties" : {
> >> "drill.default" : "1.11"        }      }    ]  },  "version" : 1}*
> >>
> >> When I fire the query below, I expect the value of TRANAMT column to be
> >> *1.11* but it gives out value as *NULL*
> >>
> >> *select  A.ID <http://A.ID> <http://A.ID>, A.CUST_ID, A.NUM, A.TRANAMT
> from
> >> table(mystore.`TEST_MOD`(schema =>
> >> 'path=`/archived_files_nw/TEST_MOD/.drill.schema`')) A*
> >>
> >> So essentially it is not considering the schema given at the query
> >> execution time. Please let me know if I am doing anything incorrectly.
> >>
> >>
> >> Thanks and Regards
> >> Prabhakar
> >>
> >>
> >> On Wed, Jul 12, 2023 at 4:33 PM James Turton <[email protected]> <
> [email protected]> wrote:
> >>
> >>
> >> Hi Prabhakar
> >>
> >>   From what I recall, Drill won't consider a provided schema when
> >> querying Parquet because Parquet files bundle their own schema. You
> >> might need to use a SQL function like COALESCE(TRAN_AMOUNT, 1.11) and
> >> possibly put that in a SQL view for reuse.
> >>
> >> Regards
> >> James
> >>
> >> On 2023/07/11 18:40, Prabhakar Bhosale wrote:
> >>
> >> Hi Team,
> >> I am using drill 1.20.1 with parquet files.
> >>
> >> I have two parquet files in a directory with one column missing in one
> >> file. When I query the directory it gives me NULL values for all those
> >>
> >> rows
> >>
> >> which are from the file where that column is missing.
> >>
> >> But I want a specific value for that column instead of NULL. So I
> >> have created the schema as given below. But even after creating it is
> >>
> >> still
> >>
> >> returning the NULL value. Please let me know what is going wrong.
> >>
> >> I have also ensured that storage.table.user_schema_file=true at system
> >> level.
> >>
> >> The files are stored on linux mount point.
> >> The name of the missing column is "TRAN_AMOUNT".
> >>
> >>
> >>
> >> The schema is as below
> >>
> >> {
> >>     "table" : "archive.default.`executions`",
> >>     "schema" : {
> >>       "type" : "tuple_schema",
> >>       "columns" : [
> >>         {
> >>           "name" : "EXEC_ID",
> >>           "type" : "VARCHAR",
> >>           "mode" : "OPTIONAL"
> >>         },
> >>         {
> >>           "name" : "CUST_ID",
> >>           "type" : "VARCHAR",
> >>           "mode" : "OPTIONAL"
> >>         },
> >>         {
> >>           "name" : "CELL_ID",
> >>           "type" : "VARCHAR",
> >>           "mode" : "OPTIONAL"
> >>         },
> >>         {
> >>           "name" : "TRAN_AMOUNT",
> >>           "type" : "FLOAT",
> >>           "mode" : "REQUIRED",
> >>           "properties" : {
> >>             "drill.default" : "1.11"
> >>       }
> >>         }
> >>       ]
> >>     },
> >>     "version" : 1
> >> }
> >>
> >>
> >>
> >>
>
>

Re: table schema for parquet file is not working

Reply via email to