Re: table schema for parquet file is not working

Prabhakar Bhosale Wed, 26 Jul 2023 22:38:41 -0700

Hi James,
Any advice on the problem on schema for json file as mentioned in
my previous mail?


Regards
Prabhakar

On Mon, Jul 24, 2023 at 3:41 PM Prabhakar Bhosale <[email protected]>
wrote:

> Hi James,
>
> I tried the same on the JSON file and it is still not working.
>
> Below is JSON file content
>
> *{"id":"T06125309","cust_id":"A20","num":"VAB6169028"}*
>
> The contents of the ".query.schema" file are as follows
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *{  "table" : "mystore.`TEST_MOD`",  "schema" : {    "type" :
> "tuple_schema",    "columns" : [      {        "name" : "ID",        "type"
> : "VARCHAR",        "mode" : "OPTIONAL"      },      {        "name" :
> "CUST_ID",        "type" : "VARCHAR",        "mode" : "OPTIONAL"      },
>   {        "name" : "NUM",        "type" : "VARCHAR",        "mode" :
> "OPTIONAL"      },      {        "name" : "TRANAMT",        "type" :
> "DOUBLE",        "mode" : "REQUIRED",        "properties" : {
> "drill.default" : "1.11"        }      }    ]  },  "version" : 1}*
>
> When I fire the query below, I expect the value of TRANAMT column to be
> *1.11* but it gives out value as *NULL*
>
> *select  A.ID <http://A.ID>, A.CUST_ID, A.NUM, A.TRANAMT from
> table(mystore.`TEST_MOD`(schema =>
> 'path=`/archived_files_nw/TEST_MOD/.drill.schema`')) A*
>
> So essentially it is not considering the schema given at the query
> execution time. Please let me know if I am doing anything incorrectly.
>
>
> Thanks and Regards
> Prabhakar
>
>
> On Wed, Jul 12, 2023 at 4:33 PM James Turton <[email protected]> wrote:
>
>> Hi Prabhakar
>>
>>  From what I recall, Drill won't consider a provided schema when
>> querying Parquet because Parquet files bundle their own schema. You
>> might need to use a SQL function like COALESCE(TRAN_AMOUNT, 1.11) and
>> possibly put that in a SQL view for reuse.
>>
>> Regards
>> James
>>
>> On 2023/07/11 18:40, Prabhakar Bhosale wrote:
>> > Hi Team,
>> > I am using drill 1.20.1 with parquet files.
>> >
>> > I have two parquet files in a directory with one column missing in one
>> > file. When I query the directory it gives me NULL values for all those
>> rows
>> > which are from the file where that column is missing.
>> >
>> > But I want a specific value for that column instead of NULL. So I
>> > have created the schema as given below. But even after creating it is
>> still
>> > returning the NULL value. Please let me know what is going wrong.
>> >
>> > I have also ensured that storage.table.user_schema_file=true at system
>> > level.
>> >
>> > The files are stored on linux mount point.
>> > The name of the missing column is "TRAN_AMOUNT".
>> >
>> >
>> >
>> > The schema is as below
>> >
>> > {
>> >    "table" : "archive.default.`executions`",
>> >    "schema" : {
>> >      "type" : "tuple_schema",
>> >      "columns" : [
>> >        {
>> >          "name" : "EXEC_ID",
>> >          "type" : "VARCHAR",
>> >          "mode" : "OPTIONAL"
>> >        },
>> >        {
>> >          "name" : "CUST_ID",
>> >          "type" : "VARCHAR",
>> >          "mode" : "OPTIONAL"
>> >        },
>> >        {
>> >          "name" : "CELL_ID",
>> >          "type" : "VARCHAR",
>> >          "mode" : "OPTIONAL"
>> >        },
>> >        {
>> >          "name" : "TRAN_AMOUNT",
>> >          "type" : "FLOAT",
>> >          "mode" : "REQUIRED",
>> >          "properties" : {
>> >            "drill.default" : "1.11"
>> >      }
>> >        }
>> >      ]
>> >    },
>> >    "version" : 1
>> > }
>> >
>>
>>

Re: table schema for parquet file is not working

Reply via email to