Re: CT from parquet to CSV seems to not properly encode to UTF8

2018-07-16 Thread Kunal Khatua
Hi Carlos

It looks similar to an issue reported previously:
https://lists.apache.org/thread.html/1f3d4c427690c06f1992bc5070f355689ccc5b1ed8cc3678ad8e9106@
 

Could you try setting the JVM's file encoding to UTF-8 and retry? If it does 
not work, please file a JIRA in https://issues.apache.org 

Thanks
Kunal
On 7/16/2018 1:25:45 PM, Carlos Derich  wrote:
It seems to be an issue only with CSV/TSV files.

Tried writing the output as JSON and it handles the encoding properly.

alter session set `store.format`='json'
create table dfs.tmp.test3 as select `city` from dfs.parquets.`file`

Returns:

{"city": "Montréal"}


additional info:

parquet-tools schema:

message root {
optional binary city (UTF8);
}


On Mon, Jul 16, 2018 at 2:49 PM, Carlos Derich
wrote:

> Hello guys, hope everyone is well.
>
> I am having an encoding issue when converting a table from parquet into
> csv files, I wonder if someone could shed some light on it ?
>
> One of my data sets has data in French with lots of accentuation, and it
> is persisted in HDFS as parquet.
>
>
> When I query the parquet table with: *select `city` from
> dfs.parquets.`file` , *it properly return the data encoded.
>
>
> *city*
>
> *Montréal*
>
>
> Then I convert this table into a CSV file with the following query:
>
> *alter session set `store.format`='csv'*
> *create table dfs.csvs.`converted` as select * from dfs.parquets.`file`*
>
>
> Then when I run a select query on it, it returns data not properly encoded:
>
> *select columns[0] from dfs.csvs.`converted`*
>
> Returns:
>
> *Montr?al*
>
>
> My storage plugin is pretty standard:
>
> "csv" : {
> "type" : "text",
> "extensions" : [ "csv" ],
> "delimiter" : ",",
> "skipFirstLine": true
> },
>
> Should I explicitly add an charset option somewhere ? Couldn't find
> anything helpful on the docs.
>
> Tried adding *export DRILL_JAVA_OPTS="$DRILL_JAVA_OPTS
> -Dsaffron.default.charset=UTF-8"* to drill-env.sh file, but no luck.
>
> Have anyone ran into similar issues ?
>
> Thank you !
>


Re: CT from parquet to CSV seems to not properly encode to UTF8

2018-07-16 Thread Carlos Derich
It seems to be an issue only with CSV/TSV files.

Tried writing the output as JSON and it handles the encoding properly.

alter session set `store.format`='json'
create table dfs.tmp.test3 as select `city` from dfs.parquets.`file`

Returns:

{"city": "Montréal"}


additional info:

parquet-tools schema:

message root {
  optional binary city (UTF8);
}


On Mon, Jul 16, 2018 at 2:49 PM, Carlos Derich 
wrote:

> Hello guys, hope everyone is well.
>
> I am having an encoding issue when converting a table from parquet into
> csv files, I wonder if someone could shed some light on it ?
>
> One of my data sets has data in French with lots of accentuation, and it
> is persisted in HDFS as parquet.
>
>
> When I query the parquet table with: *select `city` from
> dfs.parquets.`file` , *it properly return the data encoded.
>
>
> *city*
>
> *Montréal*
>
>
> Then I convert this table into a CSV file with the following query:
>
> *alter session set `store.format`='csv'*
> *create table dfs.csvs.`converted` as select * from dfs.parquets.`file`*
>
>
> Then when I run a select query on it, it returns data not properly encoded:
>
> *select columns[0] from dfs.csvs.`converted`*
>
> Returns:
>
> *Montr?al*
>
>
> My storage plugin is pretty standard:
>
> "csv" : {
> "type" : "text",
> "extensions" : [ "csv" ],
> "delimiter" : ",",
> "skipFirstLine": true
> },
>
> Should I explicitly add an charset option somewhere ? Couldn't find
> anything helpful on the docs.
>
> Tried adding *export DRILL_JAVA_OPTS="$DRILL_JAVA_OPTS
> -Dsaffron.default.charset=UTF-8"* to drill-env.sh file, but no luck.
>
> Have anyone ran into similar issues ?
>
> Thank you !
>


CT from parquet to CSV seems to not properly encode to UTF8

2018-07-16 Thread Carlos Derich
Hello guys, hope everyone is well.

I am having an encoding issue when converting a table from parquet into csv
files, I wonder if someone could shed some light on it ?

One of my data sets has data in French with lots of accentuation, and it is
persisted in HDFS as parquet.


When I query the parquet table with: *select `city` from
dfs.parquets.`file` , *it properly return the data encoded.


*city*

*Montréal*


Then I convert this table into a CSV file with the following query:

*alter session set `store.format`='csv'*
*create table dfs.csvs.`converted` as select * from dfs.parquets.`file`*


Then when I run a select query on it, it returns data not properly encoded:

*select columns[0] from dfs.csvs.`converted`*

Returns:

*Montr?al*


My storage plugin is pretty standard:

"csv" : {
"type" : "text",
"extensions" : [ "csv" ],
"delimiter" : ",",
"skipFirstLine": true
},

Should I explicitly add an charset option somewhere ? Couldn't find
anything helpful on the docs.

Tried adding *export DRILL_JAVA_OPTS="$DRILL_JAVA_OPTS
-Dsaffron.default.charset=UTF-8"* to drill-env.sh file, but no luck.

Have anyone ran into similar issues ?

Thank you !


Re: Best Practice to check Drillbit status(Cluster mode)

2018-07-16 Thread Abhishek Girish
I think logs may be the only way to figure it out, at the present. You
could have a watch on your logs to be informed of such events. For
notifications, I would say file an enhancement JIRA - if it gathers enough
attention, perhaps someone would volunteer to work or comment on it.

On Mon, Jul 16, 2018 at 2:08 AM Divya Gehlot 
wrote:

> Hi ,
> Thanks Abhishek !
> I would like to have a notification of that orphan drillbit process when it
> gets disconnected from other running drillbits for some reason , definitely
> not because of the unclean shut down as those drill bits are running for
> months .
> I know I can check the logs and kill that orphaned , which what I did in my
> case, but I  would like to have notification for down drillbit.
>
>
> Thanks,
> Divya
>
> On Fri, 13 Jul 2018 at 04:15, Abhishek Girish  wrote:
>
> > Hey Divya,
> >
> > It would depend on the situation, afaik. The sys.drillbits table
> contains a
> > list of all running drillibits. If one of the Drillbit has issues and
> > cannot stay connected to the cluster, I would assume it would be
> > unregistered and may not show up in the output of sys.drillbits. If it's
> an
> > intermittent issue and Drillbit process maintains it's heartbeat
> > connection, it may show up in the output.
> >
> > If you take a look at the logs, you might be able to figure out what is
> > causing the issue. There may be orphan Drillbit processes which may be
> have
> > left behind due to a previous unclean shutdown. Can you clean up all
> > Drillbit processes (using 'ps -ef | grep -i drillbit' and then a kill -9)
> > on nodes where you suspect issues and restart Drillbits?
> >
> > -Abhishek
> >
> > On Tue, Jul 10, 2018 at 7:16 PM Divya Gehlot 
> > wrote:
> >
> > > Hi ,
> > > select * from sys.drillbits;
> > > What does above query shows if drillbits process hangs ?
> > >
> > >
> > > Thanks
> > >
> > > On Tue, 10 Jul 2018 at 15:36, Khurram Faraaz  wrote:
> > >
> > > > You can run the below query, and look for the *state *column in the
> > > result
> > > > of the query. Online drillbits will be marked as ONLINE.
> > > >
> > > > select * from sys.drillbits;
> > > >
> > > > - Khurram
> > > >
> > > > On Tue, Jul 10, 2018 at 12:24 AM, Divya Gehlot <
> > divya.htco...@gmail.com>
> > > > wrote:
> > > >
> > > > > Hi,
> > > > > I would like to know the best practice to check the Drillbits
> status
> > in
> > > > > cluster mode.
> > > > > I have encountered the scenario when check Drillbits process
> running
> > > fine
> > > > > and When check in Drll WebUI , some of the Drillbits are down.
> > > > > When do RCA(root cause analysis) , got to know due to some reason
> > > > drillbits
> > > > > process hanged .
> > > > > For now the alert system which I have implemented now is checking
> the
> > > > >
> > > > >
> > > > > > drill/bin/drillbit.sh status
> > > > >
> > > > >
> > > > > Is there any other best way to catch the hung Drillbit process?
> > > > > Appreciate the advise from Drill community users.
> > > > >
> > > > > Thanks,
> > > > > Divya
> > > > >
> > > >
> > >
> >
>


Re: Best Practice to check Drillbit status(Cluster mode)

2018-07-16 Thread Divya Gehlot
Hi ,
Thanks Abhishek !
I would like to have a notification of that orphan drillbit process when it
gets disconnected from other running drillbits for some reason , definitely
not because of the unclean shut down as those drill bits are running for
months .
I know I can check the logs and kill that orphaned , which what I did in my
case, but I  would like to have notification for down drillbit.


Thanks,
Divya

On Fri, 13 Jul 2018 at 04:15, Abhishek Girish  wrote:

> Hey Divya,
>
> It would depend on the situation, afaik. The sys.drillbits table contains a
> list of all running drillibits. If one of the Drillbit has issues and
> cannot stay connected to the cluster, I would assume it would be
> unregistered and may not show up in the output of sys.drillbits. If it's an
> intermittent issue and Drillbit process maintains it's heartbeat
> connection, it may show up in the output.
>
> If you take a look at the logs, you might be able to figure out what is
> causing the issue. There may be orphan Drillbit processes which may be have
> left behind due to a previous unclean shutdown. Can you clean up all
> Drillbit processes (using 'ps -ef | grep -i drillbit' and then a kill -9)
> on nodes where you suspect issues and restart Drillbits?
>
> -Abhishek
>
> On Tue, Jul 10, 2018 at 7:16 PM Divya Gehlot 
> wrote:
>
> > Hi ,
> > select * from sys.drillbits;
> > What does above query shows if drillbits process hangs ?
> >
> >
> > Thanks
> >
> > On Tue, 10 Jul 2018 at 15:36, Khurram Faraaz  wrote:
> >
> > > You can run the below query, and look for the *state *column in the
> > result
> > > of the query. Online drillbits will be marked as ONLINE.
> > >
> > > select * from sys.drillbits;
> > >
> > > - Khurram
> > >
> > > On Tue, Jul 10, 2018 at 12:24 AM, Divya Gehlot <
> divya.htco...@gmail.com>
> > > wrote:
> > >
> > > > Hi,
> > > > I would like to know the best practice to check the Drillbits status
> in
> > > > cluster mode.
> > > > I have encountered the scenario when check Drillbits process running
> > fine
> > > > and When check in Drll WebUI , some of the Drillbits are down.
> > > > When do RCA(root cause analysis) , got to know due to some reason
> > > drillbits
> > > > process hanged .
> > > > For now the alert system which I have implemented now is checking the
> > > >
> > > >
> > > > > drill/bin/drillbit.sh status
> > > >
> > > >
> > > > Is there any other best way to catch the hung Drillbit process?
> > > > Appreciate the advise from Drill community users.
> > > >
> > > > Thanks,
> > > > Divya
> > > >
> > >
> >
>