Re: CT from parquet to CSV seems to not properly encode to UTF8
Hi Carlos It looks similar to an issue reported previously: https://lists.apache.org/thread.html/1f3d4c427690c06f1992bc5070f355689ccc5b1ed8cc3678ad8e9106@ Could you try setting the JVM's file encoding to UTF-8 and retry? If it does not work, please file a JIRA in https://issues.apache.org Thanks Kunal On 7/16/2018 1:25:45 PM, Carlos Derich wrote: It seems to be an issue only with CSV/TSV files. Tried writing the output as JSON and it handles the encoding properly. alter session set `store.format`='json' create table dfs.tmp.test3 as select `city` from dfs.parquets.`file` Returns: {"city": "Montréal"} additional info: parquet-tools schema: message root { optional binary city (UTF8); } On Mon, Jul 16, 2018 at 2:49 PM, Carlos Derich wrote: > Hello guys, hope everyone is well. > > I am having an encoding issue when converting a table from parquet into > csv files, I wonder if someone could shed some light on it ? > > One of my data sets has data in French with lots of accentuation, and it > is persisted in HDFS as parquet. > > > When I query the parquet table with: *select `city` from > dfs.parquets.`file` , *it properly return the data encoded. > > > *city* > > *Montréal* > > > Then I convert this table into a CSV file with the following query: > > *alter session set `store.format`='csv'* > *create table dfs.csvs.`converted` as select * from dfs.parquets.`file`* > > > Then when I run a select query on it, it returns data not properly encoded: > > *select columns[0] from dfs.csvs.`converted`* > > Returns: > > *Montr?al* > > > My storage plugin is pretty standard: > > "csv" : { > "type" : "text", > "extensions" : [ "csv" ], > "delimiter" : ",", > "skipFirstLine": true > }, > > Should I explicitly add an charset option somewhere ? Couldn't find > anything helpful on the docs. > > Tried adding *export DRILL_JAVA_OPTS="$DRILL_JAVA_OPTS > -Dsaffron.default.charset=UTF-8"* to drill-env.sh file, but no luck. > > Have anyone ran into similar issues ? > > Thank you ! >
Re: CT from parquet to CSV seems to not properly encode to UTF8
It seems to be an issue only with CSV/TSV files. Tried writing the output as JSON and it handles the encoding properly. alter session set `store.format`='json' create table dfs.tmp.test3 as select `city` from dfs.parquets.`file` Returns: {"city": "Montréal"} additional info: parquet-tools schema: message root { optional binary city (UTF8); } On Mon, Jul 16, 2018 at 2:49 PM, Carlos Derich wrote: > Hello guys, hope everyone is well. > > I am having an encoding issue when converting a table from parquet into > csv files, I wonder if someone could shed some light on it ? > > One of my data sets has data in French with lots of accentuation, and it > is persisted in HDFS as parquet. > > > When I query the parquet table with: *select `city` from > dfs.parquets.`file` , *it properly return the data encoded. > > > *city* > > *Montréal* > > > Then I convert this table into a CSV file with the following query: > > *alter session set `store.format`='csv'* > *create table dfs.csvs.`converted` as select * from dfs.parquets.`file`* > > > Then when I run a select query on it, it returns data not properly encoded: > > *select columns[0] from dfs.csvs.`converted`* > > Returns: > > *Montr?al* > > > My storage plugin is pretty standard: > > "csv" : { > "type" : "text", > "extensions" : [ "csv" ], > "delimiter" : ",", > "skipFirstLine": true > }, > > Should I explicitly add an charset option somewhere ? Couldn't find > anything helpful on the docs. > > Tried adding *export DRILL_JAVA_OPTS="$DRILL_JAVA_OPTS > -Dsaffron.default.charset=UTF-8"* to drill-env.sh file, but no luck. > > Have anyone ran into similar issues ? > > Thank you ! >
CT from parquet to CSV seems to not properly encode to UTF8
Hello guys, hope everyone is well. I am having an encoding issue when converting a table from parquet into csv files, I wonder if someone could shed some light on it ? One of my data sets has data in French with lots of accentuation, and it is persisted in HDFS as parquet. When I query the parquet table with: *select `city` from dfs.parquets.`file` , *it properly return the data encoded. *city* *Montréal* Then I convert this table into a CSV file with the following query: *alter session set `store.format`='csv'* *create table dfs.csvs.`converted` as select * from dfs.parquets.`file`* Then when I run a select query on it, it returns data not properly encoded: *select columns[0] from dfs.csvs.`converted`* Returns: *Montr?al* My storage plugin is pretty standard: "csv" : { "type" : "text", "extensions" : [ "csv" ], "delimiter" : ",", "skipFirstLine": true }, Should I explicitly add an charset option somewhere ? Couldn't find anything helpful on the docs. Tried adding *export DRILL_JAVA_OPTS="$DRILL_JAVA_OPTS -Dsaffron.default.charset=UTF-8"* to drill-env.sh file, but no luck. Have anyone ran into similar issues ? Thank you !
Re: Best Practice to check Drillbit status(Cluster mode)
I think logs may be the only way to figure it out, at the present. You could have a watch on your logs to be informed of such events. For notifications, I would say file an enhancement JIRA - if it gathers enough attention, perhaps someone would volunteer to work or comment on it. On Mon, Jul 16, 2018 at 2:08 AM Divya Gehlot wrote: > Hi , > Thanks Abhishek ! > I would like to have a notification of that orphan drillbit process when it > gets disconnected from other running drillbits for some reason , definitely > not because of the unclean shut down as those drill bits are running for > months . > I know I can check the logs and kill that orphaned , which what I did in my > case, but I would like to have notification for down drillbit. > > > Thanks, > Divya > > On Fri, 13 Jul 2018 at 04:15, Abhishek Girish wrote: > > > Hey Divya, > > > > It would depend on the situation, afaik. The sys.drillbits table > contains a > > list of all running drillibits. If one of the Drillbit has issues and > > cannot stay connected to the cluster, I would assume it would be > > unregistered and may not show up in the output of sys.drillbits. If it's > an > > intermittent issue and Drillbit process maintains it's heartbeat > > connection, it may show up in the output. > > > > If you take a look at the logs, you might be able to figure out what is > > causing the issue. There may be orphan Drillbit processes which may be > have > > left behind due to a previous unclean shutdown. Can you clean up all > > Drillbit processes (using 'ps -ef | grep -i drillbit' and then a kill -9) > > on nodes where you suspect issues and restart Drillbits? > > > > -Abhishek > > > > On Tue, Jul 10, 2018 at 7:16 PM Divya Gehlot > > wrote: > > > > > Hi , > > > select * from sys.drillbits; > > > What does above query shows if drillbits process hangs ? > > > > > > > > > Thanks > > > > > > On Tue, 10 Jul 2018 at 15:36, Khurram Faraaz wrote: > > > > > > > You can run the below query, and look for the *state *column in the > > > result > > > > of the query. Online drillbits will be marked as ONLINE. > > > > > > > > select * from sys.drillbits; > > > > > > > > - Khurram > > > > > > > > On Tue, Jul 10, 2018 at 12:24 AM, Divya Gehlot < > > divya.htco...@gmail.com> > > > > wrote: > > > > > > > > > Hi, > > > > > I would like to know the best practice to check the Drillbits > status > > in > > > > > cluster mode. > > > > > I have encountered the scenario when check Drillbits process > running > > > fine > > > > > and When check in Drll WebUI , some of the Drillbits are down. > > > > > When do RCA(root cause analysis) , got to know due to some reason > > > > drillbits > > > > > process hanged . > > > > > For now the alert system which I have implemented now is checking > the > > > > > > > > > > > > > > > > drill/bin/drillbit.sh status > > > > > > > > > > > > > > > Is there any other best way to catch the hung Drillbit process? > > > > > Appreciate the advise from Drill community users. > > > > > > > > > > Thanks, > > > > > Divya > > > > > > > > > > > > > > >
Re: Best Practice to check Drillbit status(Cluster mode)
Hi , Thanks Abhishek ! I would like to have a notification of that orphan drillbit process when it gets disconnected from other running drillbits for some reason , definitely not because of the unclean shut down as those drill bits are running for months . I know I can check the logs and kill that orphaned , which what I did in my case, but I would like to have notification for down drillbit. Thanks, Divya On Fri, 13 Jul 2018 at 04:15, Abhishek Girish wrote: > Hey Divya, > > It would depend on the situation, afaik. The sys.drillbits table contains a > list of all running drillibits. If one of the Drillbit has issues and > cannot stay connected to the cluster, I would assume it would be > unregistered and may not show up in the output of sys.drillbits. If it's an > intermittent issue and Drillbit process maintains it's heartbeat > connection, it may show up in the output. > > If you take a look at the logs, you might be able to figure out what is > causing the issue. There may be orphan Drillbit processes which may be have > left behind due to a previous unclean shutdown. Can you clean up all > Drillbit processes (using 'ps -ef | grep -i drillbit' and then a kill -9) > on nodes where you suspect issues and restart Drillbits? > > -Abhishek > > On Tue, Jul 10, 2018 at 7:16 PM Divya Gehlot > wrote: > > > Hi , > > select * from sys.drillbits; > > What does above query shows if drillbits process hangs ? > > > > > > Thanks > > > > On Tue, 10 Jul 2018 at 15:36, Khurram Faraaz wrote: > > > > > You can run the below query, and look for the *state *column in the > > result > > > of the query. Online drillbits will be marked as ONLINE. > > > > > > select * from sys.drillbits; > > > > > > - Khurram > > > > > > On Tue, Jul 10, 2018 at 12:24 AM, Divya Gehlot < > divya.htco...@gmail.com> > > > wrote: > > > > > > > Hi, > > > > I would like to know the best practice to check the Drillbits status > in > > > > cluster mode. > > > > I have encountered the scenario when check Drillbits process running > > fine > > > > and When check in Drll WebUI , some of the Drillbits are down. > > > > When do RCA(root cause analysis) , got to know due to some reason > > > drillbits > > > > process hanged . > > > > For now the alert system which I have implemented now is checking the > > > > > > > > > > > > > drill/bin/drillbit.sh status > > > > > > > > > > > > Is there any other best way to catch the hung Drillbit process? > > > > Appreciate the advise from Drill community users. > > > > > > > > Thanks, > > > > Divya > > > > > > > > > >