Hi Carlos It looks similar to an issue reported previously: https://lists.apache.org/thread.html/1f3d4c427690c06f1992bc5070f355689ccc5b1ed8cc3678ad8e9106@<user.drill.apache.org>
Could you try setting the JVM's file encoding to UTF-8 and retry? If it does not work, please file a JIRA in https://issues.apache.org Thanks Kunal On 7/16/2018 1:25:45 PM, Carlos Derich <carlosder...@gmail.com> wrote: It seems to be an issue only with CSV/TSV files. Tried writing the output as JSON and it handles the encoding properly. alter session set `store.format`='json' create table dfs.tmp.test3 as select `city` from dfs.parquets.`file` Returns: {"city": "Montréal"} additional info: parquet-tools schema: message root { optional binary city (UTF8); } On Mon, Jul 16, 2018 at 2:49 PM, Carlos Derich wrote: > Hello guys, hope everyone is well. > > I am having an encoding issue when converting a table from parquet into > csv files, I wonder if someone could shed some light on it ? > > One of my data sets has data in French with lots of accentuation, and it > is persisted in HDFS as parquet. > > > When I query the parquet table with: *select `city` from > dfs.parquets.`file` , *it properly return the data encoded. > > > *city* > > *Montréal* > > > Then I convert this table into a CSV file with the following query: > > *alter session set `store.format`='csv'* > *create table dfs.csvs.`converted` as select * from dfs.parquets.`file`* > > > Then when I run a select query on it, it returns data not properly encoded: > > *select columns[0] from dfs.csvs.`converted`* > > Returns: > > *Montr?al* > > > My storage plugin is pretty standard: > > "csv" : { > "type" : "text", > "extensions" : [ "csv" ], > "delimiter" : ",", > "skipFirstLine": true > }, > > Should I explicitly add an charset option somewhere ? Couldn't find > anything helpful on the docs. > > Tried adding *export DRILL_JAVA_OPTS="$DRILL_JAVA_OPTS > -Dsaffron.default.charset=UTF-8"* to drill-env.sh file, but no luck. > > Have anyone ran into similar issues ? > > Thank you ! >