Hello, I am having some problems with encoding at the moment of binding python script and pig script. I have the text "Catégorie" as a parameter for a pig script, and when binding with the pig script, it doesn't use the right encoding, and produces "Catgorie".
This is my python script: # -*- coding: UTF-8 -*- Prefix = "Catégorie" params = { prefix":Prefix, "output":"workspace/fr_30" } P1 = Pig.compileFromFile("topic-corpus/test.pig") bound1 = P1.bind(params) stats1 = bound1.run() And the pig script: items = LOAD '$output/items.tsv' AS (id: chararray, count: long); update_items = FOREACH items GENERATE id, REPLACE(id, '$prefix:', '') AS candidate_id; When I run the script the binding generates this code to run: 2012-03-16 10:54:18,001 [main] INFO org.apache.pig.scripting.BoundScript - Query to run: items = LOAD 'workspace/fr_30/items.tsv' AS (id: chararray, count: long, childen:long, parents:long);items_replace = FOREACH items GENERATE id, REPLACE(id, 'Cat̩gorie:', '') AS candidateId;STORE items_replace INTO 'workspace/fr_30/replaced__items.tsv'; so the prefix Cat̩gorie is not well decoded... and results not replaced... What I am missing? Thanks!