[jira] Commented: (PIG-774) Pig does not handle Chinese characters (in both the parameter subsitution using -param_file or embedded in the Pig script) correctly

2009-05-19 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12710628#action_12710628
 ] 

Daniel Dai commented on PIG-774:


You get the point, Viraj. 

Actually we can have two different configurations:
1. LANG=UTF8, all data files, script files, parameter files are UTF8
2. LANG=GB2321, data files are UTF8; script files, parameter files are GB2321
However, for RH-EL default settings, LANG=POSIX, which does not work well for 
Chinese characters. 

So for simplicity, we can have everything UTF8 (case 1). This is the default 
setting for Ubuntu. 

 Pig does not handle Chinese characters (in both the parameter subsitution 
 using -param_file or embedded in the Pig script) correctly
 

 Key: PIG-774
 URL: https://issues.apache.org/jira/browse/PIG-774
 Project: Pig
  Issue Type: Bug
  Components: grunt, impl
Affects Versions: 0.0.0
Reporter: Viraj Bhat
Assignee: Daniel Dai
Priority: Critical
 Fix For: 0.3.0

 Attachments: chinese.txt, chinese_data.pig, nextgen_paramfile, 
 pig_1240967860835.log, utf8.patch, utf8_parser-1.patch, utf8_parser-2.patch


 I created a very small test case in which I did the following.
 1) Created a UTF-8 file which contained a query string in Chinese and wrote 
 it to HDFS. I used this dfs file as an input for the tests.
 2) Created a parameter file which also contained the same query string as in 
 Step 1.
 3) Created a Pig script which takes in the parametrized query string and hard 
 coded Chinese character.
 
 Pig script: chinese_data.pig
 
 {code}
 rmf chineseoutput;
 I = load '/user/viraj/chinese.txt' using PigStorage('\u0001');
 J = filter I by $0 == '$querystring';
 --J = filter I by $0 == ' 歌手香港情牽女人心演唱會';
 store J into 'chineseoutput';
 dump J;
 {code}
 =
 Parameter file: nextgen_paramfile
 =
 queryid=20090311
 querystring='   歌手香港情牽女人心演唱會'
 =
 Input file: /user/viraj/chinese.txt
 =
 shell$ hadoop fs -cat /user/viraj/chinese.txt
 歌手香港情牽女人心演唱會
 =
 I ran the above set of inputs in the following ways:
 Run 1:
 =
 {code}
 java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' 
 org.apache.pig.Main -param_file nextgen_paramfile chinese_data.pig
 {code}
 =
 2009-04-22 01:31:35,703 [Thread-7] WARN  org.apache.hadoop.mapred.JobClient - 
 Use GenericOptionsParser for parsing the
 arguments. Applications should implement Tool for the same.
 2009-04-22 01:31:40,700 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 0% complete
 2009-04-22 01:31:50,720 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 100% complete
 2009-04-22 01:31:50,720 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 Success!
 =
 Run 2: removed the parameter substitution in the Pig script instead used the 
 following statement.
 =
 {code}
 J = filter I by $0 == ' 歌手香港情牽女人心演唱會';
 {code}
 =
 java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' 
 org.apache.pig.Main chinese_data_withoutparam.pig
 =
 2009-04-22 01:35:22,402 [Thread-7] WARN  org.apache.hadoop.mapred.JobClient - 
 Use GenericOptionsParser for parsing the
 arguments. Applications should implement Tool for the same.
 2009-04-22 01:35:27,399 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 0% complete
 2009-04-22 01:35:32,415 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 100% complete
 2009-04-22 01:35:32,415 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 Success!
 =
 In both cases:
 =
 {code}
 shell $ hadoop 

[jira] Commented: (PIG-774) Pig does not handle Chinese characters (in both the parameter subsitution using -param_file or embedded in the Pig script) correctly

2009-05-18 Thread Viraj Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12710619#action_12710619
 ] 

Viraj Bhat commented on PIG-774:


Hi Daniel,
 For this patch to work, is it important  to set:

LESSCHARSET to utf-8

LANG to en_US.utf8

I am observing that the dry run using pig -r does not yield the right parameter 
substitution, if we do not have these variables set. 

They are not set by default on the RH-EL 5.0

You have mentioned this in your earlier comments!!

Thanks Viraj

 Pig does not handle Chinese characters (in both the parameter subsitution 
 using -param_file or embedded in the Pig script) correctly
 

 Key: PIG-774
 URL: https://issues.apache.org/jira/browse/PIG-774
 Project: Pig
  Issue Type: Bug
  Components: grunt, impl
Affects Versions: 0.0.0
Reporter: Viraj Bhat
Assignee: Daniel Dai
Priority: Critical
 Fix For: 0.3.0

 Attachments: chinese.txt, chinese_data.pig, nextgen_paramfile, 
 pig_1240967860835.log, utf8.patch, utf8_parser-1.patch, utf8_parser-2.patch


 I created a very small test case in which I did the following.
 1) Created a UTF-8 file which contained a query string in Chinese and wrote 
 it to HDFS. I used this dfs file as an input for the tests.
 2) Created a parameter file which also contained the same query string as in 
 Step 1.
 3) Created a Pig script which takes in the parametrized query string and hard 
 coded Chinese character.
 
 Pig script: chinese_data.pig
 
 {code}
 rmf chineseoutput;
 I = load '/user/viraj/chinese.txt' using PigStorage('\u0001');
 J = filter I by $0 == '$querystring';
 --J = filter I by $0 == ' 歌手香港情牽女人心演唱會';
 store J into 'chineseoutput';
 dump J;
 {code}
 =
 Parameter file: nextgen_paramfile
 =
 queryid=20090311
 querystring='   歌手香港情牽女人心演唱會'
 =
 Input file: /user/viraj/chinese.txt
 =
 shell$ hadoop fs -cat /user/viraj/chinese.txt
 歌手香港情牽女人心演唱會
 =
 I ran the above set of inputs in the following ways:
 Run 1:
 =
 {code}
 java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' 
 org.apache.pig.Main -param_file nextgen_paramfile chinese_data.pig
 {code}
 =
 2009-04-22 01:31:35,703 [Thread-7] WARN  org.apache.hadoop.mapred.JobClient - 
 Use GenericOptionsParser for parsing the
 arguments. Applications should implement Tool for the same.
 2009-04-22 01:31:40,700 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 0% complete
 2009-04-22 01:31:50,720 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 100% complete
 2009-04-22 01:31:50,720 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 Success!
 =
 Run 2: removed the parameter substitution in the Pig script instead used the 
 following statement.
 =
 {code}
 J = filter I by $0 == ' 歌手香港情牽女人心演唱會';
 {code}
 =
 java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' 
 org.apache.pig.Main chinese_data_withoutparam.pig
 =
 2009-04-22 01:35:22,402 [Thread-7] WARN  org.apache.hadoop.mapred.JobClient - 
 Use GenericOptionsParser for parsing the
 arguments. Applications should implement Tool for the same.
 2009-04-22 01:35:27,399 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 0% complete
 2009-04-22 01:35:32,415 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 100% complete
 2009-04-22 01:35:32,415 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 Success!
 =
 In both cases:
 =
 {code}
 shell $ hadoop fs -ls /user/viraj/chineseoutput
 Found 2 items
 drwxr-xr-x   - viraj 

[jira] Commented: (PIG-774) Pig does not handle Chinese characters (in both the parameter subsitution using -param_file or embedded in the Pig script) correctly

2009-05-04 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12705799#action_12705799
 ] 

Daniel Dai commented on PIG-774:


Hi, Olga, 
I actually assume only textural data inside bytearray as UTF8. The functions I 
change are DataByteArray.DataByteArray(String s) and DataByteArray.toString(), 
which are the functions to convert String to/from byte array. If image or other 
binary data, we will not need to convert them to/from a String, so they will be 
fine.

 Pig does not handle Chinese characters (in both the parameter subsitution 
 using -param_file or embedded in the Pig script) correctly
 

 Key: PIG-774
 URL: https://issues.apache.org/jira/browse/PIG-774
 Project: Pig
  Issue Type: Bug
  Components: grunt, impl
Affects Versions: 0.0.0
Reporter: Viraj Bhat
Assignee: Daniel Dai
Priority: Critical
 Fix For: 0.0.0

 Attachments: chinese.txt, chinese_data.pig, nextgen_paramfile, 
 pig_1240967860835.log, utf8.patch, utf8_parser-1.patch, utf8_parser-2.patch


 I created a very small test case in which I did the following.
 1) Created a UTF-8 file which contained a query string in Chinese and wrote 
 it to HDFS. I used this dfs file as an input for the tests.
 2) Created a parameter file which also contained the same query string as in 
 Step 1.
 3) Created a Pig script which takes in the parametrized query string and hard 
 coded Chinese character.
 
 Pig script: chinese_data.pig
 
 {code}
 rmf chineseoutput;
 I = load '/user/viraj/chinese.txt' using PigStorage('\u0001');
 J = filter I by $0 == '$querystring';
 --J = filter I by $0 == ' 歌手香港情牽女人心演唱會';
 store J into 'chineseoutput';
 dump J;
 {code}
 =
 Parameter file: nextgen_paramfile
 =
 queryid=20090311
 querystring='   歌手香港情牽女人心演唱會'
 =
 Input file: /user/viraj/chinese.txt
 =
 shell$ hadoop fs -cat /user/viraj/chinese.txt
 歌手香港情牽女人心演唱會
 =
 I ran the above set of inputs in the following ways:
 Run 1:
 =
 {code}
 java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' 
 org.apache.pig.Main -param_file nextgen_paramfile chinese_data.pig
 {code}
 =
 2009-04-22 01:31:35,703 [Thread-7] WARN  org.apache.hadoop.mapred.JobClient - 
 Use GenericOptionsParser for parsing the
 arguments. Applications should implement Tool for the same.
 2009-04-22 01:31:40,700 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 0% complete
 2009-04-22 01:31:50,720 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 100% complete
 2009-04-22 01:31:50,720 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 Success!
 =
 Run 2: removed the parameter substitution in the Pig script instead used the 
 following statement.
 =
 {code}
 J = filter I by $0 == ' 歌手香港情牽女人心演唱會';
 {code}
 =
 java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' 
 org.apache.pig.Main chinese_data_withoutparam.pig
 =
 2009-04-22 01:35:22,402 [Thread-7] WARN  org.apache.hadoop.mapred.JobClient - 
 Use GenericOptionsParser for parsing the
 arguments. Applications should implement Tool for the same.
 2009-04-22 01:35:27,399 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 0% complete
 2009-04-22 01:35:32,415 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 100% complete
 2009-04-22 01:35:32,415 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 Success!
 =
 In both cases:
 =
 {code}
 shell $ hadoop fs -ls /user/viraj/chineseoutput
 Found 2 items
 drwxr-xr-x   - viraj supergroup

[jira] Commented: (PIG-774) Pig does not handle Chinese characters (in both the parameter subsitution using -param_file or embedded in the Pig script) correctly

2009-04-28 Thread Viraj Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703937#action_12703937
 ] 

Viraj Bhat commented on PIG-774:


Daniel, 
 Thanks again for your patch, I worked with Pradeep and changed the parser code 
to invoke that behavior you suggested and then filed Jira PIG-774. 
Here is one problem that I faced..
Suppose I have a script like this, known as chinese_data.pig
{code}
rmf chineseoutput;
%default querystring 'myquery';
I = load '/user/viraj/chinese.txt' using PigStorage('\u0001');

--dump I;

J = filter I by $0 == '$querystring';
--J = filter I by $0 == '   歌手香港情牽女人心演唱會';

--store J into 'chineseoutput';
dump J;
{code}

I have a parameter file known as nextgen_paramfile which contains the 
$querystring variable..

{code}
querystring=   歌手香港情牽女人心演唱會
{code}

I run the above script and parameter file as:
{code}
java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' 
org.apache.pig.Main -param_file nextgen_paramfile chinese_data.pig
{code}

I get the following error:

2009-04-29 01:05:14,979 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to 
hadoop file system at: hdfs://localhost:9000
2009-04-29 01:05:16,328 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to 
map-reduce job tracker at: localhost:9001
2009-04-29 01:05:16,907 [main] INFO  org.apache.pig.PigServer - Create a new 
graph.
2009-04-29 01:05:17,794 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
1000: Error during parsing. Lexical error at line 7, column 33.  Encountered: 
\u6b4c (27468), after : 

I realized that it was something to do with the commented line in the pig 
script. 
{code}
--J = filter I by $0 == '   歌手香港情牽女人心演唱會';
{code}
Why is that so, I am attaching the pig_*log on this Jira.

Additionally I found that the parameter substitution is happening correctly 
when I run the script as:
{code}
java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' 
org.apache.pig.Main -param_file nextgen_paramfile -r chinese_data.pig
{code}
The substituted file, chinese_data.pig.substituted is correct.

Viraj

 Pig does not handle Chinese characters (in both the parameter subsitution 
 using -param_file or embedded in the Pig script) correctly
 

 Key: PIG-774
 URL: https://issues.apache.org/jira/browse/PIG-774
 Project: Pig
  Issue Type: Bug
  Components: grunt, impl
Affects Versions: 0.0.0
Reporter: Viraj Bhat
Priority: Critical
 Fix For: 0.0.0

 Attachments: chinese.txt, chinese_data.pig, nextgen_paramfile, 
 utf8_parser-1.patch


 I created a very small test case in which I did the following.
 1) Created a UTF-8 file which contained a query string in Chinese and wrote 
 it to HDFS. I used this dfs file as an input for the tests.
 2) Created a parameter file which also contained the same query string as in 
 Step 1.
 3) Created a Pig script which takes in the parametrized query string and hard 
 coded Chinese character.
 
 Pig script: chinese_data.pig
 
 {code}
 rmf chineseoutput;
 I = load '/user/viraj/chinese.txt' using PigStorage('\u0001');
 J = filter I by $0 == '$querystring';
 --J = filter I by $0 == ' 歌手香港情牽女人心演唱會';
 store J into 'chineseoutput';
 dump J;
 {code}
 =
 Parameter file: nextgen_paramfile
 =
 queryid=20090311
 querystring='   歌手香港情牽女人心演唱會'
 =
 Input file: /user/viraj/chinese.txt
 =
 shell$ hadoop fs -cat /user/viraj/chinese.txt
 歌手香港情牽女人心演唱會
 =
 I ran the above set of inputs in the following ways:
 Run 1:
 =
 {code}
 java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' 
 org.apache.pig.Main -param_file nextgen_paramfile chinese_data.pig
 {code}
 =
 2009-04-22 01:31:35,703 [Thread-7] WARN  org.apache.hadoop.mapred.JobClient - 
 Use GenericOptionsParser for parsing the
 arguments. Applications should implement Tool for the same.
 2009-04-22 01:31:40,700 [main] INFO  
 

[jira] Commented: (PIG-774) Pig does not handle Chinese characters (in both the parameter subsitution using -param_file or embedded in the Pig script) correctly

2009-04-28 Thread Viraj Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703977#action_12703977
 ] 

Viraj Bhat commented on PIG-774:


I modified the file PigScriptParser.jj,  and it works.

 Pig does not handle Chinese characters (in both the parameter subsitution 
 using -param_file or embedded in the Pig script) correctly
 

 Key: PIG-774
 URL: https://issues.apache.org/jira/browse/PIG-774
 Project: Pig
  Issue Type: Bug
  Components: grunt, impl
Affects Versions: 0.0.0
Reporter: Viraj Bhat
Priority: Critical
 Fix For: 0.0.0

 Attachments: chinese.txt, chinese_data.pig, nextgen_paramfile, 
 pig_1240967860835.log, utf8_parser-1.patch, utf8_parser-2.patch


 I created a very small test case in which I did the following.
 1) Created a UTF-8 file which contained a query string in Chinese and wrote 
 it to HDFS. I used this dfs file as an input for the tests.
 2) Created a parameter file which also contained the same query string as in 
 Step 1.
 3) Created a Pig script which takes in the parametrized query string and hard 
 coded Chinese character.
 
 Pig script: chinese_data.pig
 
 {code}
 rmf chineseoutput;
 I = load '/user/viraj/chinese.txt' using PigStorage('\u0001');
 J = filter I by $0 == '$querystring';
 --J = filter I by $0 == ' 歌手香港情牽女人心演唱會';
 store J into 'chineseoutput';
 dump J;
 {code}
 =
 Parameter file: nextgen_paramfile
 =
 queryid=20090311
 querystring='   歌手香港情牽女人心演唱會'
 =
 Input file: /user/viraj/chinese.txt
 =
 shell$ hadoop fs -cat /user/viraj/chinese.txt
 歌手香港情牽女人心演唱會
 =
 I ran the above set of inputs in the following ways:
 Run 1:
 =
 {code}
 java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' 
 org.apache.pig.Main -param_file nextgen_paramfile chinese_data.pig
 {code}
 =
 2009-04-22 01:31:35,703 [Thread-7] WARN  org.apache.hadoop.mapred.JobClient - 
 Use GenericOptionsParser for parsing the
 arguments. Applications should implement Tool for the same.
 2009-04-22 01:31:40,700 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 0% complete
 2009-04-22 01:31:50,720 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 100% complete
 2009-04-22 01:31:50,720 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 Success!
 =
 Run 2: removed the parameter substitution in the Pig script instead used the 
 following statement.
 =
 {code}
 J = filter I by $0 == ' 歌手香港情牽女人心演唱會';
 {code}
 =
 java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' 
 org.apache.pig.Main chinese_data_withoutparam.pig
 =
 2009-04-22 01:35:22,402 [Thread-7] WARN  org.apache.hadoop.mapred.JobClient - 
 Use GenericOptionsParser for parsing the
 arguments. Applications should implement Tool for the same.
 2009-04-22 01:35:27,399 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 0% complete
 2009-04-22 01:35:32,415 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 100% complete
 2009-04-22 01:35:32,415 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 Success!
 =
 In both cases:
 =
 {code}
 shell $ hadoop fs -ls /user/viraj/chineseoutput
 Found 2 items
 drwxr-xr-x   - viraj supergroup  0 2009-04-22 01:37 
 /user/viraj/chineseoutput/_logs
 -rw-r--r--   3 viraj supergroup  0 2009-04-22 01:37 
 /user/viraj/chineseoutput/part-0
 {code}
 =
 Additionally tried the dry-run option to figure out if the parameter 
 substitution was occurring 

[jira] Commented: (PIG-774) Pig does not handle Chinese characters (in both the parameter subsitution using -param_file or embedded in the Pig script) correctly

2009-04-27 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703007#action_12703007
 ] 

Daniel Dai commented on PIG-774:


Currently Jline does not handle backspace correctly for multibyte characters. 
When we hit backspace in a UTF8 encoding OS, only partial character will be 
deleted. If the OS encoding is native, the situation is even worse, Jline will 
throw an exception for multibyte character entered. This problem is inherent in 
Jline and all applications utilize JLine share this problem. I will try to fix 
it in Jline, however, fixing this problem is out of the scope of Pig. So 
currently, we will have to live with these problem:

# Multibyte character inputing is not supportted if OS encoding is native
# Backspace handling is incorrect if line contains multibyte characters and OS 
encoding is UTF8

Interstingly, under Cygwin, Jline works fine. The above problem are only for 
Unix.

 Pig does not handle Chinese characters (in both the parameter subsitution 
 using -param_file or embedded in the Pig script) correctly
 

 Key: PIG-774
 URL: https://issues.apache.org/jira/browse/PIG-774
 Project: Pig
  Issue Type: Bug
  Components: grunt, impl
Affects Versions: 0.0.0
Reporter: Viraj Bhat
Priority: Critical
 Fix For: 0.0.0

 Attachments: chinese.txt, chinese_data.pig, nextgen_paramfile, 
 utf8_parser-1.patch


 I created a very small test case in which I did the following.
 1) Created a UTF-8 file which contained a query string in Chinese and wrote 
 it to HDFS. I used this dfs file as an input for the tests.
 2) Created a parameter file which also contained the same query string as in 
 Step 1.
 3) Created a Pig script which takes in the parametrized query string and hard 
 coded Chinese character.
 
 Pig script: chinese_data.pig
 
 {code}
 rmf chineseoutput;
 I = load '/user/viraj/chinese.txt' using PigStorage('\u0001');
 J = filter I by $0 == '$querystring';
 --J = filter I by $0 == ' 歌手香港情牽女人心演唱會';
 store J into 'chineseoutput';
 dump J;
 {code}
 =
 Parameter file: nextgen_paramfile
 =
 queryid=20090311
 querystring='   歌手香港情牽女人心演唱會'
 =
 Input file: /user/viraj/chinese.txt
 =
 shell$ hadoop fs -cat /user/viraj/chinese.txt
 歌手香港情牽女人心演唱會
 =
 I ran the above set of inputs in the following ways:
 Run 1:
 =
 {code}
 java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' 
 org.apache.pig.Main -param_file nextgen_paramfile chinese_data.pig
 {code}
 =
 2009-04-22 01:31:35,703 [Thread-7] WARN  org.apache.hadoop.mapred.JobClient - 
 Use GenericOptionsParser for parsing the
 arguments. Applications should implement Tool for the same.
 2009-04-22 01:31:40,700 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 0% complete
 2009-04-22 01:31:50,720 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 100% complete
 2009-04-22 01:31:50,720 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 Success!
 =
 Run 2: removed the parameter substitution in the Pig script instead used the 
 following statement.
 =
 {code}
 J = filter I by $0 == ' 歌手香港情牽女人心演唱會';
 {code}
 =
 java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' 
 org.apache.pig.Main chinese_data_withoutparam.pig
 =
 2009-04-22 01:35:22,402 [Thread-7] WARN  org.apache.hadoop.mapred.JobClient - 
 Use GenericOptionsParser for parsing the
 arguments. Applications should implement Tool for the same.
 2009-04-22 01:35:27,399 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 0% complete
 2009-04-22 01:35:32,415 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 100% complete
 2009-04-22 01:35:32,415 [main] INFO  
 

[jira] Commented: (PIG-774) Pig does not handle Chinese characters (in both the parameter subsitution using -param_file or embedded in the Pig script) correctly

2009-04-24 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702445#action_12702445
 ] 

Alan Gates commented on PIG-774:


Two lines of change are needed to fix this:
1. In QueryParser.jjt, introduce a new option for handling unicode
2. In the LogicalPlanBuilder, use the getBytes method with the UTF-8
charset

These changes also need to be propagated to the remaining
JavaCC parsers.  Then testing will need to be done.  Estimate 3-5 days of work.

Reference:
http://www.xrce.xerox.com/competencies/content-analysis/tools/publis/javacc_unicode.pdf


 Pig does not handle Chinese characters (in both the parameter subsitution 
 using -param_file or embedded in the Pig script) correctly
 

 Key: PIG-774
 URL: https://issues.apache.org/jira/browse/PIG-774
 Project: Pig
  Issue Type: Bug
  Components: grunt, impl
Affects Versions: 0.0.0
Reporter: Viraj Bhat
Priority: Critical
 Fix For: 0.0.0

 Attachments: chinese.txt, chinese_data.pig, nextgen_paramfile


 I created a very small test case in which I did the following.
 1) Created a UTF-8 file which contained a query string in Chinese and wrote 
 it to HDFS. I used this dfs file as an input for the tests.
 2) Created a parameter file which also contained the same query string as in 
 Step 1.
 3) Created a Pig script which takes in the parametrized query string and hard 
 coded Chinese character.
 
 Pig script: chinese_data.pig
 
 {code}
 rmf chineseoutput;
 I = load '/user/viraj/chinese.txt' using PigStorage('\u0001');
 J = filter I by $0 == '$querystring';
 --J = filter I by $0 == ' 歌手香港情牽女人心演唱會';
 store J into 'chineseoutput';
 dump J;
 {code}
 =
 Parameter file: nextgen_paramfile
 =
 queryid=20090311
 querystring='   歌手香港情牽女人心演唱會'
 =
 Input file: /user/viraj/chinese.txt
 =
 shell$ hadoop fs -cat /user/viraj/chinese.txt
 歌手香港情牽女人心演唱會
 =
 I ran the above set of inputs in the following ways:
 Run 1:
 =
 java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' 
 org.apache.pig.Main -param_file nextgen_paramfile chinese_data.pig
 =
 2009-04-22 01:31:35,703 [Thread-7] WARN  org.apache.hadoop.mapred.JobClient - 
 Use GenericOptionsParser for parsing the
 arguments. Applications should implement Tool for the same.
 2009-04-22 01:31:40,700 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 0% complete
 2009-04-22 01:31:50,720 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 100% complete
 2009-04-22 01:31:50,720 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 Success!
 =
 Run 2: removed the parameter substitution in the Pig script instead used the 
 following statement.
 =
 J = filter I by $0 == ' 歌手香港情牽女人心演唱會';
 =
 java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' 
 org.apache.pig.Main chinese_data_withoutparam.pig
 =
 2009-04-22 01:35:22,402 [Thread-7] WARN  org.apache.hadoop.mapred.JobClient - 
 Use GenericOptionsParser for parsing the
 arguments. Applications should implement Tool for the same.
 2009-04-22 01:35:27,399 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 0% complete
 2009-04-22 01:35:32,415 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 100% complete
 2009-04-22 01:35:32,415 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 Success!
 =
 In both cases:
 =
 ucdev6 01:39:22 ~/pig-svn/trunk $ hadoop fs -ls /user/viraj/chineseoutput
 Found 2 items
 drwxr-xr-x   - viraj supergroup  0 2009-04-22 01:37 
 

[jira] Commented: (PIG-774) Pig does not handle Chinese characters (in both the parameter subsitution using -param_file or embedded in the Pig script) correctly

2009-04-24 Thread David Ciemiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702533#action_12702533
 ] 

David Ciemiewicz commented on PIG-774:
--

A somewhat related bug is JIRA PIG-755 - the difficulty of debugging issues 
related to passed parameters.

If Pig produced an output file of the code with parameter substitutions made, 
we could have more rapidly isolated the problem.

 Pig does not handle Chinese characters (in both the parameter subsitution 
 using -param_file or embedded in the Pig script) correctly
 

 Key: PIG-774
 URL: https://issues.apache.org/jira/browse/PIG-774
 Project: Pig
  Issue Type: Bug
  Components: grunt, impl
Affects Versions: 0.0.0
Reporter: Viraj Bhat
Priority: Critical
 Fix For: 0.0.0

 Attachments: chinese.txt, chinese_data.pig, nextgen_paramfile


 I created a very small test case in which I did the following.
 1) Created a UTF-8 file which contained a query string in Chinese and wrote 
 it to HDFS. I used this dfs file as an input for the tests.
 2) Created a parameter file which also contained the same query string as in 
 Step 1.
 3) Created a Pig script which takes in the parametrized query string and hard 
 coded Chinese character.
 
 Pig script: chinese_data.pig
 
 {code}
 rmf chineseoutput;
 I = load '/user/viraj/chinese.txt' using PigStorage('\u0001');
 J = filter I by $0 == '$querystring';
 --J = filter I by $0 == ' 歌手香港情牽女人心演唱會';
 store J into 'chineseoutput';
 dump J;
 {code}
 =
 Parameter file: nextgen_paramfile
 =
 queryid=20090311
 querystring='   歌手香港情牽女人心演唱會'
 =
 Input file: /user/viraj/chinese.txt
 =
 shell$ hadoop fs -cat /user/viraj/chinese.txt
 歌手香港情牽女人心演唱會
 =
 I ran the above set of inputs in the following ways:
 Run 1:
 =
 java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' 
 org.apache.pig.Main -param_file nextgen_paramfile chinese_data.pig
 =
 2009-04-22 01:31:35,703 [Thread-7] WARN  org.apache.hadoop.mapred.JobClient - 
 Use GenericOptionsParser for parsing the
 arguments. Applications should implement Tool for the same.
 2009-04-22 01:31:40,700 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 0% complete
 2009-04-22 01:31:50,720 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 100% complete
 2009-04-22 01:31:50,720 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 Success!
 =
 Run 2: removed the parameter substitution in the Pig script instead used the 
 following statement.
 =
 J = filter I by $0 == ' 歌手香港情牽女人心演唱會';
 =
 java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' 
 org.apache.pig.Main chinese_data_withoutparam.pig
 =
 2009-04-22 01:35:22,402 [Thread-7] WARN  org.apache.hadoop.mapred.JobClient - 
 Use GenericOptionsParser for parsing the
 arguments. Applications should implement Tool for the same.
 2009-04-22 01:35:27,399 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 0% complete
 2009-04-22 01:35:32,415 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 100% complete
 2009-04-22 01:35:32,415 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 Success!
 =
 In both cases:
 =
 ucdev6 01:39:22 ~/pig-svn/trunk $ hadoop fs -ls /user/viraj/chineseoutput
 Found 2 items
 drwxr-xr-x   - viraj supergroup  0 2009-04-22 01:37 
 /user/viraj/chineseoutput/_logs
 -rw-r--r--   3 viraj supergroup  0 2009-04-22 01:37 
 /user/viraj/chineseoutput/part-0
 

[jira] Commented: (PIG-774) Pig does not handle Chinese characters (in both the parameter subsitution using -param_file or embedded in the Pig script) correctly

2009-04-24 Thread Santhosh Srinivasan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702538#action_12702538
 ] 

Santhosh Srinivasan commented on PIG-774:
-

In order to obtain the substituted pig script use the -dryrun switch. By 
default, the substituted pig script is not stored on disk.

 Pig does not handle Chinese characters (in both the parameter subsitution 
 using -param_file or embedded in the Pig script) correctly
 

 Key: PIG-774
 URL: https://issues.apache.org/jira/browse/PIG-774
 Project: Pig
  Issue Type: Bug
  Components: grunt, impl
Affects Versions: 0.0.0
Reporter: Viraj Bhat
Priority: Critical
 Fix For: 0.0.0

 Attachments: chinese.txt, chinese_data.pig, nextgen_paramfile


 I created a very small test case in which I did the following.
 1) Created a UTF-8 file which contained a query string in Chinese and wrote 
 it to HDFS. I used this dfs file as an input for the tests.
 2) Created a parameter file which also contained the same query string as in 
 Step 1.
 3) Created a Pig script which takes in the parametrized query string and hard 
 coded Chinese character.
 
 Pig script: chinese_data.pig
 
 {code}
 rmf chineseoutput;
 I = load '/user/viraj/chinese.txt' using PigStorage('\u0001');
 J = filter I by $0 == '$querystring';
 --J = filter I by $0 == ' 歌手香港情牽女人心演唱會';
 store J into 'chineseoutput';
 dump J;
 {code}
 =
 Parameter file: nextgen_paramfile
 =
 queryid=20090311
 querystring='   歌手香港情牽女人心演唱會'
 =
 Input file: /user/viraj/chinese.txt
 =
 shell$ hadoop fs -cat /user/viraj/chinese.txt
 歌手香港情牽女人心演唱會
 =
 I ran the above set of inputs in the following ways:
 Run 1:
 =
 java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' 
 org.apache.pig.Main -param_file nextgen_paramfile chinese_data.pig
 =
 2009-04-22 01:31:35,703 [Thread-7] WARN  org.apache.hadoop.mapred.JobClient - 
 Use GenericOptionsParser for parsing the
 arguments. Applications should implement Tool for the same.
 2009-04-22 01:31:40,700 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 0% complete
 2009-04-22 01:31:50,720 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 100% complete
 2009-04-22 01:31:50,720 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 Success!
 =
 Run 2: removed the parameter substitution in the Pig script instead used the 
 following statement.
 =
 J = filter I by $0 == ' 歌手香港情牽女人心演唱會';
 =
 java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' 
 org.apache.pig.Main chinese_data_withoutparam.pig
 =
 2009-04-22 01:35:22,402 [Thread-7] WARN  org.apache.hadoop.mapred.JobClient - 
 Use GenericOptionsParser for parsing the
 arguments. Applications should implement Tool for the same.
 2009-04-22 01:35:27,399 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 0% complete
 2009-04-22 01:35:32,415 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 100% complete
 2009-04-22 01:35:32,415 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 Success!
 =
 In both cases:
 =
 ucdev6 01:39:22 ~/pig-svn/trunk $ hadoop fs -ls /user/viraj/chineseoutput
 Found 2 items
 drwxr-xr-x   - viraj supergroup  0 2009-04-22 01:37 
 /user/viraj/chineseoutput/_logs
 -rw-r--r--   3 viraj supergroup  0 2009-04-22 01:37 
 /user/viraj/chineseoutput/part-0
 =
 Additionally tried the dry-run option to figure out if the parameter 
 substitution was 

[jira] Commented: (PIG-774) Pig does not handle Chinese characters (in both the parameter subsitution using -param_file or embedded in the Pig script) correctly

2009-04-24 Thread Viraj Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702620#action_12702620
 ] 

Viraj Bhat commented on PIG-774:


One workaround for this issue is using the FilterFunc, which reads its filter 
list from a file written on HDFS.

Care has to be taken in the FilterFunc UDF, to invoke the BufferedReader which 
read UTF8 data.
{code}

public class FILTERFROMFILE extends FilterFunc{
   private String FilterListFileName = ;
   
   private void init() throws IOException {

 Properties props = ConfigurationUtil.toProperties(PigInputFormat.sJob);
 InputStream is = FileLocalizer.openDFSFile(FilterListFileName, props);
 BufferedReader reader = new BufferedReader(new 
InputStreamReader(is,UTF8));
   
   }

   public Boolean exec(Tuple input) throws IOException {
init();
//do the matching here

   }
}
{code}

Pig code to instantiate the filter function UDF
{code}
register pigudf/myfilterfunc.jar;

define MATCHQUERY FILTERFROMFILE('/user/viraj/chinesedata');

rmf chineseoutput;

I = load '/user/viraj/testchinese' using PigStorage('\u0001') as 
(teststring:chararray);

J = filter I by MATCHQUERY(teststring);

store J into 'chineseoutput';
{code}


 Pig does not handle Chinese characters (in both the parameter subsitution 
 using -param_file or embedded in the Pig script) correctly
 

 Key: PIG-774
 URL: https://issues.apache.org/jira/browse/PIG-774
 Project: Pig
  Issue Type: Bug
  Components: grunt, impl
Affects Versions: 0.0.0
Reporter: Viraj Bhat
Priority: Critical
 Fix For: 0.0.0

 Attachments: chinese.txt, chinese_data.pig, nextgen_paramfile


 I created a very small test case in which I did the following.
 1) Created a UTF-8 file which contained a query string in Chinese and wrote 
 it to HDFS. I used this dfs file as an input for the tests.
 2) Created a parameter file which also contained the same query string as in 
 Step 1.
 3) Created a Pig script which takes in the parametrized query string and hard 
 coded Chinese character.
 
 Pig script: chinese_data.pig
 
 {code}
 rmf chineseoutput;
 I = load '/user/viraj/chinese.txt' using PigStorage('\u0001');
 J = filter I by $0 == '$querystring';
 --J = filter I by $0 == ' 歌手香港情牽女人心演唱會';
 store J into 'chineseoutput';
 dump J;
 {code}
 =
 Parameter file: nextgen_paramfile
 =
 queryid=20090311
 querystring='   歌手香港情牽女人心演唱會'
 =
 Input file: /user/viraj/chinese.txt
 =
 shell$ hadoop fs -cat /user/viraj/chinese.txt
 歌手香港情牽女人心演唱會
 =
 I ran the above set of inputs in the following ways:
 Run 1:
 =
 java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' 
 org.apache.pig.Main -param_file nextgen_paramfile chinese_data.pig
 =
 2009-04-22 01:31:35,703 [Thread-7] WARN  org.apache.hadoop.mapred.JobClient - 
 Use GenericOptionsParser for parsing the
 arguments. Applications should implement Tool for the same.
 2009-04-22 01:31:40,700 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 0% complete
 2009-04-22 01:31:50,720 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 100% complete
 2009-04-22 01:31:50,720 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 Success!
 =
 Run 2: removed the parameter substitution in the Pig script instead used the 
 following statement.
 =
 J = filter I by $0 == ' 歌手香港情牽女人心演唱會';
 =
 java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' 
 org.apache.pig.Main chinese_data_withoutparam.pig
 =
 2009-04-22 01:35:22,402 [Thread-7] WARN  org.apache.hadoop.mapred.JobClient - 
 Use GenericOptionsParser for parsing the
 arguments. Applications should implement Tool for the same.
 2009-04-22 01:35:27,399 [main] INFO  
 

[jira] Commented: (PIG-774) Pig does not handle Chinese characters (in both the parameter subsitution using -param_file or embedded in the Pig script) correctly

2009-04-24 Thread Viraj Bhat (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12702621#action_12702621
 ] 

Viraj Bhat commented on PIG-774:


Ciemo, as stated in the original problem description there is a  '-r' switch 
for achieving the same.
{code}
java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' 
org.apache.pig.Main -param_file nextgen_paramfile -r chinese_data.pig
{code}

 Pig does not handle Chinese characters (in both the parameter subsitution 
 using -param_file or embedded in the Pig script) correctly
 

 Key: PIG-774
 URL: https://issues.apache.org/jira/browse/PIG-774
 Project: Pig
  Issue Type: Bug
  Components: grunt, impl
Affects Versions: 0.0.0
Reporter: Viraj Bhat
Priority: Critical
 Fix For: 0.0.0

 Attachments: chinese.txt, chinese_data.pig, nextgen_paramfile


 I created a very small test case in which I did the following.
 1) Created a UTF-8 file which contained a query string in Chinese and wrote 
 it to HDFS. I used this dfs file as an input for the tests.
 2) Created a parameter file which also contained the same query string as in 
 Step 1.
 3) Created a Pig script which takes in the parametrized query string and hard 
 coded Chinese character.
 
 Pig script: chinese_data.pig
 
 {code}
 rmf chineseoutput;
 I = load '/user/viraj/chinese.txt' using PigStorage('\u0001');
 J = filter I by $0 == '$querystring';
 --J = filter I by $0 == ' 歌手香港情牽女人心演唱會';
 store J into 'chineseoutput';
 dump J;
 {code}
 =
 Parameter file: nextgen_paramfile
 =
 queryid=20090311
 querystring='   歌手香港情牽女人心演唱會'
 =
 Input file: /user/viraj/chinese.txt
 =
 shell$ hadoop fs -cat /user/viraj/chinese.txt
 歌手香港情牽女人心演唱會
 =
 I ran the above set of inputs in the following ways:
 Run 1:
 =
 java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' 
 org.apache.pig.Main -param_file nextgen_paramfile chinese_data.pig
 =
 2009-04-22 01:31:35,703 [Thread-7] WARN  org.apache.hadoop.mapred.JobClient - 
 Use GenericOptionsParser for parsing the
 arguments. Applications should implement Tool for the same.
 2009-04-22 01:31:40,700 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 0% complete
 2009-04-22 01:31:50,720 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 100% complete
 2009-04-22 01:31:50,720 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 Success!
 =
 Run 2: removed the parameter substitution in the Pig script instead used the 
 following statement.
 =
 J = filter I by $0 == ' 歌手香港情牽女人心演唱會';
 =
 java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' 
 org.apache.pig.Main chinese_data_withoutparam.pig
 =
 2009-04-22 01:35:22,402 [Thread-7] WARN  org.apache.hadoop.mapred.JobClient - 
 Use GenericOptionsParser for parsing the
 arguments. Applications should implement Tool for the same.
 2009-04-22 01:35:27,399 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 0% complete
 2009-04-22 01:35:32,415 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 100% complete
 2009-04-22 01:35:32,415 [main] INFO  
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
  -
 Success!
 =
 In both cases:
 =
 ucdev6 01:39:22 ~/pig-svn/trunk $ hadoop fs -ls /user/viraj/chineseoutput
 Found 2 items
 drwxr-xr-x   - viraj supergroup  0 2009-04-22 01:37 
 /user/viraj/chineseoutput/_logs
 -rw-r--r--   3 viraj supergroup  0 2009-04-22 01:37 
 /user/viraj/chineseoutput/part-0