[jira] Updated: (PIG-861) POJoinPackage lose tuple in large dataset
[ https://issues.apache.org/jira/browse/PIG-861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-861: --- Status: In Progress (was: Patch Available) POJoinPackage lose tuple in large dataset - Key: PIG-861 URL: https://issues.apache.org/jira/browse/PIG-861 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.3.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.4.0 Attachments: PIG-861-1.patch Some script using POJoinPackage loses records when processing large amount of input data. We do not see this problem in smaller input. We can reproduce this problem, however, the dataset for the test case is too big to be included here. We suspect that POJoinPackage causes the problem. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-861) POJoinPackage lose tuple in large dataset
[ https://issues.apache.org/jira/browse/PIG-861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-861: --- Affects Version/s: (was: 0.2.0) 0.3.0 Status: Patch Available (was: Open) POJoinPackage lose tuple in large dataset - Key: PIG-861 URL: https://issues.apache.org/jira/browse/PIG-861 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.3.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.4.0 Attachments: PIG-861-1.patch Some script using POJoinPackage loses records when processing large amount of input data. We do not see this problem in smaller input. We can reproduce this problem, however, the dataset for the test case is too big to be included here. We suspect that POJoinPackage causes the problem. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-734) Non-string keys in maps
[ https://issues.apache.org/jira/browse/PIG-734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12723350#action_12723350 ] Daniel Dai commented on PIG-734: Patch looks good to me. Non-string keys in maps --- Key: PIG-734 URL: https://issues.apache.org/jira/browse/PIG-734 Project: Pig Issue Type: Bug Affects Versions: 0.2.0 Reporter: Alan Gates Assignee: Alan Gates Priority: Minor Fix For: 0.4.0 Attachments: PIG-734.patch, PIG-734_2.patch, PIG-734_3.patch With the addition of types to pig, maps were changed to allow any atomic type to be a key. However, in practice we do not see people using keys other than strings. And allowing multiple types is causing us issues in serializing data (we have to check what every key type is) and in the design for non-java UDFs (since many scripting languages include associative arrays such as Perl's hash). So I propose we scope back maps to only have string keys. This would be a non-compatible change. But I am not aware of anyone using non-string keys, so hopefully it would have little or no impact. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-832) Make import list configurable
[ https://issues.apache.org/jira/browse/PIG-832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-832: --- Resolution: Fixed Status: Resolved (was: Patch Available) Remove the Yahoo line Make import list configurable - Key: PIG-832 URL: https://issues.apache.org/jira/browse/PIG-832 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich Assignee: Daniel Dai Fix For: 0.4.0 Attachments: PIG-832-1.patch, PIG-832-2.patch Currently, it is hardwired in PigContext. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-797) Limit with ORDER BY producing wrong results
[ https://issues.apache.org/jira/browse/PIG-797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-797: --- Resolution: Fixed Fix Version/s: 0.4.0 Status: Resolved (was: Patch Available) Patch submitted Limit with ORDER BY producing wrong results --- Key: PIG-797 URL: https://issues.apache.org/jira/browse/PIG-797 Project: Pig Issue Type: Bug Affects Versions: 0.2.0 Reporter: Olga Natkovich Assignee: Daniel Dai Fix For: 0.4.0 Attachments: PIG-797-2.patch, PIG-797-3.patch, PIG-797.patch Query: A = load 'studenttab10k' as (name, age, gpa); B = group A by name; C = foreach B generate group, SUM(A.gpa) as rev; D = order C by rev; E = limit D 10; dump E; Output: (alice king,31.7) (alice laertes,26.453) (alice thompson,25.867) (alice van buren,23.59) (bob allen,19.902) (bob ichabod,29.0) (bob king,28.454) (bob miller,10.28) (bob underhill,28.137) (bob van buren,25.992) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-832) Make import list configurable
[ https://issues.apache.org/jira/browse/PIG-832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-832: --- Fix Version/s: 0.4.0 Status: Patch Available (was: Open) Make import list configurable - Key: PIG-832 URL: https://issues.apache.org/jira/browse/PIG-832 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich Assignee: Daniel Dai Fix For: 0.4.0 Attachments: PIG-832-1.patch Currently, it is hardwired in PigContext. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-832) Make import list configurable
[ https://issues.apache.org/jira/browse/PIG-832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721456#action_12721456 ] Daniel Dai commented on PIG-832: Hi, Milind, in the use case you mentioned, he/she can write his own PigStorage, put the jar in the import list. Pig will take user supplied UDF first, thus override the buildin PigStorage. How is this? Make import list configurable - Key: PIG-832 URL: https://issues.apache.org/jira/browse/PIG-832 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich Assignee: Daniel Dai Fix For: 0.3.0 Currently, it is hardwired in PigContext. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-832) Make import list configurable
[ https://issues.apache.org/jira/browse/PIG-832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721503#action_12721503 ] Daniel Dai commented on PIG-832: Hi, Milind, For your first comment, yes, user's class have to be PigStorage. For your second comment, we do not put user's jar before pig.jar. We put their udf search path first. Let's say user put -Dudf.import.list=com.xxx.udf1:com.xxx.udf2, when we see an unknown UDF, we first search in the package com.xxx.udf1, then com.xxx.udf2, then org.apache.pig.builtin. We build this policy in our code. It's not put user.jar in front of pig.jar. Make import list configurable - Key: PIG-832 URL: https://issues.apache.org/jira/browse/PIG-832 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich Assignee: Daniel Dai Fix For: 0.3.0 Currently, it is hardwired in PigContext. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-832) Make import list configurable
[ https://issues.apache.org/jira/browse/PIG-832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721519#action_12721519 ] Daniel Dai commented on PIG-832: Hi, Milind, If a user wrote 10 UDFs, I guess he/she does not suppose to put 10 entries in the command line, right? Make import list configurable - Key: PIG-832 URL: https://issues.apache.org/jira/browse/PIG-832 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich Assignee: Daniel Dai Fix For: 0.3.0 Currently, it is hardwired in PigContext. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-832) Make import list configurable
[ https://issues.apache.org/jira/browse/PIG-832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12721528#action_12721528 ] Daniel Dai commented on PIG-832: yes, `cat myudflist` is a way to get around. However, in my humble opinion, this syntax is not very intuitive to the ordinary user. Many users may have the impression that they have to put their UDFs one by one. Make import list configurable - Key: PIG-832 URL: https://issues.apache.org/jira/browse/PIG-832 Project: Pig Issue Type: Improvement Affects Versions: 0.2.0 Reporter: Olga Natkovich Assignee: Daniel Dai Fix For: 0.3.0 Currently, it is hardwired in PigContext. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-797) Limit with ORDER BY producing wrong results
[ https://issues.apache.org/jira/browse/PIG-797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-797: --- Status: In Progress (was: Patch Available) Limit with ORDER BY producing wrong results --- Key: PIG-797 URL: https://issues.apache.org/jira/browse/PIG-797 Project: Pig Issue Type: Bug Affects Versions: 0.2.0 Reporter: Olga Natkovich Assignee: Daniel Dai Fix For: 0.3.0 Attachments: PIG-797-2.patch, PIG-797.patch Query: A = load 'studenttab10k' as (name, age, gpa); B = group A by name; C = foreach B generate group, SUM(A.gpa) as rev; D = order C by rev; E = limit D 10; dump E; Output: (alice king,31.7) (alice laertes,26.453) (alice thompson,25.867) (alice van buren,23.59) (bob allen,19.902) (bob ichabod,29.0) (bob king,28.454) (bob miller,10.28) (bob underhill,28.137) (bob van buren,25.992) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-797) Limit with ORDER BY producing wrong results
[ https://issues.apache.org/jira/browse/PIG-797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-797: --- Attachment: PIG-797-3.patch This patch include Santhosh's comment. It is obviously a problem. Thanks! Limit with ORDER BY producing wrong results --- Key: PIG-797 URL: https://issues.apache.org/jira/browse/PIG-797 Project: Pig Issue Type: Bug Affects Versions: 0.2.0 Reporter: Olga Natkovich Assignee: Daniel Dai Fix For: 0.3.0 Attachments: PIG-797-2.patch, PIG-797-3.patch, PIG-797.patch Query: A = load 'studenttab10k' as (name, age, gpa); B = group A by name; C = foreach B generate group, SUM(A.gpa) as rev; D = order C by rev; E = limit D 10; dump E; Output: (alice king,31.7) (alice laertes,26.453) (alice thompson,25.867) (alice van buren,23.59) (bob allen,19.902) (bob ichabod,29.0) (bob king,28.454) (bob miller,10.28) (bob underhill,28.137) (bob van buren,25.992) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-797) Limit with ORDER BY producing wrong results
[ https://issues.apache.org/jira/browse/PIG-797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-797: --- Status: Open (was: Patch Available) Limit with ORDER BY producing wrong results --- Key: PIG-797 URL: https://issues.apache.org/jira/browse/PIG-797 Project: Pig Issue Type: Bug Affects Versions: 0.2.0 Reporter: Olga Natkovich Fix For: site Attachments: PIG-797.patch Query: A = load 'studenttab10k' as (name, age, gpa); B = group A by name; C = foreach B generate group, SUM(A.gpa) as rev; D = order C by rev; E = limit D 10; dump E; Output: (alice king,31.7) (alice laertes,26.453) (alice thompson,25.867) (alice van buren,23.59) (bob allen,19.902) (bob ichabod,29.0) (bob king,28.454) (bob miller,10.28) (bob underhill,28.137) (bob van buren,25.992) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-797) Limit with ORDER BY producing wrong results
[ https://issues.apache.org/jira/browse/PIG-797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-797: --- Fix Version/s: (was: site) 0.3.0 Assignee: Daniel Dai Status: Patch Available (was: Open) Limit with ORDER BY producing wrong results --- Key: PIG-797 URL: https://issues.apache.org/jira/browse/PIG-797 Project: Pig Issue Type: Bug Affects Versions: 0.2.0 Reporter: Olga Natkovich Assignee: Daniel Dai Fix For: 0.3.0 Attachments: PIG-797-2.patch, PIG-797.patch Query: A = load 'studenttab10k' as (name, age, gpa); B = group A by name; C = foreach B generate group, SUM(A.gpa) as rev; D = order C by rev; E = limit D 10; dump E; Output: (alice king,31.7) (alice laertes,26.453) (alice thompson,25.867) (alice van buren,23.59) (bob allen,19.902) (bob ichabod,29.0) (bob king,28.454) (bob miller,10.28) (bob underhill,28.137) (bob van buren,25.992) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-797) Limit with ORDER BY producing wrong results
[ https://issues.apache.org/jira/browse/PIG-797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-797: --- Attachment: PIG-797-2.patch New patch solve the findbug issues and add testcase. Limit with ORDER BY producing wrong results --- Key: PIG-797 URL: https://issues.apache.org/jira/browse/PIG-797 Project: Pig Issue Type: Bug Affects Versions: 0.2.0 Reporter: Olga Natkovich Fix For: 0.3.0 Attachments: PIG-797-2.patch, PIG-797.patch Query: A = load 'studenttab10k' as (name, age, gpa); B = group A by name; C = foreach B generate group, SUM(A.gpa) as rev; D = order C by rev; E = limit D 10; dump E; Output: (alice king,31.7) (alice laertes,26.453) (alice thompson,25.867) (alice van buren,23.59) (bob allen,19.902) (bob ichabod,29.0) (bob king,28.454) (bob miller,10.28) (bob underhill,28.137) (bob van buren,25.992) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-850) Dump produce wrong result while store into is ok
[ https://issues.apache.org/jira/browse/PIG-850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-850: --- Attachment: PIG-850.patch When we add extra limit map-reduce operator (see [PIG-364|http://issues.apache.org/jira/browse/PIG-364]), we should mark the output file in the original map-reduce as temporary; Otherwise, dump will pick the wrong output file. Dump produce wrong result while store into is ok -- Key: PIG-850 URL: https://issues.apache.org/jira/browse/PIG-850 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.2.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.3.0 Attachments: PIG-850.patch The following script will wrongly produce 20 output, however, if we change dump to store into, the result is correct. Not sure if the problem is only for limited sort case. A = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, gpa); B = order A by gpa parallel 2; C = limit B 10; dump C; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-850) Dump produce wrong result while store into is ok
[ https://issues.apache.org/jira/browse/PIG-850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-850: --- Status: Patch Available (was: Open) Dump produce wrong result while store into is ok -- Key: PIG-850 URL: https://issues.apache.org/jira/browse/PIG-850 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.2.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.3.0 Attachments: PIG-850.patch The following script will wrongly produce 20 output, however, if we change dump to store into, the result is correct. Not sure if the problem is only for limited sort case. A = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, gpa); B = order A by gpa parallel 2; C = limit B 10; dump C; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-850) Dump produce wrong result while store into is ok
[ https://issues.apache.org/jira/browse/PIG-850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-850: --- Resolution: Fixed Status: Resolved (was: Patch Available) Patch submitted Dump produce wrong result while store into is ok -- Key: PIG-850 URL: https://issues.apache.org/jira/browse/PIG-850 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.2.0 Reporter: Daniel Dai Assignee: Daniel Dai Fix For: 0.3.0 Attachments: PIG-850.patch The following script will wrongly produce 20 output, however, if we change dump to store into, the result is correct. Not sure if the problem is only for limited sort case. A = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, gpa); B = order A by gpa parallel 2; C = limit B 10; dump C; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-797) Limit with ORDER BY producing wrong results
[ https://issues.apache.org/jira/browse/PIG-797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-797: --- Attachment: PIG-797.patch For the limited sort case, the extra limit map-reduce operator introduced in [PIG-364|http://issues.apache.org/jira/browse/PIG-364] should use the same output key as the previous sort map-reduce operator. Limit with ORDER BY producing wrong results --- Key: PIG-797 URL: https://issues.apache.org/jira/browse/PIG-797 Project: Pig Issue Type: Bug Affects Versions: 0.2.0 Reporter: Olga Natkovich Attachments: PIG-797.patch Query: A = load 'studenttab10k' as (name, age, gpa); B = group A by name; C = foreach B generate group, SUM(A.gpa) as rev; D = order C by rev; E = limit D 10; dump E; Output: (alice king,31.7) (alice laertes,26.453) (alice thompson,25.867) (alice van buren,23.59) (bob allen,19.902) (bob ichabod,29.0) (bob king,28.454) (bob miller,10.28) (bob underhill,28.137) (bob van buren,25.992) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-797) Limit with ORDER BY producing wrong results
[ https://issues.apache.org/jira/browse/PIG-797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-797: --- Fix Version/s: site Status: Patch Available (was: Open) Limit with ORDER BY producing wrong results --- Key: PIG-797 URL: https://issues.apache.org/jira/browse/PIG-797 Project: Pig Issue Type: Bug Affects Versions: 0.2.0 Reporter: Olga Natkovich Fix For: site Attachments: PIG-797.patch Query: A = load 'studenttab10k' as (name, age, gpa); B = group A by name; C = foreach B generate group, SUM(A.gpa) as rev; D = order C by rev; E = limit D 10; dump E; Output: (alice king,31.7) (alice laertes,26.453) (alice thompson,25.867) (alice van buren,23.59) (bob allen,19.902) (bob ichabod,29.0) (bob king,28.454) (bob miller,10.28) (bob underhill,28.137) (bob van buren,25.992) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-774) Pig does not handle Chinese characters (in both the parameter subsitution using -param_file or embedded in the Pig script) correctly
[ https://issues.apache.org/jira/browse/PIG-774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12710628#action_12710628 ] Daniel Dai commented on PIG-774: You get the point, Viraj. Actually we can have two different configurations: 1. LANG=UTF8, all data files, script files, parameter files are UTF8 2. LANG=GB2321, data files are UTF8; script files, parameter files are GB2321 However, for RH-EL default settings, LANG=POSIX, which does not work well for Chinese characters. So for simplicity, we can have everything UTF8 (case 1). This is the default setting for Ubuntu. Pig does not handle Chinese characters (in both the parameter subsitution using -param_file or embedded in the Pig script) correctly Key: PIG-774 URL: https://issues.apache.org/jira/browse/PIG-774 Project: Pig Issue Type: Bug Components: grunt, impl Affects Versions: 0.0.0 Reporter: Viraj Bhat Assignee: Daniel Dai Priority: Critical Fix For: 0.3.0 Attachments: chinese.txt, chinese_data.pig, nextgen_paramfile, pig_1240967860835.log, utf8.patch, utf8_parser-1.patch, utf8_parser-2.patch I created a very small test case in which I did the following. 1) Created a UTF-8 file which contained a query string in Chinese and wrote it to HDFS. I used this dfs file as an input for the tests. 2) Created a parameter file which also contained the same query string as in Step 1. 3) Created a Pig script which takes in the parametrized query string and hard coded Chinese character. Pig script: chinese_data.pig {code} rmf chineseoutput; I = load '/user/viraj/chinese.txt' using PigStorage('\u0001'); J = filter I by $0 == '$querystring'; --J = filter I by $0 == ' 歌手香港情牽女人心演唱會'; store J into 'chineseoutput'; dump J; {code} = Parameter file: nextgen_paramfile = queryid=20090311 querystring=' 歌手香港情牽女人心演唱會' = Input file: /user/viraj/chinese.txt = shell$ hadoop fs -cat /user/viraj/chinese.txt 歌手香港情牽女人心演唱會 = I ran the above set of inputs in the following ways: Run 1: = {code} java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' org.apache.pig.Main -param_file nextgen_paramfile chinese_data.pig {code} = 2009-04-22 01:31:35,703 [Thread-7] WARN org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-04-22 01:31:40,700 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2009-04-22 01:31:50,720 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2009-04-22 01:31:50,720 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! = Run 2: removed the parameter substitution in the Pig script instead used the following statement. = {code} J = filter I by $0 == ' 歌手香港情牽女人心演唱會'; {code} = java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' org.apache.pig.Main chinese_data_withoutparam.pig = 2009-04-22 01:35:22,402 [Thread-7] WARN org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-04-22 01:35:27,399 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2009-04-22 01:35:32,415 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2009-04-22 01:35:32,415 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! = In both cases: = {code} shell $ hadoop
[jira] Resolved: (PIG-774) Pig does not handle Chinese characters (in both the parameter subsitution using -param_file or embedded in the Pig script) correctly
[ https://issues.apache.org/jira/browse/PIG-774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai resolved PIG-774. Resolution: Fixed Fix Version/s: (was: 0.0.0) 0.3.0 Yes, the patch is committed. Thanks Pig does not handle Chinese characters (in both the parameter subsitution using -param_file or embedded in the Pig script) correctly Key: PIG-774 URL: https://issues.apache.org/jira/browse/PIG-774 Project: Pig Issue Type: Bug Components: grunt, impl Affects Versions: 0.0.0 Reporter: Viraj Bhat Assignee: Daniel Dai Priority: Critical Fix For: 0.3.0 Attachments: chinese.txt, chinese_data.pig, nextgen_paramfile, pig_1240967860835.log, utf8.patch, utf8_parser-1.patch, utf8_parser-2.patch I created a very small test case in which I did the following. 1) Created a UTF-8 file which contained a query string in Chinese and wrote it to HDFS. I used this dfs file as an input for the tests. 2) Created a parameter file which also contained the same query string as in Step 1. 3) Created a Pig script which takes in the parametrized query string and hard coded Chinese character. Pig script: chinese_data.pig {code} rmf chineseoutput; I = load '/user/viraj/chinese.txt' using PigStorage('\u0001'); J = filter I by $0 == '$querystring'; --J = filter I by $0 == ' 歌手香港情牽女人心演唱會'; store J into 'chineseoutput'; dump J; {code} = Parameter file: nextgen_paramfile = queryid=20090311 querystring=' 歌手香港情牽女人心演唱會' = Input file: /user/viraj/chinese.txt = shell$ hadoop fs -cat /user/viraj/chinese.txt 歌手香港情牽女人心演唱會 = I ran the above set of inputs in the following ways: Run 1: = {code} java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' org.apache.pig.Main -param_file nextgen_paramfile chinese_data.pig {code} = 2009-04-22 01:31:35,703 [Thread-7] WARN org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-04-22 01:31:40,700 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2009-04-22 01:31:50,720 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2009-04-22 01:31:50,720 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! = Run 2: removed the parameter substitution in the Pig script instead used the following statement. = {code} J = filter I by $0 == ' 歌手香港情牽女人心演唱會'; {code} = java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' org.apache.pig.Main chinese_data_withoutparam.pig = 2009-04-22 01:35:22,402 [Thread-7] WARN org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-04-22 01:35:27,399 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2009-04-22 01:35:32,415 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2009-04-22 01:35:32,415 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! = In both cases: = {code} shell $ hadoop fs -ls /user/viraj/chineseoutput Found 2 items drwxr-xr-x - viraj supergroup 0 2009-04-22 01:37 /user/viraj/chineseoutput/_logs -rw-r--r-- 3 viraj supergroup 0 2009-04-22 01:37 /user/viraj/chineseoutput/part-0 {code} = Additionally tried the dry-run option
[jira] Updated: (PIG-799) Unit tests on windows are failing after multiquery commit
[ https://issues.apache.org/jira/browse/PIG-799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-799: --- Resolution: Fixed Fix Version/s: 0.3.0 Status: Resolved (was: Patch Available) Unit tests on windows are failing after multiquery commit - Key: PIG-799 URL: https://issues.apache.org/jira/browse/PIG-799 Project: Pig Issue Type: Bug Reporter: Olga Natkovich Assignee: Daniel Dai Fix For: 0.3.0 Attachments: PIG-799.patch Daniel could you take a look. It should be reproducible with the latest trunk. Thanks -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-799) Unit tests on windows are failing after multiquery commit
[ https://issues.apache.org/jira/browse/PIG-799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-799: --- Attachment: PIG-799.patch The failure is caused by changed logic of QueryParser.massageFilename in multi-query patch. I attached patch and please review. Unit tests on windows are failing after multiquery commit - Key: PIG-799 URL: https://issues.apache.org/jira/browse/PIG-799 Project: Pig Issue Type: Bug Reporter: Olga Natkovich Assignee: Daniel Dai Attachments: PIG-799.patch Daniel could you take a look. It should be reproducible with the latest trunk. Thanks -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-774) Pig does not handle Chinese characters (in both the parameter subsitution using -param_file or embedded in the Pig script) correctly
[ https://issues.apache.org/jira/browse/PIG-774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12705799#action_12705799 ] Daniel Dai commented on PIG-774: Hi, Olga, I actually assume only textural data inside bytearray as UTF8. The functions I change are DataByteArray.DataByteArray(String s) and DataByteArray.toString(), which are the functions to convert String to/from byte array. If image or other binary data, we will not need to convert them to/from a String, so they will be fine. Pig does not handle Chinese characters (in both the parameter subsitution using -param_file or embedded in the Pig script) correctly Key: PIG-774 URL: https://issues.apache.org/jira/browse/PIG-774 Project: Pig Issue Type: Bug Components: grunt, impl Affects Versions: 0.0.0 Reporter: Viraj Bhat Assignee: Daniel Dai Priority: Critical Fix For: 0.0.0 Attachments: chinese.txt, chinese_data.pig, nextgen_paramfile, pig_1240967860835.log, utf8.patch, utf8_parser-1.patch, utf8_parser-2.patch I created a very small test case in which I did the following. 1) Created a UTF-8 file which contained a query string in Chinese and wrote it to HDFS. I used this dfs file as an input for the tests. 2) Created a parameter file which also contained the same query string as in Step 1. 3) Created a Pig script which takes in the parametrized query string and hard coded Chinese character. Pig script: chinese_data.pig {code} rmf chineseoutput; I = load '/user/viraj/chinese.txt' using PigStorage('\u0001'); J = filter I by $0 == '$querystring'; --J = filter I by $0 == ' 歌手香港情牽女人心演唱會'; store J into 'chineseoutput'; dump J; {code} = Parameter file: nextgen_paramfile = queryid=20090311 querystring=' 歌手香港情牽女人心演唱會' = Input file: /user/viraj/chinese.txt = shell$ hadoop fs -cat /user/viraj/chinese.txt 歌手香港情牽女人心演唱會 = I ran the above set of inputs in the following ways: Run 1: = {code} java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' org.apache.pig.Main -param_file nextgen_paramfile chinese_data.pig {code} = 2009-04-22 01:31:35,703 [Thread-7] WARN org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-04-22 01:31:40,700 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2009-04-22 01:31:50,720 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2009-04-22 01:31:50,720 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! = Run 2: removed the parameter substitution in the Pig script instead used the following statement. = {code} J = filter I by $0 == ' 歌手香港情牽女人心演唱會'; {code} = java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' org.apache.pig.Main chinese_data_withoutparam.pig = 2009-04-22 01:35:22,402 [Thread-7] WARN org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-04-22 01:35:27,399 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2009-04-22 01:35:32,415 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2009-04-22 01:35:32,415 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! = In both cases: = {code} shell $ hadoop fs -ls /user/viraj/chineseoutput Found 2 items drwxr-xr-x - viraj supergroup
[jira] Assigned: (PIG-774) Pig does not handle Chinese characters (in both the parameter subsitution using -param_file or embedded in the Pig script) correctly
[ https://issues.apache.org/jira/browse/PIG-774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai reassigned PIG-774: -- Assignee: Daniel Dai Pig does not handle Chinese characters (in both the parameter subsitution using -param_file or embedded in the Pig script) correctly Key: PIG-774 URL: https://issues.apache.org/jira/browse/PIG-774 Project: Pig Issue Type: Bug Components: grunt, impl Affects Versions: 0.0.0 Reporter: Viraj Bhat Assignee: Daniel Dai Priority: Critical Fix For: 0.0.0 Attachments: chinese.txt, chinese_data.pig, nextgen_paramfile, pig_1240967860835.log, utf8.patch, utf8_parser-1.patch, utf8_parser-2.patch I created a very small test case in which I did the following. 1) Created a UTF-8 file which contained a query string in Chinese and wrote it to HDFS. I used this dfs file as an input for the tests. 2) Created a parameter file which also contained the same query string as in Step 1. 3) Created a Pig script which takes in the parametrized query string and hard coded Chinese character. Pig script: chinese_data.pig {code} rmf chineseoutput; I = load '/user/viraj/chinese.txt' using PigStorage('\u0001'); J = filter I by $0 == '$querystring'; --J = filter I by $0 == ' 歌手香港情牽女人心演唱會'; store J into 'chineseoutput'; dump J; {code} = Parameter file: nextgen_paramfile = queryid=20090311 querystring=' 歌手香港情牽女人心演唱會' = Input file: /user/viraj/chinese.txt = shell$ hadoop fs -cat /user/viraj/chinese.txt 歌手香港情牽女人心演唱會 = I ran the above set of inputs in the following ways: Run 1: = {code} java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' org.apache.pig.Main -param_file nextgen_paramfile chinese_data.pig {code} = 2009-04-22 01:31:35,703 [Thread-7] WARN org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-04-22 01:31:40,700 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2009-04-22 01:31:50,720 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2009-04-22 01:31:50,720 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! = Run 2: removed the parameter substitution in the Pig script instead used the following statement. = {code} J = filter I by $0 == ' 歌手香港情牽女人心演唱會'; {code} = java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' org.apache.pig.Main chinese_data_withoutparam.pig = 2009-04-22 01:35:22,402 [Thread-7] WARN org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-04-22 01:35:27,399 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2009-04-22 01:35:32,415 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2009-04-22 01:35:32,415 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! = In both cases: = {code} shell $ hadoop fs -ls /user/viraj/chineseoutput Found 2 items drwxr-xr-x - viraj supergroup 0 2009-04-22 01:37 /user/viraj/chineseoutput/_logs -rw-r--r-- 3 viraj supergroup 0 2009-04-22 01:37 /user/viraj/chineseoutput/part-0 {code} = Additionally tried the dry-run option to figure out if the parameter substitution was occurring properly.
[jira] Commented: (PIG-774) Pig does not handle Chinese characters (in both the parameter subsitution using -param_file or embedded in the Pig script) correctly
[ https://issues.apache.org/jira/browse/PIG-774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703007#action_12703007 ] Daniel Dai commented on PIG-774: Currently Jline does not handle backspace correctly for multibyte characters. When we hit backspace in a UTF8 encoding OS, only partial character will be deleted. If the OS encoding is native, the situation is even worse, Jline will throw an exception for multibyte character entered. This problem is inherent in Jline and all applications utilize JLine share this problem. I will try to fix it in Jline, however, fixing this problem is out of the scope of Pig. So currently, we will have to live with these problem: # Multibyte character inputing is not supportted if OS encoding is native # Backspace handling is incorrect if line contains multibyte characters and OS encoding is UTF8 Interstingly, under Cygwin, Jline works fine. The above problem are only for Unix. Pig does not handle Chinese characters (in both the parameter subsitution using -param_file or embedded in the Pig script) correctly Key: PIG-774 URL: https://issues.apache.org/jira/browse/PIG-774 Project: Pig Issue Type: Bug Components: grunt, impl Affects Versions: 0.0.0 Reporter: Viraj Bhat Priority: Critical Fix For: 0.0.0 Attachments: chinese.txt, chinese_data.pig, nextgen_paramfile, utf8_parser-1.patch I created a very small test case in which I did the following. 1) Created a UTF-8 file which contained a query string in Chinese and wrote it to HDFS. I used this dfs file as an input for the tests. 2) Created a parameter file which also contained the same query string as in Step 1. 3) Created a Pig script which takes in the parametrized query string and hard coded Chinese character. Pig script: chinese_data.pig {code} rmf chineseoutput; I = load '/user/viraj/chinese.txt' using PigStorage('\u0001'); J = filter I by $0 == '$querystring'; --J = filter I by $0 == ' 歌手香港情牽女人心演唱會'; store J into 'chineseoutput'; dump J; {code} = Parameter file: nextgen_paramfile = queryid=20090311 querystring=' 歌手香港情牽女人心演唱會' = Input file: /user/viraj/chinese.txt = shell$ hadoop fs -cat /user/viraj/chinese.txt 歌手香港情牽女人心演唱會 = I ran the above set of inputs in the following ways: Run 1: = {code} java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' org.apache.pig.Main -param_file nextgen_paramfile chinese_data.pig {code} = 2009-04-22 01:31:35,703 [Thread-7] WARN org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-04-22 01:31:40,700 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2009-04-22 01:31:50,720 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2009-04-22 01:31:50,720 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! = Run 2: removed the parameter substitution in the Pig script instead used the following statement. = {code} J = filter I by $0 == ' 歌手香港情牽女人心演唱會'; {code} = java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' org.apache.pig.Main chinese_data_withoutparam.pig = 2009-04-22 01:35:22,402 [Thread-7] WARN org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-04-22 01:35:27,399 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2009-04-22 01:35:32,415 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2009-04-22 01:35:32,415 [main] INFO
[jira] Commented: (PIG-771) PigDump does not properly output Chinese UTF8 characters - they are displayed as question marks ??
[ https://issues.apache.org/jira/browse/PIG-771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703010#action_12703010 ] Daniel Dai commented on PIG-771: Hi, David, In my mind PigDump should deal with UTF8 characters correctly. Can you describe the situation in which PigDump fail? What is your OS encoding? PigDump does not properly output Chinese UTF8 characters - they are displayed as question marks ?? -- Key: PIG-771 URL: https://issues.apache.org/jira/browse/PIG-771 Project: Pig Issue Type: Bug Reporter: David Ciemiewicz PigDump does not properly output Chinese UTF8 characters. The reason for this is that the function Tuple.toString() is called. DefaultTuple implements Tuple.toString() and it calls Object.toString() on the opaque object d. Instead, I think that the code should be changed instead to call the new DataType.toString() function. {code} @Override public String toString() { StringBuilder sb = new StringBuilder(); sb.append('('); for (IteratorObject it = mFields.iterator(); it.hasNext();) { Object d = it.next(); if(d != null) { if(d instanceof Map) { sb.append(DataType.mapToString((MapObject, Object)d)); } else { sb.append(DataType.toString(d)); // Change this one line if(d instanceof Long) { sb.append(L); } else if(d instanceof Float) { sb.append(F); } } } else { sb.append(); } if (it.hasNext()) sb.append(,); } sb.append(')'); return sb.toString(); } {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-771) PigDump does not properly output Chinese UTF8 characters - they are displayed as question marks ??
[ https://issues.apache.org/jira/browse/PIG-771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12703293#action_12703293 ] Daniel Dai commented on PIG-771: David, can you check what is the language setting on that computer? Use the command locale|grep LANG. Thanks PigDump does not properly output Chinese UTF8 characters - they are displayed as question marks ?? -- Key: PIG-771 URL: https://issues.apache.org/jira/browse/PIG-771 Project: Pig Issue Type: Bug Reporter: David Ciemiewicz PigDump does not properly output Chinese UTF8 characters. The reason for this is that the function Tuple.toString() is called. DefaultTuple implements Tuple.toString() and it calls Object.toString() on the opaque object d. Instead, I think that the code should be changed instead to call the new DataType.toString() function. {code} @Override public String toString() { StringBuilder sb = new StringBuilder(); sb.append('('); for (IteratorObject it = mFields.iterator(); it.hasNext();) { Object d = it.next(); if(d != null) { if(d instanceof Map) { sb.append(DataType.mapToString((MapObject, Object)d)); } else { sb.append(DataType.toString(d)); // Change this one line if(d instanceof Long) { sb.append(L); } else if(d instanceof Float) { sb.append(F); } } } else { sb.append(); } if (it.hasNext()) sb.append(,); } sb.append(')'); return sb.toString(); } {code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-774) Pig does not handle Chinese characters (in both the parameter subsitution using -param_file or embedded in the Pig script) correctly
[ https://issues.apache.org/jira/browse/PIG-774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-774: --- Attachment: utf8_parser-1.patch As Alan said, adding option to QueryParser.jjt and ParamLoader.jj will do the trick. Probably we do not need to hardcode UTF8 into getBytes. If the OS encoding is UTF8 (LANG=UTF-8), getBytes generates byte array using OS encoding, which is UTF8. If the OS is native encoding (LANG=GB2312), getBytes generate byte array of native encoding, then SimpleCharStream will interpret the input stream as native encoding also, so everything goes fine. One thing I want to point out. On UTF8 OS, everything is perfect. However, on legacy system with native encoding, PigStorage treats all input/output file UTF8, which is reasonable because all data files come from or go to hadoop backend for which UTF8 is highly desired. However, these input/output files cannot be read by vi on OS with native encoding. Most applications (eg: vi, cat) interpret input file using OS encoding. In addition, if we do a Pig dump on such OS, we will see UTF8 output stream which is messy. Script files and parameter files are local and most users will use vi to edit. We shall interpret script files and parameter files as OS encoding. utf8_parser-1.patch is a preliminary patch. Viraj, can you give a try? Also we need to fix jline. It does not deal with multibyte characters well now. Pig does not handle Chinese characters (in both the parameter subsitution using -param_file or embedded in the Pig script) correctly Key: PIG-774 URL: https://issues.apache.org/jira/browse/PIG-774 Project: Pig Issue Type: Bug Components: grunt, impl Affects Versions: 0.0.0 Reporter: Viraj Bhat Priority: Critical Fix For: 0.0.0 Attachments: chinese.txt, chinese_data.pig, nextgen_paramfile, utf8_parser-1.patch I created a very small test case in which I did the following. 1) Created a UTF-8 file which contained a query string in Chinese and wrote it to HDFS. I used this dfs file as an input for the tests. 2) Created a parameter file which also contained the same query string as in Step 1. 3) Created a Pig script which takes in the parametrized query string and hard coded Chinese character. Pig script: chinese_data.pig {code} rmf chineseoutput; I = load '/user/viraj/chinese.txt' using PigStorage('\u0001'); J = filter I by $0 == '$querystring'; --J = filter I by $0 == ' 歌手香港情牽女人心演唱會'; store J into 'chineseoutput'; dump J; {code} = Parameter file: nextgen_paramfile = queryid=20090311 querystring=' 歌手香港情牽女人心演唱會' = Input file: /user/viraj/chinese.txt = shell$ hadoop fs -cat /user/viraj/chinese.txt 歌手香港情牽女人心演唱會 = I ran the above set of inputs in the following ways: Run 1: = {code} java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' org.apache.pig.Main -param_file nextgen_paramfile chinese_data.pig {code} = 2009-04-22 01:31:35,703 [Thread-7] WARN org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2009-04-22 01:31:40,700 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2009-04-22 01:31:50,720 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2009-04-22 01:31:50,720 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! = Run 2: removed the parameter substitution in the Pig script instead used the following statement. = {code} J = filter I by $0 == ' 歌手香港情牽女人心演唱會'; {code} = java -cp pig.jar:/home/viraj/hadoop-0.18.0-dev/conf/ -Dhod.server='' org.apache.pig.Main chinese_data_withoutparam.pig = 2009-04-22
[jira] Commented: (PIG-737) A few unit tests take a long timeto run on windows
[ https://issues.apache.org/jira/browse/PIG-737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12695980#action_12695980 ] Daniel Dai commented on PIG-737: I compare the unit test time on Unix and Cygwin. It is not a comparison between performance between Unix and Cygwin cuz I use different computers for Unix and Cygwin test, rather, I was trying to find whether a particular unit test is significantly slower. All unit tests which use MiniCluster (marked as red in the table below) are consistently slower under Cygwin in my test. For those long unit tests(100s), the range of slowdown is about 1.29--2.23. Here is a list: ||Test Case||Test Time in Unix||Test Time in Cygwin|| |TestAdd|0.116|0.25| |{color:red}TestAlgebraicEval{color}|350.899|451.75| |TestAlgebraicEvalLocal|23.428|15.531| |{color:red}TestBZip{color}|35.987|46.391| |{color:red}TestBestFitCast{color}|211.681|409| |TestBinaryStorage|1.135|4.062| |TestBoolean|0.212|0.141| |TestBuiltin|1.111|0.672| |TestCmdLineParser|0.064|0.047| |{color:red}TestCombiner{color}|144.184|222.625| |{color:red}TestCompressedFiles{color}|24.744|50.359| |TestConstExpr|0.115|0.063| |TestConversions|0.385|0.25| |{color:red}TestCustomSlicer{color}|953.088|211.719| |TestDataBag|1.329|1.625| |{color:red}TestDataBagAccess{color}|138.525|261.281| |TestDataModel|0.264|0.281| |TestDeleteOnFail|0.072|0.172| |TestDivide|0.093|0.031| |TestEqualTo|0.261|0.141| |{color:red}TestEvalPipeline{color}|608.986|1,089.17| |{color:red}TestEvalPipeline2{color}|64.247|127.532| |TestEvalPipelineLocal|5.351|4.422| |{color:red}TestExampleGenerator{color}|3.169|5.578| |{color:red}TestFRJoin{color}|413.982|667.406| |TestFilter|0.323|0.156| |{color:red}TestFilterOpNumeric{color}|66.538|145.922| |{color:red}TestFilterOpString{color}|54.282|105.844| |{color:red}TestFilterUDF{color}|17.943|40.14| |TestFinish|16.55|25.015| |TestForEach|0.28|0.172| |TestForEachNestedPlan|20.252|23.312| |TestForEachNestedPlanLocal|0.591|0.562| |TestFuncSpec|0.225|0.141| |TestGTOrEqual|0.264|0.156| |TestGreaterThan|0.264|0.188| |TestGrunt|3.083|2.343| |TestImplicitSplit|1.427|3.453| |{color:red}TestInfixArithmetic{color}|76.284|146.781| |{color:red}TestInputOutputFileValidator{color}|0.717|1.984| |TestInstantiateFunc|0.055|0.032| |{color:red}TestJobSubmission{color}|6.646|4.641| |TestKeyTypeDiscoveryVisitor|55.592|88.703| |TestLTOrEqual|0.264|0.171| |TestLessThan|0.263|0.171| |TestLoad|0.286|0.219| |TestLocal|1.698|2.172| |TestLocal2|0.713|0.609| |TestLocalJobSubmission|7.925|8.438| |TestLocalPOSplit|0.72|1.5| |TestLocalRearrange|0.27|0.156| |TestLogToPhyCompiler|1.13|1.719| |TestLogicalOptimizer|1.519|2.093| |TestLogicalPlanBuilder|1.804|2.203| |TestMRCompiler|0.79|0.625| |{color:red}TestMapReduce{color}|238.474|405.843| |TestMapReduce2|40.298|46.078| |TestMod|0.062|0.047| |TestMultiply|0.062|0.031| |TestNotEqualTo|0.254|0.188| |TestNull|0.274|0.172| |{color:red}TestNullConstant{color}|85.932|172.828| |TestOperatorPlan|0.236|0.187| |TestPOBinCond|0.114|0.078| |TestPOCast|0.302|0.235| |TestPOCogroup|0.251|0.156| |TestPOCross|0.227|0.125| |TestPODistinct|0.079|0.046| |TestPOGenerate|0.093|0.046| |TestPOMapLookUp|0.204|0.141| |{color:red}TestPONegative{color}|9.339|22.531| |TestPOSort|0.312|0.235| |TestPOUserFunc|0.291|0.265| |TestPackage|107.555|86.578| |TestParamSubPreproc|0.586|2.734| |{color:red}TestParser{color}|56.527|18.078| |TestPhyOp|0.261|0.141| |TestPigContext|1.265|1.922| |TestPigScriptParser|0.411|0.391| |{color:red}TestPigServer{color}|3.56|1.391| |TestPigSplit|0.988|1.672| |TestProject|0.283|0.188| |TestRegexp|0.189|0.11| |TestSchema|0.219|0.156| |TestSchemaParser|0.323|0.219| |{color:red}TestSplitStore{color}|363.749|562.797| |TestStore|0.459|0.313| |{color:red}TestStoreOld{color}|115.725|212.735| |TestStreaming|2.017|14.344| |TestStreamingLocal|1.964|12.875| |TestSubtract|0.062|0.031| |TestTypeChecking|1.264|1.781| |TestTypeCheckingValidator|6.584|5.625| |TestTypeCheckingValidatorNoSchema|0.408|0.25| |{color:red}TestUnion{color}|43.671|75.234| Here is a decomposed list for one of the test case TestDataBagAccess; ||Test||Test Time in Unix||Test Time in Cygwin|| |testSingleTupleBagAcess|0.261|0.188| |testNonSpillableDataBag|0.14|0.093| |testBagConstantAccess|17.773|23.375| |testBagConstantAccessFailure|0.194|0.438| |testBagConstantFlatten1|15.794|17.343| |testBagConstantFlatten2|25.736|35.594| |testBagStoreLoad|157.403|239.734| Based on these data, no particular long unit test is significantly slow. The factor of slowdown is relatively stable considering the diversity of code we are testing. So I think what we need to deal with is general performance problem under Cygwin rather than a particular unit test. Does anyone see some exceptions on other computers? A few unit tests take a long timeto run on windows --- Key: PIG-737 URL:
[jira] Commented: (PIG-737) A few unit tests take a long timeto run on windows
[ https://issues.apache.org/jira/browse/PIG-737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12689779#action_12689779 ] Daniel Dai commented on PIG-737: I will take a look of it maybe this weekend. A few unit tests take a long timeto run on windows --- Key: PIG-737 URL: https://issues.apache.org/jira/browse/PIG-737 Project: Pig Issue Type: Bug Components: build Affects Versions: 1.0.0 Environment: Windows Reporter: Santhosh Srinivasan Fix For: 1.1.0 A few unit tests take a long time to run on Windows. This problem has to be diagnosed. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Resolved: (PIG-654) Optimize build.xml
[ https://issues.apache.org/jira/browse/PIG-654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai resolved PIG-654. Resolution: Fixed Patch committed. Optimize build.xml -- Key: PIG-654 URL: https://issues.apache.org/jira/browse/PIG-654 Project: Pig Issue Type: Improvement Components: impl Affects Versions: types_branch Reporter: Daniel Dai Assignee: Daniel Dai Priority: Trivial Fix For: types_branch Attachments: PIG-654.patch Build process can be faster. Here is the issues slow down the build process: 1. Should change test/org/apache/pig/test/utils/dotGraph/Dot.jjt to DOTParser.jjt. [jjtree] assumes the output file name the same with the input file name, otherwise, it will recompile every time 2. Delete src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPostCombinerPackage.java and src/org/apache/pig/impl/plan/SplitIntroducer.java, both classes are empty and generates no class file. Ant will recompile them every time since it do not see the output class file. I do not know how to create a patch with file deletion and rename, so I describe the actions below: 1. mv test/org/apache/pig/test/utils/dotGraph/Dot.jjt test/org/apache/pig/test/utils/dotGraph/DOTParser.jjt 2. rm src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPostCombinerPackage.java 3. rm src/org/apache/pig/impl/plan/SplitIntroducer.java 4. apply patch to change build.xml (Dot.jjt - DOTParser.jjt) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-631) 4 Unit test failures on Windows
[ https://issues.apache.org/jira/browse/PIG-631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12669432#action_12669432 ] Daniel Dai commented on PIG-631: The problem is caused by [HBASE-1063|http://issues.apache.org/jira/browse/HBASE-1063]. It is fixed in hbase 0.19.0. However, hbase 0.19.0 works with hadoop 0.19.0 only, so we cannot upgrade hbase right now. I think we can disable TestHBaseStorage under Windows for now. Once we upgraded to hadoop 19, we can also upgrade hbase to 19. This issue is fixed automatically once we use hbase 19 (in my test, one additional action is needed, which is removing /tmp/hbase manually to let hbase recreate its file system). So once we move to hadoop 19 and upgrade to hbase 19, reverse the temporary patch, and close this issue. 4 Unit test failures on Windows Key: PIG-631 URL: https://issues.apache.org/jira/browse/PIG-631 Project: Pig Issue Type: Bug Affects Versions: types_branch Environment: Windows Reporter: Lee Tucker Attachments: PIG-631.temp.patch 4 Windows unit test failures. All timeouts. Errors occur at tip of branch and have for several days. [junit] Running org.apache.pig.test.TestAlgebraicEval [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0 sec [junit] Test org.apache.pig.test.TestAlgebraicEval FAILED (timeout) [junit] Running org.apache.pig.test.TestEvalPipeline [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0 sec [junit] Test org.apache.pig.test.TestEvalPipeline FAILED (timeout) [junit] Running org.apache.pig.test.TestHBaseStorage [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0 sec [junit] Test org.apache.pig.test.TestHBaseStorage FAILED (timeout) [junit] Running org.apache.pig.test.TestMapReduce [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0 sec [junit] Test org.apache.pig.test.TestMapReduce FAILED (timeout) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-654) Optimize build.xml
Optimize build.xml -- Key: PIG-654 URL: https://issues.apache.org/jira/browse/PIG-654 Project: Pig Issue Type: Improvement Components: impl Affects Versions: types_branch Reporter: Daniel Dai Assignee: Daniel Dai Priority: Trivial Fix For: types_branch Build process can be faster. Here is the issues slow down the build process: 1. Should change test/org/apache/pig/test/utils/dotGraph/Dot.jjt to DOTParser.jjt. [jjtree] assumes the output file name the same with the input file name, otherwise, it will recompile every time 2. Delete src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPostCombinerPackage.java and src/org/apache/pig/impl/plan/SplitIntroducer.java, both classes are empty and generates no class file. Ant will recompile them every time since it do not see the output class file. I do not know how to create a patch with file deletion and rename, so I describe the actions below: 1. mv test/org/apache/pig/test/utils/dotGraph/Dot.jjt test/org/apache/pig/test/utils/dotGraph/DOTParser.jjt 2. rm src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPostCombinerPackage.java 3. rm src/org/apache/pig/impl/plan/SplitIntroducer.java 4. apply patch to change build.xml (Dot.jjt - DOTParser.jjt) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-631) 4 Unit test failures on Windows
[ https://issues.apache.org/jira/browse/PIG-631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12668641#action_12668641 ] Daniel Dai commented on PIG-631: Patch committed. Please leave the case open until I figured out the problem with TestHBaseStorage. Thanks! 4 Unit test failures on Windows Key: PIG-631 URL: https://issues.apache.org/jira/browse/PIG-631 Project: Pig Issue Type: Bug Affects Versions: types_branch Environment: Windows Reporter: Lee Tucker Attachments: PIG-631.temp.patch 4 Windows unit test failures. All timeouts. Errors occur at tip of branch and have for several days. [junit] Running org.apache.pig.test.TestAlgebraicEval [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0 sec [junit] Test org.apache.pig.test.TestAlgebraicEval FAILED (timeout) [junit] Running org.apache.pig.test.TestEvalPipeline [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0 sec [junit] Test org.apache.pig.test.TestEvalPipeline FAILED (timeout) [junit] Running org.apache.pig.test.TestHBaseStorage [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0 sec [junit] Test org.apache.pig.test.TestHBaseStorage FAILED (timeout) [junit] Running org.apache.pig.test.TestMapReduce [junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 0 sec [junit] Test org.apache.pig.test.TestMapReduce FAILED (timeout) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-501) Make branches/types work under cygwin
[ https://issues.apache.org/jira/browse/PIG-501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-501: --- Attachment: javacc.jar cygwin.javacc42.patch Make Pig work under cygwin with javacc 4.2 release. Unit test pass in both cygwin and Linux. We will need to: 1. Apply cygwin.javacc42.patch 2. Change javacc.jar with the one in javacc 4.2 release. I also attached new javacc.jar. Make branches/types work under cygwin - Key: PIG-501 URL: https://issues.apache.org/jira/browse/PIG-501 Project: Pig Issue Type: Bug Components: impl Affects Versions: types_branch Environment: cygwin Reporter: Daniel Dai Assignee: Daniel Dai Fix For: types_branch Attachments: cygwin.javacc42.patch, javacc.jar, PIG-501-new1.patch, PIG_cygwin.Patch, PIG_cygwin2.Patch, PIG_cygwin_original_javacc.Patch We've already make all unit tests pass under cygwin for trunk (See [PIG-243|https://issues.apache.org/jira/browse/PIG-243]). We need to do the same for branches/types. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.