[jira] [Commented] (PIG-1998) Allow macro to return void
[ https://issues.apache.org/jira/browse/PIG-1998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13025600#comment-13025600 ] Xuefu Zhang commented on PIG-1998: -- For patch PIG-1998_2.patch: 1. in the following grammar, the second rule should be just ^(RETURN_VAL). With that, "return void" is equivalent to return zero alias. So in the code, we don't need to do this kind of check: rets.size() == 1 && rets.get(0).equals("void") +macro_return_clause +: RETURNS alias (COMMA alias)* +-> ^(RETURN_VAL alias+) +| RETURNS VOID +-> ^(RETURN_VAL VOID) 2. The bigger concern is actually the newly added method validate(). I don't think the StreamingTokenizer will meet our needs. For instance, it's not able to recognized Pig single line comments such as: -- this is a single line comment. Even if this isn't a problem, the maintenance overhead could evolve to a nightmare for us in long run. I don't necessarily have a better idea, but I think we should at least give more thoughts on this. > Allow macro to return void > -- > > Key: PIG-1998 > URL: https://issues.apache.org/jira/browse/PIG-1998 > Project: Pig > Issue Type: Improvement > Components: impl >Affects Versions: 0.9.0 >Reporter: Richard Ding >Assignee: Richard Ding > Fix For: 0.9.0 > > Attachments: PIG-1998_1.patch, PIG-1998_2.patch > > > Pig macro is allowed to not have output alias. But this property isn't clear > from macro definition and macro invocation (macro inline). Here we propose to > make it clear: > 1. If a macro doesn't output any alias, it must specify void as return value. > For example: > {code} > define mymacro(...) returns void { >... ... > }; > {code} > 2. If a macro doesn't output any alias, it must be invoked without return > value. For example, to invoke above macro, just specify: > {code} > mymacro(...); > {code} > 3. Any non-void return alias in the macro definition must exist in the macro > body and be prefixed with $. For example: > {code} > define mymacro(...) returns B { >... ... >$B = filter ...; > }; > {code} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2017) consumeMap() fails with EmptyStackException
[ https://issues.apache.org/jira/browse/PIG-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacob Perkins updated PIG-2017: --- Attachment: utf8storagepatch.txt Uses a try...catch block to catch the EmptyStackException and thows an IOException about a 'malformed map' > consumeMap() fails with EmptyStackException > --- > > Key: PIG-2017 > URL: https://issues.apache.org/jira/browse/PIG-2017 > Project: Pig > Issue Type: Bug >Reporter: Jacob Perkins > Attachments: utf8storagepatch.txt > > > If a map is read in its serialized form, eg: [key#value], then the > consumeMap() method of Utf8StorageConverter fails for the following maps: > {code:none} > [a#)] > [a#}] > [a#"take a look at my lovely curly brace, }"] > [a#'oh look, a closed parenthesis! )'] > {code} > There are a couple of options: > 1. Define an escape sequence (ie. quotes or a backslash) > 2. Call it a bad record, go get a beer, and move on. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2017) consumeMap() fails with EmptyStackException
[ https://issues.apache.org/jira/browse/PIG-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacob Perkins updated PIG-2017: --- Status: Open (was: Patch Available) > consumeMap() fails with EmptyStackException > --- > > Key: PIG-2017 > URL: https://issues.apache.org/jira/browse/PIG-2017 > Project: Pig > Issue Type: Bug >Reporter: Jacob Perkins > Attachments: utf8storagepatch.txt > > > If a map is read in its serialized form, eg: [key#value], then the > consumeMap() method of Utf8StorageConverter fails for the following maps: > {code:none} > [a#)] > [a#}] > [a#"take a look at my lovely curly brace, }"] > [a#'oh look, a closed parenthesis! )'] > {code} > There are a couple of options: > 1. Define an escape sequence (ie. quotes or a backslash) > 2. Call it a bad record, go get a beer, and move on. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2017) consumeMap() fails with EmptyStackException
[ https://issues.apache.org/jira/browse/PIG-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jacob Perkins updated PIG-2017: --- Status: Patch Available (was: Open) > consumeMap() fails with EmptyStackException > --- > > Key: PIG-2017 > URL: https://issues.apache.org/jira/browse/PIG-2017 > Project: Pig > Issue Type: Bug >Reporter: Jacob Perkins > Attachments: utf8storagepatch.txt > > > If a map is read in its serialized form, eg: [key#value], then the > consumeMap() method of Utf8StorageConverter fails for the following maps: > {code:none} > [a#)] > [a#}] > [a#"take a look at my lovely curly brace, }"] > [a#'oh look, a closed parenthesis! )'] > {code} > There are a couple of options: > 1. Define an escape sequence (ie. quotes or a backslash) > 2. Call it a bad record, go get a beer, and move on. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-2017) consumeMap() fails with EmptyStackException
consumeMap() fails with EmptyStackException --- Key: PIG-2017 URL: https://issues.apache.org/jira/browse/PIG-2017 Project: Pig Issue Type: Bug Reporter: Jacob Perkins If a map is read in its serialized form, eg: [key#value], then the consumeMap() method of Utf8StorageConverter fails for the following maps: {code:none} [a#)] [a#}] [a#"take a look at my lovely curly brace, }"] [a#'oh look, a closed parenthesis! )'] {code} There are a couple of options: 1. Define an escape sequence (ie. quotes or a backslash) 2. Call it a bad record, go get a beer, and move on. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-1827) When passing a parameter to Pig, if the value contains $ it has to be escaped for no apparent reason
[ https://issues.apache.org/jira/browse/PIG-1827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1827: -- Attachment: PIG-1827_2.patch The new patch added a test case that supports $ as part of parameter: {code} separator = '$' P = Pig.compile(\"\"\"a = load 'input' using PigStorage('$separator');store a into 'output';\"\"\") Q = P.bind() {code} On the other hand, Pig Latin doesn't support '\$' in its variables, string literals. > When passing a parameter to Pig, if the value contains $ it has to be escaped > for no apparent reason > > > Key: PIG-1827 > URL: https://issues.apache.org/jira/browse/PIG-1827 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.8.0 >Reporter: Julien Le Dem >Assignee: Richard Ding > Fix For: 0.9.0 > > Attachments: PIG-1827-1.patch, PIG-1827_2.patch > > -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-1998) Allow macro to return void
[ https://issues.apache.org/jira/browse/PIG-1998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Ding updated PIG-1998: -- Attachment: PIG-1998_2.patch Attaching a new patch that addresses Xuefu's review comments. > Allow macro to return void > -- > > Key: PIG-1998 > URL: https://issues.apache.org/jira/browse/PIG-1998 > Project: Pig > Issue Type: Improvement > Components: impl >Affects Versions: 0.9.0 >Reporter: Richard Ding >Assignee: Richard Ding > Fix For: 0.9.0 > > Attachments: PIG-1998_1.patch, PIG-1998_2.patch > > > Pig macro is allowed to not have output alias. But this property isn't clear > from macro definition and macro invocation (macro inline). Here we propose to > make it clear: > 1. If a macro doesn't output any alias, it must specify void as return value. > For example: > {code} > define mymacro(...) returns void { >... ... > }; > {code} > 2. If a macro doesn't output any alias, it must be invoked without return > value. For example, to invoke above macro, just specify: > {code} > mymacro(...); > {code} > 3. Any non-void return alias in the macro definition must exist in the macro > body and be prefixed with $. For example: > {code} > define mymacro(...) returns B { >... ... >$B = filter ...; > }; > {code} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-1989) complex type casting should return null on casting failure
[ https://issues.apache.org/jira/browse/PIG-1989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Dai updated PIG-1989: Attachment: PIG-1989-1.patch This happens when the size of tuple inner schema does not match the data. Attach PIG-1989-1.patch. > complex type casting should return null on casting failure > --- > > Key: PIG-1989 > URL: https://issues.apache.org/jira/browse/PIG-1989 > Project: Pig > Issue Type: Bug >Affects Versions: 0.9.0 >Reporter: Thejas M Nair >Assignee: Daniel Dai > Fix For: 0.9.0 > > Attachments: PIG-1989-1.patch > > > When casting fails for complex objects, pig is currently returning un-casted > object if the cast fails. > It should return null instead. That is consistent with the behavior when > casting to other basic types. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-1998) Allow macro to return void
[ https://issues.apache.org/jira/browse/PIG-1998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13025433#comment-13025433 ] Xuefu Zhang commented on PIG-1998: -- For patch PIG-1998_1.patch, 1. I don't see VOID keyword is defined or used in any of the grammar rules. 2. Grammar rules inline_return_clause have overlap 3. Will re-review once the patch is updated. > Allow macro to return void > -- > > Key: PIG-1998 > URL: https://issues.apache.org/jira/browse/PIG-1998 > Project: Pig > Issue Type: Improvement > Components: impl >Affects Versions: 0.9.0 >Reporter: Richard Ding >Assignee: Richard Ding > Fix For: 0.9.0 > > Attachments: PIG-1998_1.patch > > > Pig macro is allowed to not have output alias. But this property isn't clear > from macro definition and macro invocation (macro inline). Here we propose to > make it clear: > 1. If a macro doesn't output any alias, it must specify void as return value. > For example: > {code} > define mymacro(...) returns void { >... ... > }; > {code} > 2. If a macro doesn't output any alias, it must be invoked without return > value. For example, to invoke above macro, just specify: > {code} > mymacro(...); > {code} > 3. Any non-void return alias in the macro definition must exist in the macro > body and be prefixed with $. For example: > {code} > define mymacro(...) returns B { >... ... >$B = filter ...; > }; > {code} -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-2016) -dot option does not work with explain and new logical plan
-dot option does not work with explain and new logical plan --- Key: PIG-2016 URL: https://issues.apache.org/jira/browse/PIG-2016 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.8.0, 0.8.1, 0.9.0 Reporter: Alan Gates Priority: Minor Fix For: 0.9.0 If you specify -dot in explain, it is supposed to produce a file with the graphs in .dot format. While the physical plan and map reduce plan are correctly output in .dot format, the new logical plan is still output in text format. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-2015) Explain writes out logical plan twice
Explain writes out logical plan twice - Key: PIG-2015 URL: https://issues.apache.org/jira/browse/PIG-2015 Project: Pig Issue Type: Bug Components: impl Affects Versions: 0.9.0 Reporter: Alan Gates Priority: Minor Fix For: 0.9.0 Running explain on a script writes out the logical plan twice, the physical plan once, and the map reduce plan once. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (PIG-2004) Incorrect input types passed on to eval function
[ https://issues.apache.org/jira/browse/PIG-2004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thejas M Nair resolved PIG-2004. Resolution: Fixed Patch committed to trunk and 0.9 branch. > Incorrect input types passed on to eval function > > > Key: PIG-2004 > URL: https://issues.apache.org/jira/browse/PIG-2004 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.9.0 >Reporter: Vivek Padmanabhan >Assignee: Thejas M Nair > Fix For: 0.9.0 > > Attachments: PIG-2004-0.patch, PIG-2004.1.patch > > > The below script fails by throwing a ClassCastException from the MAX udf. The > udf expects the value of the bag supplied to be databyte array, but at run > time the udf gets the actual type, ie Double in this case. This causes the > script execution to fail with exception; > | Caused by: java.lang.ClassCastException: java.lang.Double cannot be cast to > org.apache.pig.data.DataByteArray > The same script runs properly with Pig 0.8. > {code} > A = LOAD 'myinput' as (f1,f2,f3); > B = foreach A generate f1,f2+f3/1000.0 as doub; > C = group B by f1; > D = foreach C generate (long)(MAX(B.doub)) as f4; > dump D; > {code} > myinput > --- > a 100012345 > b 200023456 > c 300034567 > a 150054321 > b 250065432 -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2004) Incorrect input types passed on to eval function
[ https://issues.apache.org/jira/browse/PIG-2004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13025389#comment-13025389 ] Daniel Dai commented on PIG-2004: - +1 > Incorrect input types passed on to eval function > > > Key: PIG-2004 > URL: https://issues.apache.org/jira/browse/PIG-2004 > Project: Pig > Issue Type: Bug > Components: impl >Affects Versions: 0.9.0 >Reporter: Vivek Padmanabhan >Assignee: Thejas M Nair > Fix For: 0.9.0 > > Attachments: PIG-2004-0.patch, PIG-2004.1.patch > > > The below script fails by throwing a ClassCastException from the MAX udf. The > udf expects the value of the bag supplied to be databyte array, but at run > time the udf gets the actual type, ie Double in this case. This causes the > script execution to fail with exception; > | Caused by: java.lang.ClassCastException: java.lang.Double cannot be cast to > org.apache.pig.data.DataByteArray > The same script runs properly with Pig 0.8. > {code} > A = LOAD 'myinput' as (f1,f2,f3); > B = foreach A generate f1,f2+f3/1000.0 as doub; > C = group B by f1; > D = foreach C generate (long)(MAX(B.doub)) as f4; > dump D; > {code} > myinput > --- > a 100012345 > b 200023456 > c 300034567 > a 150054321 > b 250065432 -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (PIG-2006) Regression: NPE when Pig processes an empty script file
[ https://issues.apache.org/jira/browse/PIG-2006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang resolved PIG-2006. -- Resolution: Fixed > Regression: NPE when Pig processes an empty script file > --- > > Key: PIG-2006 > URL: https://issues.apache.org/jira/browse/PIG-2006 > Project: Pig > Issue Type: Bug >Affects Versions: 0.9.0 >Reporter: Xuefu Zhang >Assignee: Xuefu Zhang > Fix For: 0.9.0 > > Attachments: PIG-2006-1.patch, PIG-2006.patch > > > If a pig script file is empty and supplied as input for Pig (using -f > option), an NPE is thrown. Stacktrace: > java.lang.NullPointerException > at java.util.regex.Matcher.getTextLength(Matcher.java:1140) > at java.util.regex.Matcher.reset(Matcher.java:291) > at java.util.regex.Matcher.(Matcher.java:211) > at java.util.regex.Pattern.matcher(Pattern.java:888) > at > org.apache.pig.scripting.ScriptEngine$SupportedScriptLang.accepts(ScriptEngine.java:89) > at > org.apache.pig.scripting.ScriptEngine.getSupportedScriptLang(ScriptEngine.java:163) > at org.apache.pig.Main.determineScriptType(Main.java:892) > at org.apache.pig.Main.run(Main.java:378) > at org.apache.pig.Main.main(Main.java:108) > This seems related Jython support in 0.9. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Work started] (PIG-2006) Regression: NPE when Pig processes an empty script file
[ https://issues.apache.org/jira/browse/PIG-2006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on PIG-2006 started by Xuefu Zhang. > Regression: NPE when Pig processes an empty script file > --- > > Key: PIG-2006 > URL: https://issues.apache.org/jira/browse/PIG-2006 > Project: Pig > Issue Type: Bug >Affects Versions: 0.9.0 >Reporter: Xuefu Zhang >Assignee: Xuefu Zhang > Fix For: 0.9.0 > > Attachments: PIG-2006-1.patch, PIG-2006.patch > > > If a pig script file is empty and supplied as input for Pig (using -f > option), an NPE is thrown. Stacktrace: > java.lang.NullPointerException > at java.util.regex.Matcher.getTextLength(Matcher.java:1140) > at java.util.regex.Matcher.reset(Matcher.java:291) > at java.util.regex.Matcher.(Matcher.java:211) > at java.util.regex.Pattern.matcher(Pattern.java:888) > at > org.apache.pig.scripting.ScriptEngine$SupportedScriptLang.accepts(ScriptEngine.java:89) > at > org.apache.pig.scripting.ScriptEngine.getSupportedScriptLang(ScriptEngine.java:163) > at org.apache.pig.Main.determineScriptType(Main.java:892) > at org.apache.pig.Main.run(Main.java:378) > at org.apache.pig.Main.main(Main.java:108) > This seems related Jython support in 0.9. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2014) SAMPLE shouldn't be pushed up
[ https://issues.apache.org/jira/browse/PIG-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13025345#comment-13025345 ] Dmitriy V. Ryaboy commented on PIG-2014: Proposal for a general-case fix: add a @Nondeterministic annotation for UDFs, have the PushUpFilter check whether the udf is deterministic or not when considering whether pushing up is ok. Annotate the filter UDF that sample is rewritten to, accordingly. > SAMPLE shouldn't be pushed up > - > > Key: PIG-2014 > URL: https://issues.apache.org/jira/browse/PIG-2014 > Project: Pig > Issue Type: Bug >Reporter: Jacob Perkins > > Consider the following code: > {code:none} > tfidf_all = LOAD '$TFIDF' AS (doc_id:chararray, token:chararray, > weight:double); > grouped = GROUP tfidf_all BY doc_id; > vectors = FOREACH grouped GENERATE group AS doc_id, tfidf_all.(token, > weight) AS vector; > DUMP vectors; > {code} > This, of course, runs just fine. In a real example, tfidf_all contains > 1,428,280 records. The reduce output records should be exactly the number of > documents, which turn out to be 18,863 in this case. All well and good. > The strangeness comes when you add a SAMPLE command: > {code:none} > sampled = SAMPLE vectors 0.0012; > DUMP sampled; > {code} > Running this results in 1,513 reduce output records. The reduce output > records be much much closer to 22 or 23 records (eg. 0.0012*18863). > Evidently, Pig rewrites SAMPLE into filter, and then pushes that filter in > front of the group. It shouldn't push that filter > since the UDF is non-deterministic. > Quick fix: If you add "-t PushUpFilter" to your command line when invoking > pig this won't happen. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-1622) DEFINE streaming options are ill defined and not properly documented
[ https://issues.apache.org/jira/browse/PIG-1622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13025332#comment-13025332 ] Xuefu Zhang commented on PIG-1622: -- Patch PIG-1622-2.patch is checked in for both trunk and 0.9.0. With this change, a command doesn't allow multiple occurrence of the same option. And this is backward incompatible change. > DEFINE streaming options are ill defined and not properly documented > > > Key: PIG-1622 > URL: https://issues.apache.org/jira/browse/PIG-1622 > Project: Pig > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Alan Gates >Assignee: Corinne Chandel >Priority: Minor > Fix For: 0.9.0 > > Attachments: PIG-1622-1.patch, PIG-1622-2.patch, PIG-1622.patch > > > According to the documentation > (http://hadoop.apache.org/pig/docs/r0.7.0/piglatin_ref2.html#DEFINE) the > syntax for DEFINE when used to define a streaming command is: > DEFINE cmd INPUT(stdin|path) OUTPUT(stdout|stderr|path) SHIP(path [, path, > ...]) CACHE (path [, path, ...]) > However, the actual parser accepts something pretty different. Consider the > following script: > {code} > define strm `wc -l` INPUT(stdin) > CACHE('/Users/gates/.vimrc#myvim') > OUTPUT(stdin) > INPUT('/tmp/fred') > OUTPUT('/tmp/bob') > SHIP('/Users/gates/.bashrc') > SHIP('/Users/gates/.vimrc') > CACHE('/Users/gates/.bashrc#mybash') > stderr('/tmp/errors' limit 10); > A = load '/Users/gates/test/data/studenttab10'; > B = stream A through strm; > dump B; > {code} > The above actually parsers. I see several issues here: > # What do multiple INPUT and OUTPUT statements mean in the context of > streaming? These should not be allowed. > # The documentation implies an order (INPUT, OUTPUT, SHIP, CACHE) that is not > enforced by the parser. We should either enforce the order in the parser or > update the documentation. Most likely the latter to avoid breaking existing > scripts. > # Why are multiple SHIP and CACHE clauses allowed when each can take multiple > paths? It seems we should only allow one of each. > # The error clause is completely different that what is given in the > documentation. I suspect this is a documentation error and the grammar > supported by the parser here is what we want. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-1622) DEFINE streaming options are ill defined and not properly documented
[ https://issues.apache.org/jira/browse/PIG-1622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xuefu Zhang updated PIG-1622: - Attachment: PIG-1622-2.patch Update the patch with minor fix for a test case. > DEFINE streaming options are ill defined and not properly documented > > > Key: PIG-1622 > URL: https://issues.apache.org/jira/browse/PIG-1622 > Project: Pig > Issue Type: Bug >Affects Versions: 0.7.0 >Reporter: Alan Gates >Assignee: Corinne Chandel >Priority: Minor > Fix For: 0.9.0 > > Attachments: PIG-1622-1.patch, PIG-1622-2.patch, PIG-1622.patch > > > According to the documentation > (http://hadoop.apache.org/pig/docs/r0.7.0/piglatin_ref2.html#DEFINE) the > syntax for DEFINE when used to define a streaming command is: > DEFINE cmd INPUT(stdin|path) OUTPUT(stdout|stderr|path) SHIP(path [, path, > ...]) CACHE (path [, path, ...]) > However, the actual parser accepts something pretty different. Consider the > following script: > {code} > define strm `wc -l` INPUT(stdin) > CACHE('/Users/gates/.vimrc#myvim') > OUTPUT(stdin) > INPUT('/tmp/fred') > OUTPUT('/tmp/bob') > SHIP('/Users/gates/.bashrc') > SHIP('/Users/gates/.vimrc') > CACHE('/Users/gates/.bashrc#mybash') > stderr('/tmp/errors' limit 10); > A = load '/Users/gates/test/data/studenttab10'; > B = stream A through strm; > dump B; > {code} > The above actually parsers. I see several issues here: > # What do multiple INPUT and OUTPUT statements mean in the context of > streaming? These should not be allowed. > # The documentation implies an order (INPUT, OUTPUT, SHIP, CACHE) that is not > enforced by the parser. We should either enforce the order in the parser or > update the documentation. Most likely the latter to avoid breaking existing > scripts. > # Why are multiple SHIP and CACHE clauses allowed when each can take multiple > paths? It seems we should only allow one of each. > # The error clause is completely different that what is given in the > documentation. I suspect this is a documentation error and the grammar > supported by the parser here is what we want. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-2014) SAMPLE shouldn't be pushed up
SAMPLE shouldn't be pushed up - Key: PIG-2014 URL: https://issues.apache.org/jira/browse/PIG-2014 Project: Pig Issue Type: Bug Reporter: Jacob Perkins Consider the following code: {code:none} tfidf_all = LOAD '$TFIDF' AS (doc_id:chararray, token:chararray, weight:double); grouped = GROUP tfidf_all BY doc_id; vectors = FOREACH grouped GENERATE group AS doc_id, tfidf_all.(token, weight) AS vector; DUMP vectors; {code} This, of course, runs just fine. In a real example, tfidf_all contains 1,428,280 records. The reduce output records should be exactly the number of documents, which turn out to be 18,863 in this case. All well and good. The strangeness comes when you add a SAMPLE command: {code:none} sampled = SAMPLE vectors 0.0012; DUMP sampled; {code} Running this results in 1,513 reduce output records. The reduce output records be much much closer to 22 or 23 records (eg. 0.0012*18863). Evidently, Pig rewrites SAMPLE into filter, and then pushes that filter in front of the group. It shouldn't push that filter since the UDF is non-deterministic. Quick fix: If you add "-t PushUpFilter" to your command line when invoking pig this won't happen. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira