[jira] [Commented] (PIG-1998) Allow macro to return void

2011-04-26 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13025600#comment-13025600
 ] 

Xuefu Zhang commented on PIG-1998:
--

For patch PIG-1998_2.patch:

1. in the following grammar, the second rule should be just ^(RETURN_VAL). With 
that, "return void" is equivalent to return zero alias. So in the code, we 
don't need to do this kind of check: rets.size() == 1 && 
rets.get(0).equals("void")

+macro_return_clause 
+: RETURNS alias (COMMA alias)*
+-> ^(RETURN_VAL alias+)
+| RETURNS VOID 
+-> ^(RETURN_VAL VOID)

2. The bigger concern is actually the newly added method validate(). I don't 
think the StreamingTokenizer will meet our needs. For instance, it's not able 
to recognized Pig single line comments such as:

 -- this is a single line comment.

Even if this isn't a problem, the maintenance overhead could evolve to a 
nightmare for us in long run. I don't necessarily have a better idea, but I 
think we should at least give more thoughts on this.


> Allow macro to return void
> --
>
> Key: PIG-1998
> URL: https://issues.apache.org/jira/browse/PIG-1998
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.9.0
>Reporter: Richard Ding
>Assignee: Richard Ding
> Fix For: 0.9.0
>
> Attachments: PIG-1998_1.patch, PIG-1998_2.patch
>
>
> Pig macro is allowed to not have output alias. But this property isn't clear 
> from macro definition and macro invocation (macro inline). Here we propose to 
> make it clear:
> 1. If a macro doesn't output any alias, it must specify void as return value. 
> For example:
> {code}  
> define mymacro(...) returns void {
>... ...
> };
> {code}
> 2. If a macro doesn't output any alias, it must be invoked without return 
> value. For example, to invoke above macro, just specify:
> {code}
> mymacro(...);
> {code}
> 3. Any non-void return alias in the macro definition must exist in the macro 
> body and be prefixed with $. For example:
> {code}  
> define mymacro(...) returns B {
>... ...
>$B = filter ...;
> };
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2017) consumeMap() fails with EmptyStackException

2011-04-26 Thread Jacob Perkins (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacob Perkins updated PIG-2017:
---

Attachment: utf8storagepatch.txt

Uses a try...catch block to catch the EmptyStackException and thows an 
IOException about a 'malformed map'

> consumeMap() fails with EmptyStackException
> ---
>
> Key: PIG-2017
> URL: https://issues.apache.org/jira/browse/PIG-2017
> Project: Pig
>  Issue Type: Bug
>Reporter: Jacob Perkins
> Attachments: utf8storagepatch.txt
>
>
> If a map is read in its serialized form, eg: [key#value], then the 
> consumeMap() method of Utf8StorageConverter fails for the following maps:
> {code:none}
> [a#)]
> [a#}]
> [a#"take a look at my lovely curly brace, }"]
> [a#'oh look, a closed parenthesis! )']
> {code}
> There are a couple of options:
> 1. Define an escape sequence (ie. quotes or a backslash)
> 2. Call it a bad record, go get a beer, and move on.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2017) consumeMap() fails with EmptyStackException

2011-04-26 Thread Jacob Perkins (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacob Perkins updated PIG-2017:
---

Status: Open  (was: Patch Available)

> consumeMap() fails with EmptyStackException
> ---
>
> Key: PIG-2017
> URL: https://issues.apache.org/jira/browse/PIG-2017
> Project: Pig
>  Issue Type: Bug
>Reporter: Jacob Perkins
> Attachments: utf8storagepatch.txt
>
>
> If a map is read in its serialized form, eg: [key#value], then the 
> consumeMap() method of Utf8StorageConverter fails for the following maps:
> {code:none}
> [a#)]
> [a#}]
> [a#"take a look at my lovely curly brace, }"]
> [a#'oh look, a closed parenthesis! )']
> {code}
> There are a couple of options:
> 1. Define an escape sequence (ie. quotes or a backslash)
> 2. Call it a bad record, go get a beer, and move on.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2017) consumeMap() fails with EmptyStackException

2011-04-26 Thread Jacob Perkins (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacob Perkins updated PIG-2017:
---

Status: Patch Available  (was: Open)

> consumeMap() fails with EmptyStackException
> ---
>
> Key: PIG-2017
> URL: https://issues.apache.org/jira/browse/PIG-2017
> Project: Pig
>  Issue Type: Bug
>Reporter: Jacob Perkins
> Attachments: utf8storagepatch.txt
>
>
> If a map is read in its serialized form, eg: [key#value], then the 
> consumeMap() method of Utf8StorageConverter fails for the following maps:
> {code:none}
> [a#)]
> [a#}]
> [a#"take a look at my lovely curly brace, }"]
> [a#'oh look, a closed parenthesis! )']
> {code}
> There are a couple of options:
> 1. Define an escape sequence (ie. quotes or a backslash)
> 2. Call it a bad record, go get a beer, and move on.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (PIG-2017) consumeMap() fails with EmptyStackException

2011-04-26 Thread Jacob Perkins (JIRA)
consumeMap() fails with EmptyStackException
---

 Key: PIG-2017
 URL: https://issues.apache.org/jira/browse/PIG-2017
 Project: Pig
  Issue Type: Bug
Reporter: Jacob Perkins


If a map is read in its serialized form, eg: [key#value], then the consumeMap() 
method of Utf8StorageConverter fails for the following maps:

{code:none}
[a#)]
[a#}]
[a#"take a look at my lovely curly brace, }"]
[a#'oh look, a closed parenthesis! )']
{code}

There are a couple of options:

1. Define an escape sequence (ie. quotes or a backslash)
2. Call it a bad record, go get a beer, and move on.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-1827) When passing a parameter to Pig, if the value contains $ it has to be escaped for no apparent reason

2011-04-26 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1827:
--

Attachment: PIG-1827_2.patch

The new patch added a test case that supports $ as part of parameter:

{code}
separator = '$'
P = Pig.compile(\"\"\"a = load 'input' using PigStorage('$separator');store a 
into 'output';\"\"\")
Q = P.bind()
{code}

On the other hand, Pig Latin doesn't support '\$' in its variables, string 
literals. 

> When passing a parameter to Pig, if the value contains $ it has to be escaped 
> for no apparent reason
> 
>
> Key: PIG-1827
> URL: https://issues.apache.org/jira/browse/PIG-1827
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.8.0
>Reporter: Julien Le Dem
>Assignee: Richard Ding
> Fix For: 0.9.0
>
> Attachments: PIG-1827-1.patch, PIG-1827_2.patch
>
>


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-1998) Allow macro to return void

2011-04-26 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1998:
--

Attachment: PIG-1998_2.patch

Attaching a new patch that addresses Xuefu's review comments.

> Allow macro to return void
> --
>
> Key: PIG-1998
> URL: https://issues.apache.org/jira/browse/PIG-1998
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.9.0
>Reporter: Richard Ding
>Assignee: Richard Ding
> Fix For: 0.9.0
>
> Attachments: PIG-1998_1.patch, PIG-1998_2.patch
>
>
> Pig macro is allowed to not have output alias. But this property isn't clear 
> from macro definition and macro invocation (macro inline). Here we propose to 
> make it clear:
> 1. If a macro doesn't output any alias, it must specify void as return value. 
> For example:
> {code}  
> define mymacro(...) returns void {
>... ...
> };
> {code}
> 2. If a macro doesn't output any alias, it must be invoked without return 
> value. For example, to invoke above macro, just specify:
> {code}
> mymacro(...);
> {code}
> 3. Any non-void return alias in the macro definition must exist in the macro 
> body and be prefixed with $. For example:
> {code}  
> define mymacro(...) returns B {
>... ...
>$B = filter ...;
> };
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-1989) complex type casting should return null on casting failure

2011-04-26 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-1989:


Attachment: PIG-1989-1.patch

This happens when the size of tuple inner schema does not match the data. 
Attach PIG-1989-1.patch.

> complex type casting should return null on casting failure 
> ---
>
> Key: PIG-1989
> URL: https://issues.apache.org/jira/browse/PIG-1989
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.9.0
>Reporter: Thejas M Nair
>Assignee: Daniel Dai
> Fix For: 0.9.0
>
> Attachments: PIG-1989-1.patch
>
>
> When casting fails for complex objects, pig is currently returning un-casted 
> object if the cast fails. 
> It should return null instead. That is consistent with the behavior when 
> casting to other basic types. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-1998) Allow macro to return void

2011-04-26 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13025433#comment-13025433
 ] 

Xuefu Zhang commented on PIG-1998:
--

For patch PIG-1998_1.patch,

1. I don't see VOID keyword is defined or used in any of the grammar rules.

2. Grammar rules inline_return_clause have overlap

3. Will re-review once the patch is updated. 

> Allow macro to return void
> --
>
> Key: PIG-1998
> URL: https://issues.apache.org/jira/browse/PIG-1998
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Affects Versions: 0.9.0
>Reporter: Richard Ding
>Assignee: Richard Ding
> Fix For: 0.9.0
>
> Attachments: PIG-1998_1.patch
>
>
> Pig macro is allowed to not have output alias. But this property isn't clear 
> from macro definition and macro invocation (macro inline). Here we propose to 
> make it clear:
> 1. If a macro doesn't output any alias, it must specify void as return value. 
> For example:
> {code}  
> define mymacro(...) returns void {
>... ...
> };
> {code}
> 2. If a macro doesn't output any alias, it must be invoked without return 
> value. For example, to invoke above macro, just specify:
> {code}
> mymacro(...);
> {code}
> 3. Any non-void return alias in the macro definition must exist in the macro 
> body and be prefixed with $. For example:
> {code}  
> define mymacro(...) returns B {
>... ...
>$B = filter ...;
> };
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (PIG-2016) -dot option does not work with explain and new logical plan

2011-04-26 Thread Alan Gates (JIRA)
-dot option does not work with explain and new logical plan
---

 Key: PIG-2016
 URL: https://issues.apache.org/jira/browse/PIG-2016
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.8.0, 0.8.1, 0.9.0
Reporter: Alan Gates
Priority: Minor
 Fix For: 0.9.0


If you specify -dot in explain, it is supposed to produce a file with the 
graphs in .dot format.  While the physical plan and map reduce plan are 
correctly output in .dot format, the new logical plan is still output in text 
format.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (PIG-2015) Explain writes out logical plan twice

2011-04-26 Thread Alan Gates (JIRA)
Explain writes out logical plan twice
-

 Key: PIG-2015
 URL: https://issues.apache.org/jira/browse/PIG-2015
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.9.0
Reporter: Alan Gates
Priority: Minor
 Fix For: 0.9.0


Running explain on a script writes out the logical plan twice, the physical 
plan once, and the map reduce plan once.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (PIG-2004) Incorrect input types passed on to eval function

2011-04-26 Thread Thejas M Nair (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair resolved PIG-2004.


Resolution: Fixed

Patch committed to trunk and 0.9 branch.


> Incorrect input types passed on to eval function
> 
>
> Key: PIG-2004
> URL: https://issues.apache.org/jira/browse/PIG-2004
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.9.0
>Reporter: Vivek Padmanabhan
>Assignee: Thejas M Nair
> Fix For: 0.9.0
>
> Attachments: PIG-2004-0.patch, PIG-2004.1.patch
>
>
> The below script fails by throwing a ClassCastException from the MAX udf. The 
> udf expects the value of the bag supplied to be databyte array, but at run 
> time the udf gets the actual type, ie Double in this case.  This causes the 
> script execution to fail with exception;
> | Caused by: java.lang.ClassCastException: java.lang.Double cannot be cast to 
> org.apache.pig.data.DataByteArray
> The same script runs properly with Pig 0.8.
> {code}
> A = LOAD 'myinput' as (f1,f2,f3);
> B = foreach A generate f1,f2+f3/1000.0 as doub;
> C = group B by f1;
> D = foreach C generate (long)(MAX(B.doub)) as f4;
> dump D;
> {code}
> myinput
> ---
> a   100012345
> b   200023456
> c   300034567
> a   150054321
> b   250065432

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-2004) Incorrect input types passed on to eval function

2011-04-26 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13025389#comment-13025389
 ] 

Daniel Dai commented on PIG-2004:
-

+1

> Incorrect input types passed on to eval function
> 
>
> Key: PIG-2004
> URL: https://issues.apache.org/jira/browse/PIG-2004
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.9.0
>Reporter: Vivek Padmanabhan
>Assignee: Thejas M Nair
> Fix For: 0.9.0
>
> Attachments: PIG-2004-0.patch, PIG-2004.1.patch
>
>
> The below script fails by throwing a ClassCastException from the MAX udf. The 
> udf expects the value of the bag supplied to be databyte array, but at run 
> time the udf gets the actual type, ie Double in this case.  This causes the 
> script execution to fail with exception;
> | Caused by: java.lang.ClassCastException: java.lang.Double cannot be cast to 
> org.apache.pig.data.DataByteArray
> The same script runs properly with Pig 0.8.
> {code}
> A = LOAD 'myinput' as (f1,f2,f3);
> B = foreach A generate f1,f2+f3/1000.0 as doub;
> C = group B by f1;
> D = foreach C generate (long)(MAX(B.doub)) as f4;
> dump D;
> {code}
> myinput
> ---
> a   100012345
> b   200023456
> c   300034567
> a   150054321
> b   250065432

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (PIG-2006) Regression: NPE when Pig processes an empty script file

2011-04-26 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang resolved PIG-2006.
--

Resolution: Fixed

> Regression: NPE when Pig processes an empty script file
> ---
>
> Key: PIG-2006
> URL: https://issues.apache.org/jira/browse/PIG-2006
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.9.0
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Fix For: 0.9.0
>
> Attachments: PIG-2006-1.patch, PIG-2006.patch
>
>
> If a pig script file is empty and supplied as input for Pig (using -f 
> option), an NPE is thrown. Stacktrace:
> java.lang.NullPointerException
> at java.util.regex.Matcher.getTextLength(Matcher.java:1140)
> at java.util.regex.Matcher.reset(Matcher.java:291)
> at java.util.regex.Matcher.(Matcher.java:211)
> at java.util.regex.Pattern.matcher(Pattern.java:888)
> at 
> org.apache.pig.scripting.ScriptEngine$SupportedScriptLang.accepts(ScriptEngine.java:89)
> at 
> org.apache.pig.scripting.ScriptEngine.getSupportedScriptLang(ScriptEngine.java:163)
> at org.apache.pig.Main.determineScriptType(Main.java:892)
> at org.apache.pig.Main.run(Main.java:378)
> at org.apache.pig.Main.main(Main.java:108)
> This seems related Jython support in 0.9.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Work started] (PIG-2006) Regression: NPE when Pig processes an empty script file

2011-04-26 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on PIG-2006 started by Xuefu Zhang.

> Regression: NPE when Pig processes an empty script file
> ---
>
> Key: PIG-2006
> URL: https://issues.apache.org/jira/browse/PIG-2006
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.9.0
>Reporter: Xuefu Zhang
>Assignee: Xuefu Zhang
> Fix For: 0.9.0
>
> Attachments: PIG-2006-1.patch, PIG-2006.patch
>
>
> If a pig script file is empty and supplied as input for Pig (using -f 
> option), an NPE is thrown. Stacktrace:
> java.lang.NullPointerException
> at java.util.regex.Matcher.getTextLength(Matcher.java:1140)
> at java.util.regex.Matcher.reset(Matcher.java:291)
> at java.util.regex.Matcher.(Matcher.java:211)
> at java.util.regex.Pattern.matcher(Pattern.java:888)
> at 
> org.apache.pig.scripting.ScriptEngine$SupportedScriptLang.accepts(ScriptEngine.java:89)
> at 
> org.apache.pig.scripting.ScriptEngine.getSupportedScriptLang(ScriptEngine.java:163)
> at org.apache.pig.Main.determineScriptType(Main.java:892)
> at org.apache.pig.Main.run(Main.java:378)
> at org.apache.pig.Main.main(Main.java:108)
> This seems related Jython support in 0.9.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-2014) SAMPLE shouldn't be pushed up

2011-04-26 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13025345#comment-13025345
 ] 

Dmitriy V. Ryaboy commented on PIG-2014:


Proposal for a general-case fix:

add a @Nondeterministic annotation for UDFs, have the PushUpFilter check 
whether the udf is deterministic or not when considering whether pushing up is 
ok.  Annotate the filter UDF that sample is rewritten to, accordingly.

> SAMPLE shouldn't be pushed up
> -
>
> Key: PIG-2014
> URL: https://issues.apache.org/jira/browse/PIG-2014
> Project: Pig
>  Issue Type: Bug
>Reporter: Jacob Perkins
>
> Consider the following code:
> {code:none}
> tfidf_all = LOAD '$TFIDF' AS (doc_id:chararray, token:chararray, 
> weight:double);
> grouped   = GROUP tfidf_all BY doc_id;
> vectors   = FOREACH grouped GENERATE group AS doc_id, tfidf_all.(token, 
> weight) AS vector;
> DUMP vectors;
> {code}
> This, of course, runs just fine. In a real example, tfidf_all contains 
> 1,428,280 records. The reduce output records should be exactly the number of 
> documents, which turn out to be 18,863 in this case. All well and good.
> The strangeness comes when you add a SAMPLE command:
> {code:none}
> sampled = SAMPLE vectors 0.0012;
> DUMP sampled;
> {code}
> Running this results in 1,513 reduce output records. The reduce output 
> records be much much closer to 22 or 23 records (eg. 0.0012*18863).
> Evidently, Pig rewrites SAMPLE into filter, and then pushes that filter in 
> front of the group. It shouldn't push that filter  
> since the UDF is non-deterministic.  
> Quick fix: If you add "-t PushUpFilter" to your command line when invoking 
> pig this won't happen.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-1622) DEFINE streaming options are ill defined and not properly documented

2011-04-26 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13025332#comment-13025332
 ] 

Xuefu Zhang commented on PIG-1622:
--

Patch PIG-1622-2.patch is checked in for both trunk and 0.9.0. With this 
change, a command doesn't allow multiple occurrence of the same option. And 
this is backward incompatible change.

> DEFINE streaming options are ill defined and not properly documented
> 
>
> Key: PIG-1622
> URL: https://issues.apache.org/jira/browse/PIG-1622
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Alan Gates
>Assignee: Corinne Chandel
>Priority: Minor
> Fix For: 0.9.0
>
> Attachments: PIG-1622-1.patch, PIG-1622-2.patch, PIG-1622.patch
>
>
> According to the documentation 
> (http://hadoop.apache.org/pig/docs/r0.7.0/piglatin_ref2.html#DEFINE) the 
> syntax for DEFINE when used to define a streaming command is:
> DEFINE cmd INPUT(stdin|path) OUTPUT(stdout|stderr|path) SHIP(path [, path, 
> ...]) CACHE (path [, path, ...])
> However, the actual parser accepts something pretty different.  Consider the 
> following script:
> {code}
> define strm `wc -l` INPUT(stdin) 
> CACHE('/Users/gates/.vimrc#myvim') 
> OUTPUT(stdin)
> INPUT('/tmp/fred') 
> OUTPUT('/tmp/bob')
> SHIP('/Users/gates/.bashrc') 
> SHIP('/Users/gates/.vimrc') 
> CACHE('/Users/gates/.bashrc#mybash')
> stderr('/tmp/errors' limit 10);
> A = load '/Users/gates/test/data/studenttab10';
> B = stream A through strm;
> dump B;
> {code}
> The above actually parsers.  I see several issues here:
> # What do multiple INPUT and OUTPUT statements mean in the context of 
> streaming?  These should not be allowed.
> # The documentation implies an order (INPUT, OUTPUT, SHIP, CACHE) that is not 
> enforced by the parser.  We should either enforce the order in the parser or 
> update the documentation.  Most likely the latter to avoid breaking existing 
> scripts.
> # Why are multiple SHIP and CACHE clauses allowed when each can take multiple 
> paths?  It seems we should only allow one of each.
> # The error clause is completely different that what is given in the 
> documentation.  I suspect this is a documentation error and the grammar 
> supported by the parser here is what we want.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-1622) DEFINE streaming options are ill defined and not properly documented

2011-04-26 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated PIG-1622:
-

Attachment: PIG-1622-2.patch

Update the patch with minor fix for a test case.

> DEFINE streaming options are ill defined and not properly documented
> 
>
> Key: PIG-1622
> URL: https://issues.apache.org/jira/browse/PIG-1622
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.7.0
>Reporter: Alan Gates
>Assignee: Corinne Chandel
>Priority: Minor
> Fix For: 0.9.0
>
> Attachments: PIG-1622-1.patch, PIG-1622-2.patch, PIG-1622.patch
>
>
> According to the documentation 
> (http://hadoop.apache.org/pig/docs/r0.7.0/piglatin_ref2.html#DEFINE) the 
> syntax for DEFINE when used to define a streaming command is:
> DEFINE cmd INPUT(stdin|path) OUTPUT(stdout|stderr|path) SHIP(path [, path, 
> ...]) CACHE (path [, path, ...])
> However, the actual parser accepts something pretty different.  Consider the 
> following script:
> {code}
> define strm `wc -l` INPUT(stdin) 
> CACHE('/Users/gates/.vimrc#myvim') 
> OUTPUT(stdin)
> INPUT('/tmp/fred') 
> OUTPUT('/tmp/bob')
> SHIP('/Users/gates/.bashrc') 
> SHIP('/Users/gates/.vimrc') 
> CACHE('/Users/gates/.bashrc#mybash')
> stderr('/tmp/errors' limit 10);
> A = load '/Users/gates/test/data/studenttab10';
> B = stream A through strm;
> dump B;
> {code}
> The above actually parsers.  I see several issues here:
> # What do multiple INPUT and OUTPUT statements mean in the context of 
> streaming?  These should not be allowed.
> # The documentation implies an order (INPUT, OUTPUT, SHIP, CACHE) that is not 
> enforced by the parser.  We should either enforce the order in the parser or 
> update the documentation.  Most likely the latter to avoid breaking existing 
> scripts.
> # Why are multiple SHIP and CACHE clauses allowed when each can take multiple 
> paths?  It seems we should only allow one of each.
> # The error clause is completely different that what is given in the 
> documentation.  I suspect this is a documentation error and the grammar 
> supported by the parser here is what we want.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (PIG-2014) SAMPLE shouldn't be pushed up

2011-04-26 Thread Jacob Perkins (JIRA)
SAMPLE shouldn't be pushed up
-

 Key: PIG-2014
 URL: https://issues.apache.org/jira/browse/PIG-2014
 Project: Pig
  Issue Type: Bug
Reporter: Jacob Perkins


Consider the following code:

{code:none}
tfidf_all = LOAD '$TFIDF' AS (doc_id:chararray, token:chararray, weight:double);
grouped   = GROUP tfidf_all BY doc_id;
vectors   = FOREACH grouped GENERATE group AS doc_id, tfidf_all.(token, weight) 
AS vector;
DUMP vectors;
{code}

This, of course, runs just fine. In a real example, tfidf_all contains 
1,428,280 records. The reduce output records should be exactly the number of 
documents, which turn out to be 18,863 in this case. All well and good.

The strangeness comes when you add a SAMPLE command:

{code:none}
sampled = SAMPLE vectors 0.0012;
DUMP sampled;
{code}

Running this results in 1,513 reduce output records. The reduce output records 
be much much closer to 22 or 23 records (eg. 0.0012*18863).

Evidently, Pig rewrites SAMPLE into filter, and then pushes that filter in 
front of the group. It shouldn't push that filter  
since the UDF is non-deterministic.  

Quick fix: If you add "-t PushUpFilter" to your command line when invoking pig 
this won't happen.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira