[jira] Commented: (PIG-1648) Split combination may return too many block locations to map/reduce framework

2010-09-28 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12915815#action_12915815
 ] 

Yan Zhou commented on PIG-1648:
---

Top 5 locations with most data will be used. This has been agreed upon by the 
M/R dev.

 Split combination may return too many block locations to map/reduce framework
 -

 Key: PIG-1648
 URL: https://issues.apache.org/jira/browse/PIG-1648
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Yan Zhou
Assignee: Yan Zhou
 Fix For: 0.8.0


 For instance, if a small split has block locations h1, h2 and h3; another 
 small split has h1, h3, h4. After combination, the composite split contains 4 
 block locations. If the number of component splits is big, then the number of 
 block locations could be big too. In fact, the  number of block locations 
 serves as a hint to M/R as the best hosts this composite split should be run 
 on so the list should contain a short list, say 5, of the hosts that contain 
 the most data in this composite split.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1649) FRJoin fails to compute number of input files for replicated input

2010-09-28 Thread Thejas M Nair (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair updated PIG-1649:
---

Attachment: PIG-1649.1.patch

Patch passes unit tests and test-patch .


 FRJoin fails to compute number of input files for replicated input
 --

 Key: PIG-1649
 URL: https://issues.apache.org/jira/browse/PIG-1649
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: PIG-1649.1.patch


 In FRJoin, if input path has curly braces, it fails to compute number of 
 input files and logs the following exception in the log -
 10/09/27 14:31:13 WARN mapReduceLayer.MRCompiler: failed to get number of 
 input files
 java.net.URISyntaxException: Illegal character in path at index 12: 
 /user/tejas/{std*txt}
 at java.net.URI$Parser.fail(URI.java:2809)
 at java.net.URI$Parser.checkChars(URI.java:2982)
 at java.net.URI$Parser.parseHierarchical(URI.java:3066)
 at java.net.URI$Parser.parse(URI.java:3024)
 at java.net.URI.init(URI.java:578)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.hasTooManyInputFiles(MRCompiler.java:1283)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.visitFRJoin(MRCompiler.java:1203)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFRJoin.visit(POFRJoin.java:188)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.compile(MRCompiler.java:475)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.compile(MRCompiler.java:454)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.compile(MRCompiler.java:336)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.compile(MapReduceLauncher.java:468)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:116)
 at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:301)
 at 
 org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1197)
 at org.apache.pig.PigServer.storeEx(PigServer.java:873)
 at org.apache.pig.PigServer.store(PigServer.java:815)
 at org.apache.pig.PigServer.openIterator(PigServer.java:727)
 at 
 org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:612)
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:301)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:141)
 at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:76)
 at org.apache.pig.Main.run(Main.java:453)
 at org.apache.pig.Main.main(Main.java:107)
 This does not cause a query to fail. But since the number of input files 
 don't get calculated, the optimizations added in PIG-1458 to reduce load on 
 name node will not get used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1649) FRJoin fails to compute number of input files for replicated input

2010-09-28 Thread Thejas M Nair (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair updated PIG-1649:
---

Status: Patch Available  (was: Open)

 FRJoin fails to compute number of input files for replicated input
 --

 Key: PIG-1649
 URL: https://issues.apache.org/jira/browse/PIG-1649
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: PIG-1649.1.patch


 In FRJoin, if input path has curly braces, it fails to compute number of 
 input files and logs the following exception in the log -
 10/09/27 14:31:13 WARN mapReduceLayer.MRCompiler: failed to get number of 
 input files
 java.net.URISyntaxException: Illegal character in path at index 12: 
 /user/tejas/{std*txt}
 at java.net.URI$Parser.fail(URI.java:2809)
 at java.net.URI$Parser.checkChars(URI.java:2982)
 at java.net.URI$Parser.parseHierarchical(URI.java:3066)
 at java.net.URI$Parser.parse(URI.java:3024)
 at java.net.URI.init(URI.java:578)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.hasTooManyInputFiles(MRCompiler.java:1283)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.visitFRJoin(MRCompiler.java:1203)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFRJoin.visit(POFRJoin.java:188)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.compile(MRCompiler.java:475)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.compile(MRCompiler.java:454)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.compile(MRCompiler.java:336)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.compile(MapReduceLauncher.java:468)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:116)
 at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:301)
 at 
 org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1197)
 at org.apache.pig.PigServer.storeEx(PigServer.java:873)
 at org.apache.pig.PigServer.store(PigServer.java:815)
 at org.apache.pig.PigServer.openIterator(PigServer.java:727)
 at 
 org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:612)
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:301)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:141)
 at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:76)
 at org.apache.pig.Main.run(Main.java:453)
 at org.apache.pig.Main.main(Main.java:107)
 This does not cause a query to fail. But since the number of input files 
 don't get calculated, the optimizations added in PIG-1458 to reduce load on 
 name node will not get used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1648) Split combination may return too many block locations to map/reduce framework

2010-09-28 Thread Yan Zhou (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12915852#action_12915852
 ] 

Yan Zhou commented on PIG-1648:
---

test-patch results:

 [exec] +1 overall.
 [exec]
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec]
 [exec] +1 tests included.  The patch appears to include 3 new or 
modified tests.
 [exec]
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec]
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec]
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec]
 [exec] +1 release audit.  The applied patch does not increase the 
total number of release audit warnings.

test-core tests pass too.


 Split combination may return too many block locations to map/reduce framework
 -

 Key: PIG-1648
 URL: https://issues.apache.org/jira/browse/PIG-1648
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Yan Zhou
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1648.patch


 For instance, if a small split has block locations h1, h2 and h3; another 
 small split has h1, h3, h4. After combination, the composite split contains 4 
 block locations. If the number of component splits is big, then the number of 
 block locations could be big too. In fact, the  number of block locations 
 serves as a hint to M/R as the best hosts this composite split should be run 
 on so the list should contain a short list, say 5, of the hosts that contain 
 the most data in this composite split.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1652) TestSortedTableUnion and TestSortedTableUnionMergeJoin fail on trunk due to estimateNumberOfReducers bug

2010-09-28 Thread Daniel Dai (JIRA)
TestSortedTableUnion and TestSortedTableUnionMergeJoin fail on trunk due to 
estimateNumberOfReducers bug


 Key: PIG-1652
 URL: https://issues.apache.org/jira/browse/PIG-1652
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.8.0
Reporter: Daniel Dai
 Fix For: 0.8.0


TestSortedTableUnion and TestSortedTableUnionMergeJoin fail on trunk due to the 
input size estimation. Here is the stack of TestSortedTableUnionMergeJoin:

org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1002: Unable to store 
alias records3
at org.apache.pig.PigServer.storeEx(PigServer.java:877)
at org.apache.pig.PigServer.store(PigServer.java:815)
at org.apache.pig.PigServer.openIterator(PigServer.java:727)
at 
org.apache.hadoop.zebra.pig.TestSortedTableUnionMergeJoin.testStorer(TestSortedTableUnionMergeJoin.java:203)
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2043: 
Unexpected error during execution.
at 
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:326)
at 
org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1197)
at org.apache.pig.PigServer.storeEx(PigServer.java:873)
Caused by: java.lang.IllegalArgumentException: java.net.URISyntaxException: 
Illegal character in scheme name at index 69: 
org.apache.hadoop.zebra.pig.TestSortedTableUnionMergeJoin.testStorer1,file:
at org.apache.hadoop.fs.Path.initialize(Path.java:140)
at org.apache.hadoop.fs.Path.init(Path.java:126)
at org.apache.hadoop.fs.Path.init(Path.java:50)
at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:963)
at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966)
at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966)
at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966)
at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966)
at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966)
at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966)
at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966)
at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966)
at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966)
at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966)
at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966)
at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966)
at 
org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:902)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:866)
at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:844)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getTotalInputFileSize(JobControlCompiler.java:715)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.estimateNumberOfReducers(JobControlCompiler.java:688)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.SampleOptimizer.visitMROp(SampleOptimizer.java:140)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceOper.visit(MapReduceOper.java:246)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceOper.visit(MapReduceOper.java:41)
at 
org.apache.pig.impl.plan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:69)
at 
org.apache.pig.impl.plan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:71)
at 
org.apache.pig.impl.plan.DepthFirstWalker.walk(DepthFirstWalker.java:52)
at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.SampleOptimizer.visit(SampleOptimizer.java:69)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.compile(MapReduceLauncher.java:491)
at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:116)
at 
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:301)
Caused by: java.net.URISyntaxException: Illegal character in scheme name at 
index 69: 
org.apache.hadoop.zebra.pig.TestSortedTableUnionMergeJoin.testStorer1,file:
at java.net.URI$Parser.fail(URI.java:2809)
at java.net.URI$Parser.checkChars(URI.java:2982)
at java.net.URI$Parser.parse(URI.java:3009)
at java.net.URI.init(URI.java:736)
at org.apache.hadoop.fs.Path.initialize(Path.java:137)

The reason is we are trying to 

[jira] Commented: (PIG-1652) TestSortedTableUnion and TestSortedTableUnionMergeJoin fail on trunk due to estimateNumberOfReducers bug

2010-09-28 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12915860#action_12915860
 ] 

Olga Natkovich commented on PIG-1652:
-

I think the code needs to be modified to default to 1 if we can't perform the 
computation

 TestSortedTableUnion and TestSortedTableUnionMergeJoin fail on trunk due to 
 estimateNumberOfReducers bug
 

 Key: PIG-1652
 URL: https://issues.apache.org/jira/browse/PIG-1652
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.8.0
Reporter: Daniel Dai
 Fix For: 0.8.0


 TestSortedTableUnion and TestSortedTableUnionMergeJoin fail on trunk due to 
 the input size estimation. Here is the stack of TestSortedTableUnionMergeJoin:
 org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1002: Unable to 
 store alias records3
 at org.apache.pig.PigServer.storeEx(PigServer.java:877)
 at org.apache.pig.PigServer.store(PigServer.java:815)
 at org.apache.pig.PigServer.openIterator(PigServer.java:727)
 at 
 org.apache.hadoop.zebra.pig.TestSortedTableUnionMergeJoin.testStorer(TestSortedTableUnionMergeJoin.java:203)
 Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2043: 
 Unexpected error during execution.
 at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:326)
 at 
 org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1197)
 at org.apache.pig.PigServer.storeEx(PigServer.java:873)
 Caused by: java.lang.IllegalArgumentException: java.net.URISyntaxException: 
 Illegal character in scheme name at index 69: 
 org.apache.hadoop.zebra.pig.TestSortedTableUnionMergeJoin.testStorer1,file:
 at org.apache.hadoop.fs.Path.initialize(Path.java:140)
 at org.apache.hadoop.fs.Path.init(Path.java:126)
 at org.apache.hadoop.fs.Path.init(Path.java:50)
 at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:963)
 at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966)
 at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966)
 at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966)
 at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966)
 at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966)
 at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966)
 at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966)
 at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966)
 at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966)
 at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966)
 at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966)
 at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966)
 at 
 org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:902)
 at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:866)
 at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:844)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getTotalInputFileSize(JobControlCompiler.java:715)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.estimateNumberOfReducers(JobControlCompiler.java:688)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.SampleOptimizer.visitMROp(SampleOptimizer.java:140)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceOper.visit(MapReduceOper.java:246)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceOper.visit(MapReduceOper.java:41)
 at 
 org.apache.pig.impl.plan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:69)
 at 
 org.apache.pig.impl.plan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:71)
 at 
 org.apache.pig.impl.plan.DepthFirstWalker.walk(DepthFirstWalker.java:52)
 at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.SampleOptimizer.visit(SampleOptimizer.java:69)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.compile(MapReduceLauncher.java:491)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:116)
 at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:301)
 Caused by: java.net.URISyntaxException: Illegal 

[jira] Assigned: (PIG-1652) TestSortedTableUnion and TestSortedTableUnionMergeJoin fail on trunk due to estimateNumberOfReducers bug

2010-09-28 Thread Olga Natkovich (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olga Natkovich reassigned PIG-1652:
---

Assignee: Thejas M Nair

 TestSortedTableUnion and TestSortedTableUnionMergeJoin fail on trunk due to 
 estimateNumberOfReducers bug
 

 Key: PIG-1652
 URL: https://issues.apache.org/jira/browse/PIG-1652
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.8.0
Reporter: Daniel Dai
Assignee: Thejas M Nair
 Fix For: 0.8.0


 TestSortedTableUnion and TestSortedTableUnionMergeJoin fail on trunk due to 
 the input size estimation. Here is the stack of TestSortedTableUnionMergeJoin:
 org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1002: Unable to 
 store alias records3
 at org.apache.pig.PigServer.storeEx(PigServer.java:877)
 at org.apache.pig.PigServer.store(PigServer.java:815)
 at org.apache.pig.PigServer.openIterator(PigServer.java:727)
 at 
 org.apache.hadoop.zebra.pig.TestSortedTableUnionMergeJoin.testStorer(TestSortedTableUnionMergeJoin.java:203)
 Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2043: 
 Unexpected error during execution.
 at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:326)
 at 
 org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1197)
 at org.apache.pig.PigServer.storeEx(PigServer.java:873)
 Caused by: java.lang.IllegalArgumentException: java.net.URISyntaxException: 
 Illegal character in scheme name at index 69: 
 org.apache.hadoop.zebra.pig.TestSortedTableUnionMergeJoin.testStorer1,file:
 at org.apache.hadoop.fs.Path.initialize(Path.java:140)
 at org.apache.hadoop.fs.Path.init(Path.java:126)
 at org.apache.hadoop.fs.Path.init(Path.java:50)
 at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:963)
 at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966)
 at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966)
 at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966)
 at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966)
 at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966)
 at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966)
 at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966)
 at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966)
 at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966)
 at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966)
 at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966)
 at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966)
 at 
 org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:902)
 at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:866)
 at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:844)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getTotalInputFileSize(JobControlCompiler.java:715)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.estimateNumberOfReducers(JobControlCompiler.java:688)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.SampleOptimizer.visitMROp(SampleOptimizer.java:140)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceOper.visit(MapReduceOper.java:246)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceOper.visit(MapReduceOper.java:41)
 at 
 org.apache.pig.impl.plan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:69)
 at 
 org.apache.pig.impl.plan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:71)
 at 
 org.apache.pig.impl.plan.DepthFirstWalker.walk(DepthFirstWalker.java:52)
 at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.SampleOptimizer.visit(SampleOptimizer.java:69)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.compile(MapReduceLauncher.java:491)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:116)
 at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:301)
 Caused by: java.net.URISyntaxException: Illegal character in scheme name at 
 index 69: 
 

[jira] Created: (PIG-1653) Scripting UDF fails if the path to script is an absolute path

2010-09-28 Thread Daniel Dai (JIRA)
Scripting UDF fails if the path to script is an absolute path
-

 Key: PIG-1653
 URL: https://issues.apache.org/jira/browse/PIG-1653
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Daniel Dai
 Fix For: 0.8.0


The following script fail:
{code}
register '/homes/jianyong/pig/aaa/scriptingudf.py' using jython as myfuncs;
a = load '/user/pig/tests/data/singlefile/studenttab10k' using PigStorage() as 
(name, age, gpa:double);
b = foreach a generate myfuncs.square(gpa);
dump b;
{code}

If we change the register to use relative path (such as aaa/scriptingudf.py), 
it success.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1637) Combiner not use because optimizor inserts a foreach between group and algebric function

2010-09-28 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12915880#action_12915880
 ] 

Daniel Dai commented on PIG-1637:
-

test-patch result for PIG-1637-2.patch:

 [exec] +1 overall.  
 [exec] 
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec] 
 [exec] +1 tests included.  The patch appears to include 3 new or 
modified tests.
 [exec] 
 [exec] +1 javadoc.  The javadoc tool did not generate any warning 
messages.
 [exec] 
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec] 
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec] 
 [exec] +1 release audit.  The applied patch does not increase the 
total number of release audit warnings.


 Combiner not use because optimizor inserts a foreach between group and 
 algebric function
 

 Key: PIG-1637
 URL: https://issues.apache.org/jira/browse/PIG-1637
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.8.0

 Attachments: PIG-1637-1.patch, PIG-1637-2.patch


 The following script does not use combiner after new optimization change.
 {code}
 A = load ':INPATH:/pigmix/page_views' using 
 org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
 as (user, action, timespent, query_term, ip_addr, timestamp, 
 estimated_revenue, page_info, page_links);
 B = foreach A generate user, (int)timespent as timespent, 
 (double)estimated_revenue as estimated_revenue;
 C = group B all; 
 D = foreach C generate SUM(B.timespent), AVG(B.estimated_revenue);
 store D into ':OUTPATH:';
 {code}
 This is because after group, optimizer detect group key is not used 
 afterward, it add a foreach statement after C. This is how it looks like 
 after optimization:
 {code}
 A = load ':INPATH:/pigmix/page_views' using 
 org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
 as (user, action, timespent, query_term, ip_addr, timestamp, 
 estimated_revenue, page_info, page_links);
 B = foreach A generate user, (int)timespent as timespent, 
 (double)estimated_revenue as estimated_revenue;
 C = group B all; 
 C1 = foreach C generate B;
 D = foreach C1 generate SUM(B.timespent), AVG(B.estimated_revenue);
 store D into ':OUTPATH:';
 {code}
 That cancel the combiner optimization for D. 
 The way to solve the issue is to merge the C1 we inserted and D. Currently, 
 we do not merge these two foreach. The reason is that one output of the first 
 foreach (B) is referred twice in D, and currently rule assume after merge, we 
 need to calculate B twice in D. Actually, C1 is only doing projection, no 
 calculation of B. Merging C1 and D will not result calculating B twice. So C1 
 and D should be merged.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1648) Split combination may return too many block locations to map/reduce framework

2010-09-28 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12915889#action_12915889
 ] 

Richard Ding commented on PIG-1648:
---

+1

 Split combination may return too many block locations to map/reduce framework
 -

 Key: PIG-1648
 URL: https://issues.apache.org/jira/browse/PIG-1648
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Yan Zhou
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1648.patch


 For instance, if a small split has block locations h1, h2 and h3; another 
 small split has h1, h3, h4. After combination, the composite split contains 4 
 block locations. If the number of component splits is big, then the number of 
 block locations could be big too. In fact, the  number of block locations 
 serves as a hint to M/R as the best hosts this composite split should be run 
 on so the list should contain a short list, say 5, of the hosts that contain 
 the most data in this composite split.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1648) Split combination may return too many block locations to map/reduce framework

2010-09-28 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1648:
--

Status: Resolved  (was: Patch Available)
Resolution: Fixed

Patch committed to both trunk and the 0.8 branch.

 Split combination may return too many block locations to map/reduce framework
 -

 Key: PIG-1648
 URL: https://issues.apache.org/jira/browse/PIG-1648
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Yan Zhou
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1648.patch


 For instance, if a small split has block locations h1, h2 and h3; another 
 small split has h1, h3, h4. After combination, the composite split contains 4 
 block locations. If the number of component splits is big, then the number of 
 block locations could be big too. In fact, the  number of block locations 
 serves as a hint to M/R as the best hosts this composite split should be run 
 on so the list should contain a short list, say 5, of the hosts that contain 
 the most data in this composite split.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1648) Split combination may return too many block locations to map/reduce framework

2010-09-28 Thread Yan Zhou (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1648:
--

Status: Patch Available  (was: Open)

 Split combination may return too many block locations to map/reduce framework
 -

 Key: PIG-1648
 URL: https://issues.apache.org/jira/browse/PIG-1648
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Yan Zhou
Assignee: Yan Zhou
 Fix For: 0.8.0

 Attachments: PIG-1648.patch


 For instance, if a small split has block locations h1, h2 and h3; another 
 small split has h1, h3, h4. After combination, the composite split contains 4 
 block locations. If the number of component splits is big, then the number of 
 block locations could be big too. In fact, the  number of block locations 
 serves as a hint to M/R as the best hosts this composite split should be run 
 on so the list should contain a short list, say 5, of the hosts that contain 
 the most data in this composite split.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1579) Intermittent unit test failure for TestScriptUDF.testPythonScriptUDFNullInputOutput

2010-09-28 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12915941#action_12915941
 ] 

Daniel Dai commented on PIG-1579:
-

Rollback the change and run test many times, all tests pass. Seems some change 
between r990721 and now (r1002348) fix this issue. Will rollback the change and 
close the Jira.

 Intermittent unit test failure for 
 TestScriptUDF.testPythonScriptUDFNullInputOutput
 ---

 Key: PIG-1579
 URL: https://issues.apache.org/jira/browse/PIG-1579
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.8.0

 Attachments: PIG-1579-1.patch


 Error message:
 org.apache.pig.backend.executionengine.ExecException: ERROR 0: Error 
 executing function: Traceback (most recent call last):
   File iostream, line 5, in multStr
 TypeError: can't multiply sequence by non-int of type 'NoneType'
 at 
 org.apache.pig.scripting.jython.JythonFunction.exec(JythonFunction.java:107)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:229)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:295)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:346)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:291)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:236)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:231)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:53)
 at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
 at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
 at org.apache.hadoop.mapred.Child.main(Child.java:170)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1651) PIG class loading mishandled

2010-09-28 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12915945#action_12915945
 ] 

Richard Ding commented on PIG-1651:
---

The problem here is that PigContext uses LogicalPlanBuilder.classloader to 
instantiate the LoadFuncs, but the context ClassLoader for the Thread uses a 
different class loader, and hence the static variable set for the class loaded 
by one loader is not visible by the class loaded by the other loader. The 
solution is to use the same LogicalPlanBuilder.classloader as the context 
ClassLoader for the Thread.

 PIG class loading mishandled
 

 Key: PIG-1651
 URL: https://issues.apache.org/jira/browse/PIG-1651
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Yan Zhou
Assignee: Richard Ding
 Fix For: 0.8.0


 If just having zebra.jar as being registered in a PIG script but not in the 
 CLASSPATH, the query using zebra fails since there appear to be multiple 
 classes loaded into JVM, causing static variable set previously not seen 
 after one instance of the class is created through reflection. (After the 
 zebra.jar is specified in CLASSPATH, it works fine.) The exception stack is 
 as follows:
 ackend error message during job submission
 ---
 org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to 
 create input splits for: hdfs://hostname/pathto/zebra_dir :: null
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:284)
 at 
 org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:907)
 at 
 org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:801)
 at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:752)
 at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378)
 at 
 org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247)
 at 
 org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279)
 at java.lang.Thread.run(Thread.java:619)
 Caused by: java.lang.NullPointerException
 at 
 org.apache.hadoop.zebra.io.ColumnGroup.getNonDataFilePrefix(ColumnGroup.java:123)
 at 
 org.apache.hadoop.zebra.io.ColumnGroup$CGPathFilter.accept(ColumnGroup.java:2413)
 at 
 org.apache.hadoop.zebra.mapreduce.TableInputFormat$DummyFileInputFormat$MultiPathFilter.accept(TableInputFormat.java:718)
 at 
 org.apache.hadoop.fs.FileSystem$GlobFilter.accept(FileSystem.java:1084)
 at 
 org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:919)
 at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:866)
 at 
 org.apache.hadoop.zebra.mapreduce.TableInputFormat$DummyFileInputFormat.listStatus(TableInputFormat.java:780)
 at 
 org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:246)
 at 
 org.apache.hadoop.zebra.mapreduce.TableInputFormat.getRowSplits(TableInputFormat.java:863)
 at 
 org.apache.hadoop.zebra.mapreduce.TableInputFormat.getSplits(TableInputFormat.java:1017)
 at 
 org.apache.hadoop.zebra.mapreduce.TableInputFormat.getSplits(TableInputFormat.java:961)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:269)
 ... 7 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1637) Combiner not use because optimizor inserts a foreach between group and algebric function

2010-09-28 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12915950#action_12915950
 ] 

Daniel Dai commented on PIG-1637:
-

Yes, it could be improved as per Xuefu's suggestion. Anyway, current patch 
solve the combiner not used issue, will commit this part first. I will open 
another Jira to improve it. Also, MergeForEach is a best example to practice 
cloning framework [PIG-1587|https://issues.apache.org/jira/browse/PIG-1587], so 
it is better to improve it once PIG-1587 is available.

 Combiner not use because optimizor inserts a foreach between group and 
 algebric function
 

 Key: PIG-1637
 URL: https://issues.apache.org/jira/browse/PIG-1637
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.8.0

 Attachments: PIG-1637-1.patch, PIG-1637-2.patch


 The following script does not use combiner after new optimization change.
 {code}
 A = load ':INPATH:/pigmix/page_views' using 
 org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
 as (user, action, timespent, query_term, ip_addr, timestamp, 
 estimated_revenue, page_info, page_links);
 B = foreach A generate user, (int)timespent as timespent, 
 (double)estimated_revenue as estimated_revenue;
 C = group B all; 
 D = foreach C generate SUM(B.timespent), AVG(B.estimated_revenue);
 store D into ':OUTPATH:';
 {code}
 This is because after group, optimizer detect group key is not used 
 afterward, it add a foreach statement after C. This is how it looks like 
 after optimization:
 {code}
 A = load ':INPATH:/pigmix/page_views' using 
 org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
 as (user, action, timespent, query_term, ip_addr, timestamp, 
 estimated_revenue, page_info, page_links);
 B = foreach A generate user, (int)timespent as timespent, 
 (double)estimated_revenue as estimated_revenue;
 C = group B all; 
 C1 = foreach C generate B;
 D = foreach C1 generate SUM(B.timespent), AVG(B.estimated_revenue);
 store D into ':OUTPATH:';
 {code}
 That cancel the combiner optimization for D. 
 The way to solve the issue is to merge the C1 we inserted and D. Currently, 
 we do not merge these two foreach. The reason is that one output of the first 
 foreach (B) is referred twice in D, and currently rule assume after merge, we 
 need to calculate B twice in D. Actually, C1 is only doing projection, no 
 calculation of B. Merging C1 and D will not result calculating B twice. So C1 
 and D should be merged.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1649) FRJoin fails to compute number of input files for replicated input

2010-09-28 Thread Thejas M Nair (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair updated PIG-1649:
---

Attachment: PIG-1649.2.patch

PIG-1649.2.patch
Addressing review comments from Richard 
-  pointed out that that hdfs Path class constructor can fail on valid Uri like 
the format used for jdbc. So this patch checks if the input location uri has a 
hdfs scheme before using the hdfs Path constructor.
- The code here can run into same problem as one in PIG-1652. The patch also 
includes changes to handle comma separated file names.

A better long term solution would be to have support in LoadFunc or related 
interfaces to check the input size and to check if parts of the file should be 
consolidated.



 FRJoin fails to compute number of input files for replicated input
 --

 Key: PIG-1649
 URL: https://issues.apache.org/jira/browse/PIG-1649
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: PIG-1649.1.patch, PIG-1649.2.patch


 In FRJoin, if input path has curly braces, it fails to compute number of 
 input files and logs the following exception in the log -
 10/09/27 14:31:13 WARN mapReduceLayer.MRCompiler: failed to get number of 
 input files
 java.net.URISyntaxException: Illegal character in path at index 12: 
 /user/tejas/{std*txt}
 at java.net.URI$Parser.fail(URI.java:2809)
 at java.net.URI$Parser.checkChars(URI.java:2982)
 at java.net.URI$Parser.parseHierarchical(URI.java:3066)
 at java.net.URI$Parser.parse(URI.java:3024)
 at java.net.URI.init(URI.java:578)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.hasTooManyInputFiles(MRCompiler.java:1283)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.visitFRJoin(MRCompiler.java:1203)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFRJoin.visit(POFRJoin.java:188)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.compile(MRCompiler.java:475)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.compile(MRCompiler.java:454)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.compile(MRCompiler.java:336)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.compile(MapReduceLauncher.java:468)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:116)
 at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:301)
 at 
 org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1197)
 at org.apache.pig.PigServer.storeEx(PigServer.java:873)
 at org.apache.pig.PigServer.store(PigServer.java:815)
 at org.apache.pig.PigServer.openIterator(PigServer.java:727)
 at 
 org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:612)
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:301)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:141)
 at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:76)
 at org.apache.pig.Main.run(Main.java:453)
 at org.apache.pig.Main.main(Main.java:107)
 This does not cause a query to fail. But since the number of input files 
 don't get calculated, the optimizations added in PIG-1458 to reduce load on 
 name node will not get used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1649) FRJoin fails to compute number of input files for replicated input

2010-09-28 Thread Thejas M Nair (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair updated PIG-1649:
---

Status: Open  (was: Patch Available)

The patch also includes changes to fix the issue in PIG-1652 , since FRJoin 
code path also faces similar issue.



 FRJoin fails to compute number of input files for replicated input
 --

 Key: PIG-1649
 URL: https://issues.apache.org/jira/browse/PIG-1649
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: PIG-1649.1.patch, PIG-1649.2.patch


 In FRJoin, if input path has curly braces, it fails to compute number of 
 input files and logs the following exception in the log -
 10/09/27 14:31:13 WARN mapReduceLayer.MRCompiler: failed to get number of 
 input files
 java.net.URISyntaxException: Illegal character in path at index 12: 
 /user/tejas/{std*txt}
 at java.net.URI$Parser.fail(URI.java:2809)
 at java.net.URI$Parser.checkChars(URI.java:2982)
 at java.net.URI$Parser.parseHierarchical(URI.java:3066)
 at java.net.URI$Parser.parse(URI.java:3024)
 at java.net.URI.init(URI.java:578)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.hasTooManyInputFiles(MRCompiler.java:1283)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.visitFRJoin(MRCompiler.java:1203)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFRJoin.visit(POFRJoin.java:188)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.compile(MRCompiler.java:475)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.compile(MRCompiler.java:454)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.compile(MRCompiler.java:336)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.compile(MapReduceLauncher.java:468)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:116)
 at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:301)
 at 
 org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1197)
 at org.apache.pig.PigServer.storeEx(PigServer.java:873)
 at org.apache.pig.PigServer.store(PigServer.java:815)
 at org.apache.pig.PigServer.openIterator(PigServer.java:727)
 at 
 org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:612)
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:301)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:141)
 at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:76)
 at org.apache.pig.Main.run(Main.java:453)
 at org.apache.pig.Main.main(Main.java:107)
 This does not cause a query to fail. But since the number of input files 
 don't get calculated, the optimizations added in PIG-1458 to reduce load on 
 name node will not get used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-1652) TestSortedTableUnion and TestSortedTableUnionMergeJoin fail on trunk due to estimateNumberOfReducers bug

2010-09-28 Thread Thejas M Nair (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair resolved PIG-1652.


Resolution: Duplicate

Marking as duplicate of PIG-1649 because the code path to consolidate input 
files in FRJoin also has the same issue. 


 TestSortedTableUnion and TestSortedTableUnionMergeJoin fail on trunk due to 
 estimateNumberOfReducers bug
 

 Key: PIG-1652
 URL: https://issues.apache.org/jira/browse/PIG-1652
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: 0.8.0
Reporter: Daniel Dai
Assignee: Thejas M Nair
 Fix For: 0.8.0


 TestSortedTableUnion and TestSortedTableUnionMergeJoin fail on trunk due to 
 the input size estimation. Here is the stack of TestSortedTableUnionMergeJoin:
 org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1002: Unable to 
 store alias records3
 at org.apache.pig.PigServer.storeEx(PigServer.java:877)
 at org.apache.pig.PigServer.store(PigServer.java:815)
 at org.apache.pig.PigServer.openIterator(PigServer.java:727)
 at 
 org.apache.hadoop.zebra.pig.TestSortedTableUnionMergeJoin.testStorer(TestSortedTableUnionMergeJoin.java:203)
 Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2043: 
 Unexpected error during execution.
 at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:326)
 at 
 org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1197)
 at org.apache.pig.PigServer.storeEx(PigServer.java:873)
 Caused by: java.lang.IllegalArgumentException: java.net.URISyntaxException: 
 Illegal character in scheme name at index 69: 
 org.apache.hadoop.zebra.pig.TestSortedTableUnionMergeJoin.testStorer1,file:
 at org.apache.hadoop.fs.Path.initialize(Path.java:140)
 at org.apache.hadoop.fs.Path.init(Path.java:126)
 at org.apache.hadoop.fs.Path.init(Path.java:50)
 at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:963)
 at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966)
 at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966)
 at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966)
 at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966)
 at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966)
 at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966)
 at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966)
 at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966)
 at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966)
 at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966)
 at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966)
 at org.apache.hadoop.fs.FileSystem.globPathsLevel(FileSystem.java:966)
 at 
 org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:902)
 at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:866)
 at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:844)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getTotalInputFileSize(JobControlCompiler.java:715)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.estimateNumberOfReducers(JobControlCompiler.java:688)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.SampleOptimizer.visitMROp(SampleOptimizer.java:140)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceOper.visit(MapReduceOper.java:246)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceOper.visit(MapReduceOper.java:41)
 at 
 org.apache.pig.impl.plan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:69)
 at 
 org.apache.pig.impl.plan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:71)
 at 
 org.apache.pig.impl.plan.DepthFirstWalker.walk(DepthFirstWalker.java:52)
 at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:51)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.SampleOptimizer.visit(SampleOptimizer.java:69)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.compile(MapReduceLauncher.java:491)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:116)
 at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:301)
 Caused by: 

[jira] Updated: (PIG-1651) PIG class loading mishandled

2010-09-28 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1651:
--

Status: Patch Available  (was: Open)

 PIG class loading mishandled
 

 Key: PIG-1651
 URL: https://issues.apache.org/jira/browse/PIG-1651
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Yan Zhou
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1651.patch


 If just having zebra.jar as being registered in a PIG script but not in the 
 CLASSPATH, the query using zebra fails since there appear to be multiple 
 classes loaded into JVM, causing static variable set previously not seen 
 after one instance of the class is created through reflection. (After the 
 zebra.jar is specified in CLASSPATH, it works fine.) The exception stack is 
 as follows:
 ackend error message during job submission
 ---
 org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to 
 create input splits for: hdfs://hostname/pathto/zebra_dir :: null
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:284)
 at 
 org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:907)
 at 
 org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:801)
 at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:752)
 at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378)
 at 
 org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247)
 at 
 org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279)
 at java.lang.Thread.run(Thread.java:619)
 Caused by: java.lang.NullPointerException
 at 
 org.apache.hadoop.zebra.io.ColumnGroup.getNonDataFilePrefix(ColumnGroup.java:123)
 at 
 org.apache.hadoop.zebra.io.ColumnGroup$CGPathFilter.accept(ColumnGroup.java:2413)
 at 
 org.apache.hadoop.zebra.mapreduce.TableInputFormat$DummyFileInputFormat$MultiPathFilter.accept(TableInputFormat.java:718)
 at 
 org.apache.hadoop.fs.FileSystem$GlobFilter.accept(FileSystem.java:1084)
 at 
 org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:919)
 at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:866)
 at 
 org.apache.hadoop.zebra.mapreduce.TableInputFormat$DummyFileInputFormat.listStatus(TableInputFormat.java:780)
 at 
 org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:246)
 at 
 org.apache.hadoop.zebra.mapreduce.TableInputFormat.getRowSplits(TableInputFormat.java:863)
 at 
 org.apache.hadoop.zebra.mapreduce.TableInputFormat.getSplits(TableInputFormat.java:1017)
 at 
 org.apache.hadoop.zebra.mapreduce.TableInputFormat.getSplits(TableInputFormat.java:961)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:269)
 ... 7 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1651) PIG class loading mishandled

2010-09-28 Thread Richard Ding (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Ding updated PIG-1651:
--

Attachment: PIG-1651.patch

 PIG class loading mishandled
 

 Key: PIG-1651
 URL: https://issues.apache.org/jira/browse/PIG-1651
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Yan Zhou
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1651.patch


 If just having zebra.jar as being registered in a PIG script but not in the 
 CLASSPATH, the query using zebra fails since there appear to be multiple 
 classes loaded into JVM, causing static variable set previously not seen 
 after one instance of the class is created through reflection. (After the 
 zebra.jar is specified in CLASSPATH, it works fine.) The exception stack is 
 as follows:
 ackend error message during job submission
 ---
 org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to 
 create input splits for: hdfs://hostname/pathto/zebra_dir :: null
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:284)
 at 
 org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:907)
 at 
 org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:801)
 at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:752)
 at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378)
 at 
 org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247)
 at 
 org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279)
 at java.lang.Thread.run(Thread.java:619)
 Caused by: java.lang.NullPointerException
 at 
 org.apache.hadoop.zebra.io.ColumnGroup.getNonDataFilePrefix(ColumnGroup.java:123)
 at 
 org.apache.hadoop.zebra.io.ColumnGroup$CGPathFilter.accept(ColumnGroup.java:2413)
 at 
 org.apache.hadoop.zebra.mapreduce.TableInputFormat$DummyFileInputFormat$MultiPathFilter.accept(TableInputFormat.java:718)
 at 
 org.apache.hadoop.fs.FileSystem$GlobFilter.accept(FileSystem.java:1084)
 at 
 org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:919)
 at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:866)
 at 
 org.apache.hadoop.zebra.mapreduce.TableInputFormat$DummyFileInputFormat.listStatus(TableInputFormat.java:780)
 at 
 org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:246)
 at 
 org.apache.hadoop.zebra.mapreduce.TableInputFormat.getRowSplits(TableInputFormat.java:863)
 at 
 org.apache.hadoop.zebra.mapreduce.TableInputFormat.getSplits(TableInputFormat.java:1017)
 at 
 org.apache.hadoop.zebra.mapreduce.TableInputFormat.getSplits(TableInputFormat.java:961)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:269)
 ... 7 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1651) PIG class loading mishandled

2010-09-28 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12915959#action_12915959
 ] 

Daniel Dai commented on PIG-1651:
-

+1

 PIG class loading mishandled
 

 Key: PIG-1651
 URL: https://issues.apache.org/jira/browse/PIG-1651
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Yan Zhou
Assignee: Richard Ding
 Fix For: 0.8.0

 Attachments: PIG-1651.patch


 If just having zebra.jar as being registered in a PIG script but not in the 
 CLASSPATH, the query using zebra fails since there appear to be multiple 
 classes loaded into JVM, causing static variable set previously not seen 
 after one instance of the class is created through reflection. (After the 
 zebra.jar is specified in CLASSPATH, it works fine.) The exception stack is 
 as follows:
 ackend error message during job submission
 ---
 org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to 
 create input splits for: hdfs://hostname/pathto/zebra_dir :: null
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:284)
 at 
 org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:907)
 at 
 org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:801)
 at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:752)
 at org.apache.hadoop.mapred.jobcontrol.Job.submit(Job.java:378)
 at 
 org.apache.hadoop.mapred.jobcontrol.JobControl.startReadyJobs(JobControl.java:247)
 at 
 org.apache.hadoop.mapred.jobcontrol.JobControl.run(JobControl.java:279)
 at java.lang.Thread.run(Thread.java:619)
 Caused by: java.lang.NullPointerException
 at 
 org.apache.hadoop.zebra.io.ColumnGroup.getNonDataFilePrefix(ColumnGroup.java:123)
 at 
 org.apache.hadoop.zebra.io.ColumnGroup$CGPathFilter.accept(ColumnGroup.java:2413)
 at 
 org.apache.hadoop.zebra.mapreduce.TableInputFormat$DummyFileInputFormat$MultiPathFilter.accept(TableInputFormat.java:718)
 at 
 org.apache.hadoop.fs.FileSystem$GlobFilter.accept(FileSystem.java:1084)
 at 
 org.apache.hadoop.fs.FileSystem.globStatusInternal(FileSystem.java:919)
 at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:866)
 at 
 org.apache.hadoop.zebra.mapreduce.TableInputFormat$DummyFileInputFormat.listStatus(TableInputFormat.java:780)
 at 
 org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:246)
 at 
 org.apache.hadoop.zebra.mapreduce.TableInputFormat.getRowSplits(TableInputFormat.java:863)
 at 
 org.apache.hadoop.zebra.mapreduce.TableInputFormat.getSplits(TableInputFormat.java:1017)
 at 
 org.apache.hadoop.zebra.mapreduce.TableInputFormat.getSplits(TableInputFormat.java:961)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigInputFormat.getSplits(PigInputFormat.java:269)
 ... 7 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1654) Pig should check schema alias duplication at any levels.

2010-09-28 Thread Xuefu Zhang (JIRA)
Pig should check schema alias duplication at any levels.


 Key: PIG-1654
 URL: https://issues.apache.org/jira/browse/PIG-1654
 Project: Pig
  Issue Type: Bug
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang
 Fix For: 0.9.0


The following script appears valid to Pig but it shouldn't:

A = load 'file' as (a:tuple( u:int, u:bytearray, w:long), b:int, c:chararray);
dump A;

Pig tries to launch map/reduce jobs for this.

However, for the following script, Pig correctly reports error message:

A = load 'file' as (a:int, b:long, c:bytearray);
dump A;

Error message is:
2010-09-28 15:53:37,390 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
1108: Duplicate schema alias: b in A

Thus, Pig only checks alias duplication at the top level, which is confirmed by 
looking at the code. The right behavior is that the same check should be 
applied at all levels. 

This should be addressed in the new parser.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1654) Pig should check schema alias duplication at any levels.

2010-09-28 Thread Xuefu Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xuefu Zhang updated PIG-1654:
-

Description: 
The following script appears valid to Pig but it shouldn't:

A = load 'file' as (a:tuple( u:int, u:bytearray, w:long), b:int, c:chararray);
dump A;

Pig tries to launch map/reduce jobs for this.

However, for the following script, Pig correctly reports error message:

A = load 'file' as (a:int, a:long, c:bytearray);
dump A;

Error message is:
2010-09-28 15:53:37,390 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
1108: Duplicate schema alias: b in A

Thus, Pig only checks alias duplication at the top level, which is confirmed by 
looking at the code. The right behavior is that the same check should be 
applied at all levels. 

This should be addressed in the new parser.



  was:
The following script appears valid to Pig but it shouldn't:

A = load 'file' as (a:tuple( u:int, u:bytearray, w:long), b:int, c:chararray);
dump A;

Pig tries to launch map/reduce jobs for this.

However, for the following script, Pig correctly reports error message:

A = load 'file' as (a:int, b:long, c:bytearray);
dump A;

Error message is:
2010-09-28 15:53:37,390 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
1108: Duplicate schema alias: b in A

Thus, Pig only checks alias duplication at the top level, which is confirmed by 
looking at the code. The right behavior is that the same check should be 
applied at all levels. 

This should be addressed in the new parser.




 Pig should check schema alias duplication at any levels.
 

 Key: PIG-1654
 URL: https://issues.apache.org/jira/browse/PIG-1654
 Project: Pig
  Issue Type: Bug
Reporter: Xuefu Zhang
Assignee: Xuefu Zhang
 Fix For: 0.9.0


 The following script appears valid to Pig but it shouldn't:
 A = load 'file' as (a:tuple( u:int, u:bytearray, w:long), b:int, c:chararray);
 dump A;
 Pig tries to launch map/reduce jobs for this.
 However, for the following script, Pig correctly reports error message:
 A = load 'file' as (a:int, a:long, c:bytearray);
 dump A;
 Error message is:
 2010-09-28 15:53:37,390 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 
 1108: Duplicate schema alias: b in A
 Thus, Pig only checks alias duplication at the top level, which is confirmed 
 by looking at the code. The right behavior is that the same check should be 
 applied at all levels. 
 This should be addressed in the new parser.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1655) code duplicated for udfs that were moved from piggybank to builtin

2010-09-28 Thread Thejas M Nair (JIRA)
code duplicated for udfs that were moved from piggybank to builtin
--

 Key: PIG-1655
 URL: https://issues.apache.org/jira/browse/PIG-1655
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0


As part of PIG-1405, some udfs from piggybank were made standard udfs. But now 
the code is duplicated in piggybank and org.apache.pig.builtin. . This can 
cause confusion.
I am planning to make these udfs in piggybank subclasses of those in 
org.apache.pig.builtin. so that users don't have to change their scripts.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1656) TOBAG TOTUPLE udfs ignores columns with null value; TOBAG does not use input type to determine output schema

2010-09-28 Thread Thejas M Nair (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair updated PIG-1656:
---

Fix Version/s: 0.8.0
Affects Version/s: 0.8.0

 TOBAG  TOTUPLE udfs ignores columns with null value;  TOBAG does not use 
 input type to determine output schema
 ---

 Key: PIG-1656
 URL: https://issues.apache.org/jira/browse/PIG-1656
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0


 TOBAG  TOTUPLE udfs ignores columns with null value
 {code}
 R4= foreach B generate $0, TOTUPLE(null, id, null),  TOBAG( id, null, id,null 
 );
 grunt dump R4;
 1000(,1,)   {(1),(1)}
 1000(,2,)   {(2),(2)}
 1000(,3,)   {(3),(3)}
 1000(,4,)   {(4),(4)}
 {code}
  TOBAG does not use input type to determine output schema
 {code}
 grunt B1 = foreach B generate TOBAG( 1, 2, 3); 
 grunt describe B1;
 B1: {{null}}
 {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1656) TOBAG TOTUPLE udfs ignores columns with null value; TOBAG does not use input type to determine output schema

2010-09-28 Thread Thejas M Nair (JIRA)
TOBAG  TOTUPLE udfs ignores columns with null value;  TOBAG does not use input 
type to determine output schema
---

 Key: PIG-1656
 URL: https://issues.apache.org/jira/browse/PIG-1656
 Project: Pig
  Issue Type: Bug
Reporter: Thejas M Nair
Assignee: Thejas M Nair


TOBAG  TOTUPLE udfs ignores columns with null value
{code}
R4= foreach B generate $0, TOTUPLE(null, id, null),  TOBAG( id, null, id,null );
grunt dump R4;
1000(,1,)   {(1),(1)}
1000(,2,)   {(2),(2)}
1000(,3,)   {(3),(3)}
1000(,4,)   {(4),(4)}
{code}


 TOBAG does not use input type to determine output schema
{code}
grunt B1 = foreach B generate TOBAG( 1, 2, 3); 
grunt describe B1;
B1: {{null}}
{code}


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1542) log level not propogated to MR task loggers

2010-09-28 Thread niraj rai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

niraj rai updated PIG-1542:
---

Attachment: PIG-1542_2.patch

 log level not propogated to MR task loggers
 ---

 Key: PIG-1542
 URL: https://issues.apache.org/jira/browse/PIG-1542
 Project: Pig
  Issue Type: Bug
Reporter: Thejas M Nair
Assignee: niraj rai
 Fix For: 0.8.0

 Attachments: PIG-1542.patch, PIG-1542_1.patch, PIG-1542_2.patch


 Specifying -d DEBUG does not affect the logging of the MR tasks .
 This was fixed earlier in PIG-882 .

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1542) log level not propogated to MR task loggers

2010-09-28 Thread niraj rai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

niraj rai updated PIG-1542:
---

Status: Open  (was: Patch Available)

 log level not propogated to MR task loggers
 ---

 Key: PIG-1542
 URL: https://issues.apache.org/jira/browse/PIG-1542
 Project: Pig
  Issue Type: Bug
Reporter: Thejas M Nair
Assignee: niraj rai
 Fix For: 0.8.0

 Attachments: PIG-1542.patch, PIG-1542_1.patch, PIG-1542_2.patch


 Specifying -d DEBUG does not affect the logging of the MR tasks .
 This was fixed earlier in PIG-882 .

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1542) log level not propogated to MR task loggers

2010-09-28 Thread niraj rai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

niraj rai updated PIG-1542:
---

Status: Patch Available  (was: Open)

 log level not propogated to MR task loggers
 ---

 Key: PIG-1542
 URL: https://issues.apache.org/jira/browse/PIG-1542
 Project: Pig
  Issue Type: Bug
Reporter: Thejas M Nair
Assignee: niraj rai
 Fix For: 0.8.0

 Attachments: PIG-1542.patch, PIG-1542_1.patch, PIG-1542_2.patch


 Specifying -d DEBUG does not affect the logging of the MR tasks .
 This was fixed earlier in PIG-882 .

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1649) FRJoin fails to compute number of input files for replicated input

2010-09-28 Thread Thejas M Nair (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair updated PIG-1649:
---

Attachment: PIG-1649.4.patch

New patch addressing comments from Richard
- In UriUtil.isHDFSFile(String uri) return false if uri is null
- Modified a test in TestFRJoin2 to use comma separated file name.

 FRJoin fails to compute number of input files for replicated input
 --

 Key: PIG-1649
 URL: https://issues.apache.org/jira/browse/PIG-1649
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: PIG-1649.1.patch, PIG-1649.2.patch, PIG-1649.3.patch, 
 PIG-1649.4.patch


 In FRJoin, if input path has curly braces, it fails to compute number of 
 input files and logs the following exception in the log -
 10/09/27 14:31:13 WARN mapReduceLayer.MRCompiler: failed to get number of 
 input files
 java.net.URISyntaxException: Illegal character in path at index 12: 
 /user/tejas/{std*txt}
 at java.net.URI$Parser.fail(URI.java:2809)
 at java.net.URI$Parser.checkChars(URI.java:2982)
 at java.net.URI$Parser.parseHierarchical(URI.java:3066)
 at java.net.URI$Parser.parse(URI.java:3024)
 at java.net.URI.init(URI.java:578)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.hasTooManyInputFiles(MRCompiler.java:1283)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.visitFRJoin(MRCompiler.java:1203)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFRJoin.visit(POFRJoin.java:188)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.compile(MRCompiler.java:475)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.compile(MRCompiler.java:454)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.compile(MRCompiler.java:336)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.compile(MapReduceLauncher.java:468)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:116)
 at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:301)
 at 
 org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1197)
 at org.apache.pig.PigServer.storeEx(PigServer.java:873)
 at org.apache.pig.PigServer.store(PigServer.java:815)
 at org.apache.pig.PigServer.openIterator(PigServer.java:727)
 at 
 org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:612)
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:301)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:141)
 at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:76)
 at org.apache.pig.Main.run(Main.java:453)
 at org.apache.pig.Main.main(Main.java:107)
 This does not cause a query to fail. But since the number of input files 
 don't get calculated, the optimizations added in PIG-1458 to reduce load on 
 name node will not get used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1649) FRJoin fails to compute number of input files for replicated input

2010-09-28 Thread Richard Ding (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12915985#action_12915985
 ] 

Richard Ding commented on PIG-1649:
---

+1. Looks good.

 FRJoin fails to compute number of input files for replicated input
 --

 Key: PIG-1649
 URL: https://issues.apache.org/jira/browse/PIG-1649
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Fix For: 0.8.0

 Attachments: PIG-1649.1.patch, PIG-1649.2.patch, PIG-1649.3.patch, 
 PIG-1649.4.patch


 In FRJoin, if input path has curly braces, it fails to compute number of 
 input files and logs the following exception in the log -
 10/09/27 14:31:13 WARN mapReduceLayer.MRCompiler: failed to get number of 
 input files
 java.net.URISyntaxException: Illegal character in path at index 12: 
 /user/tejas/{std*txt}
 at java.net.URI$Parser.fail(URI.java:2809)
 at java.net.URI$Parser.checkChars(URI.java:2982)
 at java.net.URI$Parser.parseHierarchical(URI.java:3066)
 at java.net.URI$Parser.parse(URI.java:3024)
 at java.net.URI.init(URI.java:578)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.hasTooManyInputFiles(MRCompiler.java:1283)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.visitFRJoin(MRCompiler.java:1203)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFRJoin.visit(POFRJoin.java:188)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.compile(MRCompiler.java:475)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.compile(MRCompiler.java:454)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.compile(MRCompiler.java:336)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.compile(MapReduceLauncher.java:468)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:116)
 at 
 org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:301)
 at 
 org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1197)
 at org.apache.pig.PigServer.storeEx(PigServer.java:873)
 at org.apache.pig.PigServer.store(PigServer.java:815)
 at org.apache.pig.PigServer.openIterator(PigServer.java:727)
 at 
 org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:612)
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:301)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:141)
 at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:76)
 at org.apache.pig.Main.run(Main.java:453)
 at org.apache.pig.Main.main(Main.java:107)
 This does not cause a query to fail. But since the number of input files 
 don't get calculated, the optimizations added in PIG-1458 to reduce load on 
 name node will not get used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (PIG-1637) Combiner not use because optimizor inserts a foreach between group and algebric function

2010-09-28 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai resolved PIG-1637.
-

Hadoop Flags: [Reviewed]
  Resolution: Fixed

All tests pass except for TestSortedTableUnion / TestSortedTableUnionMergeJoin 
for zebra, which are already fail and will be addressed by 
[PIG-1649|https://issues.apache.org/jira/browse/PIG-1649].

Patch committed to both trunk and 0.8 branch.

 Combiner not use because optimizor inserts a foreach between group and 
 algebric function
 

 Key: PIG-1637
 URL: https://issues.apache.org/jira/browse/PIG-1637
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.8.0
Reporter: Daniel Dai
Assignee: Daniel Dai
 Fix For: 0.8.0

 Attachments: PIG-1637-1.patch, PIG-1637-2.patch


 The following script does not use combiner after new optimization change.
 {code}
 A = load ':INPATH:/pigmix/page_views' using 
 org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
 as (user, action, timespent, query_term, ip_addr, timestamp, 
 estimated_revenue, page_info, page_links);
 B = foreach A generate user, (int)timespent as timespent, 
 (double)estimated_revenue as estimated_revenue;
 C = group B all; 
 D = foreach C generate SUM(B.timespent), AVG(B.estimated_revenue);
 store D into ':OUTPATH:';
 {code}
 This is because after group, optimizer detect group key is not used 
 afterward, it add a foreach statement after C. This is how it looks like 
 after optimization:
 {code}
 A = load ':INPATH:/pigmix/page_views' using 
 org.apache.pig.test.udf.storefunc.PigPerformanceLoader()
 as (user, action, timespent, query_term, ip_addr, timestamp, 
 estimated_revenue, page_info, page_links);
 B = foreach A generate user, (int)timespent as timespent, 
 (double)estimated_revenue as estimated_revenue;
 C = group B all; 
 C1 = foreach C generate B;
 D = foreach C1 generate SUM(B.timespent), AVG(B.estimated_revenue);
 store D into ':OUTPATH:';
 {code}
 That cancel the combiner optimization for D. 
 The way to solve the issue is to merge the C1 we inserted and D. Currently, 
 we do not merge these two foreach. The reason is that one output of the first 
 foreach (B) is referred twice in D, and currently rule assume after merge, we 
 need to calculate B twice in D. Actually, C1 is only doing projection, no 
 calculation of B. Merging C1 and D will not result calculating B twice. So C1 
 and D should be merged.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.