[jira] Commented: (PIG-845) PERFORMANCE: Merge Join
[ https://issues.apache.org/jira/browse/PIG-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12741621#action_12741621 ] Pradeep Kamath commented on PIG-845: Review comments: 1) In LogicalPlanTester.java, why is the following change required? {noformat} @@ -198,7 +198,7 @@ private LogicalPlan buildPlan(String query, ClassLoader cldr) { LogicalPlanBuilder.classloader = LogicalPlanTester.class.getClassLoader() ; -PigContext pigContext = new PigContext(ExecType.LOCAL, new Properties()); +PigContext pigContext = new PigContext(ExecType.MAPREDUCE, new Properties()); try { pigContext.connect(); } catch (ExecException e1) { {noformat} Typically when PigContext is constructed in Map-reduce mode, the properties should correspond to the cluster configuration. So the above initialization seems odd because the Properties object is an empty object in the constructor call above. 2) In PigMapBase.java: public static final String END_OF_INP_IN_MAP = "pig.stream.in.map"; can change to public static final String END_OF_INP_IN_MAP = "pig.blocking.operator.in.map"; and this should be put as a public static member of JobControlCompiler. In JobControlCompiler.java, jobConf.set("pig.stream.in.map", "true"); should change to use the above public static String. 3) Remove the following comment in QueryParser.jjt (line 302): {code} * Join parser. Currently can only handle skewed joins. {code} 4) In QueryParser.jjt the joinPlans passed to LOJoin constructor is not a LinkedMultiMap but in LogToPhyTranslationVistior the join plans are put in a LinkedMultiMap. If order is important, shouldn't QueryParser.jjt also change? 5) Some comments in LogToPhyTranslationVisitor about the different lists and maps would help :) 6) In validateMergeJoin() - the code only considers direct successors and predecessors of LOJoin. It should check the entire plan and ensure that predecessors of LOJoin all the way to the LOLoad are only LOForEach and LOFilter. Strictly we should not allow LOForeach since it could change sort order or position of join keys and hence invalidate the index - but we need it so that the Foreach introduced by the TypeCastInserter when there is a schema for either of the inputs remains. You should note in the documentation that only Order and join key position preserving Foreachs and Filters are allowed as predecessors to merge join and check the same in validateMergeJoin() - it is better to use a whitelist of allowed operators than a blacklist of disallowed once (since then the blacklist would need to be updated anytime a new operator comes along. The exception source here is not really a bug but a user input error since merge join really doesnot support other ops. Again for the successor, all successors from mergejoin down to map leaf should be checked to ensure stream is absent (really there should be no restriction on stream being present after the join - if there is an issue currently with this, it is fine to not allow stream but eventually it would be good to not have any restriction on what follows the merge join). You can just use a visitor to check presence of stream in the plan - this should be done after complete LogToPhyTranslation is done - in visit() so that the whole plan can be looked at. 7) Is MRStreamHandler.java now replaced by /org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/plans/EndOfAllInputSetter.java ? 8) Some of MRCompilerExceptions do not follow the Error handling spec - errcode, errMsg, Src 9) Should assert() statements in MRCompiler be replaced with Exceptions since assertions are disabled by default in Java. 10) In MRCompiler.java I wonder if you should change {code} rightMapPlan.disconnect(rightLoader, loadSucc); rightMapPlan.remove(loadSucc); {code} to {code} rightMapPlan.trimBelow(rightLoader); {code} We really want to remove all operators in rightMapPlan other than the loader. 11) We should note in documentation that merge join only works for data sorted in ascending order. (the MRCompiler code assumes this - we should have sort check if possible - see performance comment below) 12) It would be good to add a couple of unit tests with a few operators after merge join to ensure merge join operators well with successors in the plan. 13) In POMergeJoin.java, comments about foreach should be cleaned up since foreach is no longer used. For example: {code} //variable which denotes whether we are returning tuples from the foreach operator {code} The following code can be factored out into a function since its repeated twice: {code} case POStatus.STATUS_EOP: // Current file has ended. Need to open next file by reading next index entry. String prevFile = rightLoader.getLFile(
[jira] Commented: (PIG-845) PERFORMANCE: Merge Join
[ https://issues.apache.org/jira/browse/PIG-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12741589#action_12741589 ] Dmitriy V. Ryaboy commented on PIG-845: --- Some Comments below. It's a big patch, so a lot of comments... 1. EndOfAllInput flags -- could you add comments here about what the point of this flag is? You explain what EndOfAllInputSetter does (which is actually rather self-explanatory) but not what the meaning of the flag is and how it's used. There is a bit of an explanation in PigMapBase, but it really belongs here. 2. Could you explain the relationship between EndOfAllInput and (deleted) POStream? 3. Comments in MRCompiler alternate between referring to the left MROp as LeftMROper and curMROper. Choose one. 4. I am curious about the decision to throw compiler exceptions if MergeJoin requirements re number of inputs, etc, aren't satisfied. It seems like a better user experience would be to log a warning and fall back to a regular join. 5. Style notes for visitMergeJoin: It's a 200-line method. Any way you can break it up into smaller components? As is, it's hard to follow. The if statements should be broken up into multiple lines to agree with the style guides. Variable naming: you've got topPrj, prj, pkg, lr, ce, nig.. one at a time they are fine, but together in a 200-line method they are undreadable. Please consider more descriptive names. 6. Kind of a global comment, since it applies to more than just MergeJoin: It seems to me like we need a Builder for operators to clean up some of the new, set, set, set stuff. Having the setters return this and a Plan's add() method return the plan, would let us replace this: POProject topPrj = new POProject(new OperatorKey(scope,nig.getNextNodeId(scope))); topPrj.setColumn(1); topPrj.setResultType(DataType.TUPLE); topPrj.setOverloaded(true); rightMROpr.reducePlan.add(topPrj); rightMROpr.reducePlan.connect(pkg, topPrj); with this: POProject topPrj = new POProject(new OperatorKey(scope,nig.getNextNodeId(scope))) .setColumn(1).setResultType(DataType.TUPLE) .setOverloaded(true); rightMROpr.reducePlan.add(topPrj).connect(pkg, topPrj) 7. Is the change to List> keyTypes in POFRJoin related to MergeJoin or just rolled in? 8. MergeJoin break getNext() into components. I don't see you supporting Left outer joins. Plans for that? At least document the planned approach. Error codes being declared deep inside classes, and documented on the wiki, is a poor practice, imo. They should be pulled out into PigErrors (as lightweight final objects that have an error code, a name, and a description..) I thought Santhosh made progress on this already, no? Could you explain the problem with splits and streams? Why can't this work for them? 9. Sampler/Indexer: 9a Looks like you create the same number of map tasks for this as you do for a join; all a sampling map task does is read one record and emit a single tuple. That seems wasteful; there is a lot of overhead in setting up these tiny jobs which might get stuck behind other jobs running on the cluster, etc. If the underlying file has syncpoints, a smaller number of MR tasks can be created. If we know the ratio of sample tasks to "full" tasks, we can figure out how many records we should emit per job ( ceil(full_tasks/sample_tasks) ). We can approximately achieve this through seeking trough (end-offset)/num_to_emit and doing a sync() after that seek. It's approximate, but close enough for an index. 9b Consider renaming to something like SortedFileIndexer, since it's coneivable that this component can be reused in a context other than a Merge Join. 10. Would it make sense to expose this to the users via a 'CREATE INDEX' (or similar) command? That way the index could be persisted, and the user could tell you to use an existing index instead of rescanning the data. 11. I am not sure about the approach of pushing sampling above filters. Have you guys benchmarked this? Seems like you'd wind up reading the whole file in the sample job if the filter is selective enough (and high filter selectivity would also make materialize->sample go much faster). Testing: 12a You should test for refusal to do 3-way join and other error condition (or a warning and successful failover to regular join -- my preference) 12b You should do a proper unit test for the MergeJoinIndexer (or whatever we are calling it). > PERFORMANCE: Merge Join > --- > > Key: PIG-845 > URL: https://issues.apache.org/jira/browse/PIG-845 > Project: Pig > Issue Type: Improvement >Reporter: Olga Natkovich >Assignee: Ashutosh Chauhan > Attachments: merge-join-1.patch, merge-join-for-review.patch > > > Thsi join would work if the data for both tables is sorted on the join key. -- This message is
[jira] Commented: (PIG-893) support cast of chararray to other simple types
[ https://issues.apache.org/jira/browse/PIG-893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12741521#action_12741521 ] Olga Natkovich commented on PIG-893: The reason for the release audit issue is because one of the new files is missing the apache license header. Not sure what is the issue with find bugs. These issues need to be resolved or at least investigated before the patch can be committed. > support cast of chararray to other simple types > --- > > Key: PIG-893 > URL: https://issues.apache.org/jira/browse/PIG-893 > Project: Pig > Issue Type: New Feature >Affects Versions: 0.4.0 >Reporter: Thejas M Nair >Assignee: Jeff Zhang > Fix For: 0.4.0 > > Attachments: Pig_893.Patch > > > Pig should support casting of chararray to > integer,long,float,double,bytearray. If the conversion fails for reasons such > as overflow, cast should return null and log a warning. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-893) support cast of chararray to other simple types
[ https://issues.apache.org/jira/browse/PIG-893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12741500#action_12741500 ] Hadoop QA commented on PIG-893: --- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12415997/Pig_893.Patch against trunk revision 801865. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 4 new Findbugs warnings. -1 release audit. The applied patch generated 161 release audit warnings (more than the trunk's current 160 warnings). +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/155/testReport/ Release audit warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/155/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/155/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/155/console This message is automatically generated. > support cast of chararray to other simple types > --- > > Key: PIG-893 > URL: https://issues.apache.org/jira/browse/PIG-893 > Project: Pig > Issue Type: New Feature >Affects Versions: 0.4.0 >Reporter: Thejas M Nair >Assignee: Jeff Zhang > Fix For: 0.4.0 > > Attachments: Pig_893.Patch > > > Pig should support casting of chararray to > integer,long,float,double,bytearray. If the conversion fails for reasons such > as overflow, cast should return null and log a warning. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Build failed in Hudson: Pig-Patch-minerva.apache.org #155
See http://hudson.zones.apache.org/hudson/job/Pig-Patch-minerva.apache.org/155/changes Changes: [daijy] PIG-905: TOKENIZE throws exception on null data [daijy] PIG-697: Proposed improvements to pig's optimizer, Phase5 -- [...truncated 103185 lines...] [exec] [junit] 09/08/10 19:12:18 INFO dfs.DataNode: PacketResponder 1 for block blk_-6144671934640457499_1010 terminating [exec] [junit] 09/08/10 19:12:18 INFO dfs.DataNode: Received block blk_-6144671934640457499_1010 of size 6 from /127.0.0.1 [exec] [junit] 09/08/10 19:12:18 INFO dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:41819 is added to blk_-6144671934640457499_1010 size 6 [exec] [junit] 09/08/10 19:12:18 INFO dfs.DataNode: PacketResponder 2 for block blk_-6144671934640457499_1010 terminating [exec] [junit] 09/08/10 19:12:18 INFO dfs.StateChange: BLOCK* NameSystem.allocateBlock: /user/hudson/input2.txt. blk_7692080556445682047_1011 [exec] [junit] 09/08/10 19:12:18 INFO dfs.DataNode: Receiving block blk_7692080556445682047_1011 src: /127.0.0.1:33904 dest: /127.0.0.1:41819 [exec] [junit] 09/08/10 19:12:18 INFO dfs.DataNode: Receiving block blk_7692080556445682047_1011 src: /127.0.0.1:44683 dest: /127.0.0.1:44402 [exec] [junit] 09/08/10 19:12:18 INFO dfs.DataNode: Receiving block blk_7692080556445682047_1011 src: /127.0.0.1:59965 dest: /127.0.0.1:40083 [exec] [junit] 09/08/10 19:12:18 INFO dfs.DataNode: Received block blk_7692080556445682047_1011 of size 6 from /127.0.0.1 [exec] [junit] 09/08/10 19:12:18 INFO dfs.DataNode: PacketResponder 0 for block blk_7692080556445682047_1011 terminating [exec] [junit] 09/08/10 19:12:18 INFO dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:40083 is added to blk_7692080556445682047_1011 size 6 [exec] [junit] 09/08/10 19:12:18 INFO dfs.DataNode: Received block blk_7692080556445682047_1011 of size 6 from /127.0.0.1 [exec] [junit] 09/08/10 19:12:18 INFO dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:44402 is added to blk_7692080556445682047_1011 size 6 [exec] [junit] 09/08/10 19:12:18 INFO dfs.DataNode: PacketResponder 1 for block blk_7692080556445682047_1011 terminating [exec] [junit] 09/08/10 19:12:18 INFO dfs.DataNode: Received block blk_7692080556445682047_1011 of size 6 from /127.0.0.1 [exec] [junit] 09/08/10 19:12:18 INFO dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:41819 is added to blk_7692080556445682047_1011 size 6 [exec] [junit] 09/08/10 19:12:18 INFO dfs.DataNode: PacketResponder 2 for block blk_7692080556445682047_1011 terminating [exec] [junit] 09/08/10 19:12:18 INFO executionengine.HExecutionEngine: Connecting to hadoop file system at: hdfs://localhost:40520 [exec] [junit] 09/08/10 19:12:18 INFO executionengine.HExecutionEngine: Connecting to map-reduce job tracker at: localhost:51530 [exec] [junit] 09/08/10 19:12:18 INFO mapReduceLayer.MultiQueryOptimizer: MR plan size before optimization: 1 [exec] [junit] 09/08/10 19:12:18 INFO mapReduceLayer.MultiQueryOptimizer: MR plan size after optimization: 1 [exec] [junit] 09/08/10 19:12:19 INFO mapReduceLayer.JobControlCompiler: Setting up single store job [exec] [junit] 09/08/10 19:12:19 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. [exec] [junit] 09/08/10 19:12:19 INFO dfs.StateChange: BLOCK* NameSystem.allocateBlock: /tmp/hadoop-hudson/mapred/system/job_200908101911_0002/job.jar. blk_-3237392627602153862_1012 [exec] [junit] 09/08/10 19:12:19 INFO dfs.DataNode: Receiving block blk_-3237392627602153862_1012 src: /127.0.0.1:33907 dest: /127.0.0.1:41819 [exec] [junit] 09/08/10 19:12:19 INFO dfs.DataNode: Receiving block blk_-3237392627602153862_1012 src: /127.0.0.1:44686 dest: /127.0.0.1:44402 [exec] [junit] 09/08/10 19:12:19 INFO dfs.DataNode: Receiving block blk_-3237392627602153862_1012 src: /127.0.0.1:59968 dest: /127.0.0.1:40083 [exec] [junit] 09/08/10 19:12:19 INFO dfs.DataNode: Received block blk_-3237392627602153862_1012 of size 1478322 from /127.0.0.1 [exec] [junit] 09/08/10 19:12:19 INFO dfs.DataNode: PacketResponder 0 for block blk_-3237392627602153862_1012 terminating [exec] [junit] 09/08/10 19:12:19 INFO dfs.DataNode: Received block blk_-3237392627602153862_1012 of size 1478322 from /127.0.0.1 [exec] [junit] 09/08/10 19:12:19 INFO dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:40083 is added to blk_-3237392627602153862_1012 size 1478322 [exec] [junit] 09/08/10 19:12:19 INFO dfs.DataNode: PacketResponder 1 for block blk_-3237392627602153862_1012 termin
Build failed in Hudson: Pig-trunk #518
See http://hudson.zones.apache.org/hudson/job/Pig-trunk/518/ -- [...truncated 86842 lines...] [junit] 09/08/10 14:42:20 INFO mapred.JobInProgress: Task 'attempt_200908101441_0001_r_00_0' has completed task_200908101441_0001_r_00 successfully. [junit] 09/08/10 14:42:20 INFO dfs.StateChange: BLOCK* NameSystem.delete: blk_-6264389947015309622 is added to invalidSet of 127.0.0.1:41255 [junit] 09/08/10 14:42:20 INFO dfs.StateChange: BLOCK* NameSystem.delete: blk_-6264389947015309622 is added to invalidSet of 127.0.0.1:38778 [junit] 09/08/10 14:42:20 INFO dfs.StateChange: BLOCK* NameSystem.delete: blk_-6264389947015309622 is added to invalidSet of 127.0.0.1:57995 [junit] 09/08/10 14:42:20 INFO dfs.StateChange: BLOCK* NameSystem.delete: blk_5765850477736280749 is added to invalidSet of 127.0.0.1:57995 [junit] 09/08/10 14:42:20 INFO dfs.StateChange: BLOCK* NameSystem.delete: blk_5765850477736280749 is added to invalidSet of 127.0.0.1:38778 [junit] 09/08/10 14:42:20 INFO dfs.StateChange: BLOCK* NameSystem.delete: blk_5765850477736280749 is added to invalidSet of 127.0.0.1:41255 [junit] 09/08/10 14:42:20 INFO dfs.StateChange: BLOCK* NameSystem.delete: blk_7559869483135611448 is added to invalidSet of 127.0.0.1:41255 [junit] 09/08/10 14:42:20 INFO dfs.StateChange: BLOCK* NameSystem.delete: blk_7559869483135611448 is added to invalidSet of 127.0.0.1:56802 [junit] 09/08/10 14:42:20 INFO dfs.StateChange: BLOCK* NameSystem.delete: blk_7559869483135611448 is added to invalidSet of 127.0.0.1:57995 [junit] 09/08/10 14:42:20 INFO mapred.JobInProgress: Job job_200908101441_0001 has completed successfully. [junit] 09/08/10 14:42:20 INFO dfs.StateChange: BLOCK* NameSystem.allocateBlock: /tmp/temp-1684136488/tmp1187787419/_logs/history/localhost_1249915296024_job_200908101441_0001_hudson_Job8913032530943062158.jar. blk_9039513662785461979_1009 [junit] 09/08/10 14:42:20 INFO dfs.DataNode: Receiving block blk_9039513662785461979_1009 src: /127.0.0.1:33493 dest: /127.0.0.1:57995 [junit] 09/08/10 14:42:20 INFO dfs.DataNode: Receiving block blk_9039513662785461979_1009 src: /127.0.0.1:35971 dest: /127.0.0.1:41255 [junit] 09/08/10 14:42:20 INFO dfs.DataNode: Receiving block blk_9039513662785461979_1009 src: /127.0.0.1:44847 dest: /127.0.0.1:56802 [junit] 09/08/10 14:42:20 INFO dfs.DataNode: Received block blk_9039513662785461979_1009 of size 5095 from /127.0.0.1 [junit] 09/08/10 14:42:20 INFO dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:56802 is added to blk_9039513662785461979_1009 size 5095 [junit] 09/08/10 14:42:20 INFO dfs.DataNode: PacketResponder 0 for block blk_9039513662785461979_1009 terminating [junit] 09/08/10 14:42:20 INFO dfs.DataNode: Received block blk_9039513662785461979_1009 of size 5095 from /127.0.0.1 [junit] 09/08/10 14:42:20 INFO dfs.DataNode: PacketResponder 1 for block blk_9039513662785461979_1009 terminating [junit] 09/08/10 14:42:20 INFO dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:41255 is added to blk_9039513662785461979_1009 size 5095 [junit] 09/08/10 14:42:20 INFO dfs.DataNode: Received block blk_9039513662785461979_1009 of size 5095 from /127.0.0.1 [junit] 09/08/10 14:42:20 INFO dfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:57995 is added to blk_9039513662785461979_1009 size 5095 [junit] 09/08/10 14:42:20 INFO dfs.DataNode: PacketResponder 2 for block blk_9039513662785461979_1009 terminating [junit] 09/08/10 14:42:20 INFO mapReduceLayer.MapReduceLauncher: 100% complete [junit] 09/08/10 14:42:20 INFO mapReduceLayer.MapReduceLauncher: Successfully stored result in: "hdfs://localhost:48231/tmp/temp-1684136488/tmp1187787419" [junit] 09/08/10 14:42:20 INFO mapReduceLayer.MapReduceLauncher: Records written : 1 [junit] 09/08/10 14:42:20 INFO mapReduceLayer.MapReduceLauncher: Bytes written : 107 [junit] 09/08/10 14:42:20 INFO mapReduceLayer.MapReduceLauncher: Success! [junit] 09/08/10 14:42:20 INFO dfs.DataNode: DatanodeRegistration(127.0.0.1:56802, storageID=DS-641934704-67.195.138.8-56802-1249915294946, infoPort=57860, ipcPort=35751) Served block blk_-8213206887175561647_1009 to /127.0.0.1 [junit] 09/08/10 14:42:20 INFO executionengine.HExecutionEngine: Connecting to hadoop file system at: file:/// [junit] 09/08/10 14:42:20 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized [junit] 09/08/10 14:42:21 INFO dfs.StateChange: BLOCK* NameSystem.allocateBlock: /user/hudson/input1.txt. blk_-7512886654948818052_1010 [junit] 09/08/10 14:42:21 INFO dfs.DataNode: Receiving block blk_-7512886654948818052_1010 src: /127.0.0.1:35974 dest: /127.0.0.1:41255 [junit] 09/08/10 14:42:21 INFO dfs.DataNode: Receiving block blk_-751288665494881805
[jira] Updated: (PIG-893) support cast of chararray to other simple types
[ https://issues.apache.org/jira/browse/PIG-893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Zhang updated PIG-893: --- Status: Patch Available (was: Open) > support cast of chararray to other simple types > --- > > Key: PIG-893 > URL: https://issues.apache.org/jira/browse/PIG-893 > Project: Pig > Issue Type: New Feature >Affects Versions: 0.4.0 >Reporter: Thejas M Nair >Assignee: Jeff Zhang > Fix For: 0.4.0 > > Attachments: Pig_893.Patch > > > Pig should support casting of chararray to > integer,long,float,double,bytearray. If the conversion fails for reasons such > as overflow, cast should return null and log a warning. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.