[jira] [Commented] (PIG-4920) Fail to use Javascript UDF in spark yarn client mode
[ https://issues.apache.org/jira/browse/PIG-4920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15607235#comment-15607235 ] Mohit Sabharwal commented on PIG-4920: -- LGTM, +1 (non-binding) > Fail to use Javascript UDF in spark yarn client mode > > > Key: PIG-4920 > URL: https://issues.apache.org/jira/browse/PIG-4920 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > Fix For: spark-branch > > Attachments: PIG-4920.patch, PIG-4920_2.patch, PIG-4920_3.patch, > PIG-4920_4.patch, PIG-4920_5.patch, PIG-4920_6.patch > > > udf.pig > {code} > register '/home/zly/prj/oss/merge.pig/pig/bin/udf.js' using javascript as > myfuncs; > A = load './passwd' as (a0:chararray, a1:chararray); > B = foreach A generate myfuncs.helloworld(); > store B into './udf.out'; > {code} > udf.js > {code} > helloworld.outputSchema = "word:chararray"; > function helloworld() { > return 'Hello, World'; > } > > complex.outputSchema = "word:chararray"; > function complex(word){ > return {word:word}; > } > {code} > run udf.pig in spark local mode(export SPARK_MASTER="local"), it successfully. > run udf.pig in spark yarn client mode(export SPARK_MASTER="yarn-client"), it > fails and error message like following: > {noformat} > Caused by: java.lang.reflect.InvocationTargetException > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:408) > at > org.apache.pig.impl.PigContext.instantiateFuncFromSpec(PigContext.java:744) > ... 84 more > Caused by: java.lang.ExceptionInInitializerError > at > org.apache.pig.scripting.js.JsScriptEngine.getInstance(JsScriptEngine.java:87) > at org.apache.pig.scripting.js.JsFunction.(JsFunction.java:173) > ... 89 more > Caused by: java.lang.IllegalStateException: could not get script path from > UDFContext > at > org.apache.pig.scripting.js.JsScriptEngine$Holder.(JsScriptEngine.java:69) > ... 91 more > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PIG-4553) Implement secondary sort using one shuffle
[ https://issues.apache.org/jira/browse/PIG-4553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mohit Sabharwal updated PIG-4553: - Summary: Implement secondary sort using one shuffle (was: Implement secondary sort using 1 shuffle not twice) > Implement secondary sort using one shuffle > -- > > Key: PIG-4553 > URL: https://issues.apache.org/jira/browse/PIG-4553 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > Fix For: spark-branch > > Attachments: PIG-4553_1.patch, PIG-4553_2.patch > > > Now we implement secondary key sort in > GlobalRearrangeConverter#convert > first shuffle in repartitionAndSortWithinPartitions second shuffle in groupBy > {code} > public RDD convert(Listpredecessors, > POGlobalRearrangeSpark physicalOperator) throws > IOException { > > if (predecessors.size() == 1) { > // GROUP > JavaPairRDD
[jira] [Commented] (PIG-4941) TestRank3#testRankWithSplitInMap hangs after upgrade to spark 1.6.1
[ https://issues.apache.org/jira/browse/PIG-4941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15368050#comment-15368050 ] Mohit Sabharwal commented on PIG-4941: -- Following seems like the relevant thread: {code} "main" #1 prio=5 os_prio=0 tid=0x7f5e68019800 nid=0x1034 in Object.wait() [0x7f5e6fe07000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) at java.lang.Object.wait(Object.java:502) at org.apache.spark.scheduler.JobWaiter.awaitResult(JobWaiter.scala:73) - locked <0xc39b09a8> (a org.apache.spark.scheduler.JobWaiter) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:612) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1922) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1146) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1074) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1074) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) at org.apache.spark.rdd.RDD.withScope(RDD.scala:316) at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1074) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply$mcV$sp(PairRDDFunctions.scala:994) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply(PairRDDFunctions.scala:985) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply(PairRDDFunctions.scala:985) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111) at org.apache.spark.rdd.RDD.withScope(RDD.scala:316) at org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopFile(PairRDDFunctions.scala:985) at org.apache.pig.backend.hadoop.executionengine.spark.converter.StoreConverter.convert(StoreConverter.java:103) {code} [~kellyzly], could you check http://localhost:4040/ to see if you see any additional info ? http://spark.apache.org/docs/latest/monitoring.html > TestRank3#testRankWithSplitInMap hangs after upgrade to spark 1.6.1 > --- > > Key: PIG-4941 > URL: https://issues.apache.org/jira/browse/PIG-4941 > Project: Pig > Issue Type: Sub-task >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > Attachments: rank.jstack > > > After upgrading spark version to 1.6.1, TestRank3#testRankWithSplitInMap > hangs and fails due to timeout exception. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4919) Upgrade spark.version to 1.6.1
[ https://issues.apache.org/jira/browse/PIG-4919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15328855#comment-15328855 ] Mohit Sabharwal commented on PIG-4919: -- +1 (non-binding) > Upgrade spark.version to 1.6.1 > -- > > Key: PIG-4919 > URL: https://issues.apache.org/jira/browse/PIG-4919 > Project: Pig > Issue Type: Sub-task >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > Attachments: PIG-4919.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4898) Fix unit test failure after PIG-4771's patch was checked in
[ https://issues.apache.org/jira/browse/PIG-4898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15297587#comment-15297587 ] Mohit Sabharwal commented on PIG-4898: -- +1 (non-binding) > Fix unit test failure after PIG-4771's patch was checked in > --- > > Key: PIG-4898 > URL: https://issues.apache.org/jira/browse/PIG-4898 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > Fix For: spark-branch > > Attachments: PIG-4898.patch > > > Now in the [lastest jenkins|https://builds.apache.org/job/Pig-spark/#328], it > shows that following unit test cases fail: > org.apache.pig.test.TestFRJoin.testDistinctFRJoin > org.apache.pig.test.TestPigRunner.simpleMultiQueryTest3 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4771) Implement FR Join for spark engine
[ https://issues.apache.org/jira/browse/PIG-4771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15286004#comment-15286004 ] Mohit Sabharwal commented on PIG-4771: -- +1 (non-binding) > Implement FR Join for spark engine > -- > > Key: PIG-4771 > URL: https://issues.apache.org/jira/browse/PIG-4771 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > Fix For: spark-branch > > Attachments: PIG-4771.patch, PIG-4771_2.patch, PIG-4771_3.patch > > > We use regular join to replace FR join in current code base(fd31fda). We need > to implement FR join. > Some info collected from > https://pig.apache.org/docs/r0.11.0/perf.html#replicated-joins: > *Replicated Joins* > Fragment replicate join is a special type of join that works well if one or > more relations are small enough to fit into main memory. In such cases, Pig > can perform a very efficient join because all of the hadoop work is done on > the map side. In this type of join the large relation is followed by one or > more small relations. The small relations must be small enough to fit into > main memory; if they don't, the process fails and an error is generated. > *Usage* > Perform a replicated join with the USING clause (see JOIN (inner) and JOIN > (outer)). In this example, a large relation is joined with two smaller > relations. Note that the large relation comes first followed by the smaller > relations; and, all small relations together must fit into main memory, > otherwise an error is generated. > big = LOAD 'big_data' AS (b1,b2,b3); > tiny = LOAD 'tiny_data' AS (t1,t2,t3); > mini = LOAD 'mini_data' AS (m1,m2,m3); > C = JOIN big BY b1, tiny BY t1, mini BY m1 USING 'replicated'; > *Conditions* > Fragment replicate joins are experimental; we don't have a strong sense of > how small the small relation must be to fit into memory. In our tests with a > simple query that involves just a JOIN, a relation of up to 100 M can be used > if the process overall gets 1 GB of memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4886) Add PigSplit#getLocationInfo to fix the NPE found in log in spark mode
[ https://issues.apache.org/jira/browse/PIG-4886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15286003#comment-15286003 ] Mohit Sabharwal commented on PIG-4886: -- Thanks, [~kellyzly] - left couple of comments on RB. > Add PigSplit#getLocationInfo to fix the NPE found in log in spark mode > -- > > Key: PIG-4886 > URL: https://issues.apache.org/jira/browse/PIG-4886 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > Fix For: spark-branch > > Attachments: PIG-4886.patch > > > Use branch code(119f313) to test following pig script in spark mode: > {code} > A = load './SkewedJoinInput1.txt' as (id,name,n); > B = load './SkewedJoinInput2.txt' as (id,name); > D = join A by (id,name), B by (id,name); > store D into './testFRJoin.out'; > {code} > cat bin/SkewedJoinInput1.txt > {noformat} > 100 apple1 aaa > 200 orange1 bbb > 300 strawberry ccc > {noformat} > cat bin/SkewedJoinInput2.txt > {noformat} > 100 apple1 > 100 apple2 > 100 apple2 > 200 orange1 > 200 orange2 > 300 strawberry > 400 pear > {noformat} > following exception found in log: > {noformat} > [dag-scheduler-event-loop] 2016-05-05 14:21:01,046 DEBUG rdd.NewHadoopRDD > (Logging.scala:logDebug(84)) - Failed to use InputSplit#getLocationInfo. > java.lang.NullPointerException > at > scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:114) > at scala.collection.mutable.ArrayOps$ofRef.length(ArrayOps.scala:114) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:32) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at > org.apache.spark.rdd.HadoopRDD$.convertSplitLocationInfo(HadoopRDD.scala:406) > at > org.apache.spark.rdd.NewHadoopRDD.getPreferredLocations(NewHadoopRDD.scala:202) > at > org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:231) > at > org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:231) > at scala.Option.getOrElse(Option.scala:120) > at org.apache.spark.rdd.RDD.preferredLocations(RDD.scala:230) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1387) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1397) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1396) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1396) > {noformat} > org.apache.spark.rdd.NewHadoopRDD.getPreferredLocations will call > PigSplit#getLocationInfo but currently PigSplit extends InputSplit and > InputSplit#getLocationInfo return null. > {code} > @Evolving > public SplitLocationInfo[] getLocationInfo() throws IOException { > return null; > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4771) Implement FR Join for spark engine
[ https://issues.apache.org/jira/browse/PIG-4771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15285953#comment-15285953 ] Mohit Sabharwal commented on PIG-4771: -- Thanks, [~kellyzly] Left couple comments on RB. We can commit this after those changes and work on PIG-4891 to use broadcast variables. Thanks! > Implement FR Join for spark engine > -- > > Key: PIG-4771 > URL: https://issues.apache.org/jira/browse/PIG-4771 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > Fix For: spark-branch > > Attachments: PIG-4771.patch, PIG-4771_2.patch > > > We use regular join to replace FR join in current code base(fd31fda). We need > to implement FR join. > Some info collected from > https://pig.apache.org/docs/r0.11.0/perf.html#replicated-joins: > *Replicated Joins* > Fragment replicate join is a special type of join that works well if one or > more relations are small enough to fit into main memory. In such cases, Pig > can perform a very efficient join because all of the hadoop work is done on > the map side. In this type of join the large relation is followed by one or > more small relations. The small relations must be small enough to fit into > main memory; if they don't, the process fails and an error is generated. > *Usage* > Perform a replicated join with the USING clause (see JOIN (inner) and JOIN > (outer)). In this example, a large relation is joined with two smaller > relations. Note that the large relation comes first followed by the smaller > relations; and, all small relations together must fit into main memory, > otherwise an error is generated. > big = LOAD 'big_data' AS (b1,b2,b3); > tiny = LOAD 'tiny_data' AS (t1,t2,t3); > mini = LOAD 'mini_data' AS (m1,m2,m3); > C = JOIN big BY b1, tiny BY t1, mini BY m1 USING 'replicated'; > *Conditions* > Fragment replicate joins are experimental; we don't have a strong sense of > how small the small relation must be to fit into memory. In our tests with a > simple query that involves just a JOIN, a relation of up to 100 M can be used > if the process overall gets 1 GB of memory. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4876) OutputConsumeIterator can't handle the last buffered tuples for some Operators
[ https://issues.apache.org/jira/browse/PIG-4876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15285832#comment-15285832 ] Mohit Sabharwal commented on PIG-4876: -- Thanks, [~kexianda], [~kellyzly], let's go with (b). Let's add a detailed comment about this, so it can be reviewed by the committers when we merge this to master. > OutputConsumeIterator can't handle the last buffered tuples for some Operators > -- > > Key: PIG-4876 > URL: https://issues.apache.org/jira/browse/PIG-4876 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: Xianda Ke >Assignee: Xianda Ke > Fix For: spark-branch > > Attachments: PIG-4876.patch > > > Some Operators, such as MergeCogroup, Stream, CollectedGroup etc buffer some > input records to constitute the result tuples. The last result tuples are > buffered in the operator. These Operators need a flag to indicate the end of > input, so that they can flush and constitute their last tuples. > Currently, the flag 'parentPlan.endOfAllInput' is targeted for flushing the > buffered tuples in MR mode. But it does not work with OutputConsumeIterator > in Spark mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4876) OutputConsumeIterator can't handle the last buffered tuples for some Operators
[ https://issues.apache.org/jira/browse/PIG-4876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15263299#comment-15263299 ] Mohit Sabharwal commented on PIG-4876: -- It's not clear to me why adding beginOfInput() is complex or less readable. In OutputConsumerIterator, add a *beginOfInput()* abstract method: {code:title=OutputConsumerIterator.java|borderStyle=solid} abstract protected void beginOfInput(); {code} In OutputConsumerIterator.readNext(), insert *beginOfInput()* as shown below: {code:title=OutputConsumerIterator.java|borderStyle=solid} ... if (result == null) { beginOfInput(); // INSERT THIS CALL if (!input.hasNext()) { done = true; return; } Tuple v1 = input.next(); attach(v1); } ... {code} Now, in every operator, where we have implemented endOfInput(), also implement beginOfInput(). For example, in CollectedGroupConverter we have implemented endOfInput(). We implemented beginOfInput() as: {code:title=CollectedGroupConverter.java|borderStyle=solid} @Override protected void beginOfInput() { poCollectedGroup.getParentPlan().endOfAllInput = false; } {code} Maybe I'm misunderstanding this ? > OutputConsumeIterator can't handle the last buffered tuples for some Operators > -- > > Key: PIG-4876 > URL: https://issues.apache.org/jira/browse/PIG-4876 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: Xianda Ke >Assignee: Xianda Ke > Fix For: spark-branch > > Attachments: PIG-4876.patch > > > Some Operators, such as MergeCogroup, Stream, CollectedGroup etc buffer some > input records to constitute the result tuples. The last result tuples are > buffered in the operator. These Operators need a flag to indicate the end of > input, so that they can flush and constitute their last tuples. > Currently, the flag 'parentPlan.endOfAllInput' is targeted for flushing the > buffered tuples in MR mode. But it does not work with OutputConsumeIterator > in Spark mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PIG-4854) Merge spark branch to trunk
[ https://issues.apache.org/jira/browse/PIG-4854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mohit Sabharwal updated PIG-4854: - Issue Type: Sub-task (was: Task) Parent: PIG-4059 > Merge spark branch to trunk > --- > > Key: PIG-4854 > URL: https://issues.apache.org/jira/browse/PIG-4854 > Project: Pig > Issue Type: Sub-task >Reporter: Pallavi Rao > Attachments: PIG-On-Spark.patch > > > Believe the spark branch will be shortly ready to be merged with the main > branch (couple of minor patches pending commit), given that we have addressed > most functionality gaps and have ensured the UTs are clean. There are a few > optimizations which we will take up once the branch is merged to trunk. > [~xuefuz], [~rohini], [~daijy], > Hopefully, you agree that the spark branch is ready for merge. If yes, how > would like us to go about it? Do you want me to upload a huge patch that will > be merged like any other patch or do you prefer a branch merge? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4876) OutputConsumeIterator can't handle the last buffered tuples for some Operators
[ https://issues.apache.org/jira/browse/PIG-4876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15253898#comment-15253898 ] Mohit Sabharwal commented on PIG-4876: -- Another question: Does it make sense to add another abstract method (similar to endOfInput()) like beginOfInput() that resets the flag at the beginning ? Would that work ? Just trying to minimize non-spark code change... > OutputConsumeIterator can't handle the last buffered tuples for some Operators > -- > > Key: PIG-4876 > URL: https://issues.apache.org/jira/browse/PIG-4876 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: Xianda Ke >Assignee: Xianda Ke > Fix For: spark-branch > > Attachments: PIG-4876.patch > > > Some Operators, such as MergeCogroup, Stream, CollectedGroup etc buffer some > input records to constitute the result tuples. The last result tuples are > buffered in the operator. These Operators need a flag to indicate the end of > input, so that they can flush and constitute their last tuples. > Currently, the flag 'parentPlan.endOfAllInput' is targeted for flushing the > buffered tuples in MR mode. But it does not work with OutputConsumeIterator > in Spark mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4876) OutputConsumeIterator can't handle the last buffered tuples for some Operators
[ https://issues.apache.org/jira/browse/PIG-4876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15253377#comment-15253377 ] Mohit Sabharwal commented on PIG-4876: -- Thanks for the explanation [~kexianda]. Left a comment regarding naming on RB. To summarize your explanation, since endOfAllInput is shared amongst all operators in the plan, it may get set to true by a preceding operator, which may affect subsequent operators in the plan (which may not have finished processing all tuples). Is that correct ? One question: - After PIG-4542 patch (https://reviews.apache.org/r/34003), I see that TestCollectedGroup was passing. What is different about usage of CollectedGroup in PIG-4842 that it caused it to now fail ? > OutputConsumeIterator can't handle the last buffered tuples for some Operators > -- > > Key: PIG-4876 > URL: https://issues.apache.org/jira/browse/PIG-4876 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: Xianda Ke >Assignee: Xianda Ke > Fix For: spark-branch > > Attachments: PIG-4876.patch > > > Some Operators, such as MergeCogroup, Stream, CollectedGroup etc buffer some > input records to constitute the result tuples. The last result tuples are > buffered in the operator. These Operators need a flag to indicate the end of > input, so that they can flush and constitute their last tuples. > Currently, the flag 'parentPlan.endOfAllInput' is targeted for flushing the > buffered tuples in MR mode. But it does not work with OutputConsumeIterator > in Spark mode. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4859) Need upgrade snappy-java.version to 1.1.1.3
[ https://issues.apache.org/jira/browse/PIG-4859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15225017#comment-15225017 ] Mohit Sabharwal commented on PIG-4859: -- +1 (non-binding) > Need upgrade snappy-java.version to 1.1.1.3 > --- > > Key: PIG-4859 > URL: https://issues.apache.org/jira/browse/PIG-4859 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > Fix For: spark-branch > > Attachments: PIG-4859.patch > > > run pig on spark on yarn-client env as following: > export SPARK_MASTER="yarn-client" > ./pig -x spark xxx.pig > Throw error like following: > {code} > main] 2016-03-30 16:52:26,115 INFO scheduler.DAGScheduler > (Logging.scala:logInfo(59)) - Job 0 failed: saveAsNewAPIHadoopDataset at > StoreConverter.java:101, took 73.980147 s > 19895 [main] 2016-03-30 16:52:26,119 ERROR spark.JobGraphBuilder > (JobGraphBuilder.java:sparkOperToRDD(166)) - throw exception in > sparkOperToRDD: > 19896 org.apache.spark.SparkException: Job aborted due to stage failure: Task > 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage > 0.0 (TID 3, zly1.sh.intel.com): java .lang.UnsatisfiedLinkError: > org.xerial.snappy.SnappyNative.uncompressedLength(Ljava/lang/Object;II)I > 19897 at org.xerial.snappy.SnappyNative.uncompressedLength(Native > Method) > 19898 at org.xerial.snappy.Snappy.uncompressedLength(Snappy.java:541) > 19899 at > org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:350) > 19900 at > org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:158) > 19901 at > org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:142) > 19902 at > java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2313) > 19903 at > java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2326) > 19904 at > java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2797) > 19905 at > java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:802) > 19906 at java.io.ObjectInputStream.(ObjectInputStream.java:299) > 19907 at > org.apache.spark.serializer.JavaDeserializationStream$$anon$1.(JavaSerializer.scala:64) > 19908 at > org.apache.spark.serializer.JavaDeserializationStream.(JavaSerializer.scala:64) > 19909 at > org.apache.spark.serializer.JavaSerializerInstance.deserializeStream(JavaSerializer.scala:103) > 19910 at > org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:216) > 19911 at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:178) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4857) Last record is missing in STREAM operator
[ https://issues.apache.org/jira/browse/PIG-4857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15218544#comment-15218544 ] Mohit Sabharwal commented on PIG-4857: -- [~kexianda], since the StreamOperator uses OutputConsumerIterator, isn't this just a matter of correctly implementing the endOfInput() method in the OutputConsumerIterator object (https://github.com/apache/pig/blob/spark/src/org/apache/pig/backend/hadoop/executionengine/spark/converter/StreamConverter.java#L106) ? IOW, endOfInput() is supposed to be implemented to flush the last record. > Last record is missing in STREAM operator > - > > Key: PIG-4857 > URL: https://issues.apache.org/jira/browse/PIG-4857 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: Xianda Ke >Assignee: Xianda Ke > Fix For: spark-branch > > Attachments: PIG-4857.patch > > > This bug is similar to PIG-4842. > Scenario: > {code} > cat input.txt > 1 > 1 > 2 > {code} > Pig script: > {code} > REGISTER myudfs.jar; > A = LOAD 'input.txt' USING myudfs.DummyCollectableLoader() AS (id); > B = GROUP A by $0 USING 'collected';-- (1, {(1),(1)}), (2,{(2)}) > C = STREAM B THROUGH ` awk '{ > print $0; > }'`; > DUMP C; > {code} > Expected Result: > {code} > (1,{(1),(1)}) > (2,{(2)}) > {code} > Actual Result: > {code} > (1,{(1),(1)}) > {code} > The last record is missing... > Root Cause: > When the flag endOfAllInput was set as true by the predecessor, the > predecessor buffers the last record which is the input of Stream. Then > POStream find endOfAllInput is true, in fact, the last input is not consumed > yet. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4837) TestNativeMapReduce test fix
[ https://issues.apache.org/jira/browse/PIG-4837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15215345#comment-15215345 ] Mohit Sabharwal commented on PIG-4837: -- +1 (non-binding) for PIG-4837_3.patch > TestNativeMapReduce test fix > > > Key: PIG-4837 > URL: https://issues.apache.org/jira/browse/PIG-4837 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > Fix For: spark-branch > > Attachments: PIG-4837.patch, PIG-4837_2.patch, PIG-4837_3.patch, > build23.PNG, build27.PNG > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4842) Collected group doesn't work in some cases
[ https://issues.apache.org/jira/browse/PIG-4842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15215327#comment-15215327 ] Mohit Sabharwal commented on PIG-4842: -- +1 (non-binding) > Collected group doesn't work in some cases > -- > > Key: PIG-4842 > URL: https://issues.apache.org/jira/browse/PIG-4842 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: Xianda Ke >Assignee: Xianda Ke > Fix For: spark-branch > > Attachments: PIG-4842-2.patch, PIG-4842.patch > > > Scenario: > 1. input data: > cat collectedgroup1 > {code} > 1 > 1 > 2 > {code} > 2. pig script: > {code} > A = LOAD 'collectedgroup1' USING myudfs.DummyCollectableLoader() AS (id); > B = GROUP A by $0 USING 'collected'; > C = GROUP B by $0 USING 'collected'; > DUMP C; > {code} > The expected output: > {code} > (1,{(1,{(1),(1)})}) > (2,{(2,{(2)})}) > {code} > The actual output: > {code} > (1,{(1,{(1),(1)})}) > (1,) > (2,{(2,{(2)})}) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4837) TestNativeMapReduce test fix
[ https://issues.apache.org/jira/browse/PIG-4837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15215102#comment-15215102 ] Mohit Sabharwal commented on PIG-4837: -- I agree with [~pallavi.rao]. Running MR job in Spark mode should not be our priority. We may want to support such "mixed mode" in the future. My vote would be a) add it test/excluded-tests-spark and b) add a comment there with reference to this jira. > TestNativeMapReduce test fix > > > Key: PIG-4837 > URL: https://issues.apache.org/jira/browse/PIG-4837 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > Fix For: spark-branch > > Attachments: PIG-4837.patch, PIG-4837_2.patch, build23.PNG, > build27.PNG > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4836) Fix TestEvalPipeline test failure
[ https://issues.apache.org/jira/browse/PIG-4836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15189445#comment-15189445 ] Mohit Sabharwal commented on PIG-4836: -- [~xuefuz], please commit when you get get a chance. > Fix TestEvalPipeline test failure > - > > Key: PIG-4836 > URL: https://issues.apache.org/jira/browse/PIG-4836 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: Pallavi Rao >Assignee: Pallavi Rao > Labels: spork > Fix For: spark-branch > > Attachments: PIG-4836.patch > > > There are two test failures: > testMapUDF > testLimit > testLimit will get fixed by PIG-4832. This JIRA will only fix testMapUDF. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4836) Fix TestEvalPipeline test failure
[ https://issues.apache.org/jira/browse/PIG-4836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15189224#comment-15189224 ] Mohit Sabharwal commented on PIG-4836: -- +1 Thanks, [~pallavi.rao]. > Fix TestEvalPipeline test failure > - > > Key: PIG-4836 > URL: https://issues.apache.org/jira/browse/PIG-4836 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: Pallavi Rao >Assignee: Pallavi Rao > Labels: spork > Fix For: spark-branch > > Attachments: PIG-4836.patch > > > There are two test failures: > testMapUDF > testLimit > testLimit will get fixed by PIG-4832. This JIRA will only fix testMapUDF. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4835) Fix TestPigRunner test failure
[ https://issues.apache.org/jira/browse/PIG-4835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15188578#comment-15188578 ] Mohit Sabharwal commented on PIG-4835: -- yes, sorry, commented on wrong jira :) > Fix TestPigRunner test failure > -- > > Key: PIG-4835 > URL: https://issues.apache.org/jira/browse/PIG-4835 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: Pallavi Rao >Assignee: Pallavi Rao > Labels: spork > Fix For: spark-branch > > Attachments: PIG-4835.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4836) Fix TestEvalPipeline test failure
[ https://issues.apache.org/jira/browse/PIG-4836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15188577#comment-15188577 ] Mohit Sabharwal commented on PIG-4836: -- [~pallavi.rao], quick question: looks like this sets an empty mr progress reporter in thread local. Is this needed just for POForEach ? If it affects other operators as well, should we set it earlier, like in JobGraphBuilder ? > Fix TestEvalPipeline test failure > - > > Key: PIG-4836 > URL: https://issues.apache.org/jira/browse/PIG-4836 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: Pallavi Rao >Assignee: Pallavi Rao > Labels: spork > Fix For: spark-branch > > Attachments: PIG-4836.patch > > > There are two test failures: > testMapUDF > testLimit > testLimit will get fixed by PIG-4832. This JIRA will only fix testMapUDF. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4835) Fix TestPigRunner test failure
[ https://issues.apache.org/jira/browse/PIG-4835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15188562#comment-15188562 ] Mohit Sabharwal commented on PIG-4835: -- [~pallavi.rao], quick question: looks like this sets an empty mr progress reporter in thread local. Is this needed just for POForEach ? If it affects other operators as well, should we set it earlier, like in JobGraphBuilder ? > Fix TestPigRunner test failure > -- > > Key: PIG-4835 > URL: https://issues.apache.org/jira/browse/PIG-4835 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: Pallavi Rao >Assignee: Pallavi Rao > Labels: spork > Fix For: spark-branch > > Attachments: PIG-4835.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4827) Fix TestSample UT failure
[ https://issues.apache.org/jira/browse/PIG-4827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15185290#comment-15185290 ] Mohit Sabharwal commented on PIG-4827: -- +1 Thanks, [~pallavi.rao]. > Fix TestSample UT failure > - > > Key: PIG-4827 > URL: https://issues.apache.org/jira/browse/PIG-4827 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: Pallavi Rao >Assignee: Pallavi Rao > Labels: spork > Fix For: spark-branch > > Attachments: PIG-4827-v1.patch, PIG-4827.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4825) Fix TestMultiQuery failure
[ https://issues.apache.org/jira/browse/PIG-4825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184365#comment-15184365 ] Mohit Sabharwal commented on PIG-4825: -- Agreed. We saw this pattern of failures earlier and [~rohini] recommended {{Util.checkQueryOutputsAfterSort}} > Fix TestMultiQuery failure > -- > > Key: PIG-4825 > URL: https://issues.apache.org/jira/browse/PIG-4825 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: Pallavi Rao >Assignee: Pallavi Rao > Labels: spork > Fix For: spark-branch > > Attachments: PIG-4825.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4827) Fix TestSample UT failure
[ https://issues.apache.org/jira/browse/PIG-4827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15183327#comment-15183327 ] Mohit Sabharwal commented on PIG-4827: -- Needs minor error message update. Otherwise, LGTM. > Fix TestSample UT failure > - > > Key: PIG-4827 > URL: https://issues.apache.org/jira/browse/PIG-4827 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: Pallavi Rao >Assignee: Pallavi Rao > Labels: spork > Fix For: spark-branch > > Attachments: PIG-4827.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4788) the value BytesRead metric info always returns 0 even the length of input file is not 0 in spark engine
[ https://issues.apache.org/jira/browse/PIG-4788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15173189#comment-15173189 ] Mohit Sabharwal commented on PIG-4788: -- Ah, of course, sorry - FileSplit can't be replaced by PigSplit. My other concern was whether changing PigSplit to extend FileSplit will break PigSplit for inputformats that use non-File splits. Makes sense ? > the value BytesRead metric info always returns 0 even the length of input > file is not 0 in spark engine > --- > > Key: PIG-4788 > URL: https://issues.apache.org/jira/browse/PIG-4788 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > Fix For: spark-branch > > Attachments: PIG-4788.patch > > > In > [JobMetricsLinstener#onTaskEnd|https://github.com/apache/pig/blob/spark/src/org/apache/pig/tools/pigstats/spark/SparkJobStats.java#L140], > taskMetrics.inputMetrics().get().bytesRead() always returns 0 even the > length of input file is not zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4788) the value BytesRead metric info always returns 0 even the length of input file is not 0 in spark engine
[ https://issues.apache.org/jira/browse/PIG-4788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15173164#comment-15173164 ] Mohit Sabharwal commented on PIG-4788: -- [~kellyzly], if you change {{PigSplit}} to extend {{FileSplit}}, will {{PigInputFormat}} still work with non-file splits like CombineFileSplit, etc. ? Can we instead use {{FileSplit}} when we create the record reader in {{PigInputFormatSpark}}, instead of {{PigSplit}} ? That way we could isolate the change in Spark specific code. > the value BytesRead metric info always returns 0 even the length of input > file is not 0 in spark engine > --- > > Key: PIG-4788 > URL: https://issues.apache.org/jira/browse/PIG-4788 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > Fix For: spark-branch > > Attachments: PIG-4788.patch > > > In > [JobMetricsLinstener#onTaskEnd|https://github.com/apache/pig/blob/spark/src/org/apache/pig/tools/pigstats/spark/SparkJobStats.java#L140], > taskMetrics.inputMetrics().get().bytesRead() always returns 0 even the > length of input file is not zero. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4601) Implement Merge CoGroup for Spark engine
[ https://issues.apache.org/jira/browse/PIG-4601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149022#comment-15149022 ] Mohit Sabharwal commented on PIG-4601: -- +1 (non-binding), sorry about the delay reviewing the updated patch. > Implement Merge CoGroup for Spark engine > > > Key: PIG-4601 > URL: https://issues.apache.org/jira/browse/PIG-4601 > Project: Pig > Issue Type: Sub-task > Components: spark >Affects Versions: spark-branch >Reporter: Mohit Sabharwal >Assignee: liyunzhang_intel > Fix For: spark-branch > > Attachments: PIG-4601_1.patch, PIG-4601_2.patch > > > When doing a cogroup operation, we need do a map-reduce. The target of merge > cogroup is implementing cogroup only by a single stage(map). But we need to > guarantee the input data are sorted. > There is performance improvement for cases when A(big dataset) merge cogroup > B( small dataset) because we first generate an index file of A then loading A > according to the index file and B into memory to do cogroup. The performance > improves because there is no cost of reduce period comparing cogroup. > How to use > {code} > C = cogroup A by c1, B by c1 using 'merge'; > {code} > Here A and B is sorted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4601) Implement Merge CoGroup for Spark engine
[ https://issues.apache.org/jira/browse/PIG-4601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15102751#comment-15102751 ] Mohit Sabharwal commented on PIG-4601: -- Thanks, [~kellyzly]! Have couple of questions on RB. > Implement Merge CoGroup for Spark engine > > > Key: PIG-4601 > URL: https://issues.apache.org/jira/browse/PIG-4601 > Project: Pig > Issue Type: Sub-task > Components: spark >Affects Versions: spark-branch >Reporter: Mohit Sabharwal >Assignee: liyunzhang_intel > Fix For: spark-branch > > Attachments: PIG-4601_1.patch > > > When doing a cogroup operation, we need do a map-reduce. The target of merge > cogroup is implementing cogroup only by a single stage(map). But we need to > guarantee the input data are sorted. > There is performance improvement for cases when A(big dataset) merge cogroup > B( small dataset) because we first generate an index file of A then loading A > according to the index file and B into memory to do cogroup. The performance > improves because there is no cost of reduce period comparing cogroup. > How to use > {code} > C = cogroup A by c1, B by c1 using 'merge'; > {code} > Here A and B is sorted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4611) Fix remaining unit test failures about "TestHBaseStorage" in spark mode
[ https://issues.apache.org/jira/browse/PIG-4611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15101131#comment-15101131 ] Mohit Sabharwal commented on PIG-4611: -- +1 (non-binding) > Fix remaining unit test failures about "TestHBaseStorage" in spark mode > --- > > Key: PIG-4611 > URL: https://issues.apache.org/jira/browse/PIG-4611 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > Fix For: spark-branch > > Attachments: PIG-4611.patch, PIG-4611_2.patch, PIG-4611_3.patch > > > In https://builds.apache.org/job/Pig-spark/lastCompletedBuild/testReport/, it > shows following unit test failures about TestHBaseStorage: > org.apache.pig.test.TestHBaseStorage.testStoreToHBase_1_with_delete > org.apache.pig.test.TestHBaseStorage.testLoadWithProjection_1 > org.apache.pig.test.TestHBaseStorage.testLoadWithProjection_2 > org.apache.pig.test.TestHBaseStorage.testStoreToHBase_2_with_projection > org.apache.pig.test.TestHBaseStorage.testCollectedGroup > org.apache.pig.test.TestHBaseStorage.testHeterogeneousScans -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4675) Operators with multiple predecessors fail under multiquery optimization
[ https://issues.apache.org/jira/browse/PIG-4675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15063531#comment-15063531 ] Mohit Sabharwal commented on PIG-4675: -- +1(non-binding) > Operators with multiple predecessors fail under multiquery optimization > --- > > Key: PIG-4675 > URL: https://issues.apache.org/jira/browse/PIG-4675 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: Peter Lin >Assignee: liyunzhang_intel > Fix For: spark-branch > > Attachments: PIG-4675_1.patch, PIG-4675_2.patch, PIG-4675_3.patch, > name.txt, ssn.txt, test.pig > > > We are testing the spark branch pig recently with mapr3 and spark 1.5. It > turns out if we use more than 1 store command in the pig script will have > exception from the second store command. > SSN = load '/test/ssn.txt' using PigStorage() as (ssn:long); > SSN_NAME = load '/test/name.txt' using PigStorage() as (ssn:long, > name:chararray); > X = JOIN SSN by ssn LEFT OUTER, SSN_NAME by ssn USING 'replicated'; > R1 = limit SSN_NAME 10; > store R1 into '/tmp/test1_r1'; > store X into '/tmp/test1_x'; > Exception Details: > 15/09/11 13:37:00 INFO storage.MemoryStore: ensureFreeSpace(114448) called > with curMem=359237, maxMem=503379394 > 15/09/11 13:37:00 INFO storage.MemoryStore: Block broadcast_2 stored as > values in memory (estimated size 111.8 KB, free 479.6 MB) > 15/09/11 13:37:00 INFO storage.MemoryStore: ensureFreeSpace(32569) called > with curMem=473685, maxMem=503379394 > 15/09/11 13:37:00 INFO storage.MemoryStore: Block broadcast_2_piece0 stored > as bytes in memory (estimated size 31.8 KB, free 479.6 MB) > 15/09/11 13:37:00 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in > memory on 10.51.2.82:55960 (size: 31.8 KB, free: 479.9 MB) > 15/09/11 13:37:00 INFO spark.SparkContext: Created broadcast 2 from > newAPIHadoopRDD at LoadConverter.java:88 > 15/09/11 13:37:00 WARN util.ClosureCleaner: Expected a closure; got > org.apache.pig.backend.hadoop.executionengine.spark.converter.LoadConverter$ToTupleFunction > 15/09/11 13:37:00 INFO spark.SparkLauncher: Converting operator POForEach > (Name: SSN: New For Each(false)[bag] - scope-17 Operator Key: scope-17) > 15/09/11 13:37:00 INFO spark.SparkLauncher: Converting operator POFRJoin > (Name: X: FRJoin[tuple] - scope-22 Operator Key: scope-22) > 15/09/11 13:37:00 ERROR spark.SparkLauncher: throw exception in > sparkOperToRDD: > java.lang.RuntimeException: Should have greater than1 predecessors for class > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFRJoin. > Got : 1 > at > org.apache.pig.backend.hadoop.executionengine.spark.SparkUtil.assertPredecessorSizeGreaterThan(SparkUtil.java:93) > at > org.apache.pig.backend.hadoop.executionengine.spark.converter.FRJoinConverter.convert(FRJoinConverter.java:55) > at > org.apache.pig.backend.hadoop.executionengine.spark.converter.FRJoinConverter.convert(FRJoinConverter.java:46) > at > org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.physicalToRDD(SparkLauncher.java:633) > at > org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.physicalToRDD(SparkLauncher.java:600) > at > org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.physicalToRDD(SparkLauncher.java:621) > at > org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.sparkOperToRDD(SparkLauncher.java:552) > at > org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.sparkPlanToRDD(SparkLauncher.java:501) > at > org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.launchPig(SparkLauncher.java:204) > at > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.launchPig(HExecutionEngine.java:301) > at org.apache.pig.PigServer.launchPlan(PigServer.java:1390) > at > org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1375) > at org.apache.pig.PigServer.execute(PigServer.java:1364) > at org.apache.pig.PigServer.executeBatch(PigServer.java:415) > at org.apache.pig.PigServer.executeBatch(PigServer.java:398) > at > org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:171) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:234) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205) > at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81) > at org.apache.pig.Main.run(Main.java:624) > at org.apache.pig.Main.main(Main.java:170) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4293) Enable unit test "TestNativeMapReduce" for spark
[ https://issues.apache.org/jira/browse/PIG-4293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15063344#comment-15063344 ] Mohit Sabharwal commented on PIG-4293: -- +1 (non-binding) > Enable unit test "TestNativeMapReduce" for spark > > > Key: PIG-4293 > URL: https://issues.apache.org/jira/browse/PIG-4293 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > Fix For: spark-branch > > Attachments: PIG-4293.patch, PIG-4293_1.patch, > TEST-org.apache.pig.test.TestNativeMapReduce.txt > > > error log is attached -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4675) Operators with multiple predecessors fail under multiquery optimization
[ https://issues.apache.org/jira/browse/PIG-4675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15063353#comment-15063353 ] Mohit Sabharwal commented on PIG-4675: -- Thanks, [~kellyzly]. I had one minor comment. Otherwise LGTM. > Operators with multiple predecessors fail under multiquery optimization > --- > > Key: PIG-4675 > URL: https://issues.apache.org/jira/browse/PIG-4675 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: Peter Lin >Assignee: liyunzhang_intel > Fix For: spark-branch > > Attachments: PIG-4675_1.patch, PIG-4675_2.patch, name.txt, ssn.txt, > test.pig > > > We are testing the spark branch pig recently with mapr3 and spark 1.5. It > turns out if we use more than 1 store command in the pig script will have > exception from the second store command. > SSN = load '/test/ssn.txt' using PigStorage() as (ssn:long); > SSN_NAME = load '/test/name.txt' using PigStorage() as (ssn:long, > name:chararray); > X = JOIN SSN by ssn LEFT OUTER, SSN_NAME by ssn USING 'replicated'; > R1 = limit SSN_NAME 10; > store R1 into '/tmp/test1_r1'; > store X into '/tmp/test1_x'; > Exception Details: > 15/09/11 13:37:00 INFO storage.MemoryStore: ensureFreeSpace(114448) called > with curMem=359237, maxMem=503379394 > 15/09/11 13:37:00 INFO storage.MemoryStore: Block broadcast_2 stored as > values in memory (estimated size 111.8 KB, free 479.6 MB) > 15/09/11 13:37:00 INFO storage.MemoryStore: ensureFreeSpace(32569) called > with curMem=473685, maxMem=503379394 > 15/09/11 13:37:00 INFO storage.MemoryStore: Block broadcast_2_piece0 stored > as bytes in memory (estimated size 31.8 KB, free 479.6 MB) > 15/09/11 13:37:00 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in > memory on 10.51.2.82:55960 (size: 31.8 KB, free: 479.9 MB) > 15/09/11 13:37:00 INFO spark.SparkContext: Created broadcast 2 from > newAPIHadoopRDD at LoadConverter.java:88 > 15/09/11 13:37:00 WARN util.ClosureCleaner: Expected a closure; got > org.apache.pig.backend.hadoop.executionengine.spark.converter.LoadConverter$ToTupleFunction > 15/09/11 13:37:00 INFO spark.SparkLauncher: Converting operator POForEach > (Name: SSN: New For Each(false)[bag] - scope-17 Operator Key: scope-17) > 15/09/11 13:37:00 INFO spark.SparkLauncher: Converting operator POFRJoin > (Name: X: FRJoin[tuple] - scope-22 Operator Key: scope-22) > 15/09/11 13:37:00 ERROR spark.SparkLauncher: throw exception in > sparkOperToRDD: > java.lang.RuntimeException: Should have greater than1 predecessors for class > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFRJoin. > Got : 1 > at > org.apache.pig.backend.hadoop.executionengine.spark.SparkUtil.assertPredecessorSizeGreaterThan(SparkUtil.java:93) > at > org.apache.pig.backend.hadoop.executionengine.spark.converter.FRJoinConverter.convert(FRJoinConverter.java:55) > at > org.apache.pig.backend.hadoop.executionengine.spark.converter.FRJoinConverter.convert(FRJoinConverter.java:46) > at > org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.physicalToRDD(SparkLauncher.java:633) > at > org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.physicalToRDD(SparkLauncher.java:600) > at > org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.physicalToRDD(SparkLauncher.java:621) > at > org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.sparkOperToRDD(SparkLauncher.java:552) > at > org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.sparkPlanToRDD(SparkLauncher.java:501) > at > org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.launchPig(SparkLauncher.java:204) > at > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.launchPig(HExecutionEngine.java:301) > at org.apache.pig.PigServer.launchPlan(PigServer.java:1390) > at > org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1375) > at org.apache.pig.PigServer.execute(PigServer.java:1364) > at org.apache.pig.PigServer.executeBatch(PigServer.java:415) > at org.apache.pig.PigServer.executeBatch(PigServer.java:398) > at > org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:171) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:234) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205) > at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81) > at org.apache.pig.Main.run(Main.java:624) > at org.apache.pig.Main.main(Main.java:170) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4754) Fix UT failures in TestScriptLanguage
[ https://issues.apache.org/jira/browse/PIG-4754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15063427#comment-15063427 ] Mohit Sabharwal commented on PIG-4754: -- +1(non-binding). LGTM. could you please add a comment explaining why that block is protected & update the patch? > Fix UT failures in TestScriptLanguage > - > > Key: PIG-4754 > URL: https://issues.apache.org/jira/browse/PIG-4754 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: Xianda Ke >Assignee: Xianda Ke > Fix For: spark-branch > > Attachments: PIG-4754.patch > > > org.apache.pig.test.TestScriptLanguage.runParallelTest2 > Error Message > job should succeed > Stacktrace > junit.framework.AssertionFailedError: job should succeed > at > org.apache.pig.test.TestScriptLanguage.runPigRunner(TestScriptLanguage.java:96) > at > org.apache.pig.test.TestScriptLanguage.runPigRunner(TestScriptLanguage.java:105) > at > org.apache.pig.test.TestScriptLanguage.runParallelTest2(TestScriptLanguage.java:311) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PIG-4675) Operators with multiple predecessors fail under
[ https://issues.apache.org/jira/browse/PIG-4675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mohit Sabharwal updated PIG-4675: - Summary: Operators with multiple predecessors fail under (was: FR+Limit case fails when enable MultiQuery because the predecessor information is wrongly calculated in current code.) > Operators with multiple predecessors fail under > > > Key: PIG-4675 > URL: https://issues.apache.org/jira/browse/PIG-4675 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: Peter Lin >Assignee: liyunzhang_intel > Fix For: spark-branch > > Attachments: PIG-4675_1.patch, name.txt, ssn.txt, test.pig > > > We are testing the spark branch pig recently with mapr3 and spark 1.5. It > turns out if we use more than 1 store command in the pig script will have > exception from the second store command. > SSN = load '/test/ssn.txt' using PigStorage() as (ssn:long); > SSN_NAME = load '/test/name.txt' using PigStorage() as (ssn:long, > name:chararray); > X = JOIN SSN by ssn LEFT OUTER, SSN_NAME by ssn USING 'replicated'; > R1 = limit SSN_NAME 10; > store R1 into '/tmp/test1_r1'; > store X into '/tmp/test1_x'; > Exception Details: > 15/09/11 13:37:00 INFO storage.MemoryStore: ensureFreeSpace(114448) called > with curMem=359237, maxMem=503379394 > 15/09/11 13:37:00 INFO storage.MemoryStore: Block broadcast_2 stored as > values in memory (estimated size 111.8 KB, free 479.6 MB) > 15/09/11 13:37:00 INFO storage.MemoryStore: ensureFreeSpace(32569) called > with curMem=473685, maxMem=503379394 > 15/09/11 13:37:00 INFO storage.MemoryStore: Block broadcast_2_piece0 stored > as bytes in memory (estimated size 31.8 KB, free 479.6 MB) > 15/09/11 13:37:00 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in > memory on 10.51.2.82:55960 (size: 31.8 KB, free: 479.9 MB) > 15/09/11 13:37:00 INFO spark.SparkContext: Created broadcast 2 from > newAPIHadoopRDD at LoadConverter.java:88 > 15/09/11 13:37:00 WARN util.ClosureCleaner: Expected a closure; got > org.apache.pig.backend.hadoop.executionengine.spark.converter.LoadConverter$ToTupleFunction > 15/09/11 13:37:00 INFO spark.SparkLauncher: Converting operator POForEach > (Name: SSN: New For Each(false)[bag] - scope-17 Operator Key: scope-17) > 15/09/11 13:37:00 INFO spark.SparkLauncher: Converting operator POFRJoin > (Name: X: FRJoin[tuple] - scope-22 Operator Key: scope-22) > 15/09/11 13:37:00 ERROR spark.SparkLauncher: throw exception in > sparkOperToRDD: > java.lang.RuntimeException: Should have greater than1 predecessors for class > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFRJoin. > Got : 1 > at > org.apache.pig.backend.hadoop.executionengine.spark.SparkUtil.assertPredecessorSizeGreaterThan(SparkUtil.java:93) > at > org.apache.pig.backend.hadoop.executionengine.spark.converter.FRJoinConverter.convert(FRJoinConverter.java:55) > at > org.apache.pig.backend.hadoop.executionengine.spark.converter.FRJoinConverter.convert(FRJoinConverter.java:46) > at > org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.physicalToRDD(SparkLauncher.java:633) > at > org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.physicalToRDD(SparkLauncher.java:600) > at > org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.physicalToRDD(SparkLauncher.java:621) > at > org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.sparkOperToRDD(SparkLauncher.java:552) > at > org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.sparkPlanToRDD(SparkLauncher.java:501) > at > org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.launchPig(SparkLauncher.java:204) > at > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.launchPig(HExecutionEngine.java:301) > at org.apache.pig.PigServer.launchPlan(PigServer.java:1390) > at > org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1375) > at org.apache.pig.PigServer.execute(PigServer.java:1364) > at org.apache.pig.PigServer.executeBatch(PigServer.java:415) > at org.apache.pig.PigServer.executeBatch(PigServer.java:398) > at > org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:171) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:234) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205) > at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81) > at org.apache.pig.Main.run(Main.java:624) > at org.apache.pig.Main.main(Main.java:170) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PIG-4675) Operators with multiple predecessors fail under multiquery optimization
[ https://issues.apache.org/jira/browse/PIG-4675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mohit Sabharwal updated PIG-4675: - Summary: Operators with multiple predecessors fail under multiquery optimization (was: Operators with multiple predecessors fail under ) > Operators with multiple predecessors fail under multiquery optimization > --- > > Key: PIG-4675 > URL: https://issues.apache.org/jira/browse/PIG-4675 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: Peter Lin >Assignee: liyunzhang_intel > Fix For: spark-branch > > Attachments: PIG-4675_1.patch, name.txt, ssn.txt, test.pig > > > We are testing the spark branch pig recently with mapr3 and spark 1.5. It > turns out if we use more than 1 store command in the pig script will have > exception from the second store command. > SSN = load '/test/ssn.txt' using PigStorage() as (ssn:long); > SSN_NAME = load '/test/name.txt' using PigStorage() as (ssn:long, > name:chararray); > X = JOIN SSN by ssn LEFT OUTER, SSN_NAME by ssn USING 'replicated'; > R1 = limit SSN_NAME 10; > store R1 into '/tmp/test1_r1'; > store X into '/tmp/test1_x'; > Exception Details: > 15/09/11 13:37:00 INFO storage.MemoryStore: ensureFreeSpace(114448) called > with curMem=359237, maxMem=503379394 > 15/09/11 13:37:00 INFO storage.MemoryStore: Block broadcast_2 stored as > values in memory (estimated size 111.8 KB, free 479.6 MB) > 15/09/11 13:37:00 INFO storage.MemoryStore: ensureFreeSpace(32569) called > with curMem=473685, maxMem=503379394 > 15/09/11 13:37:00 INFO storage.MemoryStore: Block broadcast_2_piece0 stored > as bytes in memory (estimated size 31.8 KB, free 479.6 MB) > 15/09/11 13:37:00 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in > memory on 10.51.2.82:55960 (size: 31.8 KB, free: 479.9 MB) > 15/09/11 13:37:00 INFO spark.SparkContext: Created broadcast 2 from > newAPIHadoopRDD at LoadConverter.java:88 > 15/09/11 13:37:00 WARN util.ClosureCleaner: Expected a closure; got > org.apache.pig.backend.hadoop.executionengine.spark.converter.LoadConverter$ToTupleFunction > 15/09/11 13:37:00 INFO spark.SparkLauncher: Converting operator POForEach > (Name: SSN: New For Each(false)[bag] - scope-17 Operator Key: scope-17) > 15/09/11 13:37:00 INFO spark.SparkLauncher: Converting operator POFRJoin > (Name: X: FRJoin[tuple] - scope-22 Operator Key: scope-22) > 15/09/11 13:37:00 ERROR spark.SparkLauncher: throw exception in > sparkOperToRDD: > java.lang.RuntimeException: Should have greater than1 predecessors for class > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFRJoin. > Got : 1 > at > org.apache.pig.backend.hadoop.executionengine.spark.SparkUtil.assertPredecessorSizeGreaterThan(SparkUtil.java:93) > at > org.apache.pig.backend.hadoop.executionengine.spark.converter.FRJoinConverter.convert(FRJoinConverter.java:55) > at > org.apache.pig.backend.hadoop.executionengine.spark.converter.FRJoinConverter.convert(FRJoinConverter.java:46) > at > org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.physicalToRDD(SparkLauncher.java:633) > at > org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.physicalToRDD(SparkLauncher.java:600) > at > org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.physicalToRDD(SparkLauncher.java:621) > at > org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.sparkOperToRDD(SparkLauncher.java:552) > at > org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.sparkPlanToRDD(SparkLauncher.java:501) > at > org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.launchPig(SparkLauncher.java:204) > at > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.launchPig(HExecutionEngine.java:301) > at org.apache.pig.PigServer.launchPlan(PigServer.java:1390) > at > org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1375) > at org.apache.pig.PigServer.execute(PigServer.java:1364) > at org.apache.pig.PigServer.executeBatch(PigServer.java:415) > at org.apache.pig.PigServer.executeBatch(PigServer.java:398) > at > org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:171) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:234) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205) > at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81) > at org.apache.pig.Main.run(Main.java:624) > at org.apache.pig.Main.main(Main.java:170) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4293) Enable unit test "TestNativeMapReduce" for spark
[ https://issues.apache.org/jira/browse/PIG-4293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15039693#comment-15039693 ] Mohit Sabharwal commented on PIG-4293: -- Thanks, [~kellyzly]! Left couple of comments on RB. > Enable unit test "TestNativeMapReduce" for spark > > > Key: PIG-4293 > URL: https://issues.apache.org/jira/browse/PIG-4293 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: liyunzhang_intel >Assignee: liyunzhang_intel > Fix For: spark-branch > > Attachments: PIG-4293.patch, PIG-4293_1.patch, > TEST-org.apache.pig.test.TestNativeMapReduce.txt > > > error log is attached -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4621) Enable Illustrate in spark
[ https://issues.apache.org/jira/browse/PIG-4621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15039706#comment-15039706 ] Mohit Sabharwal commented on PIG-4621: -- Thanks, left some comments on RB. cc [~kellyzly] > Enable Illustrate in spark > -- > > Key: PIG-4621 > URL: https://issues.apache.org/jira/browse/PIG-4621 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: liyunzhang_intel >Assignee: Syed Zulfiqar Ali > Fix For: spark-branch > > > Current we don't support illustrate in spark mode. > How illustrate works > see:http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#ILLUSTRATE -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4675) FR+Limit case fails when enable MultiQuery because the predecessor information is wrongly calculated in current code.
[ https://issues.apache.org/jira/browse/PIG-4675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15039755#comment-15039755 ] Mohit Sabharwal commented on PIG-4675: -- Thanks, [~kellyzly], this looks like a pretty critical issue. It is potentially affecting many other query plans, not just FRJoin with Limit, right ? Could you summarize why the predecessor information was getting wrongly calculated ? Could you also explain the approach you took to fix it in more detail ? > FR+Limit case fails when enable MultiQuery because the predecessor > information is wrongly calculated in current code. > - > > Key: PIG-4675 > URL: https://issues.apache.org/jira/browse/PIG-4675 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: Peter Lin >Assignee: liyunzhang_intel > Fix For: spark-branch > > Attachments: PIG-4675_1.patch, name.txt, ssn.txt, test.pig > > > We are testing the spark branch pig recently with mapr3 and spark 1.5. It > turns out if we use more than 1 store command in the pig script will have > exception from the second store command. > SSN = load '/test/ssn.txt' using PigStorage() as (ssn:long); > SSN_NAME = load '/test/name.txt' using PigStorage() as (ssn:long, > name:chararray); > X = JOIN SSN by ssn LEFT OUTER, SSN_NAME by ssn USING 'replicated'; > R1 = limit SSN_NAME 10; > store R1 into '/tmp/test1_r1'; > store X into '/tmp/test1_x'; > Exception Details: > 15/09/11 13:37:00 INFO storage.MemoryStore: ensureFreeSpace(114448) called > with curMem=359237, maxMem=503379394 > 15/09/11 13:37:00 INFO storage.MemoryStore: Block broadcast_2 stored as > values in memory (estimated size 111.8 KB, free 479.6 MB) > 15/09/11 13:37:00 INFO storage.MemoryStore: ensureFreeSpace(32569) called > with curMem=473685, maxMem=503379394 > 15/09/11 13:37:00 INFO storage.MemoryStore: Block broadcast_2_piece0 stored > as bytes in memory (estimated size 31.8 KB, free 479.6 MB) > 15/09/11 13:37:00 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in > memory on 10.51.2.82:55960 (size: 31.8 KB, free: 479.9 MB) > 15/09/11 13:37:00 INFO spark.SparkContext: Created broadcast 2 from > newAPIHadoopRDD at LoadConverter.java:88 > 15/09/11 13:37:00 WARN util.ClosureCleaner: Expected a closure; got > org.apache.pig.backend.hadoop.executionengine.spark.converter.LoadConverter$ToTupleFunction > 15/09/11 13:37:00 INFO spark.SparkLauncher: Converting operator POForEach > (Name: SSN: New For Each(false)[bag] - scope-17 Operator Key: scope-17) > 15/09/11 13:37:00 INFO spark.SparkLauncher: Converting operator POFRJoin > (Name: X: FRJoin[tuple] - scope-22 Operator Key: scope-22) > 15/09/11 13:37:00 ERROR spark.SparkLauncher: throw exception in > sparkOperToRDD: > java.lang.RuntimeException: Should have greater than1 predecessors for class > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFRJoin. > Got : 1 > at > org.apache.pig.backend.hadoop.executionengine.spark.SparkUtil.assertPredecessorSizeGreaterThan(SparkUtil.java:93) > at > org.apache.pig.backend.hadoop.executionengine.spark.converter.FRJoinConverter.convert(FRJoinConverter.java:55) > at > org.apache.pig.backend.hadoop.executionengine.spark.converter.FRJoinConverter.convert(FRJoinConverter.java:46) > at > org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.physicalToRDD(SparkLauncher.java:633) > at > org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.physicalToRDD(SparkLauncher.java:600) > at > org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.physicalToRDD(SparkLauncher.java:621) > at > org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.sparkOperToRDD(SparkLauncher.java:552) > at > org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.sparkPlanToRDD(SparkLauncher.java:501) > at > org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.launchPig(SparkLauncher.java:204) > at > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.launchPig(HExecutionEngine.java:301) > at org.apache.pig.PigServer.launchPlan(PigServer.java:1390) > at > org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1375) > at org.apache.pig.PigServer.execute(PigServer.java:1364) > at org.apache.pig.PigServer.executeBatch(PigServer.java:415) > at org.apache.pig.PigServer.executeBatch(PigServer.java:398) > at > org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:171) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:234) > at >
[jira] [Commented] (PIG-4709) Improve performance of GROUPBY operator on Spark
[ https://issues.apache.org/jira/browse/PIG-4709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15032245#comment-15032245 ] Mohit Sabharwal commented on PIG-4709: -- Thanks, [~pallavi.rao], will take a look. + [~kellyzly] as well. > Improve performance of GROUPBY operator on Spark > > > Key: PIG-4709 > URL: https://issues.apache.org/jira/browse/PIG-4709 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: Pallavi Rao >Assignee: Pallavi Rao > Labels: spork > Fix For: spark-branch > > Attachments: PIG-4709.patch > > > Currently, the GROUPBY operator of PIG is mapped by Spark's CoGroup. When the > grouped data is consumed by subsequent operations to perform algebraic > operations, this is sub-optimal as there is lot of shuffle traffic. > The Spark Plan must be optimized to use reduceBy, where possible, so that a > combiner is used. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4634) Fix records count issues in output statistics
[ https://issues.apache.org/jira/browse/PIG-4634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982908#comment-14982908 ] Mohit Sabharwal commented on PIG-4634: -- Thanks, [~kexianda]! +1 (non-binding) > Fix records count issues in output statistics > - > > Key: PIG-4634 > URL: https://issues.apache.org/jira/browse/PIG-4634 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: Xianda Ke >Assignee: Xianda Ke > Fix For: spark-branch > > Attachments: PIG-4634-3.patch, PIG-4634-4.patch, PIG-4634-5.patch, > PIG-4634-6.patch, PIG-4634.patch, PIG-4634_2.patch > > > Test cases simpleTest() and simpleTest2() in TestPigRunner failed, caused by > following issues: > 1. pig context in SparkPigStats isn't initialized. > 2. the records count logic hasn't been implemented. > 3. getOutpugAlias(), getPigProperties(), getBytesWritten() and > getRecordWritten() have not been implemented. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4634) Fix records count issues in output statistics
[ https://issues.apache.org/jira/browse/PIG-4634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14975606#comment-14975606 ] Mohit Sabharwal commented on PIG-4634: -- Thanks, [~xianda]. I had couple of code readability nits on RB. Otherwise LGTM. > Fix records count issues in output statistics > - > > Key: PIG-4634 > URL: https://issues.apache.org/jira/browse/PIG-4634 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: Xianda Ke >Assignee: Xianda Ke > Fix For: spark-branch > > Attachments: PIG-4634-3.patch, PIG-4634-4.patch, PIG-4634-5.patch, > PIG-4634.patch, PIG-4634_2.patch > > > Test cases simpleTest() and simpleTest2() in TestPigRunner failed, caused by > following issues: > 1. pig context in SparkPigStats isn't initialized. > 2. the records count logic hasn't been implemented. > 3. getOutpugAlias(), getPigProperties(), getBytesWritten() and > getRecordWritten() have not been implemented. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4655) Support InputStats in spark mode
[ https://issues.apache.org/jira/browse/PIG-4655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731809#comment-14731809 ] Mohit Sabharwal commented on PIG-4655: -- That's right, depends on PIG-4634 > Support InputStats in spark mode > > > Key: PIG-4655 > URL: https://issues.apache.org/jira/browse/PIG-4655 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: Xianda Ke >Assignee: Xianda Ke > Fix For: spark-branch > > Attachments: PIG-4655-2.patch, PIG-4655-3.patch, PIG-4655.patch > > > Currently, InputStats is not implemented in spark mode. > The JUnit case TestPigRunner.testEmptyFileCounter() will fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4655) Support InputStats in spark mode
[ https://issues.apache.org/jira/browse/PIG-4655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731605#comment-14731605 ] Mohit Sabharwal commented on PIG-4655: -- +1 (non-binding) > Support InputStats in spark mode > > > Key: PIG-4655 > URL: https://issues.apache.org/jira/browse/PIG-4655 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: Xianda Ke >Assignee: Xianda Ke > Fix For: spark-branch > > Attachments: PIG-4655-2.patch, PIG-4655-3.patch, PIG-4655.patch > > > Currently, InputStats is not implemented in spark mode. > The JUnit case TestPigRunner.testEmptyFileCounter() will fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4634) Fix records count issues in output statistics
[ https://issues.apache.org/jira/browse/PIG-4634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731603#comment-14731603 ] Mohit Sabharwal commented on PIG-4634: -- Thanks, [~kexianda], left some comments on RB. > Fix records count issues in output statistics > - > > Key: PIG-4634 > URL: https://issues.apache.org/jira/browse/PIG-4634 > Project: Pig > Issue Type: Sub-task > Components: spark >Reporter: Xianda Ke >Assignee: Xianda Ke > Fix For: spark-branch > > Attachments: PIG-4634-3.patch, PIG-4634.patch, PIG-4634_2.patch > > > Test cases simpleTest() and simpleTest2() in TestPigRunner failed, caused by > following issues: > 1. pig context in SparkPigStats isn't initialized. > 2. the records count logic hasn't been implemented. > 3. getOutpugAlias(), getPigProperties(), getBytesWritten() and > getRecordWritten() have not been implemented. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4661) Fix UT failures in TestPigServerLocal
[ https://issues.apache.org/jira/browse/PIG-4661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14720688#comment-14720688 ] Mohit Sabharwal commented on PIG-4661: -- +1 (non-binding) Thanks, [~kexianda] Fix UT failures in TestPigServerLocal - Key: PIG-4661 URL: https://issues.apache.org/jira/browse/PIG-4661 Project: Pig Issue Type: Sub-task Components: spark Reporter: Xianda Ke Assignee: Xianda Ke Fix For: spark-branch Attachments: PIG-4661.patch testcase org.apache.pig.test.TestPigServerLocal.testSkipParseInRegisterForBatch failed in spark mode -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4634) Fix records count issues in output statistics
[ https://issues.apache.org/jira/browse/PIG-4634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14720863#comment-14720863 ] Mohit Sabharwal commented on PIG-4634: -- [~kexianda] could you create a RB request for this please? Fix records count issues in output statistics - Key: PIG-4634 URL: https://issues.apache.org/jira/browse/PIG-4634 Project: Pig Issue Type: Sub-task Components: spark Reporter: Xianda Ke Assignee: Xianda Ke Fix For: spark-branch Attachments: PIG-4634-3.patch, PIG-4634.patch, PIG-4634_2.patch Test cases simpleTest() and simpleTest2() in TestPigRunner failed, caused by following issues: 1. pig context in SparkPigStats isn't initialized. 2. the records count logic hasn't been implemented. 3. getOutpugAlias(), getPigProperties(), getBytesWritten() and getRecordWritten() have not been implemented. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4655) Support InputStats in spark mode
[ https://issues.apache.org/jira/browse/PIG-4655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14720861#comment-14720861 ] Mohit Sabharwal commented on PIG-4655: -- [~kexianda] could you please create RB request for this please ? Support InputStats in spark mode Key: PIG-4655 URL: https://issues.apache.org/jira/browse/PIG-4655 Project: Pig Issue Type: Sub-task Components: spark Reporter: Xianda Ke Assignee: Xianda Ke Fix For: spark-branch Attachments: PIG-4655.patch Currently, InputStats is not implemented in spark mode. The JUnit case TestPigRunner.testEmptyFileCounter() will fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4655) Support InputStats in spark mode
[ https://issues.apache.org/jira/browse/PIG-4655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14720869#comment-14720869 ] Mohit Sabharwal commented on PIG-4655: -- Please move this to the top of the class for consistency: {code} +private String counterGroupName; +private String counterName; +private SparkCounters sparkCounters; {code} Also, shouldn't addInputInfoForSparkOper be in SparkJobStats for consistency ? Support InputStats in spark mode Key: PIG-4655 URL: https://issues.apache.org/jira/browse/PIG-4655 Project: Pig Issue Type: Sub-task Components: spark Reporter: Xianda Ke Assignee: Xianda Ke Fix For: spark-branch Attachments: PIG-4655.patch Currently, InputStats is not implemented in spark mode. The JUnit case TestPigRunner.testEmptyFileCounter() will fail. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4659) Fix unit test failures in org.apache.pig.test.TestScriptLanguageJavaScript
[ https://issues.apache.org/jira/browse/PIG-4659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14702402#comment-14702402 ] Mohit Sabharwal commented on PIG-4659: -- Thanks, [~kexianda]. +1 (non-binding) Fix unit test failures in org.apache.pig.test.TestScriptLanguageJavaScript -- Key: PIG-4659 URL: https://issues.apache.org/jira/browse/PIG-4659 Project: Pig Issue Type: Sub-task Components: spark Reporter: kexianda Assignee: kexianda Fix For: spark-branch Attachments: PIG-4659.patch Failed testcase: org.apache.pig.test.TestScriptLanguageJavaScript.testTC Error Message: can't evaluate main: main(); Stacktrace java.lang.RuntimeException: can't evaluate main: main(); at org.apache.pig.scripting.js.JsScriptEngine.jsEval(JsScriptEngine.java:135) at org.apache.pig.scripting.js.JsScriptEngine.main(JsScriptEngine.java:223) at org.apache.pig.scripting.ScriptEngine.run(ScriptEngine.java:300) at org.apache.pig.test.TestScriptLanguageJavaScript.testTC(TestScriptLanguageJavaScript.java:149) Caused by: org.mozilla.javascript.EcmaError: TypeError: Cannot call method getNumberRecords of null -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4645) Support hadoop-like Counter using spark accumulator
[ https://issues.apache.org/jira/browse/PIG-4645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681119#comment-14681119 ] Mohit Sabharwal commented on PIG-4645: -- Thanks, [~kexianda]. LGTM. +1 (non-binding) Support hadoop-like Counter using spark accumulator --- Key: PIG-4645 URL: https://issues.apache.org/jira/browse/PIG-4645 Project: Pig Issue Type: Sub-task Components: spark Reporter: kexianda Assignee: kexianda Fix For: spark-branch Attachments: PIG-4645.patch Pig collect Input/Output statistic info via Counter in MR/Tez mode, we need to support this using spark accumulator. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4645) Support hadoop-like Counter using spark accumulator
[ https://issues.apache.org/jira/browse/PIG-4645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662738#comment-14662738 ] Mohit Sabharwal commented on PIG-4645: -- Thanks, [~kexianda], I was wondering if we could use the built-in [LongAccumulatorParam|http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.AccumulatorParam$$LongAccumulatorParam$] ? But looks like there are issues with using it according to [this|http://apache-spark-user-list.1001560.n3.nabble.com/How-in-Java-do-I-create-an-Accumulator-of-type-Long-td18779.html] thread. I assume that is why you implemented LongAccumulatorParam ? Support hadoop-like Counter using spark accumulator --- Key: PIG-4645 URL: https://issues.apache.org/jira/browse/PIG-4645 Project: Pig Issue Type: Sub-task Components: spark Reporter: kexianda Assignee: kexianda Fix For: spark-branch Attachments: PIG-4645.patch Pig collect Input/Output statistic info via Counter in MR/Tez mode, we need to support this using spark accumulator. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4594) Enable TestMultiQuery in spark mode
[ https://issues.apache.org/jira/browse/PIG-4594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625707#comment-14625707 ] Mohit Sabharwal commented on PIG-4594: -- The general approach here seems reasonable to me and is in line with what is being done for Tez and MR. I'm not sure about need for forceConnect and connect methods though... [~kellyzly], why don't we see This operator does not support multiple outputs exception with Tez or MR (when we merge operators for those engines) ? That wasn't clear to me. + 1 (non-binding) on this patch. We can address any changes in future patches -- since those don't seem like blockers in making progress on this feature. Enable TestMultiQuery in spark mode - Key: PIG-4594 URL: https://issues.apache.org/jira/browse/PIG-4594 Project: Pig Issue Type: Sub-task Components: spark Reporter: liyunzhang_intel Assignee: liyunzhang_intel Fix For: spark-branch Attachments: PIG-4594.patch, PIG-4594_1.patch, PIG-4594_2.patch in https://builds.apache.org/job/Pig-spark/211/#showFailuresLink,it shows that following unit test failures fail: org.apache.pig.test.TestMultiQuery.testMultiQueryJiraPig1068 org.apache.pig.test.TestMultiQuery.testMultiQueryJiraPig1157 org.apache.pig.test.TestMultiQuery.testMultiQueryJiraPig1252 org.apache.pig.test.TestMultiQuery.testMultiQueryJiraPig1438 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4633) Update hadoop version to enable Spark output statistics
[ https://issues.apache.org/jira/browse/PIG-4633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625712#comment-14625712 ] Mohit Sabharwal commented on PIG-4633: -- Thanks, [~kexianda], +1 (non-binding). Could you please paste the exception you saw on this jira ? Thanks! Update hadoop version to enable Spark output statistics --- Key: PIG-4633 URL: https://issues.apache.org/jira/browse/PIG-4633 Project: Pig Issue Type: Sub-task Components: spark Reporter: kexianda Assignee: kexianda Fix For: spark-branch Attachments: PIG-4633.patch Spark support output statistics from 1.3.0 ([SPARK-3179. Add task OutputMetrics|https://issues.apache.org/jira/browse/SPARK-3179]) {code:title=SparkHadoopUtil.scala|borderStyle=solid} stats.map(Utils.invoke(classOf[Statistics], _, getThreadStatistics)) {code} Spark invoke hadoop's function getThreadStatistics. But, this method was added into hadoop from version 2.5.0 ([HADOOP-10688|https://issues.apache.org/jira/browse/HADOOP-10688]) The version of hadoop in ivy/libraries.properties should be 2.5.0 + -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4633) Update hadoop version to enable Spark output statistics
[ https://issues.apache.org/jira/browse/PIG-4633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625669#comment-14625669 ] Mohit Sabharwal commented on PIG-4633: -- Thanks, [~kexianda]. Just curious - how did you discover this ? Was there an exception in the log ... or was some unit test failing ? Update hadoop version to enable Spark output statistics --- Key: PIG-4633 URL: https://issues.apache.org/jira/browse/PIG-4633 Project: Pig Issue Type: Sub-task Components: spark Reporter: kexianda Assignee: kexianda Fix For: spark-branch Attachments: PIG-4633.patch Spark support output statistics from 1.3.0 ([SPARK-3179. Add task OutputMetrics|https://issues.apache.org/jira/browse/SPARK-3179]) {code:title=SparkHadoopUtil.scala|borderStyle=solid} stats.map(Utils.invoke(classOf[Statistics], _, getThreadStatistics)) {code} Spark invoke hadoop's function getThreadStatistics. But, this method was added into hadoop from version 2.5.0 ([HADOOP-10688|https://issues.apache.org/jira/browse/HADOOP-10688]) The version of hadoop in ivy/libraries.properties should be 2.5.0 + -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PIG-4633) Update hadoop version to enable Spark output statistics
[ https://issues.apache.org/jira/browse/PIG-4633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mohit Sabharwal updated PIG-4633: - Summary: Update hadoop version to enable Spark output statistics (was: fix libaray version to enable output statistics for Pig on spark) Update hadoop version to enable Spark output statistics --- Key: PIG-4633 URL: https://issues.apache.org/jira/browse/PIG-4633 Project: Pig Issue Type: Sub-task Components: spark Reporter: kexianda Assignee: kexianda Fix For: spark-branch Attachments: PIG-4633.patch Spark support output statistics from 1.3.0 ([SPARK-3179. Add task OutputMetrics|https://issues.apache.org/jira/browse/SPARK-3179]) {code:title=SparkHadoopUtil.scala|borderStyle=solid} stats.map(Utils.invoke(classOf[Statistics], _, getThreadStatistics)) {code} Spark invoke hadoop's function getThreadStatistics. But, this method was added into hadoop from version 2.5.0 ([HADOOP-10688|https://issues.apache.org/jira/browse/HADOOP-10688]) The version of hadoop in ivy/libraries.properties should be 2.5.0 + -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4611) Fix remaining unit test failures about TestHBaseStorage
[ https://issues.apache.org/jira/browse/PIG-4611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14616247#comment-14616247 ] Mohit Sabharwal commented on PIG-4611: -- Thanks, [~kellyzly]. One more suggestion: Should we make your HBaseStorage change conditional on execution engine ? i.e. do the null check only for Spark engine. That way, we are not altering current MR engine behavior in any way. Fix remaining unit test failures about TestHBaseStorage - Key: PIG-4611 URL: https://issues.apache.org/jira/browse/PIG-4611 Project: Pig Issue Type: Sub-task Components: spark Reporter: liyunzhang_intel Assignee: liyunzhang_intel Fix For: spark-branch Attachments: PIG-4611.patch, PIG-4611_2.patch In https://builds.apache.org/job/Pig-spark/lastCompletedBuild/testReport/, it shows following unit test failures about TestHBaseStorage: org.apache.pig.test.TestHBaseStorage.testStoreToHBase_1_with_delete org.apache.pig.test.TestHBaseStorage.testLoadWithProjection_1 org.apache.pig.test.TestHBaseStorage.testLoadWithProjection_2 org.apache.pig.test.TestHBaseStorage.testStoreToHBase_2_with_projection org.apache.pig.test.TestHBaseStorage.testCollectedGroup org.apache.pig.test.TestHBaseStorage.testHeterogeneousScans -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4611) Fix remaining unit test failures about TestHBaseStorage
[ https://issues.apache.org/jira/browse/PIG-4611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615915#comment-14615915 ] Mohit Sabharwal commented on PIG-4611: -- Thanks for the explanation and addressing this issue, [~kellyzly]!!! Let me know if I understand this correctly: 1) Spark Executor will serialize all objects referenced in supplied closures. Since UDFContext.getUDFContext() is not initialized (because Spark does not expose a setup() interface like MR), we always default defaultCaster to STRING_CASTER. 2) However later on, in the *same* Executor thread, the record reader creation will correctly deserialize the UDFContext from JobConf (PigInputFormatSpark.createRecordReader-PigInputFormat.createRecordReader-MapRedUtil.setupUDFContext-UDFContext.deserialize) 3) Next, in the same Executor thread, when HBaseStorage is initialized by the load function, it will find a correctly populated UDFContext. This sounds reasonable to me. Since this a core change, could you please add comments to HBaseStorage.java explaining why we handling this as a special case for Spark ? I assume it is a typo, but you need -Dexectype argument to be {{spark}}, not {{TestHBaseStorage}} when running TestHBaseStorage: {code} ant test -Dhadoopversion=23 -Dtestcase=TestHBaseStorage -Dexectype=spark -DdebugPort= {code} Fix remaining unit test failures about TestHBaseStorage - Key: PIG-4611 URL: https://issues.apache.org/jira/browse/PIG-4611 Project: Pig Issue Type: Sub-task Components: spark Reporter: liyunzhang_intel Assignee: liyunzhang_intel Fix For: spark-branch Attachments: PIG-4611.patch In https://builds.apache.org/job/Pig-spark/lastCompletedBuild/testReport/, it shows following unit test failures about TestHBaseStorage: org.apache.pig.test.TestHBaseStorage.testStoreToHBase_1_with_delete org.apache.pig.test.TestHBaseStorage.testLoadWithProjection_1 org.apache.pig.test.TestHBaseStorage.testLoadWithProjection_2 org.apache.pig.test.TestHBaseStorage.testStoreToHBase_2_with_projection org.apache.pig.test.TestHBaseStorage.testCollectedGroup org.apache.pig.test.TestHBaseStorage.testHeterogeneousScans -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4622) Skip TestCubeOperator.testIllustrate and TestMultiQueryLocal.testMultiQueryWithIllustrate
[ https://issues.apache.org/jira/browse/PIG-4622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615874#comment-14615874 ] Mohit Sabharwal commented on PIG-4622: -- Thanks, [~kellyzly]. +1 (non-binding) Skip TestCubeOperator.testIllustrate and TestMultiQueryLocal.testMultiQueryWithIllustrate - Key: PIG-4622 URL: https://issues.apache.org/jira/browse/PIG-4622 Project: Pig Issue Type: Sub-task Components: spark Reporter: liyunzhang_intel Assignee: liyunzhang_intel Fix For: spark-branch Attachments: PIG-4622.patch it shows that in https://builds.apache.org/job/Pig-spark/236/#showFailuresLink following two unit tests fail: TestCubeOperator.testIllustrate and TestMultiQueryLocal.testMultiQueryWithIllustrate This because current we don't support illustrate in spark mode(see PIG-4621). why after PIG-4614_1.patch was merged to branch, these two unit test fail? in PIG-4614_1.patch, we edit [SparkExecutionEngine #instantiateScriptState|https://github.com/apache/pig/blob/a0bea12c3d5600a4c3137a8d05c054d10430b1ce/src/org/apache/pig/backend/hadoop/executionengine/spark/SparkExecutionEngine.java#L37]. When running following script with illustrate. illustrate.pig {code} a = load 'test/org/apache/pig/test/data/passwd' using PigStorage(':') as (uname:chararray, passwd:chararray, uid:int,gid:int); b = filter a by uid 5; illustrate b; store b into './testMultiQueryWithIllustrate.out'; {code} the exception is thrown out at [MRScriptState.get|https://github.com/apache/pig/blob/spark/src/org/apache/pig/tools/pigstats/mapreduce/MRScriptState.java#L67]:java.lang.ClassCastException: org.apache.pig.tools.pigstats.spark.SparkScriptState cannot be cast to org.apache.pig.tools.pigstats.mapreduce.MRScriptState. stacktrace: {code} java.lang.ClassCastException: org.apache.pig.tools.pigstats.spark.SparkScriptState cannot be cast to org.apache.pig.tools.pigstats.mapreduce.MRScriptState at org.apache.pig.tools.pigstats.mapreduce.MRScriptState.get(MRScriptState.java:67) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:512) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.compile(JobControlCompiler.java:327) at org.apache.pig.pen.LocalMapReduceSimulator.launchPig(LocalMapReduceSimulator.java:110) at org.apache.pig.pen.ExampleGenerator.getData(ExampleGenerator.java:259) at org.apache.pig.pen.ExampleGenerator.readBaseData(ExampleGenerator.java:223) at org.apache.pig.pen.ExampleGenerator.getExamples(ExampleGenerator.java:155) at org.apache.pig.PigServer.getExamples(PigServer.java:1305) at org.apache.pig.tools.grunt.GruntParser.processIllustrate(GruntParser.java:812) at org.apache.pig.tools.pigscript.parser.PigScriptParser.Illustrate(PigScriptParser.java:818) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:385) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:230) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205) at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81) at org.apache.pig.Main.run(Main.java:624) at org.apache.pig.Main.main(Main.java:170) at sun.reflect.NativeMethodAccessorImpl.invoke0(NativeMethodAccessorImpl.java:-1) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.util.RunJar.main(RunJar.java:212) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4619) Cleanup: change the indent size of some files of pig on spark project from 2 to 4 space
[ https://issues.apache.org/jira/browse/PIG-4619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14613115#comment-14613115 ] Mohit Sabharwal commented on PIG-4619: -- Thanks, [~kellyzly] +1 (non-binding). Cleanup: change the indent size of some files of pig on spark project from 2 to 4 space --- Key: PIG-4619 URL: https://issues.apache.org/jira/browse/PIG-4619 Project: Pig Issue Type: Sub-task Components: spark Reporter: liyunzhang_intel Assignee: liyunzhang_intel Fix For: spark-branch Attachments: PIG-4619.patch, indentSize.png following files under pig on spark project use 2 space indent: org.apache.pig.backend.hadoop.executionengine.spark.converter.CollectedGroupConverter org.apache.pig.backend.hadoop.executionengine.spark.JobMetricsListener org.apache.pig.backend.hadoop.executionengine.spark.SparkLocalExecType Now all the files under this project should use 4 space indent. Besides SparkLauncher.java use tab to replace space. We don't use tab to replace space in all the files in this project so need change this file. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4613) Fix unit test failures about TestAssert
[ https://issues.apache.org/jira/browse/PIG-4613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14613113#comment-14613113 ] Mohit Sabharwal commented on PIG-4613: -- Thanks, [~kexianda], [~kellyzly], LGTM +1 (non-binding) Fix unit test failures about TestAssert --- Key: PIG-4613 URL: https://issues.apache.org/jira/browse/PIG-4613 Project: Pig Issue Type: Sub-task Components: spark Reporter: kexianda Assignee: kexianda Fix For: spark-branch Attachments: PIG-4613.patch UT failed at following cases: org.apache.pig.test.TestAssert.testNegativeWithoutFetch org.apache.pig.test.TestAssert.testNegative -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4613) Fix unit test failures about TestAssert
[ https://issues.apache.org/jira/browse/PIG-4613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611392#comment-14611392 ] Mohit Sabharwal commented on PIG-4613: -- Thanks, [~kexianda]. LGTM. Just to be safe, do you think we should check for the different error message conditioned on Spark engine ? i.e. expect Job terminated with anomalous status FAILED for non-Spark and expect i should be greater than 1 for Spark. That way, we're not changing the testcase for MR and Tez... Fix unit test failures about TestAssert --- Key: PIG-4613 URL: https://issues.apache.org/jira/browse/PIG-4613 Project: Pig Issue Type: Sub-task Components: spark Reporter: kexianda Assignee: kexianda Fix For: spark-branch Attachments: PIG-4613.patch UT failed at following cases: org.apache.pig.test.TestAssert.testNegativeWithoutFetch org.apache.pig.test.TestAssert.testNegative -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4613) Fix unit test failures about TestAssert
[ https://issues.apache.org/jira/browse/PIG-4613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611428#comment-14611428 ] Mohit Sabharwal commented on PIG-4613: -- My vote is for 2) since Spark engine gives more info about the underlying problem. Fix unit test failures about TestAssert --- Key: PIG-4613 URL: https://issues.apache.org/jira/browse/PIG-4613 Project: Pig Issue Type: Sub-task Components: spark Reporter: kexianda Assignee: kexianda Fix For: spark-branch Attachments: PIG-4613.patch UT failed at following cases: org.apache.pig.test.TestAssert.testNegativeWithoutFetch org.apache.pig.test.TestAssert.testNegative -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4594) Enable TestMultiQuery in spark mode
[ https://issues.apache.org/jira/browse/PIG-4594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611416#comment-14611416 ] Mohit Sabharwal commented on PIG-4594: -- Thanks, [~kellyzly]. Could you give more details about why you need to add the forceConnect method to PhysicalPlan and OperatorPlan ? Enable TestMultiQuery in spark mode - Key: PIG-4594 URL: https://issues.apache.org/jira/browse/PIG-4594 Project: Pig Issue Type: Sub-task Components: spark Reporter: liyunzhang_intel Assignee: liyunzhang_intel Fix For: spark-branch Attachments: PIG-4594.patch, PIG-4594_1.patch in https://builds.apache.org/job/Pig-spark/211/#showFailuresLink,it shows that following unit test failures fail: org.apache.pig.test.TestMultiQuery.testMultiQueryJiraPig1068 org.apache.pig.test.TestMultiQuery.testMultiQueryJiraPig1157 org.apache.pig.test.TestMultiQuery.testMultiQueryJiraPig1252 org.apache.pig.test.TestMultiQuery.testMultiQueryJiraPig1438 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4611) Fix remaining unit test failures about TestHBaseStorage
[ https://issues.apache.org/jira/browse/PIG-4611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611367#comment-14611367 ] Mohit Sabharwal commented on PIG-4611: -- Thanks, [~kellyzly], this looks like a reasonable workaround to the UDFContext issue, where it is not initialized in Spark executor threads. However, I'm not sure whether it the right thing to do in the case where pig.hbase.caster is set by the user. i.e. For Spark engine, with your workaround, HBaseStorage will always use the default caster (i.e. Utf8StorageConverter). It will never use HBaseBinaryConverter or any other option. Fix remaining unit test failures about TestHBaseStorage - Key: PIG-4611 URL: https://issues.apache.org/jira/browse/PIG-4611 Project: Pig Issue Type: Sub-task Components: spark Reporter: liyunzhang_intel Assignee: liyunzhang_intel Fix For: spark-branch Attachments: PIG-4611.patch In https://builds.apache.org/job/Pig-spark/lastCompletedBuild/testReport/, it shows following unit test failures about TestHBaseStorage: org.apache.pig.test.TestHBaseStorage.testStoreToHBase_1_with_delete org.apache.pig.test.TestHBaseStorage.testLoadWithProjection_1 org.apache.pig.test.TestHBaseStorage.testLoadWithProjection_2 org.apache.pig.test.TestHBaseStorage.testStoreToHBase_2_with_projection org.apache.pig.test.TestHBaseStorage.testCollectedGroup org.apache.pig.test.TestHBaseStorage.testHeterogeneousScans -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4614) Enable TestLocationInPhysicalPlan in spark mode
[ https://issues.apache.org/jira/browse/PIG-4614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609567#comment-14609567 ] Mohit Sabharwal commented on PIG-4614: -- Thanks, [~kellyzly]! +1 (non-binding) Enable TestLocationInPhysicalPlan in spark mode - Key: PIG-4614 URL: https://issues.apache.org/jira/browse/PIG-4614 Project: Pig Issue Type: Sub-task Components: spark Reporter: liyunzhang_intel Assignee: liyunzhang_intel Fix For: spark-branch Attachments: PIG-4614.patch, PIG-4614_1.patch in https://builds.apache.org/job/Pig-spark/228/#showFailuresLink, it shows following unit test fails: org.apache.pig.newplan.logical.relational.TestLocationInPhysicalPlan.test expected:M: A[1,4],A[3,4],B[2,4] C: A[3,4],B[2,4] R: A[3,4] but was:null -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PIG-4059) Pig on Spark
[ https://issues.apache.org/jira/browse/PIG-4059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mohit Sabharwal updated PIG-4059: - Attachment: Pig-on-Spark-Scope.pdf Pig on Spark Key: PIG-4059 URL: https://issues.apache.org/jira/browse/PIG-4059 Project: Pig Issue Type: New Feature Components: spark Reporter: Rohini Palaniswamy Assignee: Praveen Rachabattuni Labels: spork Fix For: spark-branch Attachments: Pig-on-Spark-Design-Doc.pdf, Pig-on-Spark-Scope.pdf Setting up your development environment: 1. Check out Pig Spark branch. 2. Build Pig by running ant jar and ant -Dhadoopversion=23 jar for hadoop-2.x versions 3. Configure these environmental variables: export HADOOP_USER_CLASSPATH_FIRST=true export SPARK_MASTER=local 4. Run Pig with -x spark option. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4615) Fix null keys join in SkewedJoin in spark mode
[ https://issues.apache.org/jira/browse/PIG-4615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606893#comment-14606893 ] Mohit Sabharwal commented on PIG-4615: -- Thanks, [~kellyzly]! LGTM. +1 (non-binding) Fix null keys join in SkewedJoin in spark mode -- Key: PIG-4615 URL: https://issues.apache.org/jira/browse/PIG-4615 Project: Pig Issue Type: Sub-task Components: spark Reporter: liyunzhang_intel Assignee: liyunzhang_intel Fix For: spark-branch Attachments: PIG-4615.patch Let's use an example to explain the problem: testSkewedJoinNullKeys.pig: {code} A = LOAD './SkewedJoinInput5.txt' as (id,name); B = LOAD './SkewedJoinInput5.txt' as (id,name); C = join A by id, B by id using 'skewed'; store C into './testSkewedJoinNullKeys.out'; {code} cat SkewedJoinInput5.txt {code} apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 100 apple2 orange1 orange1 orange1 orange1 orange1 orange1 orange1 orange1 orange1 orange1 100 {code} the result of mr: {code} 100 apple2 100 apple2 100 apple2 100 100 100 apple2 100 100 {code} The result of spark: {code} cat testSkewedJoinNullKeys.out.spark/part-r-0 100 apple2 100 apple2 100 apple2 100 100 100 apple2 100 100 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 orange1 apple1 orange1 apple1 orange1 apple1 orange1 apple1 orange1 apple1 orange1 apple1 orange1 apple1 orange1 apple1 orange1 apple1 orange1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 orange1 apple1 orange1 apple1 orange1 apple1 orange1 apple1 orange1 apple1 orange1 apple1 orange1 apple1 orange1 apple1 orange1 apple1 orange1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 orange1 apple1 orange1 apple1 orange1 apple1 orange1 apple1 orange1 apple1 orange1 apple1 orange1 apple1 orange1 apple1 orange1 apple1 orange1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 orange1 apple1 orange1 apple1 orange1 apple1 orange1 apple1 orange1 apple1 orange1 apple1 orange1 apple1 orange1 apple1 orange1 apple1 orange1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 orange1 apple1 orange1 apple1 orange1 apple1 orange1 apple1 orange1 apple1 orange1 apple1 orange1 apple1 orange1 apple1 orange1 apple1 orange1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1 apple1
[jira] [Commented] (PIG-4607) Enable TestRank1,TestRank3 unit tests in spark mode
[ https://issues.apache.org/jira/browse/PIG-4607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606887#comment-14606887 ] Mohit Sabharwal commented on PIG-4607: -- Thanks, [~kexianda]! +1 (non-binding) Enable TestRank1,TestRank3 unit tests in spark mode --- Key: PIG-4607 URL: https://issues.apache.org/jira/browse/PIG-4607 Project: Pig Issue Type: Sub-task Components: spark Reporter: liyunzhang_intel Assignee: kexianda Fix For: spark-branch Attachments: PIG-4607.patch In https://builds.apache.org/job/Pig-spark/216/#showFailuresLink, unit tests about TestRank1, TestRank3: org.apache.pig.test.TestRank1.testRank02RowNumber org.apache.pig.test.TestRank1.testRank01RowNumber org.apache.pig.test.TestRank3.testRankWithSplitInMap org.apache.pig.test.TestRank3.testRankWithSplitInReduce org.apache.pig.test.TestRank3.testRankCascade -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PIG-4614) Enable TestLocationInPhysicalPlan in spark mode
[ https://issues.apache.org/jira/browse/PIG-4614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mohit Sabharwal updated PIG-4614: - Description: in https://builds.apache.org/job/Pig-spark/228/#showFailuresLink, it shows following unit test fails: org.apache.pig.newplan.logical.relational.TestLocationInPhysicalPlan.test expected:M: A[1,4],A[3,4],B[2,4] C: A[3,4],B[2,4] R: A[3,4] but was:null was: in https://builds.apache.org/job/Pig-spark/228/#showFailuresLink, it shows following unit test fails: org.apache.pig.newplan.logical.relational.TestLocationInPhysicalPlan.test Enable TestLocationInPhysicalPlan in spark mode - Key: PIG-4614 URL: https://issues.apache.org/jira/browse/PIG-4614 Project: Pig Issue Type: Sub-task Components: spark Reporter: liyunzhang_intel Assignee: liyunzhang_intel Fix For: spark-branch Attachments: PIG-4614.patch in https://builds.apache.org/job/Pig-spark/228/#showFailuresLink, it shows following unit test fails: org.apache.pig.newplan.logical.relational.TestLocationInPhysicalPlan.test expected:M: A[1,4],A[3,4],B[2,4] C: A[3,4],B[2,4] R: A[3,4] but was:null -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4614) Enable TestLocationInPhysicalPlan in spark mode
[ https://issues.apache.org/jira/browse/PIG-4614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606955#comment-14606955 ] Mohit Sabharwal commented on PIG-4614: -- Thanks, [~kellyzly], I had a question on review board. Enable TestLocationInPhysicalPlan in spark mode - Key: PIG-4614 URL: https://issues.apache.org/jira/browse/PIG-4614 Project: Pig Issue Type: Sub-task Components: spark Reporter: liyunzhang_intel Assignee: liyunzhang_intel Fix For: spark-branch Attachments: PIG-4614.patch in https://builds.apache.org/job/Pig-spark/228/#showFailuresLink, it shows following unit test fails: org.apache.pig.newplan.logical.relational.TestLocationInPhysicalPlan.test expected:M: A[1,4],A[3,4],B[2,4] C: A[3,4],B[2,4] R: A[3,4] but was:null -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4594) Enable TestMultiQuery in spark mode
[ https://issues.apache.org/jira/browse/PIG-4594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14604546#comment-14604546 ] Mohit Sabharwal commented on PIG-4594: -- Thanks, [~kellyzly]! In case 3 above (multiple splitees), looks like we could use {{RDD.cache()}} to cache the output of {{b}} in your example. Because, otherwise, since each Store corresponds to a Spark action, the entire RDD lineage will computed twice, once for each Store. Enable TestMultiQuery in spark mode - Key: PIG-4594 URL: https://issues.apache.org/jira/browse/PIG-4594 Project: Pig Issue Type: Sub-task Components: spark Reporter: liyunzhang_intel Assignee: liyunzhang_intel Fix For: spark-branch Attachments: PIG-4594.patch, PIG-4594_1.patch in https://builds.apache.org/job/Pig-spark/211/#showFailuresLink,it shows that following unit test failures fail: org.apache.pig.test.TestMultiQuery.testMultiQueryJiraPig1068 org.apache.pig.test.TestMultiQuery.testMultiQueryJiraPig1157 org.apache.pig.test.TestMultiQuery.testMultiQueryJiraPig1252 org.apache.pig.test.TestMultiQuery.testMultiQueryJiraPig1438 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4607) Enable TestRank1,TestRank3 unit tests in spark mode
[ https://issues.apache.org/jira/browse/PIG-4607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603803#comment-14603803 ] Mohit Sabharwal commented on PIG-4607: -- Thanks for the explanation, [~kexianda]! And thanks for fixing the verifyExpected bug! Code LGTM. I have a minor comment to preserve consistency since we changing non-spark related code: If you see other Pig testcases that use {{checkQueryOutputsAfterSort}}, these use the following pattern: {code} ListTuple expectedResults = Util.getTuplesFromConstantTupleStrings( new String[] { ((1,'a'),(1,'b')), ((2,'aa'),(2,'bb')) }); Util.checkQueryOutputsAfterSort(it, expectedResults); {code} For consistency, we should use {{Util.getTuplesFromConstantTupleStrings}} instead of creating a Tuple[] and then converting it to a List. Enable TestRank1,TestRank3 unit tests in spark mode --- Key: PIG-4607 URL: https://issues.apache.org/jira/browse/PIG-4607 Project: Pig Issue Type: Sub-task Components: spark Reporter: liyunzhang_intel Assignee: kexianda Fix For: spark-branch Attachments: PIG-4607.patch In https://builds.apache.org/job/Pig-spark/216/#showFailuresLink, unit tests about TestRank1, TestRank3: org.apache.pig.test.TestRank1.testRank02RowNumber org.apache.pig.test.TestRank1.testRank01RowNumber org.apache.pig.test.TestRank3.testRankWithSplitInMap org.apache.pig.test.TestRank3.testRankWithSplitInReduce org.apache.pig.test.TestRank3.testRankCascade -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4610) Enable TestOrcStorage“ unit test in spark mode
[ https://issues.apache.org/jira/browse/PIG-4610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14597045#comment-14597045 ] Mohit Sabharwal commented on PIG-4610: -- +1 (non-binding) Enable TestOrcStorage“ unit test in spark mode --- Key: PIG-4610 URL: https://issues.apache.org/jira/browse/PIG-4610 Project: Pig Issue Type: Sub-task Components: spark Reporter: liyunzhang_intel Assignee: liyunzhang_intel Fix For: spark-branch Attachments: PIG-4610.patch In https://builds.apache.org/job/Pig-spark/222/#showFailuresLink, it shows following unit test failures about TestOrcStorage: org.apache.pig.builtin.TestOrcStorage.testJoinWithPruning org.apache.pig.builtin.TestOrcStorage.testLoadStoreMoreDataType org.apache.pig.builtin.TestOrcStorage.testMultiStore -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4607) Enable TestRank1,TestRank3 unit tests in spark mode
[ https://issues.apache.org/jira/browse/PIG-4607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594189#comment-14594189 ] Mohit Sabharwal commented on PIG-4607: -- Looks like TestRank2 was not failing even without Rank/Counter in the Spark plan, which is strange: https://builds.apache.org/job/Pig-spark/lastCompletedBuild/testReport/ I was also looking at CounterConverter and didn't quite understand the purpose of maintaining two counters for every tuple (localCount and sparkCount) - one should work, right ? Enable TestRank1,TestRank3 unit tests in spark mode --- Key: PIG-4607 URL: https://issues.apache.org/jira/browse/PIG-4607 Project: Pig Issue Type: Sub-task Components: spark Reporter: liyunzhang_intel Assignee: kexianda Fix For: spark-branch Attachments: PIG-4607.patch In https://builds.apache.org/job/Pig-spark/216/#showFailuresLink, unit tests about TestRank1, TestRank3: org.apache.pig.test.TestRank1.testRank02RowNumber org.apache.pig.test.TestRank1.testRank01RowNumber org.apache.pig.test.TestRank3.testRankWithSplitInMap org.apache.pig.test.TestRank3.testRankWithSplitInReduce org.apache.pig.test.TestRank3.testRankCascade -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4607) Enable TestRank1,TestRank3 unit tests in spark mode
[ https://issues.apache.org/jira/browse/PIG-4607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593762#comment-14593762 ] Mohit Sabharwal commented on PIG-4607: -- Thanks, [~kexianda] I discovered these missing operators in SparkPlan today as well :) Any idea why TestRank2 is failing ? Enable TestRank1,TestRank3 unit tests in spark mode --- Key: PIG-4607 URL: https://issues.apache.org/jira/browse/PIG-4607 Project: Pig Issue Type: Sub-task Components: spark Reporter: liyunzhang_intel Assignee: kexianda Fix For: spark-branch Attachments: PIG-4607.patch In https://builds.apache.org/job/Pig-spark/216/#showFailuresLink, unit tests about TestRank1, TestRank3: org.apache.pig.test.TestRank1.testRank02RowNumber org.apache.pig.test.TestRank1.testRank01RowNumber org.apache.pig.test.TestRank3.testRankWithSplitInMap org.apache.pig.test.TestRank3.testRankWithSplitInReduce org.apache.pig.test.TestRank3.testRankCascade -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4606) Enable TestDefaultDateTimeZone unit tests in spark mode
[ https://issues.apache.org/jira/browse/PIG-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590386#comment-14590386 ] Mohit Sabharwal commented on PIG-4606: -- Thanks, [~kellyzly], the fix LGTM. While, we're here, it might be good to refactor some code here, because launchPig logic is getting bit crowded. For example, startSparkJob() name is confusing. The job actually gets started inside sparkPlanToRDD() method. It might be cleaner to create a new initialize() method and put all the initialization steps inside that method: - saveUdfImporList - create and populate job conf - SchemaTupleBackend.initialize - read time zone from conf and set it And rename startSparkJob() to something like addFilesToSparkJob(SparkContext sc) What do you think ? Enable TestDefaultDateTimeZone unit tests in spark mode - Key: PIG-4606 URL: https://issues.apache.org/jira/browse/PIG-4606 Project: Pig Issue Type: Sub-task Components: spark Reporter: liyunzhang_intel Assignee: liyunzhang_intel Fix For: spark-branch Attachments: PIG-4606.patch In https://builds.apache.org/job/Pig-spark/216/#showFailuresLink, unit tests about TestDefaultDateTimeZone fails: org.apache.pig.test.TestDefaultDateTimeZone.testDST org.apache.pig.test.TestDefaultDateTimeZone.testLocalExecution -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4606) Enable TestDefaultDateTimeZone unit tests in spark mode
[ https://issues.apache.org/jira/browse/PIG-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14591201#comment-14591201 ] Mohit Sabharwal commented on PIG-4606: -- Thank you so much, [~kellyzly]! +1 (non-binding) Enable TestDefaultDateTimeZone unit tests in spark mode - Key: PIG-4606 URL: https://issues.apache.org/jira/browse/PIG-4606 Project: Pig Issue Type: Sub-task Components: spark Reporter: liyunzhang_intel Assignee: liyunzhang_intel Fix For: spark-branch Attachments: PIG-4606.patch, PIG-4606_1.patch In https://builds.apache.org/job/Pig-spark/216/#showFailuresLink, unit tests about TestDefaultDateTimeZone fails: org.apache.pig.test.TestDefaultDateTimeZone.testDST org.apache.pig.test.TestDefaultDateTimeZone.testLocalExecution -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4604) Clean up: refactor the package import order in the files under pig/src/org/apache/pig/backend/hadoop/executionengine/spark according to certain rule
[ https://issues.apache.org/jira/browse/PIG-4604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14589075#comment-14589075 ] Mohit Sabharwal commented on PIG-4604: -- LGTM. +1 (non-binding) Clean up: refactor the package import order in the files under pig/src/org/apache/pig/backend/hadoop/executionengine/spark according to certain rule Key: PIG-4604 URL: https://issues.apache.org/jira/browse/PIG-4604 Project: Pig Issue Type: Sub-task Components: spark Reporter: liyunzhang_intel Assignee: liyunzhang_intel Fix For: spark-branch Attachments: IntelliJ_Java_codeStyle_Imports1.png, IntelliJ_Java_codeStyle_Imports2.png, PIG-4604.patch after discussion with [~mohitsabharwal],[~xuefuz],[~praveenr019], [~kexianda]: now we use following rule about the package import order in files under pig/src/org/apache/pig/backend/hadoop/executionengine/spark: 1. java.* and javax.* 2. blank line 3. scala.* 4. blank line 5. Project classes (org.apache.*) 6. blank line 7. Third party libraries (org.*, com.*, etc.) If you use IntelliJ as your IDE, you can reference the attachment to configure your import layout of your java code style: 1. Use IntelliJ 2. Select “File”-”Settings”-”Code Style”-”Java”-”Imports”-”Import Layout” Now the files under pig/src/org/apache/pig/backend/hadoop/executionengine/spark has different package import order. They should be in the same order. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (PIG-4601) Implement Merge CoGroup for Spark engine
Mohit Sabharwal created PIG-4601: Summary: Implement Merge CoGroup for Spark engine Key: PIG-4601 URL: https://issues.apache.org/jira/browse/PIG-4601 Project: Pig Issue Type: Sub-task Components: spark Affects Versions: spark-branch Reporter: Mohit Sabharwal Fix For: spark-branch Implement single-stage (map-side) co-group where all the input data sets are sorted by key: {code} C = cogroup A by c1, B by c1 using 'merge'; {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4597) Enable TestNullConstant unit test in spark mode
[ https://issues.apache.org/jira/browse/PIG-4597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14582647#comment-14582647 ] Mohit Sabharwal commented on PIG-4597: -- Thanks, [~kexianda]! LGTM. +1 (non-binding) Enable TestNullConstant unit test in spark mode -- Key: PIG-4597 URL: https://issues.apache.org/jira/browse/PIG-4597 Project: Pig Issue Type: Sub-task Components: spark Reporter: kexianda Assignee: kexianda Fix For: spark-branch Attachments: PIG-4597.patch ant -Dtestcase=TestNullConstant -Dexectype=spark -DdebugPort= -Dhadoopversion=23 test You will find following unit test failure: Error Message expected:4 but was:3 Stacktrace junit.framework.AssertionFailedError: expected:4 but was:3 at org.apache.pig.test.TestNullConstant.testOuterJoin(TestNullConstant.java:117) It failed because the actual result of the group operator is not in the same order as expected result. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4595) Fix unit test failures about TestFRJoinNullValue in spark mode
[ https://issues.apache.org/jira/browse/PIG-4595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580953#comment-14580953 ] Mohit Sabharwal commented on PIG-4595: -- +1 (non-binding) Fix unit test failures about TestFRJoinNullValue in spark mode -- Key: PIG-4595 URL: https://issues.apache.org/jira/browse/PIG-4595 Project: Pig Issue Type: Sub-task Components: spark Reporter: liyunzhang_intel Assignee: liyunzhang_intel Fix For: spark-branch Attachments: PIG-4595.patch, PIG-4595_1.patch based on f9a50f3, using following command to test TestFRJoinNullValue: ant -Dtestcase=TestFRJoinNullValue -Dexectype=spark -Dhadoopversion=23 test following ut fail: • org.apache.pig.test.TestFRJoinNullValue.testTupleLeftNullMatch • org.apache.pig.test.TestFRJoinNullValue.testLeftNullMatch • org.apache.pig.test.TestFRJoinNullValue.testTupleNullMatch • org.apache.pig.test.TestFRJoinNullValue.testNullMatch The reason why these unit test fail is because null value from table a and table b are considered same when table a fr join table b. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PIG-4585) Use newAPIHadoopRDD instead of newAPIHadoopFile
[ https://issues.apache.org/jira/browse/PIG-4585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mohit Sabharwal updated PIG-4585: - Attachment: PIG-4585.2.patch Use newAPIHadoopRDD instead of newAPIHadoopFile --- Key: PIG-4585 URL: https://issues.apache.org/jira/browse/PIG-4585 Project: Pig Issue Type: Sub-task Components: spark Affects Versions: spark-branch Reporter: Mohit Sabharwal Assignee: Mohit Sabharwal Fix For: spark-branch Attachments: PIG-4585.1.patch, PIG-4585.2.patch, PIG-4585.patch LoadConverter currently uses SparkContext.newAPIHadoopFile which won't work for non-filesystem based input sources, like HBase. newAPIHadoopFile assumes a FileInputFormat and attempts to [verify|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L1065] this in the constructor, which fails for HBaseTableInputFormat (which is not a FileInputFormat) {code} NewFileInputFormat.setInputPaths(job, path) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4593) Enable TestMultiQueryLocal in spark mode
[ https://issues.apache.org/jira/browse/PIG-4593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580954#comment-14580954 ] Mohit Sabharwal commented on PIG-4593: -- +1 (non-binding) Enable TestMultiQueryLocal in spark mode -- Key: PIG-4593 URL: https://issues.apache.org/jira/browse/PIG-4593 Project: Pig Issue Type: Sub-task Components: spark Reporter: liyunzhang_intel Assignee: liyunzhang_intel Fix For: spark-branch Attachments: PIG-4593.patch, PIG-4593_1.patch in https://builds.apache.org/job/Pig-spark/211/#showFailuresLink, it shows that following unit tests fail: org.apache.pig.test.TestMultiQueryLocal.testMultiQueryWithTwoStores org.apache.pig.test.TestMultiQueryLocal.testMultiQueryWithThreeStores org.apache.pig.test.TestMultiQueryLocal.testMultiQueryWithTwoLoads -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (PIG-4189) Make cross join work with Spark
[ https://issues.apache.org/jira/browse/PIG-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mohit Sabharwal resolved PIG-4189. -- Resolution: Duplicate Assignee: Mohit Sabharwal CROSS operation is implemented in two flavors in Pig: 1) Regular CROSS using GFCross UDF 2) Nested CROSS using POCross Both work with Spark due to patches in linked jiras. Make cross join work with Spark --- Key: PIG-4189 URL: https://issues.apache.org/jira/browse/PIG-4189 Project: Pig Issue Type: Sub-task Components: spark Reporter: Praveen Rachabattuni Assignee: Mohit Sabharwal Fix For: spark-branch Related e2e tests: Cross_1 - Cross_5 Sample script: a = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, gpa); b = load '/user/pig/tests/data/singlefile/votertab10k' as (name, age, registration, contributions); c = filter a by age 19 and gpa 1.0; d = filter b by age 19; e = cross c, d; store e into '/user/pig/out/praveenr-1411378727-nightly.conf/Cross_1.out'; Log: [Executor task launch worker-1] ERROR org.apache.spark.executor.Executor - Exception in task ID 2 java.lang.RuntimeException: org.apache.pig.backend.executionengine.ExecException: ERROR 2078: Caught error from UDF: org.apache.pig.impl.builtin.GFCross [Unable to get parallelism hint from job conf] at org.apache.pig.backend.hadoop.executionengine.spark.converter.POOutputConsumerIterator.readNext(POOutputConsumerIterator.java:57) at org.apache.pig.backend.hadoop.executionengine.spark.converter.POOutputConsumerIterator.hasNext(POOutputConsumerIterator.java:63) at scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:41) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102) at org.apache.spark.scheduler.Task.run(Task.scala:53) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213) at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2078: Caught error from UDF: org.apache.pig.impl.builtin.GFCross [Unable to get parallelism hint from job conf] at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:372) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNextDataBag(POUserFunc.java:388) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:331) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:378) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNextTuple(POForEach.java:298) at org.apache.pig.backend.hadoop.executionengine.spark.converter.ForEachConverter$ForEachFunction$1.getNextResult(ForEachConverter.java:53) at org.apache.pig.backend.hadoop.executionengine.spark.converter.POOutputConsumerIterator.readNext(POOutputConsumerIterator.java:36) ... 15 more Caused by: java.io.IOException: Unable to get parallelism hint from job conf at org.apache.pig.impl.builtin.GFCross.exec(GFCross.java:61) at org.apache.pig.impl.builtin.GFCross.exec(GFCross.java:1) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:344) ... 21 more -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PIG-4588) Move tests under 'test-spark' target
[ https://issues.apache.org/jira/browse/PIG-4588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mohit Sabharwal updated PIG-4588: - Attachment: PIG-4588.1.patch Move tests under 'test-spark' target Key: PIG-4588 URL: https://issues.apache.org/jira/browse/PIG-4588 Project: Pig Issue Type: Sub-task Components: spark Affects Versions: spark-branch Reporter: Mohit Sabharwal Assignee: Mohit Sabharwal Fix For: spark-branch Attachments: PIG-4588.1.patch, PIG-4588.patch Run test-spark and test-spark-local tests in the same ant target. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PIG-4585) Use newAPIHadoopRDD instead of newAPIHadoopFile
[ https://issues.apache.org/jira/browse/PIG-4585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mohit Sabharwal updated PIG-4585: - Attachment: PIG-4585.1.patch Use newAPIHadoopRDD instead of newAPIHadoopFile --- Key: PIG-4585 URL: https://issues.apache.org/jira/browse/PIG-4585 Project: Pig Issue Type: Sub-task Components: spark Affects Versions: spark-branch Reporter: Mohit Sabharwal Assignee: Mohit Sabharwal Fix For: spark-branch Attachments: PIG-4585.1.patch, PIG-4585.patch LoadConverter currently uses SparkContext.newAPIHadoopFile which won't work for non-filesystem based input sources, like HBase. newAPIHadoopFile assumes a FileInputFormat and attempts to [verify|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L1065] this in the constructor, which fails for HBaseTableInputFormat (which is not a FileInputFormat) {code} NewFileInputFormat.setInputPaths(job, path) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4593) Enable TestMultiQueryLocal in spark mode
[ https://issues.apache.org/jira/browse/PIG-4593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579417#comment-14579417 ] Mohit Sabharwal commented on PIG-4593: -- +1 (non-binding) Enable TestMultiQueryLocal in spark mode -- Key: PIG-4593 URL: https://issues.apache.org/jira/browse/PIG-4593 Project: Pig Issue Type: Sub-task Components: spark Reporter: liyunzhang_intel Assignee: liyunzhang_intel Fix For: spark-branch Attachments: PIG-4593.patch in https://builds.apache.org/job/Pig-spark/211/#showFailuresLink, it shows that following unit tests fail: org.apache.pig.test.TestMultiQueryLocal.testMultiQueryWithTwoStores org.apache.pig.test.TestMultiQueryLocal.testMultiQueryWithThreeStores org.apache.pig.test.TestMultiQueryLocal.testMultiQueryWithTwoLoads -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4596) Fix unit test failures about MergeJoinConverter in spark mode
[ https://issues.apache.org/jira/browse/PIG-4596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579431#comment-14579431 ] Mohit Sabharwal commented on PIG-4596: -- +1 (non-binding) Fix unit test failures about MergeJoinConverter in spark mode - Key: PIG-4596 URL: https://issues.apache.org/jira/browse/PIG-4596 Project: Pig Issue Type: Sub-task Components: spark Reporter: liyunzhang_intel Assignee: liyunzhang_intel Fix For: spark-branch Attachments: PIG-4596.patch using following command to test TestMergeJoin ant -Dtestcase=TestMergeJoin -Dexectype=spark -Dhadoopversion=23 test Following unit test fails: org.apache.pig.test.TestMergeJoin.testMergeJoinWithNulls The reason why these unit test fail is because null value from table a and table b are considered same when table a merge join table b. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4589) Fix unit test failure in TestCase
[ https://issues.apache.org/jira/browse/PIG-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578125#comment-14578125 ] Mohit Sabharwal commented on PIG-4589: -- +1 (non-binding) Fix unit test failure in TestCase - Key: PIG-4589 URL: https://issues.apache.org/jira/browse/PIG-4589 Project: Pig Issue Type: Sub-task Components: spark Reporter: kexianda Assignee: kexianda Fix For: spark-branch Attachments: PIG-4589.patch ant -Dtestcase=TestCase -Dexectype=spark -DdebugPort= -Dhadoopversion=23 test You will find following unit test failure: * org.apache.pig.test.TestCase.testWithDereferenceOperator It failed because the actual result of the group operator is not in the same order as expected result. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4586) Cleanup: Rename POConverter to RDDConverter
[ https://issues.apache.org/jira/browse/PIG-4586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14573206#comment-14573206 ] Mohit Sabharwal commented on PIG-4586: -- [~kellyzly], the PO prefix is used by operators. But POConverter is not an operator. So I think it will confuse someone looking at the code for the first time. RDDConverter is an alternative name (Class that converts physical operators to RDDs). Let me know if you have any other suggestions for the name. Cleanup: Rename POConverter to RDDConverter --- Key: PIG-4586 URL: https://issues.apache.org/jira/browse/PIG-4586 Project: Pig Issue Type: Sub-task Components: spark Affects Versions: spark-branch Reporter: Mohit Sabharwal Assignee: Mohit Sabharwal Fix For: spark-branch Attachments: PIG-4586.1.patch, PIG-4586.patch PO prefix should apply to operators -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PIG-4585) Use newAPIHadoopRDD instead of newAPIHadoopFile
[ https://issues.apache.org/jira/browse/PIG-4585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mohit Sabharwal updated PIG-4585: - Attachment: (was: PIG-4585.1.patch) Use newAPIHadoopRDD instead of newAPIHadoopFile --- Key: PIG-4585 URL: https://issues.apache.org/jira/browse/PIG-4585 Project: Pig Issue Type: Sub-task Components: spark Affects Versions: spark-branch Reporter: Mohit Sabharwal Assignee: Mohit Sabharwal Fix For: spark-branch Attachments: PIG-4585.patch LoadConverter currently uses SparkContext.newAPIHadoopFile which won't work for non-filesystem based input sources, like HBase. newAPIHadoopFile assumes a FileInputFormat and attempts to [verify|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L1065] this in the constructor, which fails for HBaseTableInputFormat (which is not a FileInputFormat) {code} NewFileInputFormat.setInputPaths(job, path) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PIG-4585) Use newAPIHadoopRDD instead of newAPIHadoopFile
[ https://issues.apache.org/jira/browse/PIG-4585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mohit Sabharwal updated PIG-4585: - Attachment: PIG-4585.1.patch Use newAPIHadoopRDD instead of newAPIHadoopFile --- Key: PIG-4585 URL: https://issues.apache.org/jira/browse/PIG-4585 Project: Pig Issue Type: Sub-task Components: spark Affects Versions: spark-branch Reporter: Mohit Sabharwal Assignee: Mohit Sabharwal Fix For: spark-branch Attachments: PIG-4585.1.patch, PIG-4585.patch LoadConverter currently uses SparkContext.newAPIHadoopFile which won't work for non-filesystem based input sources, like HBase. newAPIHadoopFile assumes a FileInputFormat and attempts to [verify|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L1065] this in the constructor, which fails for HBaseTableInputFormat (which is not a FileInputFormat) {code} NewFileInputFormat.setInputPaths(job, path) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (PIG-4586) Cleanup: Rename POConverter to RDDConverter
Mohit Sabharwal created PIG-4586: Summary: Cleanup: Rename POConverter to RDDConverter Key: PIG-4586 URL: https://issues.apache.org/jira/browse/PIG-4586 Project: Pig Issue Type: Sub-task Components: spark Affects Versions: spark-branch Reporter: Mohit Sabharwal Assignee: Mohit Sabharwal Fix For: spark-branch Attachments: PIG-4586.patch PO prefix should apply to operators -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PIG-4586) Cleanup: Rename POConverter to RDDConverter
[ https://issues.apache.org/jira/browse/PIG-4586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mohit Sabharwal updated PIG-4586: - Attachment: PIG-4586.patch Cleanup: Rename POConverter to RDDConverter --- Key: PIG-4586 URL: https://issues.apache.org/jira/browse/PIG-4586 Project: Pig Issue Type: Sub-task Components: spark Affects Versions: spark-branch Reporter: Mohit Sabharwal Assignee: Mohit Sabharwal Fix For: spark-branch Attachments: PIG-4586.patch PO prefix should apply to operators -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PIG-4585) Use newAPIHadoopRDD instead of newAPIHadoopFile
[ https://issues.apache.org/jira/browse/PIG-4585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mohit Sabharwal updated PIG-4585: - Status: Patch Available (was: Open) Use newAPIHadoopRDD instead of newAPIHadoopFile --- Key: PIG-4585 URL: https://issues.apache.org/jira/browse/PIG-4585 Project: Pig Issue Type: Sub-task Components: spark Affects Versions: spark-branch Reporter: Mohit Sabharwal Assignee: Mohit Sabharwal Fix For: spark-branch Attachments: PIG-4585.patch LoadConverter currently uses SparkContext.newAPIHadoopFile which won't work for non-filesystem based input sources, like HBase. newAPIHadoopFile assumes a FileInputFormat and attempts to [verify|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L1065] this in the constructor, which fails for HBaseTableInputFormat (which is not a FileInputFormat) {code} NewFileInputFormat.setInputPaths(job, path) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PIG-4586) Cleanup: Rename POConverter to RDDConverter
[ https://issues.apache.org/jira/browse/PIG-4586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mohit Sabharwal updated PIG-4586: - Attachment: PIG-4586.1.patch Cleanup: Rename POConverter to RDDConverter --- Key: PIG-4586 URL: https://issues.apache.org/jira/browse/PIG-4586 Project: Pig Issue Type: Sub-task Components: spark Affects Versions: spark-branch Reporter: Mohit Sabharwal Assignee: Mohit Sabharwal Fix For: spark-branch Attachments: PIG-4586.1.patch, PIG-4586.patch PO prefix should apply to operators -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (PIG-4585) Use newAPIHadoopRDD instead of newAPIHadoopFile
[ https://issues.apache.org/jira/browse/PIG-4585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mohit Sabharwal updated PIG-4585: - Description: LoadConverter currently uses SparkContext.newAPIHadoopFile which won't work for non-filesystem based input sources, like HBase. newAPIHadoopFile assumes a FileInputFormat and attempts to [verify|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L1065] this in the constructor, which fails for HBaseTableInputFormat (which is not a FileInputFormat) {code} NewFileInputFormat.setInputPaths(job, path) {code} was: LoadConverter currently uses SparkContext.newAPIHadoopFile which won't work for non-filesystem based input sources, like HBase. newAPIHadoopFile assumes a FileInputFormat and attempts to [verify|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L1065] this in the constructor, which fails for HBaseTableInputFormat (which is not a FileInputFormat) Use newAPIHadoopRDD instead of newAPIHadoopFile --- Key: PIG-4585 URL: https://issues.apache.org/jira/browse/PIG-4585 Project: Pig Issue Type: Sub-task Components: spark Affects Versions: spark-branch Reporter: Mohit Sabharwal Assignee: Mohit Sabharwal Fix For: spark-branch LoadConverter currently uses SparkContext.newAPIHadoopFile which won't work for non-filesystem based input sources, like HBase. newAPIHadoopFile assumes a FileInputFormat and attempts to [verify|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L1065] this in the constructor, which fails for HBaseTableInputFormat (which is not a FileInputFormat) {code} NewFileInputFormat.setInputPaths(job, path) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (PIG-4585) Use newAPIHadoopRDD instead of newAPIHadoopFile
Mohit Sabharwal created PIG-4585: Summary: Use newAPIHadoopRDD instead of newAPIHadoopFile Key: PIG-4585 URL: https://issues.apache.org/jira/browse/PIG-4585 Project: Pig Issue Type: Sub-task Components: spark Affects Versions: spark-branch Reporter: Mohit Sabharwal Assignee: Mohit Sabharwal Fix For: spark-branch LoadConverter currently uses SparkContext.newAPIHadoopFile which won't work for non-filesystem based input sources, like HBase. newAPIHadoopFile assumes a FileInputFormat and attempts to [verify|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L1065] this in the constructor, which fails for HBaseTableInputFormat (which is not a FileInputFormat) -- This message was sent by Atlassian JIRA (v6.3.4#6332)