[jira] [Commented] (PIG-4920) Fail to use Javascript UDF in spark yarn client mode

2016-10-25 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15607235#comment-15607235
 ] 

Mohit Sabharwal commented on PIG-4920:
--

LGTM, +1 (non-binding)

> Fail to use Javascript UDF in spark yarn client mode
> 
>
> Key: PIG-4920
> URL: https://issues.apache.org/jira/browse/PIG-4920
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4920.patch, PIG-4920_2.patch, PIG-4920_3.patch, 
> PIG-4920_4.patch, PIG-4920_5.patch, PIG-4920_6.patch
>
>
> udf.pig 
> {code}
> register '/home/zly/prj/oss/merge.pig/pig/bin/udf.js' using javascript as 
> myfuncs;
> A = load './passwd' as (a0:chararray, a1:chararray);
> B = foreach A generate myfuncs.helloworld();
> store B into './udf.out';
> {code}
> udf.js
> {code}
> helloworld.outputSchema = "word:chararray";
> function helloworld() {
> return 'Hello, World';
> }
> 
> complex.outputSchema = "word:chararray";
> function complex(word){
> return {word:word};
> }
> {code}
> run udf.pig in spark local mode(export SPARK_MASTER="local"), it successfully.
> run udf.pig in spark yarn client mode(export SPARK_MASTER="yarn-client"), it 
> fails and error message like following:
> {noformat}
> Caused by: java.lang.reflect.InvocationTargetException
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native 
> Method)
> at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:408)
> at 
> org.apache.pig.impl.PigContext.instantiateFuncFromSpec(PigContext.java:744)
> ... 84 more
> Caused by: java.lang.ExceptionInInitializerError
> at 
> org.apache.pig.scripting.js.JsScriptEngine.getInstance(JsScriptEngine.java:87)
> at org.apache.pig.scripting.js.JsFunction.(JsFunction.java:173)
> ... 89 more
> Caused by: java.lang.IllegalStateException: could not get script path from 
> UDFContext
> at 
> org.apache.pig.scripting.js.JsScriptEngine$Holder.(JsScriptEngine.java:69)
> ... 91 more
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4553) Implement secondary sort using one shuffle

2016-07-15 Thread Mohit Sabharwal (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mohit Sabharwal updated PIG-4553:
-
Summary: Implement secondary sort using one shuffle  (was: Implement 
secondary sort using 1 shuffle not twice)

> Implement secondary sort using one shuffle
> --
>
> Key: PIG-4553
> URL: https://issues.apache.org/jira/browse/PIG-4553
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4553_1.patch, PIG-4553_2.patch
>
>
> Now we implement secondary key sort in 
> GlobalRearrangeConverter#convert
> first shuffle in repartitionAndSortWithinPartitions second shuffle in groupBy
> {code}
> public RDD convert(List predecessors,
>   POGlobalRearrangeSpark physicalOperator) throws 
> IOException {
> 
>   if (predecessors.size() == 1) {
> // GROUP
> JavaPairRDD prdd = null;
> if (physicalOperator.isUseSecondaryKey()) {
> RDD rdd = predecessors.get(0);
> RDD> rddPair = rdd.map(new 
> ToKeyNullValueFunction(),
> SparkUtil.getTuple2Manifest());
> JavaPairRDD pairRDD = new JavaPairRDD Object>(rddPair,
> SparkUtil.getManifest(Tuple.class),
> SparkUtil.getManifest(Object.class));
> //first sort the tuple by secondary key if enable 
> useSecondaryKey sort
> JavaPairRDD sorted = 
> pairRDD.repartitionAndSortWithinPartitions(new HashPartitioner(parallelism), 
> new 
> PigSecondaryKeyComparatorSpark(physicalOperator.getSecondarySortOrder()));  
> // first shuffle 
> JavaRDD mapped = sorted.mapPartitions(new 
> ToValueFunction());
> prdd = mapped.groupBy(new GetKeyFunction(physicalOperator), 
> parallelism);// second shuffle
> } else {
> JavaRDD jrdd = predecessors.get(0).toJavaRDD();
> prdd = jrdd.groupBy(new GetKeyFunction(physicalOperator), 
> parallelism);
> }
> JavaRDD jrdd2 = prdd.map(new 
> GroupTupleFunction(physicalOperator));
> return jrdd2.rdd();
> } 
> 
> }
> {code}
> we can optimize it according to the code from 
> https://github.com/tresata/spark-sorted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4941) TestRank3#testRankWithSplitInMap hangs after upgrade to spark 1.6.1

2016-07-08 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15368050#comment-15368050
 ] 

Mohit Sabharwal commented on PIG-4941:
--

Following seems like the relevant thread:

{code}
"main" #1 prio=5 os_prio=0 tid=0x7f5e68019800 nid=0x1034 in Object.wait() 
[0x7f5e6fe07000]
   java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:502)
at org.apache.spark.scheduler.JobWaiter.awaitResult(JobWaiter.scala:73)
- locked <0xc39b09a8> (a org.apache.spark.scheduler.JobWaiter)
at 
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:612)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1922)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply$mcV$sp(PairRDDFunctions.scala:1146)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1074)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1.apply(PairRDDFunctions.scala:1074)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at 
org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopDataset(PairRDDFunctions.scala:1074)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply$mcV$sp(PairRDDFunctions.scala:994)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply(PairRDDFunctions.scala:985)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopFile$2.apply(PairRDDFunctions.scala:985)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at 
org.apache.spark.rdd.PairRDDFunctions.saveAsNewAPIHadoopFile(PairRDDFunctions.scala:985)
at 
org.apache.pig.backend.hadoop.executionengine.spark.converter.StoreConverter.convert(StoreConverter.java:103)
{code}

[~kellyzly], could you check http://localhost:4040/ to see if you see any 
additional info ?
http://spark.apache.org/docs/latest/monitoring.html

> TestRank3#testRankWithSplitInMap hangs after upgrade to spark 1.6.1
> ---
>
> Key: PIG-4941
> URL: https://issues.apache.org/jira/browse/PIG-4941
> Project: Pig
>  Issue Type: Sub-task
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: rank.jstack
>
>
> After upgrading spark version to 1.6.1, TestRank3#testRankWithSplitInMap 
> hangs and fails due to timeout exception.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4919) Upgrade spark.version to 1.6.1

2016-06-13 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15328855#comment-15328855
 ] 

Mohit Sabharwal commented on PIG-4919:
--

+1 (non-binding)

> Upgrade spark.version to 1.6.1
> --
>
> Key: PIG-4919
> URL: https://issues.apache.org/jira/browse/PIG-4919
> Project: Pig
>  Issue Type: Sub-task
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Attachments: PIG-4919.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4898) Fix unit test failure after PIG-4771's patch was checked in

2016-05-23 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15297587#comment-15297587
 ] 

Mohit Sabharwal commented on PIG-4898:
--

+1 (non-binding)

> Fix unit test failure after PIG-4771's patch was checked in
> ---
>
> Key: PIG-4898
> URL: https://issues.apache.org/jira/browse/PIG-4898
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4898.patch
>
>
> Now in the [lastest jenkins|https://builds.apache.org/job/Pig-spark/#328], it 
> shows that  following unit test cases fail:
>  org.apache.pig.test.TestFRJoin.testDistinctFRJoin
>  org.apache.pig.test.TestPigRunner.simpleMultiQueryTest3



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4771) Implement FR Join for spark engine

2016-05-16 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15286004#comment-15286004
 ] 

Mohit Sabharwal commented on PIG-4771:
--

+1 (non-binding)

> Implement FR Join for spark engine
> --
>
> Key: PIG-4771
> URL: https://issues.apache.org/jira/browse/PIG-4771
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4771.patch, PIG-4771_2.patch, PIG-4771_3.patch
>
>
> We use regular join to replace FR join in current code base(fd31fda). We need 
> to implement FR join.
> Some info collected from 
> https://pig.apache.org/docs/r0.11.0/perf.html#replicated-joins:
> *Replicated Joins*
> Fragment replicate join is a special type of join that works well if one or 
> more relations are small enough to fit into main memory. In such cases, Pig 
> can perform a very efficient join because all of the hadoop work is done on 
> the map side. In this type of join the large relation is followed by one or 
> more small relations. The small relations must be small enough to fit into 
> main memory; if they don't, the process fails and an error is generated.
> *Usage*
> Perform a replicated join with the USING clause (see JOIN (inner) and JOIN 
> (outer)). In this example, a large relation is joined with two smaller 
> relations. Note that the large relation comes first followed by the smaller 
> relations; and, all small relations together must fit into main memory, 
> otherwise an error is generated.
> big = LOAD 'big_data' AS (b1,b2,b3);
> tiny = LOAD 'tiny_data' AS (t1,t2,t3);
> mini = LOAD 'mini_data' AS (m1,m2,m3);
> C = JOIN big BY b1, tiny BY t1, mini BY m1 USING 'replicated';
> *Conditions*
> Fragment replicate joins are experimental; we don't have a strong sense of 
> how small the small relation must be to fit into memory. In our tests with a 
> simple query that involves just a JOIN, a relation of up to 100 M can be used 
> if the process overall gets 1 GB of memory. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4886) Add PigSplit#getLocationInfo to fix the NPE found in log in spark mode

2016-05-16 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15286003#comment-15286003
 ] 

Mohit Sabharwal commented on PIG-4886:
--

Thanks, [~kellyzly] - left couple of comments on RB.

> Add PigSplit#getLocationInfo to fix the NPE found in log in spark mode
> --
>
> Key: PIG-4886
> URL: https://issues.apache.org/jira/browse/PIG-4886
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4886.patch
>
>
> Use branch code(119f313) to test following pig script in spark mode:
> {code}
> A = load './SkewedJoinInput1.txt' as (id,name,n);
> B = load './SkewedJoinInput2.txt' as (id,name);
> D = join A by (id,name), B by (id,name);
> store D into './testFRJoin.out';
> {code}
> cat bin/SkewedJoinInput1.txt 
> {noformat}
> 100   apple1  aaa
> 200   orange1 bbb
> 300   strawberry  ccc
> {noformat}
> cat bin/SkewedJoinInput2.txt 
> {noformat}
> 100   apple1
> 100   apple2
> 100   apple2
> 200   orange1
> 200   orange2
> 300   strawberry
> 400   pear
> {noformat}
> following exception found in log:
> {noformat}
> [dag-scheduler-event-loop] 2016-05-05 14:21:01,046 DEBUG rdd.NewHadoopRDD 
> (Logging.scala:logDebug(84)) - Failed to use InputSplit#getLocationInfo.
> java.lang.NullPointerException
> at 
> scala.collection.mutable.ArrayOps$ofRef$.length$extension(ArrayOps.scala:114)
> at scala.collection.mutable.ArrayOps$ofRef.length(ArrayOps.scala:114)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:32)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> at 
> org.apache.spark.rdd.HadoopRDD$.convertSplitLocationInfo(HadoopRDD.scala:406)
> at 
> org.apache.spark.rdd.NewHadoopRDD.getPreferredLocations(NewHadoopRDD.scala:202)
> at 
> org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:231)
> at 
> org.apache.spark.rdd.RDD$$anonfun$preferredLocations$2.apply(RDD.scala:231)
> at scala.Option.getOrElse(Option.scala:120)
> at org.apache.spark.rdd.RDD.preferredLocations(RDD.scala:230)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal(DAGScheduler.scala:1387)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply$mcVI$sp(DAGScheduler.scala:1397)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1396)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$getPreferredLocsInternal$2$$anonfun$apply$2.apply(DAGScheduler.scala:1396)
> {noformat}
> org.apache.spark.rdd.NewHadoopRDD.getPreferredLocations will call 
> PigSplit#getLocationInfo but currently PigSplit extends InputSplit and 
> InputSplit#getLocationInfo return null.
> {code}
>   @Evolving
>   public SplitLocationInfo[] getLocationInfo() throws IOException {
> return null;
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4771) Implement FR Join for spark engine

2016-05-16 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15285953#comment-15285953
 ] 

Mohit Sabharwal commented on PIG-4771:
--

Thanks, [~kellyzly] Left couple comments on RB. We can commit this after those 
changes and work on PIG-4891 to use broadcast variables. Thanks!

> Implement FR Join for spark engine
> --
>
> Key: PIG-4771
> URL: https://issues.apache.org/jira/browse/PIG-4771
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4771.patch, PIG-4771_2.patch
>
>
> We use regular join to replace FR join in current code base(fd31fda). We need 
> to implement FR join.
> Some info collected from 
> https://pig.apache.org/docs/r0.11.0/perf.html#replicated-joins:
> *Replicated Joins*
> Fragment replicate join is a special type of join that works well if one or 
> more relations are small enough to fit into main memory. In such cases, Pig 
> can perform a very efficient join because all of the hadoop work is done on 
> the map side. In this type of join the large relation is followed by one or 
> more small relations. The small relations must be small enough to fit into 
> main memory; if they don't, the process fails and an error is generated.
> *Usage*
> Perform a replicated join with the USING clause (see JOIN (inner) and JOIN 
> (outer)). In this example, a large relation is joined with two smaller 
> relations. Note that the large relation comes first followed by the smaller 
> relations; and, all small relations together must fit into main memory, 
> otherwise an error is generated.
> big = LOAD 'big_data' AS (b1,b2,b3);
> tiny = LOAD 'tiny_data' AS (t1,t2,t3);
> mini = LOAD 'mini_data' AS (m1,m2,m3);
> C = JOIN big BY b1, tiny BY t1, mini BY m1 USING 'replicated';
> *Conditions*
> Fragment replicate joins are experimental; we don't have a strong sense of 
> how small the small relation must be to fit into memory. In our tests with a 
> simple query that involves just a JOIN, a relation of up to 100 M can be used 
> if the process overall gets 1 GB of memory. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4876) OutputConsumeIterator can't handle the last buffered tuples for some Operators

2016-05-16 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15285832#comment-15285832
 ] 

Mohit Sabharwal commented on PIG-4876:
--

Thanks, [~kexianda], [~kellyzly], let's go with (b).  Let's add a detailed 
comment about this, so it can be reviewed by the committers when we merge this 
to master.

> OutputConsumeIterator can't handle the last buffered tuples for some Operators
> --
>
> Key: PIG-4876
> URL: https://issues.apache.org/jira/browse/PIG-4876
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Xianda Ke
>Assignee: Xianda Ke
> Fix For: spark-branch
>
> Attachments: PIG-4876.patch
>
>
> Some Operators, such as MergeCogroup, Stream, CollectedGroup etc buffer some 
> input records to constitute the result tuples. The last result tuples are 
> buffered in the operator.  These Operators need a flag to indicate the end of 
> input, so that they can flush and constitute their last tuples.
> Currently, the flag 'parentPlan.endOfAllInput' is targeted for flushing the 
> buffered tuples in MR mode.  But it does not work with OutputConsumeIterator 
> in Spark mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4876) OutputConsumeIterator can't handle the last buffered tuples for some Operators

2016-04-28 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15263299#comment-15263299
 ] 

Mohit Sabharwal commented on PIG-4876:
--

It's not clear to me why adding beginOfInput() is complex or less readable.

In OutputConsumerIterator, add a *beginOfInput()* abstract method:
{code:title=OutputConsumerIterator.java|borderStyle=solid}
abstract protected void beginOfInput();
{code}

In OutputConsumerIterator.readNext(), insert *beginOfInput()* as shown below:
{code:title=OutputConsumerIterator.java|borderStyle=solid}
   ...
   if (result == null) {
  beginOfInput();   // INSERT THIS CALL
  if (!input.hasNext()) {
done = true;
return;
  }
  Tuple v1 = input.next();
  attach(v1);
}
...
{code}

Now, in every operator, where we have implemented endOfInput(), also implement 
beginOfInput().

For example, in CollectedGroupConverter we have implemented endOfInput(). We 
implemented beginOfInput() as:
{code:title=CollectedGroupConverter.java|borderStyle=solid}
@Override
protected void beginOfInput() {
poCollectedGroup.getParentPlan().endOfAllInput = 
false;
}
{code}


Maybe I'm misunderstanding this ?

> OutputConsumeIterator can't handle the last buffered tuples for some Operators
> --
>
> Key: PIG-4876
> URL: https://issues.apache.org/jira/browse/PIG-4876
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Xianda Ke
>Assignee: Xianda Ke
> Fix For: spark-branch
>
> Attachments: PIG-4876.patch
>
>
> Some Operators, such as MergeCogroup, Stream, CollectedGroup etc buffer some 
> input records to constitute the result tuples. The last result tuples are 
> buffered in the operator.  These Operators need a flag to indicate the end of 
> input, so that they can flush and constitute their last tuples.
> Currently, the flag 'parentPlan.endOfAllInput' is targeted for flushing the 
> buffered tuples in MR mode.  But it does not work with OutputConsumeIterator 
> in Spark mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4854) Merge spark branch to trunk

2016-04-26 Thread Mohit Sabharwal (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mohit Sabharwal updated PIG-4854:
-
Issue Type: Sub-task  (was: Task)
Parent: PIG-4059

> Merge spark branch to trunk
> ---
>
> Key: PIG-4854
> URL: https://issues.apache.org/jira/browse/PIG-4854
> Project: Pig
>  Issue Type: Sub-task
>Reporter: Pallavi Rao
> Attachments: PIG-On-Spark.patch
>
>
> Believe the spark branch will be shortly ready to be merged with the main 
> branch (couple of minor patches pending commit), given that we have addressed 
> most functionality gaps and have ensured the UTs are clean. There are a few 
> optimizations which we will take up once the branch is merged to trunk.
> [~xuefuz], [~rohini], [~daijy],
> Hopefully, you agree that the spark branch is ready for merge. If yes, how 
> would like us to go about it? Do you want me to upload a huge patch that will 
> be merged like any other patch or do you prefer a branch merge?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4876) OutputConsumeIterator can't handle the last buffered tuples for some Operators

2016-04-22 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15253898#comment-15253898
 ] 

Mohit Sabharwal commented on PIG-4876:
--

Another question: Does it make sense to add another abstract method (similar to 
endOfInput()) like beginOfInput() that resets the flag at the beginning ? Would 
that work ?

Just trying to minimize non-spark code change...

> OutputConsumeIterator can't handle the last buffered tuples for some Operators
> --
>
> Key: PIG-4876
> URL: https://issues.apache.org/jira/browse/PIG-4876
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Xianda Ke
>Assignee: Xianda Ke
> Fix For: spark-branch
>
> Attachments: PIG-4876.patch
>
>
> Some Operators, such as MergeCogroup, Stream, CollectedGroup etc buffer some 
> input records to constitute the result tuples. The last result tuples are 
> buffered in the operator.  These Operators need a flag to indicate the end of 
> input, so that they can flush and constitute their last tuples.
> Currently, the flag 'parentPlan.endOfAllInput' is targeted for flushing the 
> buffered tuples in MR mode.  But it does not work with OutputConsumeIterator 
> in Spark mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4876) OutputConsumeIterator can't handle the last buffered tuples for some Operators

2016-04-21 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15253377#comment-15253377
 ] 

Mohit Sabharwal commented on PIG-4876:
--

Thanks for the explanation [~kexianda]. Left a comment regarding naming on RB. 

To summarize your explanation, since endOfAllInput is shared amongst all 
operators in the plan, it may get set to true by a preceding operator, which 
may affect subsequent operators in the plan (which may not have finished 
processing all tuples). Is that correct ?

One question:
  - After PIG-4542 patch (https://reviews.apache.org/r/34003), I see that 
TestCollectedGroup was passing. What is different about usage of CollectedGroup 
in PIG-4842  that it caused it to now fail ?
 

> OutputConsumeIterator can't handle the last buffered tuples for some Operators
> --
>
> Key: PIG-4876
> URL: https://issues.apache.org/jira/browse/PIG-4876
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Xianda Ke
>Assignee: Xianda Ke
> Fix For: spark-branch
>
> Attachments: PIG-4876.patch
>
>
> Some Operators, such as MergeCogroup, Stream, CollectedGroup etc buffer some 
> input records to constitute the result tuples. The last result tuples are 
> buffered in the operator.  These Operators need a flag to indicate the end of 
> input, so that they can flush and constitute their last tuples.
> Currently, the flag 'parentPlan.endOfAllInput' is targeted for flushing the 
> buffered tuples in MR mode.  But it does not work with OutputConsumeIterator 
> in Spark mode.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4859) Need upgrade snappy-java.version to 1.1.1.3

2016-04-04 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15225017#comment-15225017
 ] 

Mohit Sabharwal commented on PIG-4859:
--

+1 (non-binding)

> Need upgrade snappy-java.version to 1.1.1.3
> ---
>
> Key: PIG-4859
> URL: https://issues.apache.org/jira/browse/PIG-4859
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4859.patch
>
>
> run pig on spark on yarn-client env as following:
> export SPARK_MASTER="yarn-client"
> ./pig -x spark xxx.pig
> Throw error like following:
> {code}
> main] 2016-03-30 16:52:26,115 INFO  scheduler.DAGScheduler 
> (Logging.scala:logInfo(59)) - Job 0 failed: saveAsNewAPIHadoopDataset at 
> StoreConverter.java:101, took 73.980147 s
> 19895 [main] 2016-03-30 16:52:26,119 ERROR spark.JobGraphBuilder 
> (JobGraphBuilder.java:sparkOperToRDD(166)) - throw exception in 
> sparkOperToRDD:
> 19896 org.apache.spark.SparkException: Job aborted due to stage failure: Task 
> 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 
> 0.0 (TID 3, zly1.sh.intel.com): java  .lang.UnsatisfiedLinkError: 
> org.xerial.snappy.SnappyNative.uncompressedLength(Ljava/lang/Object;II)I
> 19897 at org.xerial.snappy.SnappyNative.uncompressedLength(Native 
> Method)
> 19898 at org.xerial.snappy.Snappy.uncompressedLength(Snappy.java:541)
> 19899 at 
> org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:350)
> 19900 at 
> org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:158)
> 19901 at 
> org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:142)
> 19902 at 
> java.io.ObjectInputStream$PeekInputStream.read(ObjectInputStream.java:2313)
> 19903 at 
> java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2326)
> 19904 at 
> java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:2797)
> 19905 at 
> java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:802)
> 19906 at java.io.ObjectInputStream.(ObjectInputStream.java:299)
> 19907 at 
> org.apache.spark.serializer.JavaDeserializationStream$$anon$1.(JavaSerializer.scala:64)
> 19908 at 
> org.apache.spark.serializer.JavaDeserializationStream.(JavaSerializer.scala:64)
> 19909 at 
> org.apache.spark.serializer.JavaSerializerInstance.deserializeStream(JavaSerializer.scala:103)
> 19910 at 
> org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:216)
> 19911 at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:178)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4857) Last record is missing in STREAM operator

2016-03-30 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15218544#comment-15218544
 ] 

Mohit Sabharwal commented on PIG-4857:
--

[~kexianda], since the StreamOperator uses OutputConsumerIterator, isn't this 
just a matter of correctly implementing the endOfInput() method in the 
OutputConsumerIterator object  
(https://github.com/apache/pig/blob/spark/src/org/apache/pig/backend/hadoop/executionengine/spark/converter/StreamConverter.java#L106)
 ?

IOW, endOfInput() is supposed to be implemented to flush the last record.

> Last record is missing in STREAM operator
> -
>
> Key: PIG-4857
> URL: https://issues.apache.org/jira/browse/PIG-4857
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Xianda Ke
>Assignee: Xianda Ke
> Fix For: spark-branch
>
> Attachments: PIG-4857.patch
>
>
> This bug is similar to PIG-4842.
> Scenario:
> {code}
> cat input.txt
> 1
> 1
> 2
> {code}
> Pig script:
> {code}
> REGISTER myudfs.jar;
> A = LOAD 'input.txt' USING myudfs.DummyCollectableLoader() AS (id); 
> B = GROUP A by $0 USING 'collected';-- (1, {(1),(1)}), (2,{(2)})
> C = STREAM B THROUGH ` awk '{
>  print $0;
> }'`;
> DUMP C;
> {code}
> Expected Result:
> {code}
> (1,{(1),(1)})
> (2,{(2)})
> {code}
> Actual Result:
> {code}
> (1,{(1),(1)})
> {code}
> The last record is missing...
> Root Cause:
> When the flag endOfAllInput was set as true by the predecessor,  the 
> predecessor buffers the last record which is the input of Stream.   Then 
> POStream find endOfAllInput is true, in fact, the last input is not consumed 
> yet.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4837) TestNativeMapReduce test fix

2016-03-28 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15215345#comment-15215345
 ] 

Mohit Sabharwal commented on PIG-4837:
--

+1 (non-binding) for PIG-4837_3.patch

> TestNativeMapReduce test fix
> 
>
> Key: PIG-4837
> URL: https://issues.apache.org/jira/browse/PIG-4837
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4837.patch, PIG-4837_2.patch, PIG-4837_3.patch, 
> build23.PNG, build27.PNG
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4842) Collected group doesn't work in some cases

2016-03-28 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15215327#comment-15215327
 ] 

Mohit Sabharwal commented on PIG-4842:
--

+1 (non-binding)

> Collected group doesn't work in some cases
> --
>
> Key: PIG-4842
> URL: https://issues.apache.org/jira/browse/PIG-4842
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Xianda Ke
>Assignee: Xianda Ke
> Fix For: spark-branch
>
> Attachments: PIG-4842-2.patch, PIG-4842.patch
>
>
> Scenario:
> 1. input data:
> cat collectedgroup1
> {code}
> 1
> 1
> 2
> {code}
> 2. pig script:
> {code}
> A = LOAD 'collectedgroup1' USING myudfs.DummyCollectableLoader() AS (id);
> B = GROUP A by $0 USING 'collected';
> C = GROUP B by $0 USING 'collected';
> DUMP C;
> {code}
> The expected output:
> {code}
> (1,{(1,{(1),(1)})})
> (2,{(2,{(2)})})
> {code}
> The actual output:
> {code}
> (1,{(1,{(1),(1)})})
> (1,)
> (2,{(2,{(2)})})
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4837) TestNativeMapReduce test fix

2016-03-28 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15215102#comment-15215102
 ] 

Mohit Sabharwal commented on PIG-4837:
--

I agree with [~pallavi.rao]. Running MR job in Spark mode should not be our 
priority. We may want to support such "mixed mode" in the future.  My vote 
would be a) add it test/excluded-tests-spark and b) add a comment there with 
reference to this jira.

> TestNativeMapReduce test fix
> 
>
> Key: PIG-4837
> URL: https://issues.apache.org/jira/browse/PIG-4837
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4837.patch, PIG-4837_2.patch, build23.PNG, 
> build27.PNG
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4836) Fix TestEvalPipeline test failure

2016-03-10 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15189445#comment-15189445
 ] 

Mohit Sabharwal commented on PIG-4836:
--

[~xuefuz], please commit when you get get a chance.

> Fix TestEvalPipeline test failure
> -
>
> Key: PIG-4836
> URL: https://issues.apache.org/jira/browse/PIG-4836
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Pallavi Rao
>Assignee: Pallavi Rao
>  Labels: spork
> Fix For: spark-branch
>
> Attachments: PIG-4836.patch
>
>
> There are two test failures:
> testMapUDF
> testLimit 
> testLimit will get fixed by PIG-4832. This JIRA will only fix testMapUDF.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4836) Fix TestEvalPipeline test failure

2016-03-10 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15189224#comment-15189224
 ] 

Mohit Sabharwal commented on PIG-4836:
--

+1 
Thanks, [~pallavi.rao]. 


> Fix TestEvalPipeline test failure
> -
>
> Key: PIG-4836
> URL: https://issues.apache.org/jira/browse/PIG-4836
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Pallavi Rao
>Assignee: Pallavi Rao
>  Labels: spork
> Fix For: spark-branch
>
> Attachments: PIG-4836.patch
>
>
> There are two test failures:
> testMapUDF
> testLimit 
> testLimit will get fixed by PIG-4832. This JIRA will only fix testMapUDF.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4835) Fix TestPigRunner test failure

2016-03-09 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15188578#comment-15188578
 ] 

Mohit Sabharwal commented on PIG-4835:
--

yes, sorry, commented on wrong jira :)

> Fix TestPigRunner test failure
> --
>
> Key: PIG-4835
> URL: https://issues.apache.org/jira/browse/PIG-4835
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Pallavi Rao
>Assignee: Pallavi Rao
>  Labels: spork
> Fix For: spark-branch
>
> Attachments: PIG-4835.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4836) Fix TestEvalPipeline test failure

2016-03-09 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15188577#comment-15188577
 ] 

Mohit Sabharwal commented on PIG-4836:
--

[~pallavi.rao], quick question: looks like this sets an empty mr progress 
reporter in thread local.  Is this needed just for POForEach ? If it affects 
other operators as well, should we set it earlier, like in JobGraphBuilder ?


> Fix TestEvalPipeline test failure
> -
>
> Key: PIG-4836
> URL: https://issues.apache.org/jira/browse/PIG-4836
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Pallavi Rao
>Assignee: Pallavi Rao
>  Labels: spork
> Fix For: spark-branch
>
> Attachments: PIG-4836.patch
>
>
> There are two test failures:
> testMapUDF
> testLimit 
> testLimit will get fixed by PIG-4832. This JIRA will only fix testMapUDF.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4835) Fix TestPigRunner test failure

2016-03-09 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15188562#comment-15188562
 ] 

Mohit Sabharwal commented on PIG-4835:
--

[~pallavi.rao], quick question: looks like this sets an empty mr progress 
reporter in thread local.  Is this needed just for POForEach ? If it affects 
other operators as well, should we set it earlier, like in JobGraphBuilder ?

> Fix TestPigRunner test failure
> --
>
> Key: PIG-4835
> URL: https://issues.apache.org/jira/browse/PIG-4835
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Pallavi Rao
>Assignee: Pallavi Rao
>  Labels: spork
> Fix For: spark-branch
>
> Attachments: PIG-4835.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4827) Fix TestSample UT failure

2016-03-08 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15185290#comment-15185290
 ] 

Mohit Sabharwal commented on PIG-4827:
--

+1  
Thanks, [~pallavi.rao].

> Fix TestSample UT failure
> -
>
> Key: PIG-4827
> URL: https://issues.apache.org/jira/browse/PIG-4827
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Pallavi Rao
>Assignee: Pallavi Rao
>  Labels: spork
> Fix For: spark-branch
>
> Attachments: PIG-4827-v1.patch, PIG-4827.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4825) Fix TestMultiQuery failure

2016-03-07 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15184365#comment-15184365
 ] 

Mohit Sabharwal commented on PIG-4825:
--

Agreed. We saw this pattern of failures earlier and [~rohini] recommended  
{{Util.checkQueryOutputsAfterSort}}

> Fix TestMultiQuery failure
> --
>
> Key: PIG-4825
> URL: https://issues.apache.org/jira/browse/PIG-4825
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Pallavi Rao
>Assignee: Pallavi Rao
>  Labels: spork
> Fix For: spark-branch
>
> Attachments: PIG-4825.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4827) Fix TestSample UT failure

2016-03-07 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15183327#comment-15183327
 ] 

Mohit Sabharwal commented on PIG-4827:
--

Needs minor error message update. Otherwise, LGTM.

> Fix TestSample UT failure
> -
>
> Key: PIG-4827
> URL: https://issues.apache.org/jira/browse/PIG-4827
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Pallavi Rao
>Assignee: Pallavi Rao
>  Labels: spork
> Fix For: spark-branch
>
> Attachments: PIG-4827.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4788) the value BytesRead metric info always returns 0 even the length of input file is not 0 in spark engine

2016-02-29 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15173189#comment-15173189
 ] 

Mohit Sabharwal commented on PIG-4788:
--

Ah, of course, sorry - FileSplit can't be replaced by PigSplit.

My other concern was whether changing PigSplit to extend FileSplit will break 
PigSplit for inputformats that use non-File splits. Makes sense ?   

> the value BytesRead metric info always returns 0 even the length of input 
> file is not 0 in spark engine
> ---
>
> Key: PIG-4788
> URL: https://issues.apache.org/jira/browse/PIG-4788
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4788.patch
>
>
> In 
> [JobMetricsLinstener#onTaskEnd|https://github.com/apache/pig/blob/spark/src/org/apache/pig/tools/pigstats/spark/SparkJobStats.java#L140],
>  taskMetrics.inputMetrics().get().bytesRead() always returns 0 even the 
> length of input file is not zero.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4788) the value BytesRead metric info always returns 0 even the length of input file is not 0 in spark engine

2016-02-29 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15173164#comment-15173164
 ] 

Mohit Sabharwal commented on PIG-4788:
--

[~kellyzly], if you change {{PigSplit}} to extend {{FileSplit}}, will 
{{PigInputFormat}} still work with non-file splits like CombineFileSplit, etc. ?

Can we instead use {{FileSplit}} when we create the record reader in 
{{PigInputFormatSpark}}, instead of {{PigSplit}} ? That way we could isolate 
the change in Spark specific code.  

> the value BytesRead metric info always returns 0 even the length of input 
> file is not 0 in spark engine
> ---
>
> Key: PIG-4788
> URL: https://issues.apache.org/jira/browse/PIG-4788
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4788.patch
>
>
> In 
> [JobMetricsLinstener#onTaskEnd|https://github.com/apache/pig/blob/spark/src/org/apache/pig/tools/pigstats/spark/SparkJobStats.java#L140],
>  taskMetrics.inputMetrics().get().bytesRead() always returns 0 even the 
> length of input file is not zero.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4601) Implement Merge CoGroup for Spark engine

2016-02-16 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15149022#comment-15149022
 ] 

Mohit Sabharwal commented on PIG-4601:
--

+1 (non-binding), sorry about the delay reviewing the updated patch.

> Implement Merge CoGroup for Spark engine
> 
>
> Key: PIG-4601
> URL: https://issues.apache.org/jira/browse/PIG-4601
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Affects Versions: spark-branch
>Reporter: Mohit Sabharwal
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4601_1.patch, PIG-4601_2.patch
>
>
> When doing a cogroup operation, we need do a map-reduce. The target of merge 
> cogroup is implementing cogroup only by a single stage(map). But we need to 
> guarantee the input data are sorted.
> There is performance improvement for cases when A(big dataset) merge cogroup 
> B( small dataset) because we first generate an index file of A then loading A 
> according to the index file and B into memory to do cogroup. The performance 
> improves because there is no cost of reduce period comparing cogroup.
> How to use
> {code}
> C = cogroup A by c1, B by c1 using 'merge';
> {code}
> Here A and B is sorted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4601) Implement Merge CoGroup for Spark engine

2016-01-15 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4601?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15102751#comment-15102751
 ] 

Mohit Sabharwal commented on PIG-4601:
--

Thanks, [~kellyzly]! Have couple of questions on RB.

> Implement Merge CoGroup for Spark engine
> 
>
> Key: PIG-4601
> URL: https://issues.apache.org/jira/browse/PIG-4601
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Affects Versions: spark-branch
>Reporter: Mohit Sabharwal
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4601_1.patch
>
>
> When doing a cogroup operation, we need do a map-reduce. The target of merge 
> cogroup is implementing cogroup only by a single stage(map). But we need to 
> guarantee the input data are sorted.
> There is performance improvement for cases when A(big dataset) merge cogroup 
> B( small dataset) because we first generate an index file of A then loading A 
> according to the index file and B into memory to do cogroup. The performance 
> improves because there is no cost of reduce period comparing cogroup.
> How to use
> {code}
> C = cogroup A by c1, B by c1 using 'merge';
> {code}
> Here A and B is sorted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4611) Fix remaining unit test failures about "TestHBaseStorage" in spark mode

2016-01-14 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15101131#comment-15101131
 ] 

Mohit Sabharwal commented on PIG-4611:
--

+1 (non-binding)

> Fix remaining unit test failures about "TestHBaseStorage" in spark mode
> ---
>
> Key: PIG-4611
> URL: https://issues.apache.org/jira/browse/PIG-4611
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4611.patch, PIG-4611_2.patch, PIG-4611_3.patch
>
>
> In https://builds.apache.org/job/Pig-spark/lastCompletedBuild/testReport/, it 
> shows following unit test failures about TestHBaseStorage:
>  org.apache.pig.test.TestHBaseStorage.testStoreToHBase_1_with_delete  
>  org.apache.pig.test.TestHBaseStorage.testLoadWithProjection_1
>  org.apache.pig.test.TestHBaseStorage.testLoadWithProjection_2
>  org.apache.pig.test.TestHBaseStorage.testStoreToHBase_2_with_projection
>  org.apache.pig.test.TestHBaseStorage.testCollectedGroup  
>  org.apache.pig.test.TestHBaseStorage.testHeterogeneousScans



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4675) Operators with multiple predecessors fail under multiquery optimization

2015-12-17 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15063531#comment-15063531
 ] 

Mohit Sabharwal commented on PIG-4675:
--

+1(non-binding)

> Operators with multiple predecessors fail under multiquery optimization
> ---
>
> Key: PIG-4675
> URL: https://issues.apache.org/jira/browse/PIG-4675
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Peter Lin
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4675_1.patch, PIG-4675_2.patch, PIG-4675_3.patch, 
> name.txt, ssn.txt, test.pig
>
>
> We are testing the spark branch pig recently with mapr3 and spark 1.5. It 
> turns out if we use more than 1 store command in the pig script will have 
> exception from the second store command. 
>  SSN = load '/test/ssn.txt' using PigStorage() as (ssn:long);
>  SSN_NAME = load '/test/name.txt' using PigStorage() as (ssn:long, 
> name:chararray);
>  X = JOIN SSN by ssn LEFT OUTER, SSN_NAME by ssn USING 'replicated';
>  R1 = limit SSN_NAME 10;
>  store R1 into '/tmp/test1_r1'; 
>  store X into '/tmp/test1_x';
> Exception Details:
> 15/09/11 13:37:00 INFO storage.MemoryStore: ensureFreeSpace(114448) called 
> with curMem=359237, maxMem=503379394
> 15/09/11 13:37:00 INFO storage.MemoryStore: Block broadcast_2 stored as 
> values in memory (estimated size 111.8 KB, free 479.6 MB)
> 15/09/11 13:37:00 INFO storage.MemoryStore: ensureFreeSpace(32569) called 
> with curMem=473685, maxMem=503379394
> 15/09/11 13:37:00 INFO storage.MemoryStore: Block broadcast_2_piece0 stored 
> as bytes in memory (estimated size 31.8 KB, free 479.6 MB)
> 15/09/11 13:37:00 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in 
> memory on 10.51.2.82:55960 (size: 31.8 KB, free: 479.9 MB)
> 15/09/11 13:37:00 INFO spark.SparkContext: Created broadcast 2 from 
> newAPIHadoopRDD at LoadConverter.java:88
> 15/09/11 13:37:00 WARN util.ClosureCleaner: Expected a closure; got 
> org.apache.pig.backend.hadoop.executionengine.spark.converter.LoadConverter$ToTupleFunction
> 15/09/11 13:37:00 INFO spark.SparkLauncher: Converting operator POForEach 
> (Name: SSN: New For Each(false)[bag] - scope-17 Operator Key: scope-17)
> 15/09/11 13:37:00 INFO spark.SparkLauncher: Converting operator POFRJoin 
> (Name: X: FRJoin[tuple] - scope-22 Operator Key: scope-22)
> 15/09/11 13:37:00 ERROR spark.SparkLauncher: throw exception in 
> sparkOperToRDD:
> java.lang.RuntimeException: Should have greater than1 predecessors for class 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFRJoin.
>  Got : 1
> at 
> org.apache.pig.backend.hadoop.executionengine.spark.SparkUtil.assertPredecessorSizeGreaterThan(SparkUtil.java:93)
> at 
> org.apache.pig.backend.hadoop.executionengine.spark.converter.FRJoinConverter.convert(FRJoinConverter.java:55)
> at 
> org.apache.pig.backend.hadoop.executionengine.spark.converter.FRJoinConverter.convert(FRJoinConverter.java:46)
> at 
> org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.physicalToRDD(SparkLauncher.java:633)
> at 
> org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.physicalToRDD(SparkLauncher.java:600)
> at 
> org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.physicalToRDD(SparkLauncher.java:621)
> at 
> org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.sparkOperToRDD(SparkLauncher.java:552)
> at 
> org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.sparkPlanToRDD(SparkLauncher.java:501)
> at 
> org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.launchPig(SparkLauncher.java:204)
> at 
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.launchPig(HExecutionEngine.java:301)
> at org.apache.pig.PigServer.launchPlan(PigServer.java:1390)
> at 
> org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1375)
> at org.apache.pig.PigServer.execute(PigServer.java:1364)
> at org.apache.pig.PigServer.executeBatch(PigServer.java:415)
> at org.apache.pig.PigServer.executeBatch(PigServer.java:398)
> at 
> org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:171)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:234)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205)
> at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81)
> at org.apache.pig.Main.run(Main.java:624)
> at org.apache.pig.Main.main(Main.java:170)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4293) Enable unit test "TestNativeMapReduce" for spark

2015-12-17 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15063344#comment-15063344
 ] 

Mohit Sabharwal commented on PIG-4293:
--

+1 (non-binding)

> Enable unit test "TestNativeMapReduce" for spark
> 
>
> Key: PIG-4293
> URL: https://issues.apache.org/jira/browse/PIG-4293
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4293.patch, PIG-4293_1.patch, 
> TEST-org.apache.pig.test.TestNativeMapReduce.txt
>
>
> error log is attached



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4675) Operators with multiple predecessors fail under multiquery optimization

2015-12-17 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15063353#comment-15063353
 ] 

Mohit Sabharwal commented on PIG-4675:
--

Thanks, [~kellyzly]. I had one minor comment. Otherwise LGTM.

> Operators with multiple predecessors fail under multiquery optimization
> ---
>
> Key: PIG-4675
> URL: https://issues.apache.org/jira/browse/PIG-4675
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Peter Lin
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4675_1.patch, PIG-4675_2.patch, name.txt, ssn.txt, 
> test.pig
>
>
> We are testing the spark branch pig recently with mapr3 and spark 1.5. It 
> turns out if we use more than 1 store command in the pig script will have 
> exception from the second store command. 
>  SSN = load '/test/ssn.txt' using PigStorage() as (ssn:long);
>  SSN_NAME = load '/test/name.txt' using PigStorage() as (ssn:long, 
> name:chararray);
>  X = JOIN SSN by ssn LEFT OUTER, SSN_NAME by ssn USING 'replicated';
>  R1 = limit SSN_NAME 10;
>  store R1 into '/tmp/test1_r1'; 
>  store X into '/tmp/test1_x';
> Exception Details:
> 15/09/11 13:37:00 INFO storage.MemoryStore: ensureFreeSpace(114448) called 
> with curMem=359237, maxMem=503379394
> 15/09/11 13:37:00 INFO storage.MemoryStore: Block broadcast_2 stored as 
> values in memory (estimated size 111.8 KB, free 479.6 MB)
> 15/09/11 13:37:00 INFO storage.MemoryStore: ensureFreeSpace(32569) called 
> with curMem=473685, maxMem=503379394
> 15/09/11 13:37:00 INFO storage.MemoryStore: Block broadcast_2_piece0 stored 
> as bytes in memory (estimated size 31.8 KB, free 479.6 MB)
> 15/09/11 13:37:00 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in 
> memory on 10.51.2.82:55960 (size: 31.8 KB, free: 479.9 MB)
> 15/09/11 13:37:00 INFO spark.SparkContext: Created broadcast 2 from 
> newAPIHadoopRDD at LoadConverter.java:88
> 15/09/11 13:37:00 WARN util.ClosureCleaner: Expected a closure; got 
> org.apache.pig.backend.hadoop.executionengine.spark.converter.LoadConverter$ToTupleFunction
> 15/09/11 13:37:00 INFO spark.SparkLauncher: Converting operator POForEach 
> (Name: SSN: New For Each(false)[bag] - scope-17 Operator Key: scope-17)
> 15/09/11 13:37:00 INFO spark.SparkLauncher: Converting operator POFRJoin 
> (Name: X: FRJoin[tuple] - scope-22 Operator Key: scope-22)
> 15/09/11 13:37:00 ERROR spark.SparkLauncher: throw exception in 
> sparkOperToRDD:
> java.lang.RuntimeException: Should have greater than1 predecessors for class 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFRJoin.
>  Got : 1
> at 
> org.apache.pig.backend.hadoop.executionengine.spark.SparkUtil.assertPredecessorSizeGreaterThan(SparkUtil.java:93)
> at 
> org.apache.pig.backend.hadoop.executionengine.spark.converter.FRJoinConverter.convert(FRJoinConverter.java:55)
> at 
> org.apache.pig.backend.hadoop.executionengine.spark.converter.FRJoinConverter.convert(FRJoinConverter.java:46)
> at 
> org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.physicalToRDD(SparkLauncher.java:633)
> at 
> org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.physicalToRDD(SparkLauncher.java:600)
> at 
> org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.physicalToRDD(SparkLauncher.java:621)
> at 
> org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.sparkOperToRDD(SparkLauncher.java:552)
> at 
> org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.sparkPlanToRDD(SparkLauncher.java:501)
> at 
> org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.launchPig(SparkLauncher.java:204)
> at 
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.launchPig(HExecutionEngine.java:301)
> at org.apache.pig.PigServer.launchPlan(PigServer.java:1390)
> at 
> org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1375)
> at org.apache.pig.PigServer.execute(PigServer.java:1364)
> at org.apache.pig.PigServer.executeBatch(PigServer.java:415)
> at org.apache.pig.PigServer.executeBatch(PigServer.java:398)
> at 
> org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:171)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:234)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205)
> at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81)
> at org.apache.pig.Main.run(Main.java:624)
> at org.apache.pig.Main.main(Main.java:170)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4754) Fix UT failures in TestScriptLanguage

2015-12-17 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15063427#comment-15063427
 ] 

Mohit Sabharwal commented on PIG-4754:
--

+1(non-binding). LGTM.

could you please add a comment explaining why that block is protected & update 
the patch?

> Fix UT failures in TestScriptLanguage
> -
>
> Key: PIG-4754
> URL: https://issues.apache.org/jira/browse/PIG-4754
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Xianda Ke
>Assignee: Xianda Ke
> Fix For: spark-branch
>
> Attachments: PIG-4754.patch
>
>
> org.apache.pig.test.TestScriptLanguage.runParallelTest2
> Error Message
> job should succeed
> Stacktrace
> junit.framework.AssertionFailedError: job should succeed
>   at 
> org.apache.pig.test.TestScriptLanguage.runPigRunner(TestScriptLanguage.java:96)
>   at 
> org.apache.pig.test.TestScriptLanguage.runPigRunner(TestScriptLanguage.java:105)
>   at 
> org.apache.pig.test.TestScriptLanguage.runParallelTest2(TestScriptLanguage.java:311)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4675) Operators with multiple predecessors fail under

2015-12-04 Thread Mohit Sabharwal (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mohit Sabharwal updated PIG-4675:
-
Summary: Operators with multiple predecessors fail under   (was: FR+Limit 
case fails when enable MultiQuery because the predecessor information is 
wrongly calculated in current code.)

> Operators with multiple predecessors fail under 
> 
>
> Key: PIG-4675
> URL: https://issues.apache.org/jira/browse/PIG-4675
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Peter Lin
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4675_1.patch, name.txt, ssn.txt, test.pig
>
>
> We are testing the spark branch pig recently with mapr3 and spark 1.5. It 
> turns out if we use more than 1 store command in the pig script will have 
> exception from the second store command. 
>  SSN = load '/test/ssn.txt' using PigStorage() as (ssn:long);
>  SSN_NAME = load '/test/name.txt' using PigStorage() as (ssn:long, 
> name:chararray);
>  X = JOIN SSN by ssn LEFT OUTER, SSN_NAME by ssn USING 'replicated';
>  R1 = limit SSN_NAME 10;
>  store R1 into '/tmp/test1_r1'; 
>  store X into '/tmp/test1_x';
> Exception Details:
> 15/09/11 13:37:00 INFO storage.MemoryStore: ensureFreeSpace(114448) called 
> with curMem=359237, maxMem=503379394
> 15/09/11 13:37:00 INFO storage.MemoryStore: Block broadcast_2 stored as 
> values in memory (estimated size 111.8 KB, free 479.6 MB)
> 15/09/11 13:37:00 INFO storage.MemoryStore: ensureFreeSpace(32569) called 
> with curMem=473685, maxMem=503379394
> 15/09/11 13:37:00 INFO storage.MemoryStore: Block broadcast_2_piece0 stored 
> as bytes in memory (estimated size 31.8 KB, free 479.6 MB)
> 15/09/11 13:37:00 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in 
> memory on 10.51.2.82:55960 (size: 31.8 KB, free: 479.9 MB)
> 15/09/11 13:37:00 INFO spark.SparkContext: Created broadcast 2 from 
> newAPIHadoopRDD at LoadConverter.java:88
> 15/09/11 13:37:00 WARN util.ClosureCleaner: Expected a closure; got 
> org.apache.pig.backend.hadoop.executionengine.spark.converter.LoadConverter$ToTupleFunction
> 15/09/11 13:37:00 INFO spark.SparkLauncher: Converting operator POForEach 
> (Name: SSN: New For Each(false)[bag] - scope-17 Operator Key: scope-17)
> 15/09/11 13:37:00 INFO spark.SparkLauncher: Converting operator POFRJoin 
> (Name: X: FRJoin[tuple] - scope-22 Operator Key: scope-22)
> 15/09/11 13:37:00 ERROR spark.SparkLauncher: throw exception in 
> sparkOperToRDD:
> java.lang.RuntimeException: Should have greater than1 predecessors for class 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFRJoin.
>  Got : 1
> at 
> org.apache.pig.backend.hadoop.executionengine.spark.SparkUtil.assertPredecessorSizeGreaterThan(SparkUtil.java:93)
> at 
> org.apache.pig.backend.hadoop.executionengine.spark.converter.FRJoinConverter.convert(FRJoinConverter.java:55)
> at 
> org.apache.pig.backend.hadoop.executionengine.spark.converter.FRJoinConverter.convert(FRJoinConverter.java:46)
> at 
> org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.physicalToRDD(SparkLauncher.java:633)
> at 
> org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.physicalToRDD(SparkLauncher.java:600)
> at 
> org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.physicalToRDD(SparkLauncher.java:621)
> at 
> org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.sparkOperToRDD(SparkLauncher.java:552)
> at 
> org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.sparkPlanToRDD(SparkLauncher.java:501)
> at 
> org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.launchPig(SparkLauncher.java:204)
> at 
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.launchPig(HExecutionEngine.java:301)
> at org.apache.pig.PigServer.launchPlan(PigServer.java:1390)
> at 
> org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1375)
> at org.apache.pig.PigServer.execute(PigServer.java:1364)
> at org.apache.pig.PigServer.executeBatch(PigServer.java:415)
> at org.apache.pig.PigServer.executeBatch(PigServer.java:398)
> at 
> org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:171)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:234)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205)
> at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81)
> at org.apache.pig.Main.run(Main.java:624)
> at org.apache.pig.Main.main(Main.java:170)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4675) Operators with multiple predecessors fail under multiquery optimization

2015-12-04 Thread Mohit Sabharwal (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mohit Sabharwal updated PIG-4675:
-
Summary: Operators with multiple predecessors fail under multiquery 
optimization  (was: Operators with multiple predecessors fail under )

> Operators with multiple predecessors fail under multiquery optimization
> ---
>
> Key: PIG-4675
> URL: https://issues.apache.org/jira/browse/PIG-4675
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Peter Lin
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4675_1.patch, name.txt, ssn.txt, test.pig
>
>
> We are testing the spark branch pig recently with mapr3 and spark 1.5. It 
> turns out if we use more than 1 store command in the pig script will have 
> exception from the second store command. 
>  SSN = load '/test/ssn.txt' using PigStorage() as (ssn:long);
>  SSN_NAME = load '/test/name.txt' using PigStorage() as (ssn:long, 
> name:chararray);
>  X = JOIN SSN by ssn LEFT OUTER, SSN_NAME by ssn USING 'replicated';
>  R1 = limit SSN_NAME 10;
>  store R1 into '/tmp/test1_r1'; 
>  store X into '/tmp/test1_x';
> Exception Details:
> 15/09/11 13:37:00 INFO storage.MemoryStore: ensureFreeSpace(114448) called 
> with curMem=359237, maxMem=503379394
> 15/09/11 13:37:00 INFO storage.MemoryStore: Block broadcast_2 stored as 
> values in memory (estimated size 111.8 KB, free 479.6 MB)
> 15/09/11 13:37:00 INFO storage.MemoryStore: ensureFreeSpace(32569) called 
> with curMem=473685, maxMem=503379394
> 15/09/11 13:37:00 INFO storage.MemoryStore: Block broadcast_2_piece0 stored 
> as bytes in memory (estimated size 31.8 KB, free 479.6 MB)
> 15/09/11 13:37:00 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in 
> memory on 10.51.2.82:55960 (size: 31.8 KB, free: 479.9 MB)
> 15/09/11 13:37:00 INFO spark.SparkContext: Created broadcast 2 from 
> newAPIHadoopRDD at LoadConverter.java:88
> 15/09/11 13:37:00 WARN util.ClosureCleaner: Expected a closure; got 
> org.apache.pig.backend.hadoop.executionengine.spark.converter.LoadConverter$ToTupleFunction
> 15/09/11 13:37:00 INFO spark.SparkLauncher: Converting operator POForEach 
> (Name: SSN: New For Each(false)[bag] - scope-17 Operator Key: scope-17)
> 15/09/11 13:37:00 INFO spark.SparkLauncher: Converting operator POFRJoin 
> (Name: X: FRJoin[tuple] - scope-22 Operator Key: scope-22)
> 15/09/11 13:37:00 ERROR spark.SparkLauncher: throw exception in 
> sparkOperToRDD:
> java.lang.RuntimeException: Should have greater than1 predecessors for class 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFRJoin.
>  Got : 1
> at 
> org.apache.pig.backend.hadoop.executionengine.spark.SparkUtil.assertPredecessorSizeGreaterThan(SparkUtil.java:93)
> at 
> org.apache.pig.backend.hadoop.executionengine.spark.converter.FRJoinConverter.convert(FRJoinConverter.java:55)
> at 
> org.apache.pig.backend.hadoop.executionengine.spark.converter.FRJoinConverter.convert(FRJoinConverter.java:46)
> at 
> org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.physicalToRDD(SparkLauncher.java:633)
> at 
> org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.physicalToRDD(SparkLauncher.java:600)
> at 
> org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.physicalToRDD(SparkLauncher.java:621)
> at 
> org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.sparkOperToRDD(SparkLauncher.java:552)
> at 
> org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.sparkPlanToRDD(SparkLauncher.java:501)
> at 
> org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.launchPig(SparkLauncher.java:204)
> at 
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.launchPig(HExecutionEngine.java:301)
> at org.apache.pig.PigServer.launchPlan(PigServer.java:1390)
> at 
> org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1375)
> at org.apache.pig.PigServer.execute(PigServer.java:1364)
> at org.apache.pig.PigServer.executeBatch(PigServer.java:415)
> at org.apache.pig.PigServer.executeBatch(PigServer.java:398)
> at 
> org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:171)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:234)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205)
> at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81)
> at org.apache.pig.Main.run(Main.java:624)
> at org.apache.pig.Main.main(Main.java:170)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4293) Enable unit test "TestNativeMapReduce" for spark

2015-12-03 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15039693#comment-15039693
 ] 

Mohit Sabharwal commented on PIG-4293:
--

Thanks, [~kellyzly]! Left couple of comments on RB.

> Enable unit test "TestNativeMapReduce" for spark
> 
>
> Key: PIG-4293
> URL: https://issues.apache.org/jira/browse/PIG-4293
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4293.patch, PIG-4293_1.patch, 
> TEST-org.apache.pig.test.TestNativeMapReduce.txt
>
>
> error log is attached



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4621) Enable Illustrate in spark

2015-12-03 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4621?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15039706#comment-15039706
 ] 

Mohit Sabharwal commented on PIG-4621:
--

Thanks, left some comments on RB.  cc [~kellyzly]

> Enable Illustrate in spark
> --
>
> Key: PIG-4621
> URL: https://issues.apache.org/jira/browse/PIG-4621
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: Syed Zulfiqar Ali
> Fix For: spark-branch
>
>
> Current we don't support illustrate in spark mode.
> How illustrate works 
> see:http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#ILLUSTRATE



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4675) FR+Limit case fails when enable MultiQuery because the predecessor information is wrongly calculated in current code.

2015-12-03 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15039755#comment-15039755
 ] 

Mohit Sabharwal commented on PIG-4675:
--

Thanks, [~kellyzly], this looks like a pretty critical issue.  It is 
potentially affecting many other query plans, not just FRJoin with Limit, right 
? 

Could you summarize why the predecessor information was getting wrongly 
calculated ? 

Could you also explain the approach you took to fix it in more detail ?

> FR+Limit case fails when enable MultiQuery because the predecessor 
> information is wrongly calculated in current code.
> -
>
> Key: PIG-4675
> URL: https://issues.apache.org/jira/browse/PIG-4675
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Peter Lin
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4675_1.patch, name.txt, ssn.txt, test.pig
>
>
> We are testing the spark branch pig recently with mapr3 and spark 1.5. It 
> turns out if we use more than 1 store command in the pig script will have 
> exception from the second store command. 
>  SSN = load '/test/ssn.txt' using PigStorage() as (ssn:long);
>  SSN_NAME = load '/test/name.txt' using PigStorage() as (ssn:long, 
> name:chararray);
>  X = JOIN SSN by ssn LEFT OUTER, SSN_NAME by ssn USING 'replicated';
>  R1 = limit SSN_NAME 10;
>  store R1 into '/tmp/test1_r1'; 
>  store X into '/tmp/test1_x';
> Exception Details:
> 15/09/11 13:37:00 INFO storage.MemoryStore: ensureFreeSpace(114448) called 
> with curMem=359237, maxMem=503379394
> 15/09/11 13:37:00 INFO storage.MemoryStore: Block broadcast_2 stored as 
> values in memory (estimated size 111.8 KB, free 479.6 MB)
> 15/09/11 13:37:00 INFO storage.MemoryStore: ensureFreeSpace(32569) called 
> with curMem=473685, maxMem=503379394
> 15/09/11 13:37:00 INFO storage.MemoryStore: Block broadcast_2_piece0 stored 
> as bytes in memory (estimated size 31.8 KB, free 479.6 MB)
> 15/09/11 13:37:00 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in 
> memory on 10.51.2.82:55960 (size: 31.8 KB, free: 479.9 MB)
> 15/09/11 13:37:00 INFO spark.SparkContext: Created broadcast 2 from 
> newAPIHadoopRDD at LoadConverter.java:88
> 15/09/11 13:37:00 WARN util.ClosureCleaner: Expected a closure; got 
> org.apache.pig.backend.hadoop.executionengine.spark.converter.LoadConverter$ToTupleFunction
> 15/09/11 13:37:00 INFO spark.SparkLauncher: Converting operator POForEach 
> (Name: SSN: New For Each(false)[bag] - scope-17 Operator Key: scope-17)
> 15/09/11 13:37:00 INFO spark.SparkLauncher: Converting operator POFRJoin 
> (Name: X: FRJoin[tuple] - scope-22 Operator Key: scope-22)
> 15/09/11 13:37:00 ERROR spark.SparkLauncher: throw exception in 
> sparkOperToRDD:
> java.lang.RuntimeException: Should have greater than1 predecessors for class 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFRJoin.
>  Got : 1
> at 
> org.apache.pig.backend.hadoop.executionengine.spark.SparkUtil.assertPredecessorSizeGreaterThan(SparkUtil.java:93)
> at 
> org.apache.pig.backend.hadoop.executionengine.spark.converter.FRJoinConverter.convert(FRJoinConverter.java:55)
> at 
> org.apache.pig.backend.hadoop.executionengine.spark.converter.FRJoinConverter.convert(FRJoinConverter.java:46)
> at 
> org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.physicalToRDD(SparkLauncher.java:633)
> at 
> org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.physicalToRDD(SparkLauncher.java:600)
> at 
> org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.physicalToRDD(SparkLauncher.java:621)
> at 
> org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.sparkOperToRDD(SparkLauncher.java:552)
> at 
> org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.sparkPlanToRDD(SparkLauncher.java:501)
> at 
> org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.launchPig(SparkLauncher.java:204)
> at 
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.launchPig(HExecutionEngine.java:301)
> at org.apache.pig.PigServer.launchPlan(PigServer.java:1390)
> at 
> org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1375)
> at org.apache.pig.PigServer.execute(PigServer.java:1364)
> at org.apache.pig.PigServer.executeBatch(PigServer.java:415)
> at org.apache.pig.PigServer.executeBatch(PigServer.java:398)
> at 
> org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:171)
> at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:234)
> at 
> 

[jira] [Commented] (PIG-4709) Improve performance of GROUPBY operator on Spark

2015-11-30 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15032245#comment-15032245
 ] 

Mohit Sabharwal commented on PIG-4709:
--

Thanks, [~pallavi.rao], will take a look. + [~kellyzly] as well.

> Improve performance of GROUPBY operator on Spark
> 
>
> Key: PIG-4709
> URL: https://issues.apache.org/jira/browse/PIG-4709
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Pallavi Rao
>Assignee: Pallavi Rao
>  Labels: spork
> Fix For: spark-branch
>
> Attachments: PIG-4709.patch
>
>
> Currently, the GROUPBY operator of PIG is mapped by Spark's CoGroup. When the 
> grouped data is consumed by subsequent operations to perform algebraic 
> operations, this is sub-optimal as there is lot of shuffle traffic. 
> The Spark Plan must be optimized to use reduceBy, where possible, so that a 
> combiner is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4634) Fix records count issues in output statistics

2015-10-30 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14982908#comment-14982908
 ] 

Mohit Sabharwal commented on PIG-4634:
--

Thanks, [~kexianda]! 

+1 (non-binding)

> Fix records count issues in output statistics
> -
>
> Key: PIG-4634
> URL: https://issues.apache.org/jira/browse/PIG-4634
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Xianda Ke
>Assignee: Xianda Ke
> Fix For: spark-branch
>
> Attachments: PIG-4634-3.patch, PIG-4634-4.patch, PIG-4634-5.patch, 
> PIG-4634-6.patch, PIG-4634.patch, PIG-4634_2.patch
>
>
> Test cases simpleTest() and simpleTest2()  in TestPigRunner failed, caused by 
> following issues:
> 1. pig context in SparkPigStats isn't initialized.
> 2. the records count logic hasn't been implemented.
> 3. getOutpugAlias(), getPigProperties(), getBytesWritten() and 
> getRecordWritten() have not been implemented.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4634) Fix records count issues in output statistics

2015-10-26 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14975606#comment-14975606
 ] 

Mohit Sabharwal commented on PIG-4634:
--

Thanks, [~xianda]. I had couple of code readability nits on RB. Otherwise LGTM.

> Fix records count issues in output statistics
> -
>
> Key: PIG-4634
> URL: https://issues.apache.org/jira/browse/PIG-4634
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Xianda Ke
>Assignee: Xianda Ke
> Fix For: spark-branch
>
> Attachments: PIG-4634-3.patch, PIG-4634-4.patch, PIG-4634-5.patch, 
> PIG-4634.patch, PIG-4634_2.patch
>
>
> Test cases simpleTest() and simpleTest2()  in TestPigRunner failed, caused by 
> following issues:
> 1. pig context in SparkPigStats isn't initialized.
> 2. the records count logic hasn't been implemented.
> 3. getOutpugAlias(), getPigProperties(), getBytesWritten() and 
> getRecordWritten() have not been implemented.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4655) Support InputStats in spark mode

2015-09-04 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731809#comment-14731809
 ] 

Mohit Sabharwal commented on PIG-4655:
--

That's right, depends on PIG-4634

> Support InputStats in spark mode
> 
>
> Key: PIG-4655
> URL: https://issues.apache.org/jira/browse/PIG-4655
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Xianda Ke
>Assignee: Xianda Ke
> Fix For: spark-branch
>
> Attachments: PIG-4655-2.patch, PIG-4655-3.patch, PIG-4655.patch
>
>
> Currently, InputStats is not implemented in spark mode. 
> The JUnit case TestPigRunner.testEmptyFileCounter() will fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4655) Support InputStats in spark mode

2015-09-04 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731605#comment-14731605
 ] 

Mohit Sabharwal commented on PIG-4655:
--

+1 (non-binding)

> Support InputStats in spark mode
> 
>
> Key: PIG-4655
> URL: https://issues.apache.org/jira/browse/PIG-4655
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Xianda Ke
>Assignee: Xianda Ke
> Fix For: spark-branch
>
> Attachments: PIG-4655-2.patch, PIG-4655-3.patch, PIG-4655.patch
>
>
> Currently, InputStats is not implemented in spark mode. 
> The JUnit case TestPigRunner.testEmptyFileCounter() will fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4634) Fix records count issues in output statistics

2015-09-04 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731603#comment-14731603
 ] 

Mohit Sabharwal commented on PIG-4634:
--

Thanks, [~kexianda], left some comments on RB.

> Fix records count issues in output statistics
> -
>
> Key: PIG-4634
> URL: https://issues.apache.org/jira/browse/PIG-4634
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Xianda Ke
>Assignee: Xianda Ke
> Fix For: spark-branch
>
> Attachments: PIG-4634-3.patch, PIG-4634.patch, PIG-4634_2.patch
>
>
> Test cases simpleTest() and simpleTest2()  in TestPigRunner failed, caused by 
> following issues:
> 1. pig context in SparkPigStats isn't initialized.
> 2. the records count logic hasn't been implemented.
> 3. getOutpugAlias(), getPigProperties(), getBytesWritten() and 
> getRecordWritten() have not been implemented.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4661) Fix UT failures in TestPigServerLocal

2015-08-28 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14720688#comment-14720688
 ] 

Mohit Sabharwal commented on PIG-4661:
--

+1 (non-binding)

Thanks, [~kexianda]

 Fix UT failures in TestPigServerLocal
 -

 Key: PIG-4661
 URL: https://issues.apache.org/jira/browse/PIG-4661
 Project: Pig
  Issue Type: Sub-task
  Components: spark
Reporter: Xianda Ke
Assignee: Xianda Ke
 Fix For: spark-branch

 Attachments: PIG-4661.patch


 testcase 
 org.apache.pig.test.TestPigServerLocal.testSkipParseInRegisterForBatch failed 
 in spark mode



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4634) Fix records count issues in output statistics

2015-08-28 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14720863#comment-14720863
 ] 

Mohit Sabharwal commented on PIG-4634:
--

[~kexianda] could you create a RB request for this please?

 Fix records count issues in output statistics
 -

 Key: PIG-4634
 URL: https://issues.apache.org/jira/browse/PIG-4634
 Project: Pig
  Issue Type: Sub-task
  Components: spark
Reporter: Xianda Ke
Assignee: Xianda Ke
 Fix For: spark-branch

 Attachments: PIG-4634-3.patch, PIG-4634.patch, PIG-4634_2.patch


 Test cases simpleTest() and simpleTest2()  in TestPigRunner failed, caused by 
 following issues:
 1. pig context in SparkPigStats isn't initialized.
 2. the records count logic hasn't been implemented.
 3. getOutpugAlias(), getPigProperties(), getBytesWritten() and 
 getRecordWritten() have not been implemented.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4655) Support InputStats in spark mode

2015-08-28 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14720861#comment-14720861
 ] 

Mohit Sabharwal commented on PIG-4655:
--

[~kexianda] could you please create RB request for this please ?



 Support InputStats in spark mode
 

 Key: PIG-4655
 URL: https://issues.apache.org/jira/browse/PIG-4655
 Project: Pig
  Issue Type: Sub-task
  Components: spark
Reporter: Xianda Ke
Assignee: Xianda Ke
 Fix For: spark-branch

 Attachments: PIG-4655.patch


 Currently, InputStats is not implemented in spark mode. 
 The JUnit case TestPigRunner.testEmptyFileCounter() will fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4655) Support InputStats in spark mode

2015-08-28 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14720869#comment-14720869
 ] 

Mohit Sabharwal commented on PIG-4655:
--

Please move this to the top of the class for consistency:
{code}
+private String counterGroupName;
+private String counterName;
+private SparkCounters sparkCounters;
{code}

Also, shouldn't addInputInfoForSparkOper be in SparkJobStats for consistency ?

 Support InputStats in spark mode
 

 Key: PIG-4655
 URL: https://issues.apache.org/jira/browse/PIG-4655
 Project: Pig
  Issue Type: Sub-task
  Components: spark
Reporter: Xianda Ke
Assignee: Xianda Ke
 Fix For: spark-branch

 Attachments: PIG-4655.patch


 Currently, InputStats is not implemented in spark mode. 
 The JUnit case TestPigRunner.testEmptyFileCounter() will fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4659) Fix unit test failures in org.apache.pig.test.TestScriptLanguageJavaScript

2015-08-18 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14702402#comment-14702402
 ] 

Mohit Sabharwal commented on PIG-4659:
--

Thanks, [~kexianda]. 

+1 (non-binding)

 Fix unit test failures in org.apache.pig.test.TestScriptLanguageJavaScript
 --

 Key: PIG-4659
 URL: https://issues.apache.org/jira/browse/PIG-4659
 Project: Pig
  Issue Type: Sub-task
  Components: spark
Reporter: kexianda
Assignee: kexianda
 Fix For: spark-branch

 Attachments: PIG-4659.patch


 Failed testcase: org.apache.pig.test.TestScriptLanguageJavaScript.testTC
 Error Message:
 can't evaluate main: main();
 Stacktrace
 java.lang.RuntimeException: can't evaluate main: main();
   at 
 org.apache.pig.scripting.js.JsScriptEngine.jsEval(JsScriptEngine.java:135)
   at 
 org.apache.pig.scripting.js.JsScriptEngine.main(JsScriptEngine.java:223)
   at org.apache.pig.scripting.ScriptEngine.run(ScriptEngine.java:300)
   at 
 org.apache.pig.test.TestScriptLanguageJavaScript.testTC(TestScriptLanguageJavaScript.java:149)
 Caused by: org.mozilla.javascript.EcmaError: TypeError: Cannot call method 
 getNumberRecords of null



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4645) Support hadoop-like Counter using spark accumulator

2015-08-10 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14681119#comment-14681119
 ] 

Mohit Sabharwal commented on PIG-4645:
--

Thanks, [~kexianda]. LGTM. 

+1 (non-binding)

 Support hadoop-like Counter using spark accumulator
 ---

 Key: PIG-4645
 URL: https://issues.apache.org/jira/browse/PIG-4645
 Project: Pig
  Issue Type: Sub-task
  Components: spark
Reporter: kexianda
Assignee: kexianda
 Fix For: spark-branch

 Attachments: PIG-4645.patch


 Pig collect Input/Output statistic info via Counter in MR/Tez mode, we need 
 to support this using spark accumulator. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4645) Support hadoop-like Counter using spark accumulator

2015-08-07 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14662738#comment-14662738
 ] 

Mohit Sabharwal commented on PIG-4645:
--

Thanks, [~kexianda], I was wondering if we could use the built-in 
[LongAccumulatorParam|http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.AccumulatorParam$$LongAccumulatorParam$]
 ? 

But looks like there are issues with using it according to 
[this|http://apache-spark-user-list.1001560.n3.nabble.com/How-in-Java-do-I-create-an-Accumulator-of-type-Long-td18779.html]
 thread. I assume that is why you implemented LongAccumulatorParam ?



 Support hadoop-like Counter using spark accumulator
 ---

 Key: PIG-4645
 URL: https://issues.apache.org/jira/browse/PIG-4645
 Project: Pig
  Issue Type: Sub-task
  Components: spark
Reporter: kexianda
Assignee: kexianda
 Fix For: spark-branch

 Attachments: PIG-4645.patch


 Pig collect Input/Output statistic info via Counter in MR/Tez mode, we need 
 to support this using spark accumulator. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4594) Enable TestMultiQuery in spark mode

2015-07-13 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625707#comment-14625707
 ] 

Mohit Sabharwal commented on PIG-4594:
--

The general approach here seems reasonable to me and is in line with what is 
being done for Tez and MR.
I'm not sure about need for forceConnect and connect methods though... 
[~kellyzly], why don't we see This operator does not support multiple outputs 
exception with Tez or MR (when we merge operators for those engines) ? That 
wasn't clear to me.

+ 1 (non-binding) on this patch. We can address any changes in future patches 
-- since those don't seem like blockers in making progress on this feature.


 Enable TestMultiQuery in spark mode
 -

 Key: PIG-4594
 URL: https://issues.apache.org/jira/browse/PIG-4594
 Project: Pig
  Issue Type: Sub-task
  Components: spark
Reporter: liyunzhang_intel
Assignee: liyunzhang_intel
 Fix For: spark-branch

 Attachments: PIG-4594.patch, PIG-4594_1.patch, PIG-4594_2.patch


 in https://builds.apache.org/job/Pig-spark/211/#showFailuresLink,it shows 
 that 
 following unit test failures fail:
 org.apache.pig.test.TestMultiQuery.testMultiQueryJiraPig1068
 org.apache.pig.test.TestMultiQuery.testMultiQueryJiraPig1157
 org.apache.pig.test.TestMultiQuery.testMultiQueryJiraPig1252
 org.apache.pig.test.TestMultiQuery.testMultiQueryJiraPig1438



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4633) Update hadoop version to enable Spark output statistics

2015-07-13 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625712#comment-14625712
 ] 

Mohit Sabharwal commented on PIG-4633:
--

Thanks, [~kexianda], +1 (non-binding).

Could you please paste the exception you saw on this jira ? Thanks!

 Update hadoop version to enable Spark output statistics
 ---

 Key: PIG-4633
 URL: https://issues.apache.org/jira/browse/PIG-4633
 Project: Pig
  Issue Type: Sub-task
  Components: spark
Reporter: kexianda
Assignee: kexianda
 Fix For: spark-branch

 Attachments: PIG-4633.patch


 Spark support output statistics from 1.3.0 ([SPARK-3179. Add task 
 OutputMetrics|https://issues.apache.org/jira/browse/SPARK-3179])
 {code:title=SparkHadoopUtil.scala|borderStyle=solid}
 stats.map(Utils.invoke(classOf[Statistics], _, getThreadStatistics))
 {code}
 Spark invoke hadoop's function getThreadStatistics. But, this method was 
 added into hadoop from version 2.5.0 
 ([HADOOP-10688|https://issues.apache.org/jira/browse/HADOOP-10688])
 The version of hadoop in ivy/libraries.properties should be 2.5.0 +



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4633) Update hadoop version to enable Spark output statistics

2015-07-13 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14625669#comment-14625669
 ] 

Mohit Sabharwal commented on PIG-4633:
--

Thanks, [~kexianda].  Just curious - how did you discover this ? Was there an 
exception in the log ... or was some unit test failing ? 

 Update hadoop version to enable Spark output statistics
 ---

 Key: PIG-4633
 URL: https://issues.apache.org/jira/browse/PIG-4633
 Project: Pig
  Issue Type: Sub-task
  Components: spark
Reporter: kexianda
Assignee: kexianda
 Fix For: spark-branch

 Attachments: PIG-4633.patch


 Spark support output statistics from 1.3.0 ([SPARK-3179. Add task 
 OutputMetrics|https://issues.apache.org/jira/browse/SPARK-3179])
 {code:title=SparkHadoopUtil.scala|borderStyle=solid}
 stats.map(Utils.invoke(classOf[Statistics], _, getThreadStatistics))
 {code}
 Spark invoke hadoop's function getThreadStatistics. But, this method was 
 added into hadoop from version 2.5.0 
 ([HADOOP-10688|https://issues.apache.org/jira/browse/HADOOP-10688])
 The version of hadoop in ivy/libraries.properties should be 2.5.0 +



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4633) Update hadoop version to enable Spark output statistics

2015-07-13 Thread Mohit Sabharwal (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mohit Sabharwal updated PIG-4633:
-
Summary: Update hadoop version to enable Spark output statistics  (was: fix 
libaray version to enable output statistics for Pig on spark)

 Update hadoop version to enable Spark output statistics
 ---

 Key: PIG-4633
 URL: https://issues.apache.org/jira/browse/PIG-4633
 Project: Pig
  Issue Type: Sub-task
  Components: spark
Reporter: kexianda
Assignee: kexianda
 Fix For: spark-branch

 Attachments: PIG-4633.patch


 Spark support output statistics from 1.3.0 ([SPARK-3179. Add task 
 OutputMetrics|https://issues.apache.org/jira/browse/SPARK-3179])
 {code:title=SparkHadoopUtil.scala|borderStyle=solid}
 stats.map(Utils.invoke(classOf[Statistics], _, getThreadStatistics))
 {code}
 Spark invoke hadoop's function getThreadStatistics. But, this method was 
 added into hadoop from version 2.5.0 
 ([HADOOP-10688|https://issues.apache.org/jira/browse/HADOOP-10688])
 The version of hadoop in ivy/libraries.properties should be 2.5.0 +



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4611) Fix remaining unit test failures about TestHBaseStorage

2015-07-07 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14616247#comment-14616247
 ] 

Mohit Sabharwal commented on PIG-4611:
--

Thanks, [~kellyzly]. 

One more suggestion: Should we make your HBaseStorage  change conditional on 
execution engine ? i.e. do the null check only for Spark engine. That way, we 
are not altering current MR engine behavior in any way.

 Fix remaining unit test failures about TestHBaseStorage
 -

 Key: PIG-4611
 URL: https://issues.apache.org/jira/browse/PIG-4611
 Project: Pig
  Issue Type: Sub-task
  Components: spark
Reporter: liyunzhang_intel
Assignee: liyunzhang_intel
 Fix For: spark-branch

 Attachments: PIG-4611.patch, PIG-4611_2.patch


 In https://builds.apache.org/job/Pig-spark/lastCompletedBuild/testReport/, it 
 shows following unit test failures about TestHBaseStorage:
  org.apache.pig.test.TestHBaseStorage.testStoreToHBase_1_with_delete  
  org.apache.pig.test.TestHBaseStorage.testLoadWithProjection_1
  org.apache.pig.test.TestHBaseStorage.testLoadWithProjection_2
  org.apache.pig.test.TestHBaseStorage.testStoreToHBase_2_with_projection
  org.apache.pig.test.TestHBaseStorage.testCollectedGroup  
  org.apache.pig.test.TestHBaseStorage.testHeterogeneousScans



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4611) Fix remaining unit test failures about TestHBaseStorage

2015-07-06 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615915#comment-14615915
 ] 

Mohit Sabharwal commented on PIG-4611:
--

Thanks for the explanation and addressing this issue, [~kellyzly]!!!

Let me know if I understand this correctly:

1) Spark Executor will serialize all objects referenced in supplied closures. 
Since UDFContext.getUDFContext() is not initialized (because Spark does not 
expose a setup() interface like MR), we always default defaultCaster to 
STRING_CASTER.

2) However later on, in the *same* Executor thread,  the record reader creation 
will correctly deserialize the UDFContext from JobConf 
(PigInputFormatSpark.createRecordReader-PigInputFormat.createRecordReader-MapRedUtil.setupUDFContext-UDFContext.deserialize)

3) Next, in the same Executor thread, when HBaseStorage is initialized by the 
load function, it will find a correctly populated UDFContext.

This sounds reasonable to me. Since this a core change, could you please add 
comments to HBaseStorage.java explaining why we handling this as a special case 
for Spark ?


I assume it is a typo, but you need -Dexectype argument to be {{spark}}, not 
{{TestHBaseStorage}} when running TestHBaseStorage:
{code}
ant test -Dhadoopversion=23 -Dtestcase=TestHBaseStorage -Dexectype=spark 
-DdebugPort=
{code}

 Fix remaining unit test failures about TestHBaseStorage
 -

 Key: PIG-4611
 URL: https://issues.apache.org/jira/browse/PIG-4611
 Project: Pig
  Issue Type: Sub-task
  Components: spark
Reporter: liyunzhang_intel
Assignee: liyunzhang_intel
 Fix For: spark-branch

 Attachments: PIG-4611.patch


 In https://builds.apache.org/job/Pig-spark/lastCompletedBuild/testReport/, it 
 shows following unit test failures about TestHBaseStorage:
  org.apache.pig.test.TestHBaseStorage.testStoreToHBase_1_with_delete  
  org.apache.pig.test.TestHBaseStorage.testLoadWithProjection_1
  org.apache.pig.test.TestHBaseStorage.testLoadWithProjection_2
  org.apache.pig.test.TestHBaseStorage.testStoreToHBase_2_with_projection
  org.apache.pig.test.TestHBaseStorage.testCollectedGroup  
  org.apache.pig.test.TestHBaseStorage.testHeterogeneousScans



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4622) Skip TestCubeOperator.testIllustrate and TestMultiQueryLocal.testMultiQueryWithIllustrate

2015-07-06 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14615874#comment-14615874
 ] 

Mohit Sabharwal commented on PIG-4622:
--

Thanks, [~kellyzly].

+1 (non-binding)

 Skip TestCubeOperator.testIllustrate and 
 TestMultiQueryLocal.testMultiQueryWithIllustrate
 -

 Key: PIG-4622
 URL: https://issues.apache.org/jira/browse/PIG-4622
 Project: Pig
  Issue Type: Sub-task
  Components: spark
Reporter: liyunzhang_intel
Assignee: liyunzhang_intel
 Fix For: spark-branch

 Attachments: PIG-4622.patch


 it shows that in 
 https://builds.apache.org/job/Pig-spark/236/#showFailuresLink following two 
 unit tests fail:
 TestCubeOperator.testIllustrate and 
 TestMultiQueryLocal.testMultiQueryWithIllustrate
 This because current we don't support illustrate in spark mode(see PIG-4621).
 why after PIG-4614_1.patch was merged to branch, these two unit test fail?
 in PIG-4614_1.patch, we edit [SparkExecutionEngine 
 #instantiateScriptState|https://github.com/apache/pig/blob/a0bea12c3d5600a4c3137a8d05c054d10430b1ce/src/org/apache/pig/backend/hadoop/executionengine/spark/SparkExecutionEngine.java#L37].
   When running following script with illustrate.
 illustrate.pig
 {code}
 a = load 'test/org/apache/pig/test/data/passwd' using PigStorage(':') as 
 (uname:chararray, passwd:chararray, uid:int,gid:int);
 b = filter a by uid 5;
 illustrate b;
 store b into './testMultiQueryWithIllustrate.out';
 {code}
 the exception is thrown out at 
 [MRScriptState.get|https://github.com/apache/pig/blob/spark/src/org/apache/pig/tools/pigstats/mapreduce/MRScriptState.java#L67]:java.lang.ClassCastException:
  org.apache.pig.tools.pigstats.spark.SparkScriptState cannot be cast to 
 org.apache.pig.tools.pigstats.mapreduce.MRScriptState.
 stacktrace:
 {code}
   java.lang.ClassCastException: 
 org.apache.pig.tools.pigstats.spark.SparkScriptState cannot be cast to 
 org.apache.pig.tools.pigstats.mapreduce.MRScriptState
 at 
 org.apache.pig.tools.pigstats.mapreduce.MRScriptState.get(MRScriptState.java:67)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.getJob(JobControlCompiler.java:512)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler.compile(JobControlCompiler.java:327)
 at 
 org.apache.pig.pen.LocalMapReduceSimulator.launchPig(LocalMapReduceSimulator.java:110)
 at 
 org.apache.pig.pen.ExampleGenerator.getData(ExampleGenerator.java:259)
 at 
 org.apache.pig.pen.ExampleGenerator.readBaseData(ExampleGenerator.java:223)
 at 
 org.apache.pig.pen.ExampleGenerator.getExamples(ExampleGenerator.java:155)
 at org.apache.pig.PigServer.getExamples(PigServer.java:1305)
 at 
 org.apache.pig.tools.grunt.GruntParser.processIllustrate(GruntParser.java:812)
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.Illustrate(PigScriptParser.java:818)
 at 
 org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:385)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:230)
 at 
 org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:205)
 at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81)
 at org.apache.pig.Main.run(Main.java:624)
 at org.apache.pig.Main.main(Main.java:170)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke0(NativeMethodAccessorImpl.java:-1)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4619) Cleanup: change the indent size of some files of pig on spark project from 2 to 4 space

2015-07-03 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14613115#comment-14613115
 ] 

Mohit Sabharwal commented on PIG-4619:
--

Thanks, [~kellyzly]

+1 (non-binding).

 Cleanup: change the indent size of some files of pig on spark project from 2 
 to 4 space
 ---

 Key: PIG-4619
 URL: https://issues.apache.org/jira/browse/PIG-4619
 Project: Pig
  Issue Type: Sub-task
  Components: spark
Reporter: liyunzhang_intel
Assignee: liyunzhang_intel
 Fix For: spark-branch

 Attachments: PIG-4619.patch, indentSize.png


 following files under pig on spark project use 2 space indent:
 org.apache.pig.backend.hadoop.executionengine.spark.converter.CollectedGroupConverter
 org.apache.pig.backend.hadoop.executionengine.spark.JobMetricsListener
 org.apache.pig.backend.hadoop.executionengine.spark.SparkLocalExecType
 Now all the files under this project should use 4 space indent.
 Besides SparkLauncher.java use tab to replace space.  We don't use tab to 
 replace space in all the files in this project so need change this file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4613) Fix unit test failures about TestAssert

2015-07-03 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14613113#comment-14613113
 ] 

Mohit Sabharwal commented on PIG-4613:
--

Thanks, [~kexianda], [~kellyzly], LGTM

+1 (non-binding)

 Fix unit test failures about TestAssert
 ---

 Key: PIG-4613
 URL: https://issues.apache.org/jira/browse/PIG-4613
 Project: Pig
  Issue Type: Sub-task
  Components: spark
Reporter: kexianda
Assignee: kexianda
 Fix For: spark-branch

 Attachments: PIG-4613.patch


 UT failed at following cases:
 org.apache.pig.test.TestAssert.testNegativeWithoutFetch
 org.apache.pig.test.TestAssert.testNegative



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4613) Fix unit test failures about TestAssert

2015-07-01 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611392#comment-14611392
 ] 

Mohit Sabharwal commented on PIG-4613:
--

Thanks, [~kexianda]. LGTM.

Just to be safe, do you think we should check for the different error message 
conditioned on Spark engine ?

i.e. expect Job terminated with anomalous status FAILED for non-Spark
and expect i should be greater than 1 for Spark.

That way, we're not changing the testcase for MR and Tez...

 Fix unit test failures about TestAssert
 ---

 Key: PIG-4613
 URL: https://issues.apache.org/jira/browse/PIG-4613
 Project: Pig
  Issue Type: Sub-task
  Components: spark
Reporter: kexianda
Assignee: kexianda
 Fix For: spark-branch

 Attachments: PIG-4613.patch


 UT failed at following cases:
 org.apache.pig.test.TestAssert.testNegativeWithoutFetch
 org.apache.pig.test.TestAssert.testNegative



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4613) Fix unit test failures about TestAssert

2015-07-01 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611428#comment-14611428
 ] 

Mohit Sabharwal commented on PIG-4613:
--

My vote is for 2) since Spark engine gives more info about the underlying 
problem.

 Fix unit test failures about TestAssert
 ---

 Key: PIG-4613
 URL: https://issues.apache.org/jira/browse/PIG-4613
 Project: Pig
  Issue Type: Sub-task
  Components: spark
Reporter: kexianda
Assignee: kexianda
 Fix For: spark-branch

 Attachments: PIG-4613.patch


 UT failed at following cases:
 org.apache.pig.test.TestAssert.testNegativeWithoutFetch
 org.apache.pig.test.TestAssert.testNegative



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4594) Enable TestMultiQuery in spark mode

2015-07-01 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611416#comment-14611416
 ] 

Mohit Sabharwal commented on PIG-4594:
--

Thanks, [~kellyzly]. Could you give more details about why you need to add the 
forceConnect method to PhysicalPlan and OperatorPlan ? 

 Enable TestMultiQuery in spark mode
 -

 Key: PIG-4594
 URL: https://issues.apache.org/jira/browse/PIG-4594
 Project: Pig
  Issue Type: Sub-task
  Components: spark
Reporter: liyunzhang_intel
Assignee: liyunzhang_intel
 Fix For: spark-branch

 Attachments: PIG-4594.patch, PIG-4594_1.patch


 in https://builds.apache.org/job/Pig-spark/211/#showFailuresLink,it shows 
 that 
 following unit test failures fail:
 org.apache.pig.test.TestMultiQuery.testMultiQueryJiraPig1068
 org.apache.pig.test.TestMultiQuery.testMultiQueryJiraPig1157
 org.apache.pig.test.TestMultiQuery.testMultiQueryJiraPig1252
 org.apache.pig.test.TestMultiQuery.testMultiQueryJiraPig1438



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4611) Fix remaining unit test failures about TestHBaseStorage

2015-07-01 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14611367#comment-14611367
 ] 

Mohit Sabharwal commented on PIG-4611:
--

Thanks, [~kellyzly], this looks like a reasonable workaround to the UDFContext 
issue, where it is not initialized in Spark executor threads.

However, I'm not sure whether it the right thing to do in the case where 
pig.hbase.caster is set by the user.

i.e. For Spark engine, with your workaround, HBaseStorage will always use the 
default caster (i.e. Utf8StorageConverter). It will never use 
HBaseBinaryConverter or any other option.



 Fix remaining unit test failures about TestHBaseStorage
 -

 Key: PIG-4611
 URL: https://issues.apache.org/jira/browse/PIG-4611
 Project: Pig
  Issue Type: Sub-task
  Components: spark
Reporter: liyunzhang_intel
Assignee: liyunzhang_intel
 Fix For: spark-branch

 Attachments: PIG-4611.patch


 In https://builds.apache.org/job/Pig-spark/lastCompletedBuild/testReport/, it 
 shows following unit test failures about TestHBaseStorage:
  org.apache.pig.test.TestHBaseStorage.testStoreToHBase_1_with_delete  
  org.apache.pig.test.TestHBaseStorage.testLoadWithProjection_1
  org.apache.pig.test.TestHBaseStorage.testLoadWithProjection_2
  org.apache.pig.test.TestHBaseStorage.testStoreToHBase_2_with_projection
  org.apache.pig.test.TestHBaseStorage.testCollectedGroup  
  org.apache.pig.test.TestHBaseStorage.testHeterogeneousScans



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4614) Enable TestLocationInPhysicalPlan in spark mode

2015-06-30 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609567#comment-14609567
 ] 

Mohit Sabharwal commented on PIG-4614:
--

Thanks, [~kellyzly]!

+1 (non-binding)

 Enable TestLocationInPhysicalPlan in spark mode
 -

 Key: PIG-4614
 URL: https://issues.apache.org/jira/browse/PIG-4614
 Project: Pig
  Issue Type: Sub-task
  Components: spark
Reporter: liyunzhang_intel
Assignee: liyunzhang_intel
 Fix For: spark-branch

 Attachments: PIG-4614.patch, PIG-4614_1.patch


 in https://builds.apache.org/job/Pig-spark/228/#showFailuresLink, it shows 
 following unit test fails:
 org.apache.pig.newplan.logical.relational.TestLocationInPhysicalPlan.test
 expected:M: A[1,4],A[3,4],B[2,4] C: A[3,4],B[2,4] R: A[3,4] but was:null



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4059) Pig on Spark

2015-06-30 Thread Mohit Sabharwal (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mohit Sabharwal updated PIG-4059:
-
Attachment: Pig-on-Spark-Scope.pdf

 Pig on Spark
 

 Key: PIG-4059
 URL: https://issues.apache.org/jira/browse/PIG-4059
 Project: Pig
  Issue Type: New Feature
  Components: spark
Reporter: Rohini Palaniswamy
Assignee: Praveen Rachabattuni
  Labels: spork
 Fix For: spark-branch

 Attachments: Pig-on-Spark-Design-Doc.pdf, Pig-on-Spark-Scope.pdf


 Setting up your development environment:
 1. Check out Pig Spark branch.
 2. Build Pig by running ant jar and ant -Dhadoopversion=23 jar for 
 hadoop-2.x versions
 3. Configure these environmental variables:
 export HADOOP_USER_CLASSPATH_FIRST=true
 export SPARK_MASTER=local
 4. Run Pig with -x spark option.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4615) Fix null keys join in SkewedJoin in spark mode

2015-06-29 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606893#comment-14606893
 ] 

Mohit Sabharwal commented on PIG-4615:
--

Thanks, [~kellyzly]! LGTM.

+1 (non-binding)

 Fix null keys join in SkewedJoin in spark mode
 --

 Key: PIG-4615
 URL: https://issues.apache.org/jira/browse/PIG-4615
 Project: Pig
  Issue Type: Sub-task
  Components: spark
Reporter: liyunzhang_intel
Assignee: liyunzhang_intel
 Fix For: spark-branch

 Attachments: PIG-4615.patch


 Let's use an example to explain the problem:
 testSkewedJoinNullKeys.pig:
 {code}
 A = LOAD './SkewedJoinInput5.txt' as (id,name);
 B = LOAD './SkewedJoinInput5.txt' as (id,name);
 C = join A by id, B by id using 'skewed';
 store C into './testSkewedJoinNullKeys.out';
 {code}
 cat SkewedJoinInput5.txt 
 {code}
   apple1
   apple1
   apple1
   apple1
   apple1
   apple1
   apple1
   apple1
   apple1
   apple1
 100   apple2
   orange1
   orange1
   orange1
   orange1
   orange1
   orange1
   orange1
   orange1
   orange1
   orange1
 100
 {code}
 the result of mr:
 {code}
 100   apple2  100 apple2
 100   apple2  100 
 100   100 apple2
 100   100 
 {code}
 The result of spark:
 {code}
 cat testSkewedJoinNullKeys.out.spark/part-r-0 
 100   apple2  100 apple2
 100   apple2  100 
 100   100 apple2
 100   100 
   apple1  apple1
   apple1  apple1
   apple1  apple1
   apple1  apple1
   apple1  apple1
   apple1  apple1
   apple1  apple1
   apple1  apple1
   apple1  apple1
   apple1  apple1
   apple1  orange1
   apple1  orange1
   apple1  orange1
   apple1  orange1
   apple1  orange1
   apple1  orange1
   apple1  orange1
   apple1  orange1
   apple1  orange1
   apple1  orange1
   apple1  
   apple1  apple1
   apple1  apple1
   apple1  apple1
   apple1  apple1
   apple1  apple1
   apple1  apple1
   apple1  apple1
   apple1  apple1
   apple1  apple1
   apple1  apple1
   apple1  orange1
   apple1  orange1
   apple1  orange1
   apple1  orange1
   apple1  orange1
   apple1  orange1
   apple1  orange1
   apple1  orange1
   apple1  orange1
   apple1  orange1
   apple1  
   apple1  apple1
   apple1  apple1
   apple1  apple1
   apple1  apple1
   apple1  apple1
   apple1  apple1
   apple1  apple1
   apple1  apple1
   apple1  apple1
   apple1  apple1
   apple1  orange1
   apple1  orange1
   apple1  orange1
   apple1  orange1
   apple1  orange1
   apple1  orange1
   apple1  orange1
   apple1  orange1
   apple1  orange1
   apple1  orange1
   apple1  
   apple1  apple1
   apple1  apple1
   apple1  apple1
   apple1  apple1
   apple1  apple1
   apple1  apple1
   apple1  apple1
   apple1  apple1
   apple1  apple1
   apple1  apple1
   apple1  orange1
   apple1  orange1
   apple1  orange1
   apple1  orange1
   apple1  orange1
   apple1  orange1
   apple1  orange1
   apple1  orange1
   apple1  orange1
   apple1  orange1
   apple1  
   apple1  apple1
   apple1  apple1
   apple1  apple1
   apple1  apple1
   apple1  apple1
   apple1  apple1
   apple1  apple1
   apple1  apple1
   apple1  apple1
   apple1  apple1
   apple1  orange1
   apple1  orange1
   apple1  orange1
   apple1  orange1
   apple1  orange1
   apple1  orange1
   apple1  orange1
   apple1  orange1
   apple1  orange1
   apple1  orange1
   apple1  
   apple1  apple1
   apple1  apple1
   apple1  apple1
   apple1  apple1
   apple1 

[jira] [Commented] (PIG-4607) Enable TestRank1,TestRank3 unit tests in spark mode

2015-06-29 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606887#comment-14606887
 ] 

Mohit Sabharwal commented on PIG-4607:
--

Thanks, [~kexianda]!

+1 (non-binding)

 Enable TestRank1,TestRank3 unit tests in spark mode
 ---

 Key: PIG-4607
 URL: https://issues.apache.org/jira/browse/PIG-4607
 Project: Pig
  Issue Type: Sub-task
  Components: spark
Reporter: liyunzhang_intel
Assignee: kexianda
 Fix For: spark-branch

 Attachments: PIG-4607.patch


  In https://builds.apache.org/job/Pig-spark/216/#showFailuresLink, unit tests 
 about TestRank1, TestRank3:
 org.apache.pig.test.TestRank1.testRank02RowNumber
 org.apache.pig.test.TestRank1.testRank01RowNumber
 org.apache.pig.test.TestRank3.testRankWithSplitInMap
 org.apache.pig.test.TestRank3.testRankWithSplitInReduce
 org.apache.pig.test.TestRank3.testRankCascade



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4614) Enable TestLocationInPhysicalPlan in spark mode

2015-06-29 Thread Mohit Sabharwal (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mohit Sabharwal updated PIG-4614:
-
Description: 
in https://builds.apache.org/job/Pig-spark/228/#showFailuresLink, it shows 
following unit test fails:
org.apache.pig.newplan.logical.relational.TestLocationInPhysicalPlan.test

expected:M: A[1,4],A[3,4],B[2,4] C: A[3,4],B[2,4] R: A[3,4] but was:null


  was:
in https://builds.apache.org/job/Pig-spark/228/#showFailuresLink, it shows 
following unit test fails:
org.apache.pig.newplan.logical.relational.TestLocationInPhysicalPlan.test


 Enable TestLocationInPhysicalPlan in spark mode
 -

 Key: PIG-4614
 URL: https://issues.apache.org/jira/browse/PIG-4614
 Project: Pig
  Issue Type: Sub-task
  Components: spark
Reporter: liyunzhang_intel
Assignee: liyunzhang_intel
 Fix For: spark-branch

 Attachments: PIG-4614.patch


 in https://builds.apache.org/job/Pig-spark/228/#showFailuresLink, it shows 
 following unit test fails:
 org.apache.pig.newplan.logical.relational.TestLocationInPhysicalPlan.test
 expected:M: A[1,4],A[3,4],B[2,4] C: A[3,4],B[2,4] R: A[3,4] but was:null



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4614) Enable TestLocationInPhysicalPlan in spark mode

2015-06-29 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606955#comment-14606955
 ] 

Mohit Sabharwal commented on PIG-4614:
--

Thanks, [~kellyzly], I had a question on review board. 

 Enable TestLocationInPhysicalPlan in spark mode
 -

 Key: PIG-4614
 URL: https://issues.apache.org/jira/browse/PIG-4614
 Project: Pig
  Issue Type: Sub-task
  Components: spark
Reporter: liyunzhang_intel
Assignee: liyunzhang_intel
 Fix For: spark-branch

 Attachments: PIG-4614.patch


 in https://builds.apache.org/job/Pig-spark/228/#showFailuresLink, it shows 
 following unit test fails:
 org.apache.pig.newplan.logical.relational.TestLocationInPhysicalPlan.test
 expected:M: A[1,4],A[3,4],B[2,4] C: A[3,4],B[2,4] R: A[3,4] but was:null



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4594) Enable TestMultiQuery in spark mode

2015-06-27 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14604546#comment-14604546
 ] 

Mohit Sabharwal commented on PIG-4594:
--

Thanks, [~kellyzly]!  

In case 3 above (multiple splitees), looks like we could use {{RDD.cache()}} to 
cache the output of {{b}} in your example.

Because, otherwise, since each Store corresponds to a Spark action, the entire 
RDD lineage will computed twice, once for each Store.

 Enable TestMultiQuery in spark mode
 -

 Key: PIG-4594
 URL: https://issues.apache.org/jira/browse/PIG-4594
 Project: Pig
  Issue Type: Sub-task
  Components: spark
Reporter: liyunzhang_intel
Assignee: liyunzhang_intel
 Fix For: spark-branch

 Attachments: PIG-4594.patch, PIG-4594_1.patch


 in https://builds.apache.org/job/Pig-spark/211/#showFailuresLink,it shows 
 that 
 following unit test failures fail:
 org.apache.pig.test.TestMultiQuery.testMultiQueryJiraPig1068
 org.apache.pig.test.TestMultiQuery.testMultiQueryJiraPig1157
 org.apache.pig.test.TestMultiQuery.testMultiQueryJiraPig1252
 org.apache.pig.test.TestMultiQuery.testMultiQueryJiraPig1438



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4607) Enable TestRank1,TestRank3 unit tests in spark mode

2015-06-26 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603803#comment-14603803
 ] 

Mohit Sabharwal commented on PIG-4607:
--

Thanks for the explanation, [~kexianda]!  And thanks for fixing the 
verifyExpected bug!

Code LGTM. I have a minor comment to preserve consistency since we changing 
non-spark related code:
If you see other Pig testcases that use {{checkQueryOutputsAfterSort}}, these 
use the following pattern:
{code}
ListTuple expectedResults = Util.getTuplesFromConstantTupleStrings(
new String[] {
((1,'a'),(1,'b')),
((2,'aa'),(2,'bb'))
});
Util.checkQueryOutputsAfterSort(it, expectedResults);
{code}

For consistency, we should use {{Util.getTuplesFromConstantTupleStrings}} 
instead of creating a Tuple[] and then converting it to a List.

 Enable TestRank1,TestRank3 unit tests in spark mode
 ---

 Key: PIG-4607
 URL: https://issues.apache.org/jira/browse/PIG-4607
 Project: Pig
  Issue Type: Sub-task
  Components: spark
Reporter: liyunzhang_intel
Assignee: kexianda
 Fix For: spark-branch

 Attachments: PIG-4607.patch


  In https://builds.apache.org/job/Pig-spark/216/#showFailuresLink, unit tests 
 about TestRank1, TestRank3:
 org.apache.pig.test.TestRank1.testRank02RowNumber
 org.apache.pig.test.TestRank1.testRank01RowNumber
 org.apache.pig.test.TestRank3.testRankWithSplitInMap
 org.apache.pig.test.TestRank3.testRankWithSplitInReduce
 org.apache.pig.test.TestRank3.testRankCascade



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4610) Enable TestOrcStorage“ unit test in spark mode

2015-06-22 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14597045#comment-14597045
 ] 

Mohit Sabharwal commented on PIG-4610:
--

+1 (non-binding)

 Enable TestOrcStorage“ unit test in spark mode
 ---

 Key: PIG-4610
 URL: https://issues.apache.org/jira/browse/PIG-4610
 Project: Pig
  Issue Type: Sub-task
  Components: spark
Reporter: liyunzhang_intel
Assignee: liyunzhang_intel
 Fix For: spark-branch

 Attachments: PIG-4610.patch


 In https://builds.apache.org/job/Pig-spark/222/#showFailuresLink, it shows 
 following unit test failures about TestOrcStorage:
 org.apache.pig.builtin.TestOrcStorage.testJoinWithPruning
 org.apache.pig.builtin.TestOrcStorage.testLoadStoreMoreDataType
 org.apache.pig.builtin.TestOrcStorage.testMultiStore



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4607) Enable TestRank1,TestRank3 unit tests in spark mode

2015-06-19 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14594189#comment-14594189
 ] 

Mohit Sabharwal commented on PIG-4607:
--

Looks like TestRank2 was not failing even without Rank/Counter in the Spark 
plan, which is strange: 
https://builds.apache.org/job/Pig-spark/lastCompletedBuild/testReport/

I was also looking at CounterConverter and didn't quite understand the purpose 
of maintaining two counters for every tuple (localCount and sparkCount) - one 
should work, right ?

 Enable TestRank1,TestRank3 unit tests in spark mode
 ---

 Key: PIG-4607
 URL: https://issues.apache.org/jira/browse/PIG-4607
 Project: Pig
  Issue Type: Sub-task
  Components: spark
Reporter: liyunzhang_intel
Assignee: kexianda
 Fix For: spark-branch

 Attachments: PIG-4607.patch


  In https://builds.apache.org/job/Pig-spark/216/#showFailuresLink, unit tests 
 about TestRank1, TestRank3:
 org.apache.pig.test.TestRank1.testRank02RowNumber
 org.apache.pig.test.TestRank1.testRank01RowNumber
 org.apache.pig.test.TestRank3.testRankWithSplitInMap
 org.apache.pig.test.TestRank3.testRankWithSplitInReduce
 org.apache.pig.test.TestRank3.testRankCascade



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4607) Enable TestRank1,TestRank3 unit tests in spark mode

2015-06-19 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593762#comment-14593762
 ] 

Mohit Sabharwal commented on PIG-4607:
--

Thanks, [~kexianda]

I discovered these missing operators in SparkPlan today as well :)

Any idea why TestRank2 is failing ?

 Enable TestRank1,TestRank3 unit tests in spark mode
 ---

 Key: PIG-4607
 URL: https://issues.apache.org/jira/browse/PIG-4607
 Project: Pig
  Issue Type: Sub-task
  Components: spark
Reporter: liyunzhang_intel
Assignee: kexianda
 Fix For: spark-branch

 Attachments: PIG-4607.patch


  In https://builds.apache.org/job/Pig-spark/216/#showFailuresLink, unit tests 
 about TestRank1, TestRank3:
 org.apache.pig.test.TestRank1.testRank02RowNumber
 org.apache.pig.test.TestRank1.testRank01RowNumber
 org.apache.pig.test.TestRank3.testRankWithSplitInMap
 org.apache.pig.test.TestRank3.testRankWithSplitInReduce
 org.apache.pig.test.TestRank3.testRankCascade



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4606) Enable TestDefaultDateTimeZone unit tests in spark mode

2015-06-17 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14590386#comment-14590386
 ] 

Mohit Sabharwal commented on PIG-4606:
--

Thanks, [~kellyzly], the fix LGTM.

While, we're here, it might be good to refactor some code here, because 
launchPig logic is getting bit crowded. 

For example, startSparkJob() name is confusing. The job actually gets started 
inside sparkPlanToRDD() method.

It might be cleaner to create a new initialize() method and put all the 
initialization steps inside that method:
 - saveUdfImporList
 - create and populate job conf
 - SchemaTupleBackend.initialize
- read time zone from conf and set it

And rename startSparkJob() to something like addFilesToSparkJob(SparkContext sc)

What do you think ?
 

 Enable TestDefaultDateTimeZone unit tests in spark mode
 -

 Key: PIG-4606
 URL: https://issues.apache.org/jira/browse/PIG-4606
 Project: Pig
  Issue Type: Sub-task
  Components: spark
Reporter: liyunzhang_intel
Assignee: liyunzhang_intel
 Fix For: spark-branch

 Attachments: PIG-4606.patch


 In https://builds.apache.org/job/Pig-spark/216/#showFailuresLink, unit tests 
 about TestDefaultDateTimeZone fails:
 org.apache.pig.test.TestDefaultDateTimeZone.testDST
 org.apache.pig.test.TestDefaultDateTimeZone.testLocalExecution



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4606) Enable TestDefaultDateTimeZone unit tests in spark mode

2015-06-17 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14591201#comment-14591201
 ] 

Mohit Sabharwal commented on PIG-4606:
--

Thank you so much, [~kellyzly]!

+1 (non-binding)

 Enable TestDefaultDateTimeZone unit tests in spark mode
 -

 Key: PIG-4606
 URL: https://issues.apache.org/jira/browse/PIG-4606
 Project: Pig
  Issue Type: Sub-task
  Components: spark
Reporter: liyunzhang_intel
Assignee: liyunzhang_intel
 Fix For: spark-branch

 Attachments: PIG-4606.patch, PIG-4606_1.patch


 In https://builds.apache.org/job/Pig-spark/216/#showFailuresLink, unit tests 
 about TestDefaultDateTimeZone fails:
 org.apache.pig.test.TestDefaultDateTimeZone.testDST
 org.apache.pig.test.TestDefaultDateTimeZone.testLocalExecution



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4604) Clean up: refactor the package import order in the files under pig/src/org/apache/pig/backend/hadoop/executionengine/spark according to certain rule

2015-06-16 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14589075#comment-14589075
 ] 

Mohit Sabharwal commented on PIG-4604:
--

LGTM. +1 (non-binding)

 Clean up: refactor the package import order in the files under 
 pig/src/org/apache/pig/backend/hadoop/executionengine/spark according to 
 certain rule
 

 Key: PIG-4604
 URL: https://issues.apache.org/jira/browse/PIG-4604
 Project: Pig
  Issue Type: Sub-task
  Components: spark
Reporter: liyunzhang_intel
Assignee: liyunzhang_intel
 Fix For: spark-branch

 Attachments: IntelliJ_Java_codeStyle_Imports1.png, 
 IntelliJ_Java_codeStyle_Imports2.png, PIG-4604.patch


 after discussion with [~mohitsabharwal],[~xuefuz],[~praveenr019], [~kexianda]:
 now we use following rule about the package import order in files under 
 pig/src/org/apache/pig/backend/hadoop/executionengine/spark:
 1.  java.* and javax.*
 2.  blank line
 3.  scala.*
 4. blank line
 5.  Project classes (org.apache.*)
 6.  blank line
 7.  Third party libraries (org.*, com.*, etc.)
 If you use IntelliJ as your IDE, you can reference the attachment to 
 configure  your import layout of your java code style:
  1. Use IntelliJ
  2. Select “File”-”Settings”-”Code Style”-”Java”-”Imports”-”Import 
 Layout”
 Now the files under 
 pig/src/org/apache/pig/backend/hadoop/executionengine/spark has different 
 package import order. They should be in the same order.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PIG-4601) Implement Merge CoGroup for Spark engine

2015-06-12 Thread Mohit Sabharwal (JIRA)
Mohit Sabharwal created PIG-4601:


 Summary: Implement Merge CoGroup for Spark engine
 Key: PIG-4601
 URL: https://issues.apache.org/jira/browse/PIG-4601
 Project: Pig
  Issue Type: Sub-task
  Components: spark
Affects Versions: spark-branch
Reporter: Mohit Sabharwal
 Fix For: spark-branch


Implement single-stage (map-side) co-group where all the input data sets are 
sorted by key:

{code}
C = cogroup A by c1, B by c1 using 'merge';
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4597) Enable TestNullConstant unit test in spark mode

2015-06-11 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14582647#comment-14582647
 ] 

Mohit Sabharwal commented on PIG-4597:
--

Thanks, [~kexianda]! LGTM.

+1 (non-binding)

 Enable TestNullConstant unit test in spark mode 
 --

 Key: PIG-4597
 URL: https://issues.apache.org/jira/browse/PIG-4597
 Project: Pig
  Issue Type: Sub-task
  Components: spark
Reporter: kexianda
Assignee: kexianda
 Fix For: spark-branch

 Attachments: PIG-4597.patch


 ant -Dtestcase=TestNullConstant -Dexectype=spark -DdebugPort= 
 -Dhadoopversion=23 test
 You will find following unit test failure:
 Error Message
 expected:4 but was:3
 Stacktrace
 junit.framework.AssertionFailedError: expected:4 but was:3
   at 
 org.apache.pig.test.TestNullConstant.testOuterJoin(TestNullConstant.java:117)
 It failed because the actual result of the group operator is not in the same 
 order as expected result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4595) Fix unit test failures about TestFRJoinNullValue in spark mode

2015-06-10 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580953#comment-14580953
 ] 

Mohit Sabharwal commented on PIG-4595:
--

+1 (non-binding)

 Fix unit test failures about TestFRJoinNullValue in spark mode
 --

 Key: PIG-4595
 URL: https://issues.apache.org/jira/browse/PIG-4595
 Project: Pig
  Issue Type: Sub-task
  Components: spark
Reporter: liyunzhang_intel
Assignee: liyunzhang_intel
 Fix For: spark-branch

 Attachments: PIG-4595.patch, PIG-4595_1.patch


 based on f9a50f3, using following command to test TestFRJoinNullValue:
 ant -Dtestcase=TestFRJoinNullValue -Dexectype=spark  -Dhadoopversion=23  test 
 following ut fail:
 • org.apache.pig.test.TestFRJoinNullValue.testTupleLeftNullMatch
 • org.apache.pig.test.TestFRJoinNullValue.testLeftNullMatch
 • org.apache.pig.test.TestFRJoinNullValue.testTupleNullMatch
 • org.apache.pig.test.TestFRJoinNullValue.testNullMatch
 The reason why these unit test fail is because null value from table a and 
 table b are considered same when table a fr join table b.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4585) Use newAPIHadoopRDD instead of newAPIHadoopFile

2015-06-10 Thread Mohit Sabharwal (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mohit Sabharwal updated PIG-4585:
-
Attachment: PIG-4585.2.patch

 Use newAPIHadoopRDD instead of newAPIHadoopFile
 ---

 Key: PIG-4585
 URL: https://issues.apache.org/jira/browse/PIG-4585
 Project: Pig
  Issue Type: Sub-task
  Components: spark
Affects Versions: spark-branch
Reporter: Mohit Sabharwal
Assignee: Mohit Sabharwal
 Fix For: spark-branch

 Attachments: PIG-4585.1.patch, PIG-4585.2.patch, PIG-4585.patch


 LoadConverter currently uses SparkContext.newAPIHadoopFile which won't work 
 for non-filesystem based input sources, like HBase.
 newAPIHadoopFile assumes a FileInputFormat and attempts to  
 [verify|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L1065]
  this in the constructor, which fails for HBaseTableInputFormat (which is not 
 a FileInputFormat)
 {code}
   NewFileInputFormat.setInputPaths(job, path)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4593) Enable TestMultiQueryLocal in spark mode

2015-06-10 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14580954#comment-14580954
 ] 

Mohit Sabharwal commented on PIG-4593:
--

+1 (non-binding)

 Enable TestMultiQueryLocal in spark mode
 --

 Key: PIG-4593
 URL: https://issues.apache.org/jira/browse/PIG-4593
 Project: Pig
  Issue Type: Sub-task
  Components: spark
Reporter: liyunzhang_intel
Assignee: liyunzhang_intel
 Fix For: spark-branch

 Attachments: PIG-4593.patch, PIG-4593_1.patch


 in https://builds.apache.org/job/Pig-spark/211/#showFailuresLink, it shows 
 that following unit tests fail:
 org.apache.pig.test.TestMultiQueryLocal.testMultiQueryWithTwoStores
 org.apache.pig.test.TestMultiQueryLocal.testMultiQueryWithThreeStores
 org.apache.pig.test.TestMultiQueryLocal.testMultiQueryWithTwoLoads



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (PIG-4189) Make cross join work with Spark

2015-06-10 Thread Mohit Sabharwal (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mohit Sabharwal resolved PIG-4189.
--
Resolution: Duplicate
  Assignee: Mohit Sabharwal

CROSS operation is implemented in two flavors in Pig:
1) Regular CROSS using GFCross UDF
2) Nested CROSS using POCross

Both work with Spark due to patches in linked jiras.

 Make cross join work with Spark
 ---

 Key: PIG-4189
 URL: https://issues.apache.org/jira/browse/PIG-4189
 Project: Pig
  Issue Type: Sub-task
  Components: spark
Reporter: Praveen Rachabattuni
Assignee: Mohit Sabharwal
 Fix For: spark-branch


 Related e2e tests: Cross_1 - Cross_5
 Sample script:
 a = load '/user/pig/tests/data/singlefile/studenttab10k' as (name, age, gpa);
 b = load '/user/pig/tests/data/singlefile/votertab10k' as (name, age, 
 registration, contributions);
 c = filter a by age  19 and gpa  1.0;
 d = filter b by age  19;
 e = cross c, d;
 store e into '/user/pig/out/praveenr-1411378727-nightly.conf/Cross_1.out';
 Log:
 [Executor task launch worker-1] ERROR org.apache.spark.executor.Executor - 
 Exception in task ID 2
 java.lang.RuntimeException: 
 org.apache.pig.backend.executionengine.ExecException: ERROR 2078: Caught 
 error from UDF: org.apache.pig.impl.builtin.GFCross [Unable to get 
 parallelism hint from job conf]
   at 
 org.apache.pig.backend.hadoop.executionengine.spark.converter.POOutputConsumerIterator.readNext(POOutputConsumerIterator.java:57)
   at 
 org.apache.pig.backend.hadoop.executionengine.spark.converter.POOutputConsumerIterator.hasNext(POOutputConsumerIterator.java:63)
   at 
 scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:41)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
   at org.apache.spark.scheduler.Task.run(Task.scala:53)
   at 
 org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
   at 
 org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:49)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:178)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:744)
 Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2078: 
 Caught error from UDF: org.apache.pig.impl.builtin.GFCross [Unable to get 
 parallelism hint from job conf]
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:372)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNextDataBag(POUserFunc.java:388)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:331)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:378)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNextTuple(POForEach.java:298)
   at 
 org.apache.pig.backend.hadoop.executionengine.spark.converter.ForEachConverter$ForEachFunction$1.getNextResult(ForEachConverter.java:53)
   at 
 org.apache.pig.backend.hadoop.executionengine.spark.converter.POOutputConsumerIterator.readNext(POOutputConsumerIterator.java:36)
   ... 15 more
 Caused by: java.io.IOException: Unable to get parallelism hint from job conf
   at org.apache.pig.impl.builtin.GFCross.exec(GFCross.java:61)
   at org.apache.pig.impl.builtin.GFCross.exec(GFCross.java:1)
   at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:344)
   ... 21 more



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4588) Move tests under 'test-spark' target

2015-06-09 Thread Mohit Sabharwal (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mohit Sabharwal updated PIG-4588:
-
Attachment: PIG-4588.1.patch

 Move tests under 'test-spark' target
 

 Key: PIG-4588
 URL: https://issues.apache.org/jira/browse/PIG-4588
 Project: Pig
  Issue Type: Sub-task
  Components: spark
Affects Versions: spark-branch
Reporter: Mohit Sabharwal
Assignee: Mohit Sabharwal
 Fix For: spark-branch

 Attachments: PIG-4588.1.patch, PIG-4588.patch


 Run test-spark and test-spark-local tests in the same ant target.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4585) Use newAPIHadoopRDD instead of newAPIHadoopFile

2015-06-09 Thread Mohit Sabharwal (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mohit Sabharwal updated PIG-4585:
-
Attachment: PIG-4585.1.patch

 Use newAPIHadoopRDD instead of newAPIHadoopFile
 ---

 Key: PIG-4585
 URL: https://issues.apache.org/jira/browse/PIG-4585
 Project: Pig
  Issue Type: Sub-task
  Components: spark
Affects Versions: spark-branch
Reporter: Mohit Sabharwal
Assignee: Mohit Sabharwal
 Fix For: spark-branch

 Attachments: PIG-4585.1.patch, PIG-4585.patch


 LoadConverter currently uses SparkContext.newAPIHadoopFile which won't work 
 for non-filesystem based input sources, like HBase.
 newAPIHadoopFile assumes a FileInputFormat and attempts to  
 [verify|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L1065]
  this in the constructor, which fails for HBaseTableInputFormat (which is not 
 a FileInputFormat)
 {code}
   NewFileInputFormat.setInputPaths(job, path)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4593) Enable TestMultiQueryLocal in spark mode

2015-06-09 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579417#comment-14579417
 ] 

Mohit Sabharwal commented on PIG-4593:
--

+1 (non-binding)

 Enable TestMultiQueryLocal in spark mode
 --

 Key: PIG-4593
 URL: https://issues.apache.org/jira/browse/PIG-4593
 Project: Pig
  Issue Type: Sub-task
  Components: spark
Reporter: liyunzhang_intel
Assignee: liyunzhang_intel
 Fix For: spark-branch

 Attachments: PIG-4593.patch


 in https://builds.apache.org/job/Pig-spark/211/#showFailuresLink, it shows 
 that following unit tests fail:
 org.apache.pig.test.TestMultiQueryLocal.testMultiQueryWithTwoStores
 org.apache.pig.test.TestMultiQueryLocal.testMultiQueryWithThreeStores
 org.apache.pig.test.TestMultiQueryLocal.testMultiQueryWithTwoLoads



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4596) Fix unit test failures about MergeJoinConverter in spark mode

2015-06-09 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579431#comment-14579431
 ] 

Mohit Sabharwal commented on PIG-4596:
--

+1 (non-binding)

 Fix unit test failures about MergeJoinConverter in spark mode
 -

 Key: PIG-4596
 URL: https://issues.apache.org/jira/browse/PIG-4596
 Project: Pig
  Issue Type: Sub-task
  Components: spark
Reporter: liyunzhang_intel
Assignee: liyunzhang_intel
 Fix For: spark-branch

 Attachments: PIG-4596.patch


 using following command to test TestMergeJoin
 ant -Dtestcase=TestMergeJoin -Dexectype=spark  -Dhadoopversion=23  test
  
 Following unit test fails:
 org.apache.pig.test.TestMergeJoin.testMergeJoinWithNulls
 The reason why these unit test fail is because null value from table a and 
 table b are considered same when table a merge join table b.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4589) Fix unit test failure in TestCase

2015-06-08 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578125#comment-14578125
 ] 

Mohit Sabharwal commented on PIG-4589:
--

+1 (non-binding)

 Fix unit test failure in TestCase
 -

 Key: PIG-4589
 URL: https://issues.apache.org/jira/browse/PIG-4589
 Project: Pig
  Issue Type: Sub-task
  Components: spark
Reporter: kexianda
Assignee: kexianda
 Fix For: spark-branch

 Attachments: PIG-4589.patch


 ant -Dtestcase=TestCase -Dexectype=spark -DdebugPort= -Dhadoopversion=23 
 test
 You will find following unit test failure:
 * org.apache.pig.test.TestCase.testWithDereferenceOperator
 It failed because the actual result of the group operator  is not in the same 
 order as expected result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4586) Cleanup: Rename POConverter to RDDConverter

2015-06-04 Thread Mohit Sabharwal (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14573206#comment-14573206
 ] 

Mohit Sabharwal commented on PIG-4586:
--

[~kellyzly], the PO prefix is used by operators. But POConverter is not an 
operator. So I think it will confuse someone looking at the code for the first 
time.

RDDConverter is an alternative name (Class that converts physical operators to 
RDDs).

Let me know if you have any other suggestions for the name.

 Cleanup: Rename POConverter to RDDConverter
 ---

 Key: PIG-4586
 URL: https://issues.apache.org/jira/browse/PIG-4586
 Project: Pig
  Issue Type: Sub-task
  Components: spark
Affects Versions: spark-branch
Reporter: Mohit Sabharwal
Assignee: Mohit Sabharwal
 Fix For: spark-branch

 Attachments: PIG-4586.1.patch, PIG-4586.patch


 PO prefix should apply to operators



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4585) Use newAPIHadoopRDD instead of newAPIHadoopFile

2015-06-03 Thread Mohit Sabharwal (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mohit Sabharwal updated PIG-4585:
-
Attachment: (was: PIG-4585.1.patch)

 Use newAPIHadoopRDD instead of newAPIHadoopFile
 ---

 Key: PIG-4585
 URL: https://issues.apache.org/jira/browse/PIG-4585
 Project: Pig
  Issue Type: Sub-task
  Components: spark
Affects Versions: spark-branch
Reporter: Mohit Sabharwal
Assignee: Mohit Sabharwal
 Fix For: spark-branch

 Attachments: PIG-4585.patch


 LoadConverter currently uses SparkContext.newAPIHadoopFile which won't work 
 for non-filesystem based input sources, like HBase.
 newAPIHadoopFile assumes a FileInputFormat and attempts to  
 [verify|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L1065]
  this in the constructor, which fails for HBaseTableInputFormat (which is not 
 a FileInputFormat)
 {code}
   NewFileInputFormat.setInputPaths(job, path)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4585) Use newAPIHadoopRDD instead of newAPIHadoopFile

2015-06-03 Thread Mohit Sabharwal (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mohit Sabharwal updated PIG-4585:
-
Attachment: PIG-4585.1.patch

 Use newAPIHadoopRDD instead of newAPIHadoopFile
 ---

 Key: PIG-4585
 URL: https://issues.apache.org/jira/browse/PIG-4585
 Project: Pig
  Issue Type: Sub-task
  Components: spark
Affects Versions: spark-branch
Reporter: Mohit Sabharwal
Assignee: Mohit Sabharwal
 Fix For: spark-branch

 Attachments: PIG-4585.1.patch, PIG-4585.patch


 LoadConverter currently uses SparkContext.newAPIHadoopFile which won't work 
 for non-filesystem based input sources, like HBase.
 newAPIHadoopFile assumes a FileInputFormat and attempts to  
 [verify|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L1065]
  this in the constructor, which fails for HBaseTableInputFormat (which is not 
 a FileInputFormat)
 {code}
   NewFileInputFormat.setInputPaths(job, path)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PIG-4586) Cleanup: Rename POConverter to RDDConverter

2015-06-03 Thread Mohit Sabharwal (JIRA)
Mohit Sabharwal created PIG-4586:


 Summary: Cleanup: Rename POConverter to RDDConverter
 Key: PIG-4586
 URL: https://issues.apache.org/jira/browse/PIG-4586
 Project: Pig
  Issue Type: Sub-task
  Components: spark
Affects Versions: spark-branch
Reporter: Mohit Sabharwal
Assignee: Mohit Sabharwal
 Fix For: spark-branch
 Attachments: PIG-4586.patch

PO prefix should apply to operators



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4586) Cleanup: Rename POConverter to RDDConverter

2015-06-03 Thread Mohit Sabharwal (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mohit Sabharwal updated PIG-4586:
-
Attachment: PIG-4586.patch

 Cleanup: Rename POConverter to RDDConverter
 ---

 Key: PIG-4586
 URL: https://issues.apache.org/jira/browse/PIG-4586
 Project: Pig
  Issue Type: Sub-task
  Components: spark
Affects Versions: spark-branch
Reporter: Mohit Sabharwal
Assignee: Mohit Sabharwal
 Fix For: spark-branch

 Attachments: PIG-4586.patch


 PO prefix should apply to operators



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4585) Use newAPIHadoopRDD instead of newAPIHadoopFile

2015-06-03 Thread Mohit Sabharwal (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mohit Sabharwal updated PIG-4585:
-
Status: Patch Available  (was: Open)

 Use newAPIHadoopRDD instead of newAPIHadoopFile
 ---

 Key: PIG-4585
 URL: https://issues.apache.org/jira/browse/PIG-4585
 Project: Pig
  Issue Type: Sub-task
  Components: spark
Affects Versions: spark-branch
Reporter: Mohit Sabharwal
Assignee: Mohit Sabharwal
 Fix For: spark-branch

 Attachments: PIG-4585.patch


 LoadConverter currently uses SparkContext.newAPIHadoopFile which won't work 
 for non-filesystem based input sources, like HBase.
 newAPIHadoopFile assumes a FileInputFormat and attempts to  
 [verify|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L1065]
  this in the constructor, which fails for HBaseTableInputFormat (which is not 
 a FileInputFormat)
 {code}
   NewFileInputFormat.setInputPaths(job, path)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4586) Cleanup: Rename POConverter to RDDConverter

2015-06-03 Thread Mohit Sabharwal (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mohit Sabharwal updated PIG-4586:
-
Attachment: PIG-4586.1.patch

 Cleanup: Rename POConverter to RDDConverter
 ---

 Key: PIG-4586
 URL: https://issues.apache.org/jira/browse/PIG-4586
 Project: Pig
  Issue Type: Sub-task
  Components: spark
Affects Versions: spark-branch
Reporter: Mohit Sabharwal
Assignee: Mohit Sabharwal
 Fix For: spark-branch

 Attachments: PIG-4586.1.patch, PIG-4586.patch


 PO prefix should apply to operators



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4585) Use newAPIHadoopRDD instead of newAPIHadoopFile

2015-06-03 Thread Mohit Sabharwal (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mohit Sabharwal updated PIG-4585:
-
Description: 
LoadConverter currently uses SparkContext.newAPIHadoopFile which won't work for 
non-filesystem based input sources, like HBase.

newAPIHadoopFile assumes a FileInputFormat and attempts to  
[verify|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L1065]
 this in the constructor, which fails for HBaseTableInputFormat (which is not a 
FileInputFormat)

{code}
  NewFileInputFormat.setInputPaths(job, path)
{code}

  was:
LoadConverter currently uses SparkContext.newAPIHadoopFile which won't work for 
non-filesystem based input sources, like HBase.

newAPIHadoopFile assumes a FileInputFormat and attempts to  
[verify|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L1065]
 this in the constructor, which fails for HBaseTableInputFormat (which is not a 
FileInputFormat)


 Use newAPIHadoopRDD instead of newAPIHadoopFile
 ---

 Key: PIG-4585
 URL: https://issues.apache.org/jira/browse/PIG-4585
 Project: Pig
  Issue Type: Sub-task
  Components: spark
Affects Versions: spark-branch
Reporter: Mohit Sabharwal
Assignee: Mohit Sabharwal
 Fix For: spark-branch


 LoadConverter currently uses SparkContext.newAPIHadoopFile which won't work 
 for non-filesystem based input sources, like HBase.
 newAPIHadoopFile assumes a FileInputFormat and attempts to  
 [verify|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L1065]
  this in the constructor, which fails for HBaseTableInputFormat (which is not 
 a FileInputFormat)
 {code}
   NewFileInputFormat.setInputPaths(job, path)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PIG-4585) Use newAPIHadoopRDD instead of newAPIHadoopFile

2015-06-03 Thread Mohit Sabharwal (JIRA)
Mohit Sabharwal created PIG-4585:


 Summary: Use newAPIHadoopRDD instead of newAPIHadoopFile
 Key: PIG-4585
 URL: https://issues.apache.org/jira/browse/PIG-4585
 Project: Pig
  Issue Type: Sub-task
  Components: spark
Affects Versions: spark-branch
Reporter: Mohit Sabharwal
Assignee: Mohit Sabharwal
 Fix For: spark-branch


LoadConverter currently uses SparkContext.newAPIHadoopFile which won't work for 
non-filesystem based input sources, like HBase.

newAPIHadoopFile assumes a FileInputFormat and attempts to  
[verify|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L1065]
 this in the constructor, which fails for HBaseTableInputFormat (which is not a 
FileInputFormat)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   3   >