from:"Zhang, Liyun"

Re: New Apache Pig Committer: Nandor Kollar

2018-09-06 Thread Zhang, Liyun

Congratulations!


Best Regards
ZhangLiyun/Kelly Zhang


On 2018/9/7, 3:59 AM, "Koji Noguchi"  wrote:

On behalf of the Apache Pig PMC, it is my pleasure to announce that
Nandor Kollar has accepted the invitation to become an Apache Pig committer.
We appreciate all the work Nandor has done and look forward to seeing
continued involvement.

Please join me in congratulating Nandor!

Thanks,
Koji

RE: Review Request 59530: PIG-5157 Upgrade to Spark 2.0

2017-06-13 Thread Zhang, Liyun

No need to add null checks. Thanks for explanation. Please update the patch 
with latest trunk code and this makes me easy to test the code.

Best Regards
Kelly Zhang/Zhang,Liyun



-Original Message-
From: Nandor Kollar [mailto:nore...@reviews.apache.org] On Behalf Of Nandor 
Kollar
Sent: Tuesday, June 13, 2017 3:52 PM
To: Rohini Palaniswamy <rohini.adi...@gmail.com>; Zhang, Liyun 
<liyun.zh...@intel.com>; Adam Szita <sz...@cloudera.com>
Cc: Zhang, Liyun <liyun.zh...@intel.com>; pig <dev@pig.apache.org>; Nandor 
Kollar <nkol...@cloudera.com>
Subject: Re: Review Request 59530: PIG-5157 Upgrade to Spark 2.0



> On June 13, 2017, 7:33 a.m., kelly zhang wrote:
> > src/org/apache/pig/tools/pigstats/spark/SparkJobStats1.java
> > Lines 60-63 (patched)
> > <https://reviews.apache.org/r/59530/diff/4/?file=1749088#file1749088line60>
> >
> > why 
> > inputMetricExists,outputMetricExist,shuffleReadMetricExist,shuffleWriteMetricExist
> >  are deleted in SparkJobStats2.java?

These metrics are no longer Optional, but have an initialized value. For 
example this is the shuffleWriteMetrics: val shuffleWriteMetrics: 
ShuffleWriteMetrics = new ShuffleWriteMetrics(); I don't think we need to check 
for the existence of these metrics, they are initialized all the time, but if 
you think it is better to check for null before using them, I can add null 
checks. for it.


- Nandor


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/59530/#review177699
---


On June 12, 2017, 9:20 p.m., Nandor Kollar wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/59530/
> ---
> 
> (Updated June 12, 2017, 9:20 p.m.)
> 
> 
> Review request for pig, liyun zhang, Rohini Palaniswamy, and Adam Szita.
> 
> 
> Repository: pig-git
> 
> 
> Description
> ---
> 
> Upgrade to Spark 2.1 API using shims.
> 
> 
> Diffs
> -
> 
>   build.xml 4040fcec8f88d448ed7442461fbf0dea8cd1136e 
>   ivy.xml 971724380f086d214ce62c7ab7879b08b6926802 
>   ivy/libraries.properties a0eb00acd2df42324540df4a9d762c64c608a6d3 
>   
> src/org/apache/pig/backend/hadoop/executionengine/spark/FlatMapFunctionAdapter.java
>  PRE-CREATION 
>   
> src/org/apache/pig/backend/hadoop/executionengine/spark/JobMetricsListener.java
>  f81341233447203abc4800cc7b22a4f419e10262 
>   
> src/org/apache/pig/backend/hadoop/executionengine/spark/PairFlatMapFunctionAdapter.java
>  PRE-CREATION 
>   src/org/apache/pig/backend/hadoop/executionengine/spark/SparkLauncher.java 
> c6351e01a48f297ea2e432401ffd65c4f27f8078 
>   src/org/apache/pig/backend/hadoop/executionengine/spark/SparkShim.java 
> PRE-CREATION 
>   src/org/apache/pig/backend/hadoop/executionengine/spark/SparkShim1.java 
> PRE-CREATION 
>   src/org/apache/pig/backend/hadoop/executionengine/spark/SparkShim2.java 
> PRE-CREATION 
>   
> src/org/apache/pig/backend/hadoop/executionengine/spark/converter/CollectedGroupConverter.java
>  83311dfa5bb25209a5366c2db7e8d483c31d94cd 
>   
> src/org/apache/pig/backend/hadoop/executionengine/spark/converter/FRJoinConverter.java
>  382258e7ff9105aa397c5a2888df0c11e9562ec9 
>   
> src/org/apache/pig/backend/hadoop/executionengine/spark/converter/ForEachConverter.java
>  b58415e7e18ca4cf1331beef06e9214600a51424 
>   
> src/org/apache/pig/backend/hadoop/executionengine/spark/converter/GlobalRearrangeConverter.java
>  f571b808839c2de9415a3e8e4b229a7f4b2eebd7 
>   
> src/org/apache/pig/backend/hadoop/executionengine/spark/converter/LimitConverter.java
>  fe1b54c8f128661d7d19c276d3bb2de7874d3086 
>   
> src/org/apache/pig/backend/hadoop/executionengine/spark/converter/MergeCogroupConverter.java
>  adf78ecab0da10d3b1a7fdde8af2b42dd899810f 
>   
> src/org/apache/pig/backend/hadoop/executionengine/spark/converter/MergeJoinConverter.java
>  d1c43b1e06adc4c9fe45a83b8110402e3756 
>   
> src/org/apache/pig/backend/hadoop/executionengine/spark/converter/PoissonSampleConverter.java
>  e003bbd95763b2d189ff9ec540c89abe52592420 
>   
> src/org/apache/pig/backend/hadoop/executionengine/spark/converter/SecondaryKeySortUtil.java
>  00d29b44848546ed16dde2baa8c61b36939971b2 
>   
> src/org/apache/pig/backend/hadoop/executionengine/spark/converter/SkewedJoinConverter.java
>  c55ba3145495a53d69db2dd56434dcc9b3bf8ed5 
>   
> src/org/apache/pig/backend/hadoop/executionengine/spark/converter/SortConverter.java
>  baabfa090323e3bef087e259ce19df2e4c34dd63 
>   
>

RE: [VOTE] Release Pig 0.17.0 (candidate 0)

2017-06-06 Thread Zhang, Liyun

Thanks Adam,
Sorry to reply late.
Simple test about pig on spark when using 
pig-0.17.0.tar.gz(http://home.apache.org/~szita/pig-0.17.0-rc0/pig-0.17.0.tar.gz),
 not found problem.



Best Regards
Kelly Zhang/Zhang,Liyun



-Original Message-
From: Koji Noguchi [mailto:knogu...@yahoo-inc.com.INVALID] 
Sent: Wednesday, June 7, 2017 5:42 AM
To: dev@pig.apache.org
Subject: Re: [VOTE] Release Pig 0.17.0 (candidate 0)

Thanks Adam.
Is asc file missing for the tarball ?   (pig-0.17.0.tar.gz.asc) Koji


On Friday, June 2, 2017, 5:00:44 PM EDT, Adam Szita <sz...@cloudera.com> wrote:

Hi,

I have created a candidate build for Pig 0.17.0. The most important changes 
since 0.16.0 are:

(1) Dropped support for Hadoop 1
(2) Included Spark as execution engine

Keys used to sign the release are available at 
http://svn.apache.org/viewvc/pig/trunk/KEYS?view=markup .

Please download, test, and try it out:

http://home.apache.org/~szita/pig-0.17.0-rc0/

Release notes and the rat report are available from the same location.

Should we release this?
Vote closes on Wednesday, June 7th.

Adam

RE: [ANNOUNCE] Welcome new Pig Committer - Adam Szita

2017-05-22 Thread Zhang, Liyun

Congratulations, Adam!

-Original Message-
From: Xuefu Zhang [mailto:xu...@uber.com] 
Sent: Tuesday, May 23, 2017 2:09 AM
To: dev@pig.apache.org
Cc: u...@pig.apache.org
Subject: Re: [ANNOUNCE] Welcome new Pig Committer - Adam Szita

Congratulations, Adam!

On Mon, May 22, 2017 at 10:51 AM, Rohini Palaniswamy < rohini.adi...@gmail.com> 
wrote:

> Hi all,
> It is my pleasure to announce that Adam Szita has been voted in as 
> a committer to Apache Pig. Please join me in congratulating Adam. Adam 
> has been actively contributing to core Pig and Pig on Spark. We 
> appreciate all the work he has done and are looking forward to more 
> contributions from him.
>
> Welcome aboard, Adam.
>
> Regards,
> Rohini
>

RE: Preparing for Pig 0.16.1 release

2017-02-08 Thread Zhang, Liyun

Hi Rohini:
   Very glad to know that Pig 0.17(include Pig on Spark) will be released in 
March or April.  Here  What I am very confused now is the code in review 
board(https://reviews.apache.org/r/45667/) is not the latest code.
Which version of code (review board) or branch is ready to merge to trunk?  If 
use spark branch code, I will update newest patch to review board  and 
community will review new commits.  Appreciate to get your suggestion. 

Best Regards
Kelly Zhang/Zhang,Liyun

-Original Message-
From: Rohini Palaniswamy [mailto:rohini.adi...@gmail.com] 
Sent: Wednesday, February 1, 2017 5:21 AM
To: dev@pig.apache.org
Subject: Re: Preparing for Pig 0.16.1 release

Most patches are in and 0.16.1 branch is close to ready to start doing release 
candidate builds.  The next release Pig 0.17 will be a major release with big 
features like Pig on Spark and lot of change and will be around end of March or 
early April. Pig 0.16.1 will most likely be the last patch release before it. 
So if you need any issues that has to go in the Pig 0.16.1 patch release, 
please raise it now.

If there are no issues raised in the next couple of days, will start the RC 
process end of this week.

Regards,
Rohini

On Thu, Dec 15, 2016 at 3:05 PM, Rohini Palaniswamy <rohini.adi...@gmail.com
> wrote:

> Folks,
> Many important fixes have into 0.16 branch and it would be good to 
> have a 0.16.1 release out with those fixes. If there are any jiras you 
> think should go in, please mark them with 0.16.1. Since this is a 
> patch release, the focus will be mainly on critical or important bug 
> fixes. We will target mid Jan for the release.
>
> Regards,
> Rohini
>

RE: Why pig on spark use RDD API rather than DataFrame API ?

2017-01-08 Thread Zhang, Liyun

Hi Jeff:
  Thanks for your interest, when this project is started (Aug in 2014)  
DataFrame API is not available and this is why we don't use this in the 
project.  Engineer in InMobi raised similar idea before. In my view, if 
DataFrame API is more suitable than RDD API, we can consider this in late 
optimization work after first release. Now you can file a subtask on 
PIG-4856(an umbrella jira for optimization work) and work on it if have 
interest.



Best Regards
Kelly Zhang/Zhang,Liyun



-Original Message-
From: Jeff Zhang [mailto:zjf...@gmail.com] 
Sent: Sunday, January 8, 2017 10:13 AM
To: dev@pig.apache.org
Subject: Why pig on spark use RDD API rather than DataFrame API ?

Hi Folks,

I am very interested on the project of pig on spark. When I read the code, I 
find that the current implementation is based on spark RDD API. I don't know 
the original background (maybe when this project is started, DataFrame API is 
not available) , but for now I feel DataFrame API might be more suitable than 
RDD API. Here's 2 advantages of DataFrame API I can think of:
1.  DataFrame API is easier to use than RDD API, although it is not flexible 
than RDD, but I think Pig's tuple data structure is very similar with that of 
DataFrame. I think it should be able to map each pig operation to data frame 
operation. If not, we can give feedback to spark community.
2.  Spark's catalyst provide lots of optimization on DataFrame. If we use 
DataFrame API, we can leverage lots of optimization in catalyst rather than 
reinvent the wheel in pig.

What do you think ? Thanks

RE: [ANNOUNCE] Welcome new Pig Committer - Liyun Zhang

2016-12-21 Thread Zhang, Liyun

Thanks all, glad  to get the recognition on this project, and appreciate all 
your interest on Pig on Spark and hope it will be released  in the future.

From: Koji Noguchi [mailto:knogu...@yahoo-inc.com]
Sent: Wednesday, December 21, 2016 7:10 AM
To: dev@pig.apache.org; u...@pig.apache.org
Cc: Praveen R <prav...@sigmoidanalytics.com>; Zhang, Liyun 
<liyun.zh...@intel.com>; Mohit Sabharwal (mo...@cloudera.com) 
<mo...@cloudera.com>; xu...@uber.com
Subject: Re: [ANNOUNCE] Welcome new Pig Committer - Liyun Zhang

Congrats!!!


From: Rohini Palaniswamy 
<rohini.adi...@gmail.com<mailto:rohini.adi...@gmail.com>>
To: "u...@pig.apache.org<mailto:u...@pig.apache.org>" 
<u...@pig.apache.org<mailto:u...@pig.apache.org>>
Cc: Praveen R 
<prav...@sigmoidanalytics.com<mailto:prav...@sigmoidanalytics.com>>; "Zhang, 
Liyun" <liyun.zh...@intel.com<mailto:liyun.zh...@intel.com>>; 
"dev@pig.apache.org<mailto:dev@pig.apache.org>" 
<dev@pig.apache.org<mailto:dev@pig.apache.org>>; "Mohit Sabharwal 
(mo...@cloudera.com<mailto:mo...@cloudera.com>)" 
<mo...@cloudera.com<mailto:mo...@cloudera.com>>; 
"xu...@uber.com<mailto:xu...@uber.com>" <xu...@uber.com<mailto:xu...@uber.com>>
Sent: Tuesday, December 20, 2016 12:36 PM
Subject: Re: [ANNOUNCE] Welcome new Pig Committer - Liyun Zhang

Congratulations Liyun !!!

On Mon, Dec 19, 2016 at 10:25 PM, Jianfeng (Jeff) Zhang <
jzh...@hortonworks.com<mailto:jzh...@hortonworks.com>> wrote:

> Congratulations Liyun!
>
>
>
> Best Regard,
> Jeff Zhang
>
>
>
>
>
> On 12/20/16, 11:29 AM, "Pallavi Rao" 
> <pallavi@inmobi.com<mailto:pallavi@inmobi.com>> wrote:
>
> >Congratulations Liyun!
>
>

RE: [ANNOUNCE] Welcome new Pig Committer - Liyun Zhang

2016-12-15 Thread Zhang, Liyun

Thanks for all your help on this project!


Regards,
Zhang,Liyun




-Original Message-
From: Ke, Xianda [mailto:xianda...@intel.com] 
Sent: Friday, December 16, 2016 9:51 AM
To: dev@pig.apache.org; u...@pig.apache.org
Subject: RE: [ANNOUNCE] Welcome new Pig Committer - Liyun Zhang

Congrats, Liyun!

Regards,
Xianda

-Original Message-
From: Daniel Dai [mailto:da...@hortonworks.com] 
Sent: Friday, December 16, 2016 5:55 AM
To: dev@pig.apache.org; u...@pig.apache.org
Subject: [ANNOUNCE] Welcome new Pig Committer - Liyun Zhang

It is my pleasure to announce that Liyun Zhang became the newest addition to 
the Pig Committers!
Liyun has been actively contributing to Pig on Spark.

Please join me in congratulating Liyun!

FW: File could only be replicated to 0 nodes, instead of 1

2016-12-07 Thread Zhang, Liyun

Hi:
  You can google “File could only be replicated to 0 nodes, instead of 1” there 
are several reasons for it. In most case, it is because of lack of disk space 
and all datanodes die.

Best Regards
Kelly Zhang/Zhang,Liyun





From: mingda li [mailto:limingda1...@gmail.com]
Sent: Wednesday, December 7, 2016 1:01 PM
To: dev@pig.apache.org; u...@pig.apache.org
Subject: File could only be replicated to 0 nodes, instead of 1

Hi,
I am running a multiple join of 100G TPC-DS data with bad order on our cluster. 
And each time, it returns such log file to me with the exception:  Has anyone 
ever met it? Is it caused by too much data more than disk space?

 org.apache.hadoop.ipc.RemoteException: java.io.IOException: File 
/tmp/temp-1180529634/tmp-491747926/_temporary/_attempt_201607142217_0115_r_00_0/part-r-0
 could only be replicated to 0 nodes, instead of 1
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1558)
  5 at 
org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:696)
  6 at sun.reflect.GeneratedMethodAccessor851.invoke(Unknown Source)
  7 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  8 at java.lang.reflect.Method.invoke(Method.java:606)
  9 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563)
 10 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388)
 11 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384)
 12 at java.security.AccessController.doPrivileged(Native Method)
.
Pig Stack Trace
 32 ---
 33 ERROR 1066: Unable to open iterator for alias limit_data
 34
 35 org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to 
open iterator for alias limit_data
 36 at org.apache.pig.PigServer.openIterator(PigServer.java:935)
 37 at 
org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:754)
 38 at 
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:376)


More detailed information is in attachment.

RE: How to test the efficiency of multiple join

2016-12-06 Thread Zhang, Liyun

Hi:
   I think the query time about multiple join part is not related with the 
number of limit operator(in your case the number is 4). When the query is 
executed, limit_data is executed after Bad_OrderRes, after join (Bad_OrderRes) 
is finished, limit(limit_data) starts.
If I have missed something, please tell me.


Best Regards
Kelly Zhang/Zhang,Liyun



-Original Message-
From: mingda li [mailto:limingda1...@gmail.com] 
Sent: Wednesday, December 7, 2016 8:18 AM
To: dev@pig.apache.org; u...@pig.apache.org
Subject: How to test the efficiency of multiple join

Dear all,

I want to test the different multiple join orders' efficiency. However, since 
the pig query is executed lazily, I need to use dump or store to let the query 
be executed.

Now, I use the following query to test the efficiency.

*Bad_OrderIn = JOIN inventory BY  inv_item_sk, catalog_sales BY cs_item_sk;*
*Bad_OrderRes = JOIN Bad_OrderIn  BY   (cs_item_sk, cs_order_number),
catalog_returns BY (cr_item_sk, cr_order_number);* *limit_data = LIMIT 
Bad_OrderRes 4; * *Dump limit_data;*

Do you think this is OK to just show 4 of results? Could this query execution 
time represent the efficiency of multilpe join? I am not sure if it will just 
get 4 items and stop without executing other items.

Bests,
Mingda

RE: [ANNOUNCE] Congratulations to our new PMC member Koji Noguchi

2016-08-07 Thread Zhang, Liyun

Congrats Koji!

-Original Message-
From: Daniel Dai [mailto:da...@hortonworks.com] 
Sent: Saturday, August 06, 2016 7:28 AM
To: dev@pig.apache.org; u...@pig.apache.org
Subject: [ANNOUNCE] Congratulations to our new PMC member Koji Noguchi

Please welcome Koji Noguchi as our latest Pig PMC member.

Congrats Koji!

RE: Can anyone who has the experience on pigmix share configuration and expected results?

2016-07-28 Thread Zhang, Liyun

istory.JhCounters":{"name":"COUNTERS","groups":[{"name":"org.apache.hadoop.mapreduce.FileSystemCounter","displayName":"File
 System Counters","counts":[{"name":"FILE_BYTES_READ","displayName":"FILE: 
Number of bytes 
read","value":0},{"name":"FILE_BYTES_WRITTEN","displayName":"FILE: Number of 
bytes written","value":169316},{"name":"FILE_READ_OPS","displayName":"FILE: 
Number of read 
operations","value":0},{"name":"FILE_LARGE_READ_OPS","displayName":"FILE: 
Number of large read 
operations","value":0},{"name":"FILE_WRITE_OPS","displayName":"FILE: Number of 
write operations","value":0},{"name":"HDFS_BYTES_READ","displayName":"HDFS: 
Number of bytes 
read","value":0},{"name":"HDFS_BYTES_WRITTEN","displayName":"HDFS: Number of 
bytes written","value":0},{"name":"HDFS_READ_OPS","displayName":"HDFS: Number 
of read 
operations","value":0},{"name":"HDFS_LARGE_READ_OPS","displayName":"HDFS: 
Number of large read 
operations","value":0},{"name":"HDFS_WRITE_OPS","displayName":"HDFS: Number of 
write 
operations","value":0}]},{"name":"org.apache.hadoop.mapreduce.TaskCounter","displayName":"Map-Reduce
 Framework","counts":[{"name":"COMBINE_INPUT_RECORDS","displayName":"Combine 
input 
records","value":0},{"name":"COMBINE_OUTPUT_RECORDS","displayName":"Combine 
output records","value":0},{"name":"REDUCE_INPUT_GROUPS","displayName":"Reduce 
input groups","value":0},{"name":"REDUCE_SHUFFLE_BYTES","displayName":"Reduce 
shuffle 
bytes","value":21039704},{"name":"REDUCE_INPUT_RECORDS","displayName":"Reduce 
input records","value":0},{"name":"REDUCE_OUTPUT_RECORDS","displayName":"Reduce 
output records","value":0},{"name":"SPILLED_RECORDS","displayName":"Spilled 
Records","value":0},{"name":"SHUFFLED_MAPS","displayName":"Shuffled Maps 
","value":6405},{"name":"FAILED_SHUFFLE","displayName":"Failed 
Shuffles","value":0},{"name":"MERGED_MAP_OUTPUTS","displayName":"Merged Map 
outputs","value":0},{"name":"GC_TIME_MILLIS","displayName":"GC time elapsed 
(ms)","value":3617},{"name":"CPU_MILLISECONDS","displayName":"CPU time spent 
(ms)","value":148570},{"name":"PHYSICAL_MEMORY_BYTES","displayName":"Physical 
memory (bytes) 
snapshot","value":346775552},{"name":"VIRTUAL_MEMORY_BYTES","displayName":"Virtual
 memory (bytes) 
snapshot","value":2975604736},{"name":"COMMITTED_HEAP_BYTES","displayName":"Total
 committed heap usage (bytes)","value":1490026496}]},{"name":"Shuffle 
Errors","displayName":"Shuffle 
Errors","counts":[{"name":"BAD_ID","displayName":"BAD_ID","value":0},{"name":"CONNECTION","displayName":"CONNECTION","value":0},{"name":"IO_ERROR","displayName":"IO_ERROR","value":0},{"name":"WRONG_LENGTH","displayName":"WRONG_LENGTH","value":0},{"name":"WRONG_MAP","displayName":"WRONG_MAP","value":0},{"name":"WRONG_REDUCE","displayName":"WRONG_REDUCE","value":0}]},{"name":"org.apache.hadoop.mapreduce.lib.output.FileOutputFormatCounter","displayName":"File
 Output Format Counters 
","counts":[{"name":"BYTES_WRITTEN","displayName":"Bytes 
Written","value":0}]}]}},"clockSplits":[363810,597022,913686,4199950,340,339,340,340,339,340,340,340],"cpuUsages":[14016,15693,7,96634,0,0,0,0,0,0,0,0],"vMemKbytes

RE: Review Request 45667: Support Pig On Spark

2016-06-13 Thread Zhang, Liyun

Pallavi create this review board so I have not privilege to upload new patch to 
this review board, I have sent the new patch to Pallavi and later Pallavi will 
upload the patch.

-Original Message-
From: kelly zhang [mailto:nore...@reviews.apache.org] On Behalf Of kelly zhang
Sent: Tuesday, June 14, 2016 11:25 AM
To: Rohini Palaniswamy; Daniel Dai
Cc: Zhang, Liyun; Pallavi Rao; pig
Subject: Re: Review Request 45667: Support Pig On Spark



> On May 22, 2016, 9:57 p.m., Rohini Palaniswamy wrote:
> > test/org/apache/pig/test/TestBuiltin.java, line 3255 
> > <https://reviews.apache.org/r/45667/diff/1/?file=1323870#file1323870
> > line3255>
> >
> > This testcase is broken if you have 0-0 repeating twice. It is not 
> > UniqueID anymore.

0-0 repeating twice is because we use TaskID in UniqueID#exec:
public String exec(Tuple input) throws IOException {
String taskIndex = 
PigMapReduce.sJobConfInternal.get().get(PigConstants.TASK_INDEX);
String sequenceId = taskIndex + "-" + Long.toString(sequence);
sequence++;
return sequenceId;
}
in MR, we initialize PigContants.TASK_INDEX in  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce.Reduce#setup
protected void setup(Context context) throws IOException, InterruptedException {
   ...
context.getConfiguration().set(PigConstants.TASK_INDEX, 
Integer.toString(context.getTaskAttemptID().getTaskID().getId()));
...
}

But spark does not provide funtion like PigGenericMapReduce.Reduce#setup to 
initialize PigContants.TASK_INDEX when job starts.
Suggest to file a new jira(Initialize PigContants.TASK_INDEX when spark job 
starts) and skip this unit test until this jira is resolved.


> On May 22, 2016, 9:57 p.m., Rohini Palaniswamy wrote:
> > src/org/apache/pig/data/SelfSpillBag.java, line 32 
> > <https://reviews.apache.org/r/45667/diff/1/?file=1323848#file1323848
> > line32>
> >
> > Why is bag even being serialized by Spark?

SelfSpillBag is used in TestHBaseStorage, if not mark it transient, 
NotSerializableExecption is thrown out


> On May 22, 2016, 9:57 p.m., Rohini Palaniswamy wrote:
> > test/org/apache/pig/newplan/logical/relational/TestLocationInPhysica
> > lPlan.java, line 66 
> > <https://reviews.apache.org/r/45667/diff/1/?file=1323865#file1323865
> > line66>
> >
> > Why does A[3,4] repeat?

The pig script is like:

LOAD '" + Util.encodeEscape(input.getAbsolutePath()) + "' using 
PigStorage();\n"
   "B = GROUP A BY $0;\n"
"A = FOREACH B GENERATE COUNT(A);\n"
   "STORE A INTO '" + Util.encodeEscape(output.getAbsolutePath()) + 
"';");

The spark plan is :
A: 
Store(/tmp/pig_junit_tmp1755582848/test6087259092054964214output:org.apache.pig.builtin.PigStorage)
 - scope-9
|
|---A: New For Each(false)[tuple] - scope-13
|   |
|   Project[bag][1] - scope-11
|   
|   POUserFunc(org.apache.pig.builtin.COUNT$Final)[long] - scope-12
|   |
|   |---Project[bag][1] - scope-28
|
|---Reduce By(false,false)[tuple] - scope-18
|   |
|   Project[bytearray][0] - scope-19
|   |
|   POUserFunc(org.apache.pig.builtin.COUNT$Intermediate)[tuple] - 
scope-20
|   |
|   |---Project[bag][1] - scope-21
|
|---B: Local Rearrange[tuple]{bytearray}(false) - scope-24
|   |
|   Project[bytearray][0] - scope-26
|
|---A: New For Each(false,false)[bag] - scope-14
|   |
|   Project[bytearray][0] - scope-15
|   |
|   POUserFunc(org.apache.pig.builtin.COUNT$Initial)[tuple] - 
scope-16
|   |
|   |---Project[bag][1] - scope-17
|
|---Pre Combiner Local Rearrange[tuple]{Unknown} - scope-27
|
|---A: 
Load(/tmp/pig_junit_tmp1755582848/test7108242581632795697input:PigStorage) - 
scope-0

 There are two ForEach (scope-13 and scope-14) in the sparkplan so A[3,4] 
appears twice.

Comparing with MR plan:
#--
# Map Reduce Plan 
#--
MapReduce node scope-10
Map Plan
B: Local Rearrange[tuple]{bytearray}(false) - scope-22
|   |
|   Project[bytearray][0] - scope-24
|
|---A: New For Each(false,false)[bag] - scope-11
|   |
|   Project[bytearray][0] - scope-12
|   |
|   POUserFunc(org.apache.pig.builtin.COUNT$Initial)[tuple] - scope-13
|   |
|   |---Project[bag][1] - scope-14
|
|---Pre Combiner Local Rearrange[tuple]{Unknown} - scope-25
|
|---A: 
Load(/tmp/pig_junit_tmp910232853/test2548400580131197161inpu

RE: Welcome our new Pig PMC chair Daniel Dai

2016-03-26 Thread Zhang, Liyun

Congrats Daniel!


Best Regards
Kelly Zhang/Zhang,Liyun



-Original Message-
From: Praveen R [mailto:prav...@sigmoidanalytics.com] 
Sent: Sunday, March 27, 2016 3:47 AM
To: dev@pig.apache.org
Subject: Re: Welcome our new Pig PMC chair Daniel Dai

Congrats Daniel.

On Fri, 25 Mar 2016, 8:53 p.m. Lorand Bendig, <lben...@gmail.com> wrote:

> Congratulations, Daniel!
> -Lorand
>
> 2016-03-23 23:23 GMT+01:00 Rohini Palaniswamy <rohini.adi...@gmail.com>:
>
> > Hi folks,
> > I am very happy to announce that we elected Daniel Dai as our 
> > new Pig PMC Chair and it is official now.  Please join me in 
> > congratulating
> Daniel.
> >
> > Regards,
> > Rohini
> >
>

RE: Welcome to our new Pig PMC member Xuefu Zhang

2016-02-24 Thread Zhang, Liyun

Congratulations Xuefu!


Kelly Zhang/Zhang,Liyun
Best Regards



-Original Message-
From: Jarek Jarcec Cecho [mailto:jar...@gmail.com] On Behalf Of Jarek Jarcec 
Cecho
Sent: Thursday, February 25, 2016 6:36 AM
To: dev@pig.apache.org
Cc: u...@pig.apache.org
Subject: Re: Welcome to our new Pig PMC member Xuefu Zhang

Congratulations Xuefu!

Jarcec

> On Feb 24, 2016, at 1:29 PM, Rohini Palaniswamy <rohini.adi...@gmail.com> 
> wrote:
> 
> It is my pleasure to announce that Xuefu Zhang is our newest addition 
> to the Pig PMC. Xuefu is a long time committer of Pig and has been 
> actively involved in driving the Pig on Spark effort for the past year.
> 
> Please join me in congratulating Xuefu !!!
> 
> Regards,
> Rohini

a new mr operator should be generated when POSplit is encounted?

2015-05-12 Thread Zhang, Liyun

:org.apache.pig.impl.io.InterStorage)
 - scope-59
|
|---C: Package(Packager)[tuple]{int} - scope-16
Global sort: false


MapReduce node scope-64
Map Plan
Union[tuple] - scope-65
|
|---H: Local Rearrange[tuple]{int}(false) - scope-50
|   |   |
|   |   Project[int][0] - scope-51
|   |
|   |---D: New For Each(false,false)[bag] - scope-32
|   |   |
|   |   Project[int][0] - scope-22
|   |   |
|   |   POUserFunc(myudfs.AccumulativeSumBag)[chararray] - scope-25
|   |   |
|   |   |---RelationToExpressionProject[bag][*] - scope-24
|   |   |
|   |   |---F: New For Each(false)[bag] - scope-31
|   |   |   |
|   |   |   Project[bytearray][1] - scope-29
|   |   |
|   |   |---E: POSort[bag]() - scope-28
|   |   |   |
|   |   |   Project[bytearray][1] - scope-27
|   |   |
|   |   |---Project[bag][1] - scope-26
|   |
|   
|---Load(hdfs://zly2.sh.intel.com:8020/tmp/temp523129898/tmp344582360:org.apache.pig.impl.io.InterStorage)
 - scope-60
|
|---H: Local Rearrange[tuple]{int}(false) - scope-52
|   |
|   Project[int][0] - scope-53
|
|---G: New For Each(false,false)[bag] - scope-45
|   |
|   Project[int][0] - scope-35
|   |
|   POUserFunc(myudfs.AccumulativeSumBag)[chararray] - scope-38
|   |
|   |---RelationToExpressionProject[bag][*] - scope-37
|   |
|   |---F: New For Each(false)[bag] - scope-44
|   |   |
|   |   Project[bytearray][1] - scope-42
|   |
|   |---E: POSort[bag]() - scope-41
|   |   |
|   |   Project[bytearray][1] - scope-40
|   |
|   |---Project[bag][1] - scope-39
|

|---Load(hdfs://zly2.sh.intel.com:8020/tmp/temp523129898/tmp344582360:org.apache.pig.impl.io.InterStorage)
 - scope-62
Reduce Plan
H: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-57
|
|---H: Package(JoinPackager(true,true))[tuple]{int} - scope-49
Global sort: false




Kelly Zhang/Zhang,Liyun
Best Regards

I can not upload patch from https://issues.apache.org/jira/xxx now. Anyone knows why?

2015-05-05 Thread Zhang, Liyun

Hi all:
  I met some problem when I upload patch:
   When I choose my patch and upload,  then  submit Attach button,  it shows 
Please indicate the file you wish to upload and upload fails.Anyone knows 
why?

[cid:image001.png@01D08744.D740D2E0]





[cid:image002.png@01D08744.D740D2E0]

Kelly Zhang/Zhang,Liyun
Best Regards

a question about TestSkewedJoin# testSkewedJoinKeyPartition

2015-04-17 Thread Zhang, Liyun

Hi all:
   I want to ask a question about TestSkewedJoin# testSkewedJoinKeyPartition:

@Test
public void testSkewedJoinKeyPartition() throws IOException {
String outputDir = testSkewedJoinKeyPartition;
try{
 Util.deleteFile(cluster, outputDir);
}catch(Exception e){
// it is ok if directory not exist
}

 pigServer.registerQuery(A = LOAD ' + INPUT_FILE1 + ' as (id, name, 
n););
 pigServer.registerQuery(B = LOAD ' + INPUT_FILE2 + ' as (id, 
name););
 pigServer.registerQuery(E = join A by id, B by id using 'skewed' 
parallel 7;);
 pigServer.store(E, outputDir);

 int[][] lineCount = new int[3][7];

 FileStatus[] outputFiles = fs.listStatus(new Path(outputDir), 
Util.getSuccessMarkerPathFilter());
 // check how many times a key appear in each part- file
 for (int i=0; i7; i++) {
 String filename = outputFiles[i].getPath().toString();
 Util.copyFromClusterToLocal(cluster, filename, OUTPUT_DIR + / + 
i);
 BufferedReader reader = new BufferedReader(new 
FileReader(OUTPUT_DIR + / + i));
 String line = null;
 while((line = reader.readLine()) != null) {
 String[] cols = line.split(\t);
 int key = Integer.parseInt(cols[0])/100 -1;
 lineCount[key][i] ++;
 }
 reader.close();
 }

 int fc = 0;
 for(int i=0; i3; i++) {
 for(int j=0; j7; j++) {
 if (lineCount[i][j]  0) {
 fc ++;
 }
 }
 }
 // atleast one key should be a skewed key
 // check atleast one key should appear in more than 1 part- file
 assertTrue(fc  3);
}


When  I run this unit test ,  I found the result is in OUTPUT_DIR/0 
~OUTPUT_DIR/6( because the parallel number is 7).   One key appears in more 1 
part-file.

But when  I the script in command, I found the result in OUTPUT_DIR/part-2, 
OUTPUT_DIR/part-4, OUTPUT_DIR/part-6.  Other part-x is empty. One 
key only appears in 1 part-file.
A = LOAD './SkewedJoinInput1.txt' as (id, name, n);
B = LOAD './SkewedJoinInput2.txt' as (id, name);
E = join A by id, B by id using 'skewed' parallel 7;
store E into './testSkewedJoin.out';


I don't understand why have different results when running in unit test 
environment and running in command directly?

I'm appreciated if anyone can give me some suggestions.




Kelly Zhang/Zhang,Liyun
Best Regards

a question about MRCompiler#visitSort

2015-03-31 Thread Zhang, Liyun

Hi all,
  I want to ask a question about following script:
testlimit.pig


a = load './testlimit.txt' as (x:int, y:chararray);


b = order a by x;


c = limit b 1;


store c into './testlimit.out';




In MR:it will generate 4 MapReduce node(scope-11, scope-14, scope-29,scope-40)

scope-11: load the input data and store it to a tmp file
scope-14: sampleload the tmp file and generate the quantile file: 
hdfs://zly1.sh.intel.com:8020/tmp/temp2146669591/tmp300898425. I think the 
quantile file contains
the instance of WeightedRangePartitioner which shows how keys distribute.
scope-29: use the quantile file to sort. My question here: 
WeightedRangePartitioner only shows how key distribute and makes every reduce 
receive equal data from map. But this can gurantee sort?


#--
# Map Reduce Plan
#--
MapReduce node scope-11
Map Plan
Store(hdfs://zly1.sh.intel.com:8020/tmp/temp2146669591/tmp694083214:org.apache.pig.impl.io.InterStorage)
 - scope-12
|
|---a: New For Each(false,false)[bag] - scope-7
 |   |
 |   Cast[int] - scope-2
 |   |
 |   |---Project[bytearray][0] - scope-1
 |   |
 |   Cast[chararray] - scope-5
 |   |
 |   |---Project[bytearray][1] - scope-4
 |
 |---a: 
Load(hdfs://zly1.sh.intel.com:8020/user/root/testlimit.txt:org.apache.pig.builtin.PigStorage)
 - scope-0
Global sort: false


MapReduce node scope-14
Map Plan
b: Local Rearrange[tuple]{tuple}(false) - scope-18
|   |
|   Constant(all) - scope-17
|
|---New For Each(false)[tuple] - scope-16
 |   |
 |   Project[int][0] - scope-15
 |
 
|---Load(hdfs://zly1.sh.intel.com:8020/tmp/temp2146669591/tmp694083214:org.apache.pig.impl.builtin.RandomSampleLoader('org.apache.pig.impl.io.InterStorage','100'))
 - scope-13
Reduce Plan
Store(hdfs://zly1.sh.intel.com:8020/tmp/temp2146669591/tmp300898425:org.apache.pig.impl.io.InterStorage)
 - scope-27
|
|---New For Each(false)[tuple] - scope-26
 |   |
 |   POUserFunc(org.apache.pig.impl.builtin.FindQuantiles)[tuple] - scope-25
 |   |
 |   |---Project[tuple][*] - scope-24
 |
 |---New For Each(false,false)[tuple] - scope-23
 |   |
 |   Constant(2) - scope-22
 |   |
 |   Project[bag][1] - scope-20
 |
 |---Package(Packager)[tuple]{chararray} - scope-19
Global sort: false
Secondary sort: true


MapReduce node scope-29
Map Plan
b: Local Rearrange[tuple]{int}(false) - scope-30
|   |
|   Project[int][0] - scope-8
|
|---Load(hdfs://zly1.sh.intel.com:8020/tmp/temp2146669591/tmp694083214:org.apache.pig.impl.io.InterStorage)
 - scope-28
Combine Plan
Local Rearrange[tuple]{int}(false) - scope-35
|   |
|   Project[int][0] - scope-8
|
|---Limit - scope-34
 |
 |---New For Each(true)[tuple] - scope-33
 |   |
 |   Project[bag][1] - scope-32
 |
 |---Package(LitePackager)[tuple]{int} - scope-31
Reduce Plan
c: 
Store(hdfs://zly1.sh.intel.com:8020/tmp/temp2146669591/tmp538566422:org.apache.pig.impl.io.InterStorage)
 - scope-10
|
|---Limit - scope-39
 |
 |---New For Each(true)[tuple] - scope-38
 |   |
 |   Project[bag][1] - scope-37
 |
 |---Package(LitePackager)[tuple]{int} - scope-36
Global sort: true
Quantile file: hdfs://zly1.sh.intel.com:8020/tmp/temp2146669591/tmp300898425


MapReduce node scope-40
Map Plan
b: Local Rearrange[tuple]{int}(false) - scope-42
|   |
|   Project[int][0] - scope-43
|
|---Load(hdfs://zly1.sh.intel.com:8020/tmp/temp2146669591/tmp538566422:org.apache.pig.impl.io.InterStorage)
 - scope-41
Reduce Plan
c: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-49
|
|---Limit - scope-48
 |
 |---New For Each(true)[bag] - scope-47
 |   |
 |   Project[tuple][1] - scope-46
 |
 |---Package(LitePackager)[tuple]{int} - scope-45
Global sort: false





Kelly Zhang/Zhang,Liyun
Best Regards

RE: Welcome our new Pig PMC chair Rohini Palaniswamy

2015-03-19 Thread Zhang, Liyun


Congratulations, Rohini!

Best regards
Kelly Zhang/Zhang,Liyun



-Original Message-
From: Xuefu Zhang [mailto:xzh...@cloudera.com] 
Sent: Thursday, March 19, 2015 10:29 AM
To: dev@pig.apache.org
Cc: u...@pig.apache.org
Subject: Re: Welcome our new Pig PMC chair Rohini Palaniswamy

Congratulations, Rohini!

--Xuefu

On Wed, Mar 18, 2015 at 6:48 PM, Cheolsoo Park cheol...@apache.org wrote:

 Hi all,

 Now it's official that Rohini Palaniswamy is our new Pig PMC chair. 
 Please join me in congratulating Rohini for her new role. Congrats!

 Thanks!
 Cheolsoo

a question about MRCompiler#visitSort

2015-03-04 Thread Zhang, Liyun

://zly1.sh.intel.com:8020/tmp/temp2146669591/tmp538566422:org.apache.pig.impl.io.InterStorage)
 - scope-41
Reduce Plan
c: Store(fakefile:org.apache.pig.builtin.PigStorage) - scope-49
|
|---Limit - scope-48
 |
 |---New For Each(true)[bag] - scope-47
 |   |
 |   Project[tuple][1] - scope-46
 |
 |---Package(LitePackager)[tuple]{int} - 
scope-45
Global sort: false




Best regards
Zhang,Liyun

RE: a question about TestAccumulator#testAccumBasic

2015-03-03 Thread Zhang, Liyun

Hi Daniel:
  Thanks for your comment and I have figured out why  
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator#isAccumulative()
 is false  in second case.


Best regards
Zhang,Liyun

-Original Message-
From: Daniel Dai [mailto:da...@hortonworks.com] 
Sent: Wednesday, March 04, 2015 2:39 AM
To: dev@pig.apache.org
Subject: Re: a question about TestAccumulator#testAccumBasic

In first case Accumulator is get used and the second case is not (believe due 
to the second UDF BagCount in the plan). There are certain conditions whether 
Accumulator can be used in the plan or not, the logic is in 
AccumulatorOptimizer.

Daniel

On 3/3/15, 12:40 AM, Zhang, Liyun liyun.zh...@intel.com wrote:

Hi Daniel:
  Thanks for your reply!
According to your comment: The first case will use Accumulator, so 
accumulate - cleanup will be called, but no exec. The second case will 
not use Accumulator, exec will be called instead of accumulate - cleanup.

I guess what you mean is that If
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOpe
rat
or#isAccumulative() is true, it will execute 
((Accumulator)func).accumulate((Tuple)result.result); while 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOpe
rat
or#isAccumulative() is false, it will execute
func.exec((Tuple) result.result);


but I think the second case also use AccumulatorBagCount.  The 
difference between the first and second case is second case use 
org.apache.pig.test.utils.BagCount while the first one are not.
 The first pig script: (TestAccumulator line 151~154)
  A = load ' + INPUT_FILE1 + ' as (id:int, fruit);
  B = group A by id;
  C = foreach B generate group,
org.apache.pig.test.utils.AccumulatorBagCount(A);


The second script: (TestAccumulator line 169~171)
  A = load ' + INPUT_FILE1 + ' as (id:int, fruit);  B = group A by 
id;
  C = foreach B generate group,
org.apache.pig.test.utils.AccumulatorBagCount(A),
org.apache.pig.test.utils.BagCount(A);



Best regards
Zhang,Liyun



-Original Message-
From: Daniel Dai [mailto:da...@hortonworks.com]
Sent: Tuesday, March 03, 2015 3:45 AM
To: dev@pig.apache.org
Subject: Re: a question about TestAccumulator#testAccumBasic

The first case will use Accumulator, so accumulate - cleanup will be 
called, but no exec. The second case will not use Accumulator, exec 
will be called instead of accumulate - cleanup.

Daniel

On 3/1/15, 7:21 PM, Zhang, Liyun liyun.zh...@intel.com wrote:

Hi all:
  I have a question about  TestAccumulator#testAccumBasic.
 The first pig script: (TestAccumulator line 151~154)
  A = load ' + INPUT_FILE1 + ' as (id:int, fruit);
  B = group A by id;
  C = foreach B generate group,
org.apache.pig.test.utils.AccumulatorBagCount(A);

  It uses org.apache.pig.test.utils.AccumulatorBagCount, in 
org.apache.pig.test.utils.AccumulatorBagCount#exec
  org.apache.pig.test.utils.AccumulatorBagCount#exec
public Integer exec(Tuple tuple) throws IOException {
throw new IOException(exec() should not be called.); } My 
question:It should throw exception when script is excuted but why not 
throw exception?

The second script: (TestAccumulator line 169~171)
  A = load ' + INPUT_FILE1 + ' as (id:int, fruit);  B = group A by 
id;
  C = foreach B generate group,
org.apache.pig.test.utils.AccumulatorBagCount(A),
org.apache.pig.test.utils.BagCount(A);
 It uses , org.apache.pig.test.utils.AccumulatorBagCount and ), 
org.apache.pig.test.utils.BagCount.
  The code checks whether if it throws exception, if not throw 
exception, the unit test fails.


TestAccumulator#testAccumBasic
@Test
public void testAccumBasic() throws IOException{
151// test group by
152pigServer.registerQuery(A = load ' + INPUT_FILE1 + ' as
(id:int, fruit););
   153 pigServer.registerQuery(B = group A by id;);
154pigServer.registerQuery(C = foreach B generate group,
org.apache.pig.test.utils.AccumulatorBagCount(A););

HashMapInteger, Integer expected = new HashMapInteger,
Integer();
expected.put(100, 2);
expected.put(200, 1);
expected.put(300, 3);
expected.put(400, 1);

IteratorTuple iter = pigServer.openIterator(C);

while(iter.hasNext()) {
Tuple t = iter.next();
assertEquals(expected.get((Integer)t.get(0)),
(Integer)t.get(1));
}

169pigServer.registerQuery(B = group A by id;);
170   pigServer.registerQuery(C = foreach B generate group,   +
org.apache.pig.test.utils.AccumulatorBagCount(A),
org.apache.pig.test.utils.BagCount(A););

try{
iter = pigServer.openIterator(C);

while(iter.hasNext()) {
Tuple t = iter.next();
assertEquals(expected.get((Integer)t.get(0)),
(Integer)t.get(1));
}
fail(accumulator should not be called.);
}catch(IOException e) {
// should throw exception from AccumulatorBagCount

RE: a question about TestAccumulator#testAccumBasic

2015-03-03 Thread Zhang, Liyun

Hi Daniel:
  Thanks for your reply!
According to your comment: The first case will use Accumulator, so accumulate 
- cleanup will be called, but no exec. The second case will not use 
Accumulator, exec will be called instead of accumulate - cleanup.

I guess what you mean is that If 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator#isAccumulative()
 is true, it will execute ((Accumulator)func).accumulate((Tuple)result.result); 
while 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator#isAccumulative()
 is false, it will execute 
func.exec((Tuple) result.result);


but I think the second case also use AccumulatorBagCount.  The difference 
between the first and second case is second case use 
org.apache.pig.test.utils.BagCount while the first one are not. 
 The first pig script: (TestAccumulator line 151~154)
  A = load ' + INPUT_FILE1 + ' as (id:int, fruit);
  B = group A by id;
  C = foreach B generate group,
org.apache.pig.test.utils.AccumulatorBagCount(A);


The second script: (TestAccumulator line 169~171)
  A = load ' + INPUT_FILE1 + ' as (id:int, fruit);  B = group A by 
id;
  C = foreach B generate group,
org.apache.pig.test.utils.AccumulatorBagCount(A),
org.apache.pig.test.utils.BagCount(A);



Best regards
Zhang,Liyun



-Original Message-
From: Daniel Dai [mailto:da...@hortonworks.com] 
Sent: Tuesday, March 03, 2015 3:45 AM
To: dev@pig.apache.org
Subject: Re: a question about TestAccumulator#testAccumBasic

The first case will use Accumulator, so accumulate - cleanup will be called, 
but no exec. The second case will not use Accumulator, exec will be called 
instead of accumulate - cleanup.

Daniel

On 3/1/15, 7:21 PM, Zhang, Liyun liyun.zh...@intel.com wrote:

Hi all:
  I have a question about  TestAccumulator#testAccumBasic.
 The first pig script: (TestAccumulator line 151~154)
  A = load ' + INPUT_FILE1 + ' as (id:int, fruit);
  B = group A by id;
  C = foreach B generate group,
org.apache.pig.test.utils.AccumulatorBagCount(A);

  It uses org.apache.pig.test.utils.AccumulatorBagCount, in 
org.apache.pig.test.utils.AccumulatorBagCount#exec
  org.apache.pig.test.utils.AccumulatorBagCount#exec
public Integer exec(Tuple tuple) throws IOException {
throw new IOException(exec() should not be called.); } My 
question:It should throw exception when script is excuted but why not 
throw exception?

The second script: (TestAccumulator line 169~171)
  A = load ' + INPUT_FILE1 + ' as (id:int, fruit);  B = group A by 
id;
  C = foreach B generate group,
org.apache.pig.test.utils.AccumulatorBagCount(A),
org.apache.pig.test.utils.BagCount(A);
 It uses , org.apache.pig.test.utils.AccumulatorBagCount and ), 
org.apache.pig.test.utils.BagCount.
  The code checks whether if it throws exception, if not throw 
exception, the unit test fails.


TestAccumulator#testAccumBasic
@Test
public void testAccumBasic() throws IOException{
151// test group by
152pigServer.registerQuery(A = load ' + INPUT_FILE1 + ' as
(id:int, fruit););
   153 pigServer.registerQuery(B = group A by id;);
154pigServer.registerQuery(C = foreach B generate group,
org.apache.pig.test.utils.AccumulatorBagCount(A););

HashMapInteger, Integer expected = new HashMapInteger,
Integer();
expected.put(100, 2);
expected.put(200, 1);
expected.put(300, 3);
expected.put(400, 1);

IteratorTuple iter = pigServer.openIterator(C);

while(iter.hasNext()) {
Tuple t = iter.next();
assertEquals(expected.get((Integer)t.get(0)),
(Integer)t.get(1));
}

169pigServer.registerQuery(B = group A by id;);
170   pigServer.registerQuery(C = foreach B generate group,   +
org.apache.pig.test.utils.AccumulatorBagCount(A),
org.apache.pig.test.utils.BagCount(A););

try{
iter = pigServer.openIterator(C);

while(iter.hasNext()) {
Tuple t = iter.next();
assertEquals(expected.get((Integer)t.get(0)),
(Integer)t.get(1));
}
fail(accumulator should not be called.);
}catch(IOException e) {
// should throw exception from AccumulatorBagCount.
}

// test cogroup
pigServer.registerQuery(A = load ' + INPUT_FILE1 + ' as 
(id:int, fruit););
pigServer.registerQuery(B = load ' + INPUT_FILE1 + ' as 
(id:int, fruit););
pigServer.registerQuery(C = cogroup A by id, B by id;);
pigServer.registerQuery(D = foreach C generate group,   +
org.apache.pig.test.utils.AccumulatorBagCount(A),
org.apache.pig.test.utils.AccumulatorBagCount(B););

HashMapInteger, String expected2 = new HashMapInteger,
String();
expected2.put(100, 2,2);
expected2.put(200, 1,1);
expected2.put(300, 3,3);
expected2.put(400, 1,1);

iter = pigServer.openIterator(D);

while(iter.hasNext

a question about TestAccumulator#testAccumBasic

2015-03-01 Thread Zhang, Liyun

Hi all:
  I have a question about  TestAccumulator#testAccumBasic.
 The first pig script: (TestAccumulator line 151~154)
  A = load ' + INPUT_FILE1 + ' as (id:int, fruit);
  B = group A by id;
  C = foreach B generate group,  
org.apache.pig.test.utils.AccumulatorBagCount(A);

  It uses org.apache.pig.test.utils.AccumulatorBagCount, in 
org.apache.pig.test.utils.AccumulatorBagCount#exec
  org.apache.pig.test.utils.AccumulatorBagCount#exec
public Integer exec(Tuple tuple) throws IOException {
throw new IOException(exec() should not be called.);
}
My question:It should throw exception when script is excuted but why not throw 
exception?

The second script: (TestAccumulator line 169~171)
  A = load ' + INPUT_FILE1 + ' as (id:int, fruit);
 B = group A by id;
  C = foreach B generate group, 
org.apache.pig.test.utils.AccumulatorBagCount(A), 
org.apache.pig.test.utils.BagCount(A);
 It uses , org.apache.pig.test.utils.AccumulatorBagCount and ), 
org.apache.pig.test.utils.BagCount.
  The code checks whether if it throws exception, if not throw exception, the 
unit test fails.


TestAccumulator#testAccumBasic
@Test
public void testAccumBasic() throws IOException{
151// test group by
152pigServer.registerQuery(A = load ' + INPUT_FILE1 + ' as (id:int, 
fruit););
   153 pigServer.registerQuery(B = group A by id;);
154pigServer.registerQuery(C = foreach B generate group,  
org.apache.pig.test.utils.AccumulatorBagCount(A););

HashMapInteger, Integer expected = new HashMapInteger, Integer();
expected.put(100, 2);
expected.put(200, 1);
expected.put(300, 3);
expected.put(400, 1);

IteratorTuple iter = pigServer.openIterator(C);

while(iter.hasNext()) {
Tuple t = iter.next();
assertEquals(expected.get((Integer)t.get(0)), (Integer)t.get(1));
}

169pigServer.registerQuery(B = group A by id;);
170   pigServer.registerQuery(C = foreach B generate group,   +
org.apache.pig.test.utils.AccumulatorBagCount(A), 
org.apache.pig.test.utils.BagCount(A););

try{
iter = pigServer.openIterator(C);

while(iter.hasNext()) {
Tuple t = iter.next();
assertEquals(expected.get((Integer)t.get(0)), 
(Integer)t.get(1));
}
fail(accumulator should not be called.);
}catch(IOException e) {
// should throw exception from AccumulatorBagCount.
}

// test cogroup
pigServer.registerQuery(A = load ' + INPUT_FILE1 + ' as (id:int, 
fruit););
pigServer.registerQuery(B = load ' + INPUT_FILE1 + ' as (id:int, 
fruit););
pigServer.registerQuery(C = cogroup A by id, B by id;);
pigServer.registerQuery(D = foreach C generate group,   +
org.apache.pig.test.utils.AccumulatorBagCount(A), 
org.apache.pig.test.utils.AccumulatorBagCount(B););

HashMapInteger, String expected2 = new HashMapInteger, String();
expected2.put(100, 2,2);
expected2.put(200, 1,1);
expected2.put(300, 3,3);
expected2.put(400, 1,1);

iter = pigServer.openIterator(D);

while(iter.hasNext()) {
Tuple t = iter.next();
assertEquals(expected2.get((Integer)t.get(0)), 
t.get(1).toString()+,+t.get(2).toString());
}
}

   Can anyone help me solving my question?

Best regards
Zhang,Liyun

RE: how to test NativeMapReduceOper?

2015-02-17 Thread Zhang, Liyun

Thanks for your suggestions!

From: Rohini Palaniswamy [rohini.adi...@gmail.com]
Sent: Tuesday, February 17, 2015 12:10 PM
To: dev@pig.apache.org
Cc: pig-...@hadoop.apache.org
Subject: Re: how to test NativeMapReduceOper?

nightly.conf - Native group tests.

On Sat, Feb 14, 2015 at 10:02 PM, Zhang, Liyun liyun.zh...@intel.com
wrote:

 Hi all
   I want to ask a question about how to write a pig script to test
 NativeMapReduceOper?  I approciate if someone can provide a pig script for
 me.

 Best regards
 Zhang,Liyun

how to test NativeMapReduceOper?

2015-02-15 Thread Zhang, Liyun

Hi all
  I want to ask a question about how to write a pig script to test 
NativeMapReduceOper?  I approciate if someone can provide a pig script for me.

Best regards
Zhang,Liyun

How to deal with visitLocalRearrange in spark mode

2015-02-05 Thread Zhang, Liyun

Hi all:
Now i'm working on PIG-4374https://issues.apache.org/jira/browse/PIG-4374(Add 
SparkPlan in spark package). I met problem in following scripts in spark mode.
Join.pig
A = load '/SkewedJoinInput1.txt' as (id,name,n);
B = load '/SkewedJoinInput2.txt' as (id,name);
C = group A by id;
D = group B by id;
E = join C by group, D by group;
store E into '/skewedjoin.out';
explain E;

The physical plan will change to a mr plan which contains 3 mapreduce nodes 
(see attached mr_join.txt)
logroup will converts to  poLocalRearrange,poGlobalRearrange, poPackage
lojoin will converts to poLocalRearrange,poGlobalRearrange, 
poPackage,poPackage

   in mapreduce mode, In MapReduceOper, there is mapplan, reduceplan, 
combineplan.
   
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.visitLocalRearrange
   
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.addToMap
   private void addToMap(PhysicalOperator op) throws PlanException, IOException{
 if (compiledInputs.length == 1) {
 //For speed
 MapReduceOper mro = compiledInputs[0];
 if (!mro.isMapDone()) {
 mro.mapPlan.addAsLeaf(op);
 } else if (mro.isMapDone()  !mro.isReduceDone()) {
 FileSpec fSpec = getTempFileSpec();

 POStore st = getStore(); // MyComment: It will first add a 
POStore in mro.reducePlan and store the mro result in a tmp file.
// Then create a new MROper 
which contains a poload which loads previous tmp file
 st.setSFile(fSpec);
 mro.reducePlan.addAsLeaf(st);
 mro.setReduceDone(true);
 mro = startNew(fSpec, mro);
 mro.mapPlan.addAsLeaf(op);
 compiledInputs[0] = mro;
 } else {
 int errCode = 2022;
 String msg = Both map and reduce phases have been done. This 
is unexpected while compiling.;
 throw new PlanException(msg, errCode, PigException.BUG);
 }
 curMROp = mro;

  
  }

  In SparkOper I created, there is only plan.
  How can i deal with the situation i mentioned above? Now I use following 
ways to deal with:
  
org.apache.pig.backend.hadoop.executionengine.spark.plan.SparkCompiler.visitLocalRearrange
  
org.apache.pig.backend.hadoop.executionengine.spark.plan.SparkCompiler.addToMap
  private void addToMap(POLocalRearrange op) throws PlanException, 
IOException {
  if (compiledInputs.length == 1) {
  SparkOper sparkOp = compiledInputs[0];
  ListPhysicalOperator preds =  plan.getPredecessors(op); 
//MyComment: It will first search the predecessor of POLocalRearrange,
  if( preds!=null  preds.size() 0  preds.size() == 1){
  if(!( preds.get(0) instanceof  POLoad)  ){ // If 
predecessor is not a poload(usually the precessor of polocalrearrange is poload 
when using group, join)
  FileSpec fSpec = getTempFileSpec();//it will 
add a POStore in sparkOper.plan and store the sparkOper result in a tmp file
  POStore st = getStore();   // Then 
create a new SparkOper which contains a poload which loads previous tmp file
  st.setSFile(fSpec);
  sparkOp.plan.addAsLeaf(st);
  sparkOp = startNew(fSpec, sparkOp);
  compiledInputs[0] = sparkOp;
  }
  }
  sparkOp.plan.addAsLeaf(op);
  curSparkOp = sparkOp;
  } else {
  }
  .
  }


Can anyone tell me how tez deal with this situation, I want to reference 
something from other execution mode like mapreduce, tez.


Best regards
Zhang,Liyun

A = load '/SkewedJoinInput1.txt' as (id,name,n);
B = load '/SkewedJoinInput2.txt' as (id,name);
C = group A by id;
D = group B by id;
E = join C by group, D by group;
store E into '/skewedjoin.out';
explain E;



#---
# New Logical Plan:
#---
E: (Name: LOStore Schema: 
C::group#10:bytearray,C::A#23:bag{#29:tuple(id#10:bytearray,name#11:bytearray,n#12:bytearray)},D::group#15:bytearray,D::B#25:bag{#30:tuple(id#15:bytearray,name#16:bytearray)})
|
|---E: (Name: LOJoin(HASH) Schema: 
C::group#10:bytearray,C::A#23:bag{#29:tuple(id#10:bytearray,name#11:bytearray,n#12:bytearray)},D::group#15:bytearray,D::B#25:bag{#30:tuple(id#15:bytearray,name#16:bytearray)})
|   |
|   group:(Name: Project Type: bytearray Uid: 10 Input: 0 Column: 0)
|   |
|   group:(Name: Project Type: bytearray Uid: 15 Input: 1 Column: 0)
|
|---C: (Name: LOCogroup Schema: 
group#10:bytearray,A#23:bag{#29:tuple(id#10:bytearray

Can anyone to give me a sample script that a load has precessors?

2015-01-21 Thread Zhang, Liyun

Hi all
I found that:
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler#compile
Line 384~433 this code is for dealing with the situation where a load has 
predecessors.  Can anyone to give me a sample script that a load has precessors?
  else if (predecessors != null  predecessors.size()  0) {
// When processing an entire script (multiquery), we can
// get into a situation where a load has
// predecessors. This means that it depends on some store
// earlier in the plan. We need to take that dependency
// and connect the respective MR operators, while at the
// same time removing the connection between the Physical
// operators. That way the jobs will run in the right
// order.
if (op instanceof POLoad) {

if (predecessors.size() != 1) {
int errCode = 2125;
String msg = Expected at most one predecessor of load. Got 
+predecessors.size();
throw new PlanException(msg, errCode, PigException.BUG);
}

PhysicalOperator p = predecessors.get(0);
MapReduceOper oper = null;
if(p instanceof POStore || p instanceof PONative){
oper = phyToMROpMap.get(p);
}else{
int errCode = 2126;
String msg = Predecessor of load should be a store or 
mapreduce operator. Got +p.getClass();
throw new PlanException(msg, errCode, PigException.BUG);
}

// Need new operator
curMROp = getMROp();
curMROp.mapPlan.add(op);
MRPlan.add(curMROp);

plan.disconnect(op, p);
MRPlan.connect(oper, curMROp);
phyToMROpMap.put(op, curMROp);
return;
}

Collections.sort(predecessors);
compiledInputs = new MapReduceOper[predecessors.size()];
int i = -1;
for (PhysicalOperator pred : predecessors) {
if(pred instanceof POSplit  
splitsSeen.containsKey(pred.getOperatorKey())){
compiledInputs[++i] = 
startNew(((POSplit)pred).getSplitStore(), 
splitsSeen.get(pred.getOperatorKey()));
continue;
}
compile(pred);
compiledInputs[++i] = curMROp;
}
} else {




Best Regards
Zhang,Liyun

RE: Re:Re: Is ther a way to run one test of special unit test?

2015-01-19 Thread Zhang, Liyun

Hi:
If you want to run one unittest, you can use following command:
ant -Dtestcase=$TestFileName -Dexectype=$mode -DdebugPort=$debugPort test

$mode can be : MR, tez, spark


Best regards
Zhang,Liyun





-Original Message-
From: lulynn_2008 [mailto:lulynn_2...@163.com] 
Sent: Monday, January 19, 2015 11:57 AM
To: u...@pig.apache.org
Cc: dev@pig.apache.org
Subject: Re:Re: Is ther a way to run one test of special unit test?

Thanks for reply. But Pig is using ant.

At 2015-01-16 15:06:51, Pradeep Gollakota pradeep...@gmail.com wrote:
If you're using maven AND using surefire plugin 2.7.3+ AND using Junit 
4, then you can do this by specifying -Dtest=TestClass#methodName

ref:
http://maven.apache.org/surefire/maven-surefire-plugin/examples/single-
test.html

On Thu, Jan 15, 2015 at 8:02 PM, Cheolsoo Park piaozhe...@gmail.com wrote:

 I don't think you can disable test cases on the fly in JUnit. You 
 will need to add @Ignore annotation and recompile the test file. 
 Correct me if I am wrong.

 On Thu, Jan 15, 2015 at 6:55 PM, lulynn_2008 lulynn_2...@163.com wrote:

  Hi All,
 
  There are multiple tests in one Test* file. Is there a way to just 
  run only one pointed test?
 
  Thanks

RE: add data into hive table in pig

2015-01-14 Thread Zhang, Liyun

Hi:
   In 
http://zh.hortonworks.com/community/forums/topic/problems-using-hcatatalog-with-pig-java-lang-nosuchmethoderror-org-apache-hadoo/,
 some one has the similar problem as you. Maybe you can find your solution here.




Best regards
Zhang,Liyun



-Original Message-
From: 李运田 [mailto:cumt...@163.com] 
Sent: Thursday, January 15, 2015 10:59 AM
To: dev@pig.apache.org; user
Subject: add data into hive table in pig

pigServer.registerQuery(a = load ' + inputTable+ ' using 
org.apache.hcatalog.pig.HCatLoader(););
pigServer.registerQuery(store a into '' using 
org.apache.hcatalog.pig.HCatStorer(););
 but,I always get error like this:
Exception in thread main java.lang.NoSuchMethodError: 
org.apache.hadoop.hive.ql.plan.TableDesc.init(Ljava/lang/Class;Ljava/lang/Class;Ljava/lang/Class;Ljava/util/Properties;)V
 at 
org.apache.hcatalog.common.HCatUtil.configureOutputStorageHandler(HCatUtil.java:481)
 at 
org.apache.hcatalog.mapreduce.HCatOutputFormat.setOutput(HCatOutputFormat.java:181)
 at 
org.apache.hcatalog.mapreduce.HCatOutputFormat.setOutput(HCatOutputFormat.java:65)
 at org.apache.hcatalog.pig.HCatStorer.setStoreLocation(HCatStorer.java:125)
please give me some advices,thanks.

about how to implement ship in other mode like spark

2015-01-02 Thread Zhang, Liyun

Hi all,
  I want to ask a question about ship in pig:
Ship with streaming, it will send streaming binary and supporting files, if 
any, from the client node to the compute nodes.
  I found that the implementation of ship in Mapreduce mode is:

/home/zly/prj/oss/pig/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java
line 721:
setupDistributedCache(pigContext, conf, pigContext.getProperties(),
pig.streaming.ship.files, true);

this function gets all pig.streaming.ship.files from the properties, then 
copy the ship files to hadoop using fs.copyFromLocalFile, at the same time, 
symlink feature is turned on by using DistributedCache.createSymlink(conf). For 
example, if ship file /tmp/teststreaming.pl is copyed from local to hadoop, 
the hadoop file will be hdfs://:8020/tmp/temp/tmp-xxx#teststreaming.pl. 
/tmp/hadoop-root/mapred/local/1419842279890/tmp-1268857767 is a cache for 
hdfs://:8020/tmp/temp/tmp-xxx#teststreaming.pl . teststreaming.pl will 
be generated as a link to  
/tmp/hadoop-root/mapred/local/1419842279890/tmp-1268857767 in the current 
execution path.  If i want to implement ship in other mode like spark, the only 
thing i need to do is copying the shiped files from the shiped path to current 
execution path?



Best regards
Zhang,Liyun

RE: use pig in eclipse

2014-12-31 Thread Zhang, Liyun

Hi 李运田:
  Following error is in your mail:
   java.lang.NoSuchFieldException: runnerState  at 
java.lang.Class.getDeclaredField(Class.java:1948)

Are you use hadoop 2 while compiling pig with hadooop 1?

How to compile pig with hadoop2:
ant -Dhadoopversion=23 jar






-Original Message-
From: 李运田 [mailto:cumt...@163.com] 
Sent: Tuesday, December 30, 2014 5:39 PM
To: dev@pig.apache.org; user
Subject: use pig in eclipse

my eclipse and pig are in same linux. 
this is my pig configuration in  eclipse:
 props.setProperty(fs.defaultFS, hdfs://10.210.90.101:8020);
 props.setProperty(hadoop.job.user, hadoop);
 props.setProperty(mapreduce.framework.name, yarn);
 props.setProperty(yarn.resourcemanager.hostname, 10.210.90.101);
 props.setProperty(yarn.resourcemanager.admin.address, 
10.210.90.101:8141);
props.setProperty(yarn.resourcemanager.address, 10.210.90.101:8050);
 props.setProperty(yarn.resourcemanager.resource-tracker.address, 
10.210.90.101:8025);
 props.setProperty(yarn.resourcemanager.scheduler.address, 
10.210.90.101:8030); I have added core-site.xml、 yarn-site.xml、。into 
eclipse project.
I can run pig script in  grunt 
but,when I run
pigServer = new PigServer( ExecType.MAPREDUCE, props);
   pigServer.registerQuery(tmp= LOAD '/user/hadoop/aa.txt';);
   pigServer.registerQuery(tmp_table_limit = order tmp by $0;);
   pigServer.store(tmp_table_limit, /user/hadoop/shi.txt); I always get 
error:
14/12/30 17:28:33 WARN hadoop20.PigJobControl: falling back to default 
JobControl (not using hadoop 0.20 ?)
java.lang.NoSuchFieldException: runnerState  at 
java.lang.Class.getDeclaredField(Class.java:1948)
 at 
org.apache.pig.backend.hadoop20.PigJobControl.clinit(PigJobControl.java:51)
 
 
 
 
 
 
 
help！！

about how to implement ship in other mode like spark

2014-12-29 Thread Zhang, Liyun

Hi all,
  I want to ask a question about ship in pig:
Ship with streaming, it will send streaming binary and supporting files, if 
any, from the client node to the compute nodes.
  I found that the implementation of ship in Mapreduce mode is:

/home/zly/prj/oss/pig/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/JobControlCompiler.java
line 721:
setupDistributedCache(pigContext, conf, pigContext.getProperties(),
pig.streaming.ship.files, true);

this function gets all pig.streaming.ship.files from the properties, then 
copy the ship files to hadoop using fs.copyFromLocalFile, at the same time, 
symlink feature is turned on by using DistributedCache.createSymlink(conf). For 
example, if ship file /tmp/teststreaming.pl is copyed from local to hadoop, 
the hadoop file will be hdfs://:8020/tmp/temp/tmp-xxx#teststreaming.pl. 
/tmp/hadoop-root/mapred/local/1419842279890/tmp-1268857767 is a cache for 
hdfs://:8020/tmp/temp/tmp-xxx#teststreaming.pl . teststreaming.pl will 
be generated as a link to  
/tmp/hadoop-root/mapred/local/1419842279890/tmp-1268857767 in the current 
execution path.  If i want to implement ship in other mode like spark, the only 
thing i need to do is copying the shiped files from the shiped path to current 
execution path?



Best regards
Zhang,Liyun

some question about streaming_local.conf

2014-12-26 Thread Zhang, Liyun

Hi all:
  I have 2 questiones about pig/test/e2e/pig/tests/streaming_local.conf:
1.
$cfg = {
'driver' = 'Pig',
'nummachines' = 5,

'groups' = [
{
# This group is for local mode testing
'name' = 'StreamingLocal',
'sortBenchmark' = 1,
'sortResults' = 1,
'floatpostprocess' = 1,
'delimiter' = '   ',
'tests' = [
{
#Section 1.1: perl script, no parameters
'num' = 1,
'execonly' = 'local',  // this line
'pig' = q#

all e2e test cases are only executed in local mode now. Can these e2e tests run 
in other mode, like mapreduce,tez,spark?
when i replace 'execonly'='local' with 'execonly'='spark', all cases pass 
when POStream is implemented in spark mode.
I think we can remove 'execonly'='local' and can test these e2e tests in other 
modes.

2. when using ship with streaming, it will send streaming binary and supporting 
files, if any, from the client node to the compute nodes..I found we use perl 
./libexec/GroupBy.pl in StreamingLocal_3.pig, this path is a relative path to 
current executed path. can we use perl GroupBy.pl because i think the file 
./libexec/GroupBy.pl has been shipped to compute nodes.
StreamingLocal_3.pig
/test/e2e/pigMD `perl ./libexec/GroupBy.pl '\t' 0` ship('./libexec/GroupBy.pl');
A = load './data/singlefile/studenttab10k';
B = group A by $0;
C = foreach B generate flatten(A);
D = stream C through CMD;
store D into 
'./testout/root-1419582821-streaming_local.conf/StreamingLocal_3.out';



Best regards
Zhang,Liyun

RE: Is there any way to guarantee the sequence of group field as the input when using group operator in pig

2014-12-19 Thread Zhang, Liyun

Hi Remi:
  Thanks for your reply. I agree that group makes no guarantee by contract. 
The sequence of result is not same as the input. So we need make some changes 
in org.apache.pig.test.TestForEachNestedPlan.testInnerDistinct() and 
org.apache.pig.test.TestForEachNestedPlan.testInnerOrderByAliasReuse() . 
Because in those two functions, it judges the result of group according to the 
input sequence. I have submitted PIG-4282_1.patch. Can anyone help review? Very 
thanks

TestForEachNestedPlan.testInnerDistinct()  Line219:

ListTuple expectedResults =
Util.getTuplesFromConstantTupleStrings(
new String[] {(10,68), (20,78)});

int counter = 0;
while (iter.hasNext()) {   // judges the result of group according 
to the input sequence
   assertEquals(expectedResults.get(counter++).toString(),  
iter.next().toString());
}

assertEquals(expectedResults.size(), counter);




Best Regards
Zhang,Liyun



-Original Message-
From: remi.catheri...@orange.com [mailto:remi.catheri...@orange.com] 
Sent: Thursday, December 18, 2014 10:56 PM
To: dev@pig.apache.org
Subject: RE: Is there any way to guarantee the sequence of group field as the 
input when using group operator in pig

Hi all,

If you need any kind of ordering in the output you use on the sort operator. 
It was designed for such needs. The fact that different engines produce 
differently ordered groups is due to each engine specific optimizations. If you 
ask PIG to re-order the groups you just remove any benefit of those 
optimization. I would rather keep groups the way it is because I know I could 
rely on sort if I need and pay its price or have the best speed if I don't need 
any specific ordering.

My conclusion is : group makes no guarantee by contract, so this is neither a 
problem nor a bug. It is a misuse of group compared to sort

Regards,
Remi

-Message d'origine-
De : Zhang, Liyun [mailto:liyun.zh...@intel.com] Envoyé : jeudi 18 décembre 
2014 07:38 À : pig-...@hadoop.apache.org Objet : Is there any way to guarantee 
the sequence of group field as the input when using group operator in pig

Hi all,
   I met a problem that group operator has different results in different 
engines like spark and 
mapreduce(PIG-4282https://issues.apache.org/jira/browse/PIG-4282).

groupdistinct.pig
A = load 'input1.txt' as (age:int,gpa:int); B = group A by age; C = foreach B { 
 D = A.gpa;  E = distinct D; generate group, MIN(E); }; dump C; input1.txt is:
10 89
20 78
10 68
10 89
20 92
the mapreduce output is:
(10,68),(20,78)
the spark output is
(20,78),(10,68)
These two results are different, because the sequence of field 'group' is not 
same.

Is there any way to guarantee the sequence of group field as the input when 
using group operator in pig?


Best regards
Zhang,Liyun


_

Ce message et ses pieces jointes peuvent contenir des informations 
confidentielles ou privilegiees et ne doivent donc pas etre diffuses, exploites 
ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez 
le signaler a l'expediteur et le detruire ainsi que les pieces jointes. Les 
messages electroniques etant susceptibles d'alteration, Orange decline toute 
responsabilite si ce message a ete altere, deforme ou falsifie. Merci.

This message and its attachments may contain confidential or privileged 
information that may be protected by law; they should not be distributed, used 
or copied without authorisation.
If you have received this email in error, please notify the sender and delete 
this message and its attachments.
As emails may be altered, Orange is not liable for messages that have been 
modified, changed or falsified.
Thank you.

Is there any way to guarantee the sequence of “group” field as the input when using “group” operator in pig

2014-12-17 Thread Zhang, Liyun

Hi all,
   I met a problem that “group operator has different results in different 
engines like spark and 
mapreduce(PIG-4282https://issues.apache.org/jira/browse/PIG-4282).

groupdistinct.pig
A = load 'input1.txt' as (age:int,gpa:int);
B = group A by age;
C = foreach B {
 D = A.gpa;
 E = distinct D;
generate group, MIN(E);
};
dump C;
input1.txt is:
10 89
20 78
10 68
10 89
20 92
the mapreduce output is:
(10,68),(20,78)
the spark output is
(20,78),(10,68)
These two results are different, because the sequence of field ‘group’ is not 
same.

Is there any way to guarantee the sequence of “group” field as the input when 
using “group” operator in pig?


Best regards
Zhang,Liyun

Can anyone help review PIG-4265

2014-11-03 Thread Zhang, Liyun

Hi all,
  Can anyone help review PIG-4265: AlgebraicDoubleMathBase has Java double 
precision problems ?

Java Double precision problem will cause invalid result of SUM eval function. 
The detail steps to reproduce the bug is in 
PIG-4265https://issues.apache.org/jira/browse/PIG-4265.
About Java Double precision problem you can referred  
https://community.oracle.com/thread/2448849?tstart=0;


Best Regards
Zhang,Liyun

some problem i found in UDFContext

2014-10-23 Thread Zhang, Liyun

Hi all:
Now I'm fixing PIG-4232.  I want to ask question about UDFContext
I found when call UDFContext#getUDFProperties, in this function , it will put a 
new entry which key is UDFContextKey( its class is c) and value is an empty 
property when the specified property is not found (p= null).
public Properties getUDFProperties(Class c) {
UDFContextKey k = generateKey(c, null);
Properties p = udfConfs.get(k);
if (p == null) {
p = new Properties();
udfConfs.put(k, p);
}
return p;
}

UDFContext#setupUDFContext: before deserializing udfc, it first judges that 
udfc.isUDFConfEmpty().  I think that this judgement will have problem when 
UDFContext#getUDFProperties(Class c) is executed first in current thread, then 
UDFContext#setupUDFContext(Configuration job) is executed. In this situation, 
UDFContext#udfConfs only has an entry which value is an empty property but 
udfc.IsUDFConfEmpty returns false.

public static void setupUDFContext(Configuration job) throws IOException {
UDFContext udfc = UDFContext.getUDFContext();
udfc.addJobConf(job);
// don't deserialize in front-end
if (udfc.isUDFConfEmpty()) {
udfc.deserialize();
}
}

public boolean isUDFConfEmpty() {
return udfConfs.isEmpty();
}

I think following way can solve the situation mentioned.

public boolean isUDFConfEmpty() {

-return udfConfs.isEmpty();

+   // return udfConfs.isEmpty();

+if( udfConfs.isEmpty()){

+return true;

+}else{

+boolean res = true;

+for(UDFContextKey udfContextKey:udfConfs.keySet()){

+if(!udfConfs.get(udfContextKey).isEmpty()){

+res = false;

+break;

+}

+}

+return res;

+}

+   }

Can anyone tell me in the situation I mentioned, is there any other way to 
solve?  Very thanks



Best Regards
Zhang,Liyun

RE: need use ant jar -Dhadoopversion=23 to build, then e2e tests works?

2014-09-27 Thread Zhang, Liyun

Have filed jira PIG-4205 and made patch for it. Please review.

Best Regards
Zhang,Liyun



-Original Message-
From: Daniel Dai [mailto:da...@hortonworks.com] 
Sent: Saturday, September 27, 2014 2:50 AM
To: dev@pig.apache.org
Subject: Re: need use ant jar -Dhadoopversion=23 to build, then e2e tests 
works?

Most probably your HADOOP_HOME is not defined correctly. Do you have 
-Dharness.hadoop.home in the command line?

On Fri, Sep 26, 2014 at 8:21 AM, Rohini Palaniswamy rohini.adi...@gmail.com 
wrote:
 No. Seems like a bug introduced after breaking down the fat jar. 
 Please file a jira.

 On Thu, Sep 25, 2014 at 10:56 PM, Zhang, Liyun liyun.zh...@intel.com
 wrote:

 Hi all,
   I'm using e2e test of pig(
 https://cwiki.apache.org/confluence/display/PIG/HowToTest#HowToTest-HowtoRune2eTests).
 My hadoop env is hadoop1.
 Because I use hadoop-1 , I use ant jar to build .

 When I execute following command:

 ant -Dharness.old.pig=old_pig -Dharness.cluster.conf=hadoop_conf_dir
 -Dharness.cluster.bin=hadoop_script test-e2e-deploy

 following error is found:
 67 Going to run /home/zly/prj/oss/pig/test/e2e/pig/../../../bin/pig 
 -e mkdir /user/pig/out/root-1411632015-nightly.conf/
 168 Cannot locate pig-core-h2.jar. do 'ant -Dhadoopversion=23 
 jar', and try again


 It seems that I need use ant jar -Dhadoopversion=23 to build , the 
 test-e2e-deploy can success.

 Can anyone tell me my understanding is right?


 Best Regards
 Zhang,Liyun



--
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader of 
this message is not the intended recipient, you are hereby notified that any 
printing, copying, dissemination, distribution, disclosure or forwarding of 
this communication is strictly prohibited. If you have received this 
communication in error, please contact the sender immediately and delete it 
from your system. Thank You.

need use ant jar -Dhadoopversion=23 to build, then e2e tests works?

2014-09-25 Thread Zhang, Liyun

Hi all,
  I'm using e2e test of 
pig(https://cwiki.apache.org/confluence/display/PIG/HowToTest#HowToTest-HowtoRune2eTests).
  My hadoop env is hadoop1.
Because I use hadoop-1 , I use ant jar to build .

When I execute following command:

ant -Dharness.old.pig=old_pig -Dharness.cluster.conf=hadoop_conf_dir 
-Dharness.cluster.bin=hadoop_script test-e2e-deploy

following error is found:
67 Going to run /home/zly/prj/oss/pig/test/e2e/pig/../../../bin/pig -e mkdir 
/user/pig/out/root-1411632015-nightly.conf/
168 Cannot locate pig-core-h2.jar. do 'ant -Dhadoopversion=23 jar', and try 
again


It seems that I need use ant jar -Dhadoopversion=23 to build , the 
test-e2e-deploy can success.

Can anyone tell me my understanding is right?


Best Regards
Zhang,Liyun

RE: [VOTE] Drop support for Hadoop 0.20 from Pig 0.14

2014-09-17 Thread Zhang, Liyun

+1

Best Regards
Zhang,Liyun

-Original Message-
From: Rohini Palaniswamy [mailto:rohini.adi...@gmail.com] 
Sent: Wednesday, September 17, 2014 12:38 PM
To: dev@pig.apache.org
Subject: [VOTE] Drop support for Hadoop 0.20 from Pig 0.14

Hi,
   Hadoop has matured far from Hadoop 0.20 and has had two major releases after 
that and there has been no development on branch-0.20 (
http://svn.apache.org/viewvc/hadoop/common/branches/branch-0.20/) for 3 years 
now. It is high time we drop support for Hadoop 0.20 and only support Hadoop 
1.x and 2.x lines going forward. This will reduce the maintenance effort and 
also enable us to right more efficient code and cut down on reflections.

Vote closes on Tuesday, Sep 23 2014.

Thanks,
Rohini

RE: [VOTE] Drop support for JDK 6 from Pig 0.14

2014-09-17 Thread Zhang, Liyun

+1

-Original Message-
From: Julien Le Dem [mailto:jul...@twitter.com.INVALID] 
Sent: Wednesday, September 17, 2014 1:09 PM
To: dev@pig.apache.org
Subject: Re: [VOTE] Drop support for JDK 6 from Pig 0.14

+1

On Tuesday, September 16, 2014, Rohini Palaniswamy rohini.adi...@gmail.com
wrote:

 Hi,
Hadoop is dropping support for JDK6 from hadoop-2.7 this year as 
 mentioned in the mail below. Pig should also move to JDK7 to be able 
 to compile against future hadoop 2.x releases and start making 
 releases with jars (binaries, maven repo) compiled in JDK 7. This 
 would also open it up for developers to code with JDK7 specific APIs.

 Vote closes on Tuesday, Sep 23 2014.

 Thanks,
 Rohini

 -- Forwarded message --
 From: Arun C Murthy a...@hortonworks.com javascript:;
 Date: Tue, Aug 19, 2014 at 10:52 AM
 Subject: Dropping support for JDK6 in Apache Hadoop
 To: d...@hbase.apache.org javascript:; d...@hbase.apache.org 
 javascript:;, d...@hive.apache.org javascript:;, 
 dev@pig.apache.org javascript:;, d...@oozie.apache.org javascript:;
 Cc: common-...@hadoop.apache.org javascript:;  
 common-...@hadoop.apache.org javascript:;

 [Apologies for the wide distribution.]

 Dear HBase/Hive/Pig/Oozie communities,

  We, over at Hadoop are considering dropping support for JDK6 this year.

  As you maybe aware we just released hadoop-2.5.0 and are now 
 considering making the next release i.e. hadoop-2.6.0 the *last* 
 release of Apache Hadoop which supports JDK6. This means, from 
 hadoop-2.7.0 onwards we will not support JDK6 anymore and we *may* start 
 relying on JDK7-specific apis.

  Now, the above releases a proposal and we do not want to pull the 
 trigger without talking to projects downstream - hence the request for 
 you feedback.

  Please feel free to forward this to other communities you might deem 
 to be at risk from this too.

 thanks,
 Arun

RE: Creating a branch for Pig on Spark (PIG-4059)

2014-08-25 Thread Zhang, Liyun

Add me, I work on Pig on spark.

-Original Message-
From: Cheolsoo Park [mailto:piaozhe...@gmail.com] 
Sent: Tuesday, August 26, 2014 10:57 AM
To: Jarek Jarcec Cecho
Cc: dev@pig.apache.org; Praveen R
Subject: Re: Creating a branch for Pig on Spark (PIG-4059)

Hi guys,

I asked about branch committership to the infra mailing list, and here is the 
reply-

Many projects have what they consider 'partial committers' that is folks who 
have access to specific parts of a projects svn tree. Some projects do this for 
GSoC participants, others as a mechanism for moving to 'full committership' 
within the project.

Do note though that in the eyes of the ASF someone with an ICLA and an account 
with any permissions to commit code anywhere in the public svn tree is a 
committer. IOW, you would vote, have ICLAs filed, and request account creation 
as per normal, and then merely adjust the karma in asf-authorization-template 
(and or LDAP)

Looks like we need to vote and follow the normal process just like any other 
new committer.

@Praveen, Jacec,
I think Mayur and Praveen from Sigmoid Analytics need branch committership.
Will anyone else work on Pig-on-Spark? Please reply.

Once I have a full list of people, I will open a vote for Pig PMCs.

Thanks,
Cheolsoo

On Mon, Aug 25, 2014 at 11:51 AM, Cheolsoo Park piaozhe...@gmail.com
wrote:

 Additionally, I will give branch-specific commit permission to 
 people who will work on Pig on Spark (assuming it is possible).

 Please let me know if you have any objection on this too.

 On Mon, Aug 25, 2014 at 10:25 AM, Jarek Jarcec Cecho 
 jar...@apache.org
 wrote:

 No objections from my side, thank you for creating the branch 
 Cheolsoo and kudos to the Sigmoid Analytics team for the great work!

 Jarcec

 On Aug 25, 2014, at 7:14 PM, Cheolsoo Park piaozhe...@gmail.com wrote:

  Hi devs,

  Sigmoid Analytics has been working on Pig-on-Spark (PIG-4059), and 
  they
 want to merge their work into Apache.

  I am going to create a Spark branch for them. Please let me know 
  if
 you have any concerns.

  Thanks,
  Cheolsoo

43 matches

Mail list logo