[jira] Subscription: PIG patch available

2015-09-16 Thread jira
Issue Subscription
Filter: PIG patch available (30 issues)

Subscriber: pigdaily

Key Summary
PIG-4677Display failure information on stop on failure
https://issues.apache.org/jira/browse/PIG-4677
PIG-4670Embedded Python scripts still parse line by line
https://issues.apache.org/jira/browse/PIG-4670
PIG-4667Enable Pig on Spark to run on Yarn Client mode
https://issues.apache.org/jira/browse/PIG-4667
PIG-4663HBaseStorage should allow the MaxResultsPerColumnFamily limit to 
avoid memory or scan timeout issues
https://issues.apache.org/jira/browse/PIG-4663
PIG-4656Improve String serialization and comparator performance in 
BinInterSedes
https://issues.apache.org/jira/browse/PIG-4656
PIG-4644PORelationToExprProject.clone() is broken
https://issues.apache.org/jira/browse/PIG-4644
PIG-4598Allow user defined plan optimizer rules
https://issues.apache.org/jira/browse/PIG-4598
PIG-4581thread safe issue in NodeIdGenerator
https://issues.apache.org/jira/browse/PIG-4581
PIG-4539New PigUnit
https://issues.apache.org/jira/browse/PIG-4539
PIG-4515org.apache.pig.builtin.Distinct throws ClassCastException
https://issues.apache.org/jira/browse/PIG-4515
PIG-4468Pig's jackson version conflicts with that of hadoop 2.6.0
https://issues.apache.org/jira/browse/PIG-4468
PIG-4455Should use DependencyOrderWalker instead of DepthFirstWalker in 
MRPrinter
https://issues.apache.org/jira/browse/PIG-4455
PIG-4417Pig's register command should support automatic fetching of jars 
from repo.
https://issues.apache.org/jira/browse/PIG-4417
PIG-4373Implement PIG-3861 in Tez
https://issues.apache.org/jira/browse/PIG-4373
PIG-4341Add CMX support to pig.tmpfilecompression.codec
https://issues.apache.org/jira/browse/PIG-4341
PIG-4323PackageConverter hanging in Spark
https://issues.apache.org/jira/browse/PIG-4323
PIG-4313StackOverflowError in LIMIT operation on Spark
https://issues.apache.org/jira/browse/PIG-4313
PIG-4251Pig on Storm
https://issues.apache.org/jira/browse/PIG-4251
PIG-4111Make Pig compiles with avro-1.7.7
https://issues.apache.org/jira/browse/PIG-4111
PIG-4002Disable combiner when map-side aggregation is used
https://issues.apache.org/jira/browse/PIG-4002
PIG-3952PigStorage accepts '-tagSplit' to return full split information
https://issues.apache.org/jira/browse/PIG-3952
PIG-3911Define unique fields with @OutputSchema
https://issues.apache.org/jira/browse/PIG-3911
PIG-3877Getting Geo Latitude/Longitude from Address Lines
https://issues.apache.org/jira/browse/PIG-3877
PIG-3873Geo distance calculation using Haversine
https://issues.apache.org/jira/browse/PIG-3873
PIG-3866Create ThreadLocal classloader per PigContext
https://issues.apache.org/jira/browse/PIG-3866
PIG-3864ToDate(userstring, format, timezone) computes DateTime with strange 
handling of Daylight Saving Time with location based timezones
https://issues.apache.org/jira/browse/PIG-3864
PIG-3851Upgrade jline to 2.11
https://issues.apache.org/jira/browse/PIG-3851
PIG-3668COR built-in function when atleast one of the coefficient values is 
NaN
https://issues.apache.org/jira/browse/PIG-3668
PIG-3635Fix e2e tests for Hadoop 2.X on Windows
https://issues.apache.org/jira/browse/PIG-3635
PIG-3587add functionality for rolling over dates
https://issues.apache.org/jira/browse/PIG-3587

You may edit this subscription at:
https://issues.apache.org/jira/secure/FilterSubscription!default.jspa?subId=16328&filterId=12322384


[jira] [Updated] (PIG-4673) Built In UDF - REPLACE_MULTI : For a given string, search and replace all occurrences of search keys with replacement values.

2015-09-16 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-4673:

   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: (was: site)
   0.16.0
   Status: Resolved  (was: Patch Available)

Patch committed to trunk. Only committer can commit code. 

Thanks Murali!

> Built In UDF - REPLACE_MULTI : For a given string, search and replace all 
> occurrences of search keys with replacement values. 
> --
>
> Key: PIG-4673
> URL: https://issues.apache.org/jira/browse/PIG-4673
> Project: Pig
>  Issue Type: New Feature
>  Components: piggybank
>Affects Versions: site
>Reporter: Murali Rao
>Assignee: Murali Rao
>Priority: Minor
>  Labels: None
> Fix For: 0.16.0
>
> Attachments: PIG-4673-1.patch, replace_multi_udf.patch
>
>
> Lets say we have a string = 'A1B2C3D4'. Our objective is to replace A with 1, 
> B with 2, C with 3 and D with 4 to derive 11223344 string. 
> Using existing REPLACE method 
> REPLACE(REPLACE(REPLACE(REPLACE('A1B2C3D4','A','1'),'B','2'),'C','3'),'D','4')
>  
> With proposed UDF : REPLACE_MULTI method
> General Syntax : 
> REPLACE_MULTI ( sourceString,  [  search1#replacement1, ... ] )
> REPLACE_MULTI ( 'A1B2C3D4',  [ 'A'#'1','B'#'2', 'C'#'3', 'D'#'4' ] )
> Advantage : 
>   1. Function calls are reduced. 
>   2. Ease to code and better readable.
>   
> Let me know your thoughts/ inputs on having this UDF in Piggy Bank. Will take 
> this up based on this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4674) TOMAP should infer schema

2015-09-16 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-4674:

  Resolution: Fixed
Hadoop Flags: Reviewed
  Status: Resolved  (was: Patch Available)

Patch committed to trunk. Thanks Rohini for review!

> TOMAP should infer schema
> -
>
> Key: PIG-4674
> URL: https://issues.apache.org/jira/browse/PIG-4674
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.16.0
>
> Attachments: PIG-4674-1.patch, PIG-4674-2.patch, PIG-4674-3.patch
>
>
> TOMAP schema is map only without map value schema. This should be inferred if 
> available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (PIG-4679) Performance degradation due to InputSizeReducerEstimator since PIG-3754

2015-09-16 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14791485#comment-14791485
 ] 

Daniel Dai edited comment on PIG-4679 at 9/17/15 2:39 AM:
--

Patch committed to trunk. Thanks Rohini for review!


was (Author: daijy):
Patch committed to trunk. Thanks Thejas for review!

> Performance degradation due to InputSizeReducerEstimator since PIG-3754
> ---
>
> Key: PIG-4679
> URL: https://issues.apache.org/jira/browse/PIG-4679
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.16.0
>
> Attachments: PIG-4679-0.patch, PIG-4679-1.patch
>
>
> On encountering a non-HDFS location in the input (for example a JOIN 
> involving both HBase tables and intermediate temp files), Pig 0.14 
> ReducerEstimator is returning total input size as -1 (unknown) where as in 
> Pig 0.12.1 it was returning the sum of temp file sizes as the total size. 
> Since -1 is returned as the input size, Pig end up using only one reducer for 
> the job.
> STEPS TO REPRODUCE:
> 1.Create an HBase table with enough data.  Using PerformanceEvaluation 
> tool to generate data
> {code:java}
> hbase org.apache.hadoop.hbase.PerformanceEvaluation --presplit=20 
> --rows=100 sequentialWrite 10
> {code}
> 2.Dump the table data into a file which we can then use in a Pig JOIN.  
> Following Pig script generates the data file
> {code:java}
> $ pig
> A = LOAD 'hbase://TestTable' USING 
> org.apache.pig.backend.hadoop.hbase.HBaseStorage('info:data', '-loadKey') AS 
> (row_key: chararray, data: chararray);
> STORE A INTO 'hdfs:///tmp/re_test/test_table_data' USING PigStorage('|');
> {code}
> 3.Check file size to make sure that it is more than 1,000,000,000 which 
> is the default bytes per reducer Pig configuration
> {code:java}
> $ hdfs dfs -count hdfs:///tmp/re_test/test_table_data
> QA:   1   411028000 
> hdfs:///tmp/re_test/test_table_data
> PROD: 1   571028000 
> hdfs:///tmp/re_test/test_table_data
> {code}
> 4.Run a Pig script that joins the HBase table with the data file.  QA and 
> PROD will use different number of reducers.  QA (176243) should run 1 reducer 
> and PROD (176258) should run 11 reducers (10,280,000,000 / 1,000,000,000)
> {code:java}
> $ pig
> A = LOAD 'hbase://TestTable' USING 
> org.apache.pig.backend.hadoop.hbase.HBaseStorage('info:data', '-loadKey') AS 
> (row_key: chararray, data: chararray);
> B = LOAD 'hdfs:///tmp/re_test/test_table_data' USING PigStorage('|') AS 
> (row_key: chararray, data: chararray);
> C = JOIN A BY row_key, B BY row_key;
> STORE C INTO 'hdfs:///tmp/re_test/test_table_data_join' USING PigStorage('|');
> {code}
> Pig 0.12.1 ran 11 reduce, Pig 0.13+ run only 1 reduce.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4679) Performance degradation due to InputSizeReducerEstimator since PIG-3754

2015-09-16 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-4679:

  Resolution: Fixed
Hadoop Flags: Reviewed
  Status: Resolved  (was: Patch Available)

Patch committed to trunk. Thanks Thejas for review!

> Performance degradation due to InputSizeReducerEstimator since PIG-3754
> ---
>
> Key: PIG-4679
> URL: https://issues.apache.org/jira/browse/PIG-4679
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.16.0
>
> Attachments: PIG-4679-0.patch, PIG-4679-1.patch
>
>
> On encountering a non-HDFS location in the input (for example a JOIN 
> involving both HBase tables and intermediate temp files), Pig 0.14 
> ReducerEstimator is returning total input size as -1 (unknown) where as in 
> Pig 0.12.1 it was returning the sum of temp file sizes as the total size. 
> Since -1 is returned as the input size, Pig end up using only one reducer for 
> the job.
> STEPS TO REPRODUCE:
> 1.Create an HBase table with enough data.  Using PerformanceEvaluation 
> tool to generate data
> {code:java}
> hbase org.apache.hadoop.hbase.PerformanceEvaluation --presplit=20 
> --rows=100 sequentialWrite 10
> {code}
> 2.Dump the table data into a file which we can then use in a Pig JOIN.  
> Following Pig script generates the data file
> {code:java}
> $ pig
> A = LOAD 'hbase://TestTable' USING 
> org.apache.pig.backend.hadoop.hbase.HBaseStorage('info:data', '-loadKey') AS 
> (row_key: chararray, data: chararray);
> STORE A INTO 'hdfs:///tmp/re_test/test_table_data' USING PigStorage('|');
> {code}
> 3.Check file size to make sure that it is more than 1,000,000,000 which 
> is the default bytes per reducer Pig configuration
> {code:java}
> $ hdfs dfs -count hdfs:///tmp/re_test/test_table_data
> QA:   1   411028000 
> hdfs:///tmp/re_test/test_table_data
> PROD: 1   571028000 
> hdfs:///tmp/re_test/test_table_data
> {code}
> 4.Run a Pig script that joins the HBase table with the data file.  QA and 
> PROD will use different number of reducers.  QA (176243) should run 1 reducer 
> and PROD (176258) should run 11 reducers (10,280,000,000 / 1,000,000,000)
> {code:java}
> $ pig
> A = LOAD 'hbase://TestTable' USING 
> org.apache.pig.backend.hadoop.hbase.HBaseStorage('info:data', '-loadKey') AS 
> (row_key: chararray, data: chararray);
> B = LOAD 'hdfs:///tmp/re_test/test_table_data' USING PigStorage('|') AS 
> (row_key: chararray, data: chararray);
> C = JOIN A BY row_key, B BY row_key;
> STORE C INTO 'hdfs:///tmp/re_test/test_table_data_join' USING PigStorage('|');
> {code}
> Pig 0.12.1 ran 11 reduce, Pig 0.13+ run only 1 reduce.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4676) Upgrade Hive to 1.2.1

2015-09-16 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14791481#comment-14791481
 ] 

Daniel Dai commented on PIG-4676:
-

PIG-4676-fixtest.patch committed.

> Upgrade Hive to 1.2.1
> -
>
> Key: PIG-4676
> URL: https://issues.apache.org/jira/browse/PIG-4676
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.16.0
>
> Attachments: PIG-4676-1.patch, PIG-4676-fixtest.patch
>
>
> Upgrade Hive dependency to the latest version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4673) Built In UDF - REPLACE_MULTI : For a given string, search and replace all occurrences of search keys with replacement values.

2015-09-16 Thread Murali Rao (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14791312#comment-14791312
 ] 

Murali Rao commented on PIG-4673:
-

[~daijy] : Thanks for the review. While committing the code to SVN, getting 
error as
below. Plz. let me know how to get write access to repo.

org.apache.subversion.javahl.ClientException: svn: E170001: Commit failed
(details follow):

svn: E170001: MKACTIVITY of
'/repos/asf/!svn/act/0fb982d8-4f01-0010-a887-1302968552fb': 403 Forbidden (
http://svn.apache.org)

> Built In UDF - REPLACE_MULTI : For a given string, search and replace all 
> occurrences of search keys with replacement values. 
> --
>
> Key: PIG-4673
> URL: https://issues.apache.org/jira/browse/PIG-4673
> Project: Pig
>  Issue Type: New Feature
>  Components: piggybank
>Affects Versions: site
>Reporter: Murali Rao
>Assignee: Murali Rao
>Priority: Minor
>  Labels: None
> Fix For: site
>
> Attachments: PIG-4673-1.patch, replace_multi_udf.patch
>
>
> Lets say we have a string = 'A1B2C3D4'. Our objective is to replace A with 1, 
> B with 2, C with 3 and D with 4 to derive 11223344 string. 
> Using existing REPLACE method 
> REPLACE(REPLACE(REPLACE(REPLACE('A1B2C3D4','A','1'),'B','2'),'C','3'),'D','4')
>  
> With proposed UDF : REPLACE_MULTI method
> General Syntax : 
> REPLACE_MULTI ( sourceString,  [  search1#replacement1, ... ] )
> REPLACE_MULTI ( 'A1B2C3D4',  [ 'A'#'1','B'#'2', 'C'#'3', 'D'#'4' ] )
> Advantage : 
>   1. Function calls are reduced. 
>   2. Ease to code and better readable.
>   
> Let me know your thoughts/ inputs on having this UDF in Piggy Bank. Will take 
> this up based on this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-3251) Bzip2TextInputFormat requires double the memory of maximum record size

2015-09-16 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14791291#comment-14791291
 ] 

Rohini Palaniswamy commented on PIG-3251:
-

Looks good so far. Will wait for the unit test investigation. Just one 
suggestion for a small change to have better readability

public static final String PIG_BZIPINPUT_USEHADOOPS = 
"pig.bzipinput.usehadoops";
 to
public static final String PIG_BZIP_USE_HADOOP_INPUTFORMAT = 
"pig.bzip.use.hadoop.inputformat";




> Bzip2TextInputFormat requires double the memory of maximum record size
> --
>
> Key: PIG-3251
> URL: https://issues.apache.org/jira/browse/PIG-3251
> Project: Pig
>  Issue Type: Improvement
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
>Priority: Minor
> Fix For: 0.16.0
>
> Attachments: pig-3251-trunk-v01.patch, pig-3251-trunk-v02.patch, 
> pig-3251-trunk-v03.patch, pig-3251-trunk-v04.patch, pig-3251-trunk-v05.patch, 
> pig-3251-trunk-v06.patch
>
>
> While looking at user's OOM heap dump, noticed that pig's 
> Bzip2TextInputFormat consumes memory at both
> Bzip2TextInputFormat.buffer (ByteArrayOutputStream) 
> and actual Text that is returned as line.
> For example, when having one record with 160MBytes, buffer was 268MBytes and 
> Text was 160MBytes.  
> We can probably eliminate one of them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4677) Display failure information on stop on failure

2015-09-16 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14791270#comment-14791270
 ] 

Rohini Palaniswamy commented on PIG-4677:
-

[~mitdesai],
   checkStopOnFailure currently throws an exception which makes it exit early. 
So it ends up skipping cleanupOnFailure, printing "Job Stats" information, etc. 
i.e all code starting from 
MRScriptState.get().emitProgressUpdatedNotification(100);. 

  So instead of throwing exception and exiting early and relying on system 
shutdown hook to kill the remaining jobs, you will have to call 
failJob(message) on JobControl.getReadyJobsList() and  
JobControl.getRunningJobList() in checkStopOnFailure. That will kill all jobs 
and will then follow the regular code path.

> Display failure information on stop on failure
> --
>
> Key: PIG-4677
> URL: https://issues.apache.org/jira/browse/PIG-4677
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.11.1
>Reporter: Mit Desai
>Assignee: Mit Desai
> Attachments: PIG-4677.patch
>
>
> When stop on failure option is specified, pig abruptly exits without 
> displaying any job stats or failed job information which it usually does in 
> case of failures.
> {code}
> 2015-06-04 20:35:38,170 [uber-SubtaskRunner] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>   - 9% complete
> 2015-06-04 20:35:38,171 [uber-SubtaskRunner] INFO  
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher
>   - Running jobs are 
> [job_1428329756093_3741748,job_1428329756093_3741752,job_1428329756093_3741753,job_1428329756093_3741754,job_1428329756093_3741756]
> 2015-06-04 20:35:40,201 [uber-SubtaskRunner] ERROR 
> org.apache.pig.tools.grunt.Grunt  - ERROR 6017: Job failed!
> Hadoop Job IDs executed by Pig: 
> job_1428329756093_3739816,job_1428329756093_3741752,job_1428329756093_3739814,job_1428329756093_3741748,job_1428329756093_3741756,job_1428329756093_3741753,job_1428329756093_3741754
> <<< Invocation of Main class completed <<<
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4674) TOMAP should infer schema

2015-09-16 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14791258#comment-14791258
 ] 

Rohini Palaniswamy commented on PIG-4674:
-

+1

> TOMAP should infer schema
> -
>
> Key: PIG-4674
> URL: https://issues.apache.org/jira/browse/PIG-4674
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.16.0
>
> Attachments: PIG-4674-1.patch, PIG-4674-2.patch, PIG-4674-3.patch
>
>
> TOMAP schema is map only without map value schema. This should be inferred if 
> available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4674) TOMAP should infer schema

2015-09-16 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-4674:

Attachment: PIG-4674-3.patch

Yes, there is a hole. Apply suggested change.

> TOMAP should infer schema
> -
>
> Key: PIG-4674
> URL: https://issues.apache.org/jira/browse/PIG-4674
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.16.0
>
> Attachments: PIG-4674-1.patch, PIG-4674-2.patch, PIG-4674-3.patch
>
>
> TOMAP schema is map only without map value schema. This should be inferred if 
> available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4536) LIMIT and DISTINCT inside nested foreach should have combiner optimization

2015-09-16 Thread Rohini Palaniswamy (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rohini Palaniswamy updated PIG-4536:

Summary: LIMIT and DISTINCT inside nested foreach should have combiner 
optimization  (was: LIMIT inside nested foreach should have combiner 
optimization)

> LIMIT and DISTINCT inside nested foreach should have combiner optimization
> --
>
> Key: PIG-4536
> URL: https://issues.apache.org/jira/browse/PIG-4536
> Project: Pig
>  Issue Type: Improvement
>Reporter: Rohini Palaniswamy
>  Labels: Performance
>
> {code}
> data_group = GROUP A BY (f1, f2) PARALLEL 100;
> group_result = FOREACH data_group {
> B = LIMIT A.f3 1;
> GENERATE group,  SUM(A.f3), SUM(A.f4), SUM(A.f5), SUM(A.f6),FLATTEN(B);
> };
> {code}
> A script like this has combiner optimization turned off and so consumes a lot 
> of memory and is slow. We should implement LIMIT using Combiner in cases like 
> this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-3251) Bzip2TextInputFormat requires double the memory of maximum record size

2015-09-16 Thread Koji Noguchi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated PIG-3251:
--
Attachment: pig-3251-trunk-v06.patch

Updated the patch (pig-3251-trunk-v06.patch) to
* Only to use Hadoop's TextInputFormat for 0.23/2.X.
* Make it configurable and turning it *on* as default.  (need a good config 
name)
* Moving the update of this fake codec with extension "bz" to PigServer.java.  
(probably a wrong place but not sure where to put.)
* Updating TestBZip so that it would run the test twice for 0.23/2.X with 
option turned on and off.

Then I realized that for the issue I previously reported on this jira, 
testBlockHeaderEndingWithCR and testBlockHeaderEndingAtSplitNotByteAligned 
failing, only the former is fixed in  MAPREDUCE-5656 but not the latter.  
Looking at testBlockHeaderEndingAtSplitNotByteAligned.



> Bzip2TextInputFormat requires double the memory of maximum record size
> --
>
> Key: PIG-3251
> URL: https://issues.apache.org/jira/browse/PIG-3251
> Project: Pig
>  Issue Type: Improvement
>Reporter: Koji Noguchi
>Assignee: Koji Noguchi
>Priority: Minor
> Fix For: 0.16.0
>
> Attachments: pig-3251-trunk-v01.patch, pig-3251-trunk-v02.patch, 
> pig-3251-trunk-v03.patch, pig-3251-trunk-v04.patch, pig-3251-trunk-v05.patch, 
> pig-3251-trunk-v06.patch
>
>
> While looking at user's OOM heap dump, noticed that pig's 
> Bzip2TextInputFormat consumes memory at both
> Bzip2TextInputFormat.buffer (ByteArrayOutputStream) 
> and actual Text that is returned as line.
> For example, when having one record with 160MBytes, buffer was 268MBytes and 
> Text was 160MBytes.  
> We can probably eliminate one of them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4679) Performance degradation due to InputSizeReducerEstimator since PIG-3754

2015-09-16 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14791110#comment-14791110
 ] 

Rohini Palaniswamy commented on PIG-4679:
-

+1

> Performance degradation due to InputSizeReducerEstimator since PIG-3754
> ---
>
> Key: PIG-4679
> URL: https://issues.apache.org/jira/browse/PIG-4679
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.16.0
>
> Attachments: PIG-4679-0.patch, PIG-4679-1.patch
>
>
> On encountering a non-HDFS location in the input (for example a JOIN 
> involving both HBase tables and intermediate temp files), Pig 0.14 
> ReducerEstimator is returning total input size as -1 (unknown) where as in 
> Pig 0.12.1 it was returning the sum of temp file sizes as the total size. 
> Since -1 is returned as the input size, Pig end up using only one reducer for 
> the job.
> STEPS TO REPRODUCE:
> 1.Create an HBase table with enough data.  Using PerformanceEvaluation 
> tool to generate data
> {code:java}
> hbase org.apache.hadoop.hbase.PerformanceEvaluation --presplit=20 
> --rows=100 sequentialWrite 10
> {code}
> 2.Dump the table data into a file which we can then use in a Pig JOIN.  
> Following Pig script generates the data file
> {code:java}
> $ pig
> A = LOAD 'hbase://TestTable' USING 
> org.apache.pig.backend.hadoop.hbase.HBaseStorage('info:data', '-loadKey') AS 
> (row_key: chararray, data: chararray);
> STORE A INTO 'hdfs:///tmp/re_test/test_table_data' USING PigStorage('|');
> {code}
> 3.Check file size to make sure that it is more than 1,000,000,000 which 
> is the default bytes per reducer Pig configuration
> {code:java}
> $ hdfs dfs -count hdfs:///tmp/re_test/test_table_data
> QA:   1   411028000 
> hdfs:///tmp/re_test/test_table_data
> PROD: 1   571028000 
> hdfs:///tmp/re_test/test_table_data
> {code}
> 4.Run a Pig script that joins the HBase table with the data file.  QA and 
> PROD will use different number of reducers.  QA (176243) should run 1 reducer 
> and PROD (176258) should run 11 reducers (10,280,000,000 / 1,000,000,000)
> {code:java}
> $ pig
> A = LOAD 'hbase://TestTable' USING 
> org.apache.pig.backend.hadoop.hbase.HBaseStorage('info:data', '-loadKey') AS 
> (row_key: chararray, data: chararray);
> B = LOAD 'hdfs:///tmp/re_test/test_table_data' USING PigStorage('|') AS 
> (row_key: chararray, data: chararray);
> C = JOIN A BY row_key, B BY row_key;
> STORE C INTO 'hdfs:///tmp/re_test/test_table_data_join' USING PigStorage('|');
> {code}
> Pig 0.12.1 ran 11 reduce, Pig 0.13+ run only 1 reduce.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4674) TOMAP should infer schema

2015-09-16 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14791105#comment-14791105
 ] 

Rohini Palaniswamy commented on PIG-4674:
-

{code}
+byte valueType = DataType.BYTEARRAY;
.
+if (valueType == DataType.BYTEARRAY || valueType == 
input.getFields().get(i).type) {
+valueType = input.getFields().get(i).type;
+} else {
+valueType = DataType.BYTEARRAY;
+break;
+}
{code}

should be

{code}
+Byte valueType = null;
+if (valueType == null)) {
+valueType = input.getFields().get(i).type;
+} else if (valueType != input.getFields().get(i).type) {
+valueType = DataType.BYTEARRAY;
+break;
+}
{code}

Without that if we had a0,a1,a2,a3 with a1 as bytearray and a3 as int, it will 
become map[int] instead of map[bytearray]. 

> TOMAP should infer schema
> -
>
> Key: PIG-4674
> URL: https://issues.apache.org/jira/browse/PIG-4674
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.16.0
>
> Attachments: PIG-4674-1.patch, PIG-4674-2.patch
>
>
> TOMAP schema is map only without map value schema. This should be inferred if 
> available.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4676) Upgrade Hive to 1.2.1

2015-09-16 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14790982#comment-14790982
 ] 

Rohini Palaniswamy commented on PIG-4676:
-

+1 for 
https://issues.apache.org/jira/secure/attachment/12756297/PIG-4676-fixtest.patch

> Upgrade Hive to 1.2.1
> -
>
> Key: PIG-4676
> URL: https://issues.apache.org/jira/browse/PIG-4676
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.16.0
>
> Attachments: PIG-4676-1.patch, PIG-4676-fixtest.patch
>
>
> Upgrade Hive dependency to the latest version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4673) Built In UDF - REPLACE_MULTI : For a given string, search and replace all occurrences of search keys with replacement values.

2015-09-16 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-4673:

Attachment: PIG-4673-1.patch

Move the code to piggybank, adjust the format. Otherwise looks good. Will check 
in shortly.

> Built In UDF - REPLACE_MULTI : For a given string, search and replace all 
> occurrences of search keys with replacement values. 
> --
>
> Key: PIG-4673
> URL: https://issues.apache.org/jira/browse/PIG-4673
> Project: Pig
>  Issue Type: New Feature
>  Components: piggybank
>Affects Versions: site
>Reporter: Murali Rao
>Assignee: Murali Rao
>Priority: Minor
>  Labels: None
> Fix For: site
>
> Attachments: PIG-4673-1.patch, replace_multi_udf.patch
>
>
> Lets say we have a string = 'A1B2C3D4'. Our objective is to replace A with 1, 
> B with 2, C with 3 and D with 4 to derive 11223344 string. 
> Using existing REPLACE method 
> REPLACE(REPLACE(REPLACE(REPLACE('A1B2C3D4','A','1'),'B','2'),'C','3'),'D','4')
>  
> With proposed UDF : REPLACE_MULTI method
> General Syntax : 
> REPLACE_MULTI ( sourceString,  [  search1#replacement1, ... ] )
> REPLACE_MULTI ( 'A1B2C3D4',  [ 'A'#'1','B'#'2', 'C'#'3', 'D'#'4' ] )
> Advantage : 
>   1. Function calls are reduced. 
>   2. Ease to code and better readable.
>   
> Let me know your thoughts/ inputs on having this UDF in Piggy Bank. Will take 
> this up based on this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4679) Performance degradation due to InputSizeReducerEstimator since PIG-3754

2015-09-16 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-4679:

Status: Patch Available  (was: Open)

> Performance degradation due to InputSizeReducerEstimator since PIG-3754
> ---
>
> Key: PIG-4679
> URL: https://issues.apache.org/jira/browse/PIG-4679
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.16.0
>
> Attachments: PIG-4679-0.patch, PIG-4679-1.patch
>
>
> On encountering a non-HDFS location in the input (for example a JOIN 
> involving both HBase tables and intermediate temp files), Pig 0.14 
> ReducerEstimator is returning total input size as -1 (unknown) where as in 
> Pig 0.12.1 it was returning the sum of temp file sizes as the total size. 
> Since -1 is returned as the input size, Pig end up using only one reducer for 
> the job.
> STEPS TO REPRODUCE:
> 1.Create an HBase table with enough data.  Using PerformanceEvaluation 
> tool to generate data
> {code:java}
> hbase org.apache.hadoop.hbase.PerformanceEvaluation --presplit=20 
> --rows=100 sequentialWrite 10
> {code}
> 2.Dump the table data into a file which we can then use in a Pig JOIN.  
> Following Pig script generates the data file
> {code:java}
> $ pig
> A = LOAD 'hbase://TestTable' USING 
> org.apache.pig.backend.hadoop.hbase.HBaseStorage('info:data', '-loadKey') AS 
> (row_key: chararray, data: chararray);
> STORE A INTO 'hdfs:///tmp/re_test/test_table_data' USING PigStorage('|');
> {code}
> 3.Check file size to make sure that it is more than 1,000,000,000 which 
> is the default bytes per reducer Pig configuration
> {code:java}
> $ hdfs dfs -count hdfs:///tmp/re_test/test_table_data
> QA:   1   411028000 
> hdfs:///tmp/re_test/test_table_data
> PROD: 1   571028000 
> hdfs:///tmp/re_test/test_table_data
> {code}
> 4.Run a Pig script that joins the HBase table with the data file.  QA and 
> PROD will use different number of reducers.  QA (176243) should run 1 reducer 
> and PROD (176258) should run 11 reducers (10,280,000,000 / 1,000,000,000)
> {code:java}
> $ pig
> A = LOAD 'hbase://TestTable' USING 
> org.apache.pig.backend.hadoop.hbase.HBaseStorage('info:data', '-loadKey') AS 
> (row_key: chararray, data: chararray);
> B = LOAD 'hdfs:///tmp/re_test/test_table_data' USING PigStorage('|') AS 
> (row_key: chararray, data: chararray);
> C = JOIN A BY row_key, B BY row_key;
> STORE C INTO 'hdfs:///tmp/re_test/test_table_data_join' USING PigStorage('|');
> {code}
> Pig 0.12.1 ran 11 reduce, Pig 0.13+ run only 1 reduce.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4679) Performance degradation due to InputSizeReducerEstimator since PIG-3754

2015-09-16 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-4679:

Attachment: PIG-4679-1.patch

> Performance degradation due to InputSizeReducerEstimator since PIG-3754
> ---
>
> Key: PIG-4679
> URL: https://issues.apache.org/jira/browse/PIG-4679
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.16.0
>
> Attachments: PIG-4679-0.patch, PIG-4679-1.patch
>
>
> On encountering a non-HDFS location in the input (for example a JOIN 
> involving both HBase tables and intermediate temp files), Pig 0.14 
> ReducerEstimator is returning total input size as -1 (unknown) where as in 
> Pig 0.12.1 it was returning the sum of temp file sizes as the total size. 
> Since -1 is returned as the input size, Pig end up using only one reducer for 
> the job.
> STEPS TO REPRODUCE:
> 1.Create an HBase table with enough data.  Using PerformanceEvaluation 
> tool to generate data
> {code:java}
> hbase org.apache.hadoop.hbase.PerformanceEvaluation --presplit=20 
> --rows=100 sequentialWrite 10
> {code}
> 2.Dump the table data into a file which we can then use in a Pig JOIN.  
> Following Pig script generates the data file
> {code:java}
> $ pig
> A = LOAD 'hbase://TestTable' USING 
> org.apache.pig.backend.hadoop.hbase.HBaseStorage('info:data', '-loadKey') AS 
> (row_key: chararray, data: chararray);
> STORE A INTO 'hdfs:///tmp/re_test/test_table_data' USING PigStorage('|');
> {code}
> 3.Check file size to make sure that it is more than 1,000,000,000 which 
> is the default bytes per reducer Pig configuration
> {code:java}
> $ hdfs dfs -count hdfs:///tmp/re_test/test_table_data
> QA:   1   411028000 
> hdfs:///tmp/re_test/test_table_data
> PROD: 1   571028000 
> hdfs:///tmp/re_test/test_table_data
> {code}
> 4.Run a Pig script that joins the HBase table with the data file.  QA and 
> PROD will use different number of reducers.  QA (176243) should run 1 reducer 
> and PROD (176258) should run 11 reducers (10,280,000,000 / 1,000,000,000)
> {code:java}
> $ pig
> A = LOAD 'hbase://TestTable' USING 
> org.apache.pig.backend.hadoop.hbase.HBaseStorage('info:data', '-loadKey') AS 
> (row_key: chararray, data: chararray);
> B = LOAD 'hdfs:///tmp/re_test/test_table_data' USING PigStorage('|') AS 
> (row_key: chararray, data: chararray);
> C = JOIN A BY row_key, B BY row_key;
> STORE C INTO 'hdfs:///tmp/re_test/test_table_data_join' USING PigStorage('|');
> {code}
> Pig 0.12.1 ran 11 reduce, Pig 0.13+ run only 1 reduce.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4673) Built In UDF - REPLACE_MULTI : For a given string, search and replace all occurrences of search keys with replacement values.

2015-09-16 Thread Murali Rao (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14790765#comment-14790765
 ] 

Murali Rao commented on PIG-4673:
-

[~daijy] : Attached patch, plz check and let me know your inputs.

> Built In UDF - REPLACE_MULTI : For a given string, search and replace all 
> occurrences of search keys with replacement values. 
> --
>
> Key: PIG-4673
> URL: https://issues.apache.org/jira/browse/PIG-4673
> Project: Pig
>  Issue Type: New Feature
>  Components: piggybank
>Affects Versions: site
>Reporter: Murali Rao
>Assignee: Murali Rao
>Priority: Minor
>  Labels: None
> Fix For: site
>
> Attachments: replace_multi_udf.patch
>
>
> Lets say we have a string = 'A1B2C3D4'. Our objective is to replace A with 1, 
> B with 2, C with 3 and D with 4 to derive 11223344 string. 
> Using existing REPLACE method 
> REPLACE(REPLACE(REPLACE(REPLACE('A1B2C3D4','A','1'),'B','2'),'C','3'),'D','4')
>  
> With proposed UDF : REPLACE_MULTI method
> General Syntax : 
> REPLACE_MULTI ( sourceString,  [  search1#replacement1, ... ] )
> REPLACE_MULTI ( 'A1B2C3D4',  [ 'A'#'1','B'#'2', 'C'#'3', 'D'#'4' ] )
> Advantage : 
>   1. Function calls are reduced. 
>   2. Ease to code and better readable.
>   
> Let me know your thoughts/ inputs on having this UDF in Piggy Bank. Will take 
> this up based on this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4673) Built In UDF - REPLACE_MULTI : For a given string, search and replace all occurrences of search keys with replacement values.

2015-09-16 Thread Murali Rao (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Murali Rao updated PIG-4673:

Attachment: replace_multi_udf.patch

Attaching patch for REPLACE_MULTI UDF. Path contains UDF and Test File.

> Built In UDF - REPLACE_MULTI : For a given string, search and replace all 
> occurrences of search keys with replacement values. 
> --
>
> Key: PIG-4673
> URL: https://issues.apache.org/jira/browse/PIG-4673
> Project: Pig
>  Issue Type: New Feature
>  Components: piggybank
>Affects Versions: site
>Reporter: Murali Rao
>Assignee: Murali Rao
>Priority: Minor
>  Labels: None
> Fix For: site
>
> Attachments: replace_multi_udf.patch
>
>
> Lets say we have a string = 'A1B2C3D4'. Our objective is to replace A with 1, 
> B with 2, C with 3 and D with 4 to derive 11223344 string. 
> Using existing REPLACE method 
> REPLACE(REPLACE(REPLACE(REPLACE('A1B2C3D4','A','1'),'B','2'),'C','3'),'D','4')
>  
> With proposed UDF : REPLACE_MULTI method
> General Syntax : 
> REPLACE_MULTI ( sourceString,  [  search1#replacement1, ... ] )
> REPLACE_MULTI ( 'A1B2C3D4',  [ 'A'#'1','B'#'2', 'C'#'3', 'D'#'4' ] )
> Advantage : 
>   1. Function calls are reduced. 
>   2. Ease to code and better readable.
>   
> Let me know your thoughts/ inputs on having this UDF in Piggy Bank. Will take 
> this up based on this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4676) Upgrade Hive to 1.2.1

2015-09-16 Thread Daniel Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Dai updated PIG-4676:

Attachment: PIG-4676-fixtest.patch

TestLoaderStorerShipCacheFiles is broken by the patch. Attach fix.

> Upgrade Hive to 1.2.1
> -
>
> Key: PIG-4676
> URL: https://issues.apache.org/jira/browse/PIG-4676
> Project: Pig
>  Issue Type: Improvement
>  Components: impl
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.16.0
>
> Attachments: PIG-4676-1.patch, PIG-4676-fixtest.patch
>
>
> Upgrade Hive dependency to the latest version.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4673) Built In UDF - REPLACE_MULTI : For a given string, search and replace all occurrences of search keys with replacement values.

2015-09-16 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14790701#comment-14790701
 ] 

Daniel Dai commented on PIG-4673:
-

I cannot see the patch, can you attach to the Jira?

> Built In UDF - REPLACE_MULTI : For a given string, search and replace all 
> occurrences of search keys with replacement values. 
> --
>
> Key: PIG-4673
> URL: https://issues.apache.org/jira/browse/PIG-4673
> Project: Pig
>  Issue Type: New Feature
>  Components: piggybank
>Affects Versions: site
>Reporter: Murali Rao
>Assignee: Murali Rao
>Priority: Minor
>  Labels: None
> Fix For: site
>
>
> Lets say we have a string = 'A1B2C3D4'. Our objective is to replace A with 1, 
> B with 2, C with 3 and D with 4 to derive 11223344 string. 
> Using existing REPLACE method 
> REPLACE(REPLACE(REPLACE(REPLACE('A1B2C3D4','A','1'),'B','2'),'C','3'),'D','4')
>  
> With proposed UDF : REPLACE_MULTI method
> General Syntax : 
> REPLACE_MULTI ( sourceString,  [  search1#replacement1, ... ] )
> REPLACE_MULTI ( 'A1B2C3D4',  [ 'A'#'1','B'#'2', 'C'#'3', 'D'#'4' ] )
> Advantage : 
>   1. Function calls are reduced. 
>   2. Ease to code and better readable.
>   
> Let me know your thoughts/ inputs on having this UDF in Piggy Bank. Will take 
> this up based on this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Jenkins build became unstable: Pig-trunk-commit #2234

2015-09-16 Thread Apache Jenkins Server
See 



[jira] [Commented] (PIG-4667) Enable Pig on Spark to run on Yarn Client mode

2015-09-16 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14781849#comment-14781849
 ] 

Xuefu Zhang commented on PIG-4667:
--

+1

> Enable Pig on Spark to run on Yarn Client mode
> --
>
> Key: PIG-4667
> URL: https://issues.apache.org/jira/browse/PIG-4667
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Srikanth Sundarrajan
>Assignee: Srikanth Sundarrajan
> Fix For: spark-branch
>
> Attachments: PIG-4667-logs.tgz, PIG-4667-v1.patch, PIG-4667.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Review Request 38352: Enable Pig on Spark to run on Yarn Client mode

2015-09-16 Thread Xuefu Zhang

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/38352/#review99223
---

Ship it!


Ship It!

- Xuefu Zhang


On Sept. 16, 2015, 12:06 p.m., Srikanth Sundarrajan wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/38352/
> ---
> 
> (Updated Sept. 16, 2015, 12:06 p.m.)
> 
> 
> Review request for pig.
> 
> 
> Bugs: PIG-4667
> https://issues.apache.org/jira/browse/PIG-4667
> 
> 
> Repository: pig-git
> 
> 
> Description
> ---
> 
> Enable Pig on Spark to run on Yarn Client mode
> 
> 
> Diffs
> -
> 
>   bin/pig 15341d1 
>   build.xml b17b0e1 
>   ivy.xml 2ebebdc 
>   ivy/libraries.properties 4d1f61e 
>   src/docs/src/documentation/content/xdocs/start.xml 97d3a4d 
> 
> Diff: https://reviews.apache.org/r/38352/diff/
> 
> 
> Testing
> ---
> 
> Script used for testing
> A = LOAD '/tmp/x' USING PigStorage('\t') AS (line);
> STORE A INTO '/tmp/y' USING PigStorage(',');
> 
> Used the following environment setting before launching the script.
> declare -x HADOOP_CONF_DIR="/opt/hadoop-2.6.0.2.2.0.0-2041/etc/hadoop/"
> declare -x HADOOP_HOME="/opt/hadoop-2.6.0.2.2.0.0-2041/"
> declare -x SPARK_MASTER="yarn-client"
> 
> 
> Thanks,
> 
> Srikanth Sundarrajan
> 
>



[jira] [Updated] (PIG-4667) Enable Pig on Spark to run on Yarn Client mode

2015-09-16 Thread Srikanth Sundarrajan (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Srikanth Sundarrajan updated PIG-4667:
--
Attachment: PIG-4667-v1.patch

Hi [~xuefuz], Have attached a revised patch which removes changes to Kryo and 
Guava lib versions. Have verified that yarn-client works with these changes.


> Enable Pig on Spark to run on Yarn Client mode
> --
>
> Key: PIG-4667
> URL: https://issues.apache.org/jira/browse/PIG-4667
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Srikanth Sundarrajan
>Assignee: Srikanth Sundarrajan
> Fix For: spark-branch
>
> Attachments: PIG-4667-logs.tgz, PIG-4667-v1.patch, PIG-4667.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Review Request 38352: Enable Pig on Spark to run on Yarn Client mode

2015-09-16 Thread Srikanth Sundarrajan

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/38352/
---

(Updated Sept. 16, 2015, 12:06 p.m.)


Review request for pig.


Bugs: PIG-4667
https://issues.apache.org/jira/browse/PIG-4667


Repository: pig-git


Description
---

Enable Pig on Spark to run on Yarn Client mode


Diffs (updated)
-

  bin/pig 15341d1 
  build.xml b17b0e1 
  ivy.xml 2ebebdc 
  ivy/libraries.properties 4d1f61e 
  src/docs/src/documentation/content/xdocs/start.xml 97d3a4d 

Diff: https://reviews.apache.org/r/38352/diff/


Testing
---

Script used for testing
A = LOAD '/tmp/x' USING PigStorage('\t') AS (line);
STORE A INTO '/tmp/y' USING PigStorage(',');

Used the following environment setting before launching the script.
declare -x HADOOP_CONF_DIR="/opt/hadoop-2.6.0.2.2.0.0-2041/etc/hadoop/"
declare -x HADOOP_HOME="/opt/hadoop-2.6.0.2.2.0.0-2041/"
declare -x SPARK_MASTER="yarn-client"


Thanks,

Srikanth Sundarrajan



[jira] [Commented] (PIG-4673) Built In UDF - REPLACE_MULTI : For a given string, search and replace all occurrences of search keys with replacement values.

2015-09-16 Thread Murali Rao (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14747063#comment-14747063
 ] 

Murali Rao commented on PIG-4673:
-

[~daijy] : Have added a patch with UDF and test file for REPLACE_MULTI UDF. 
Plz. do review and let me know your inputs. Many Thanks.

> Built In UDF - REPLACE_MULTI : For a given string, search and replace all 
> occurrences of search keys with replacement values. 
> --
>
> Key: PIG-4673
> URL: https://issues.apache.org/jira/browse/PIG-4673
> Project: Pig
>  Issue Type: New Feature
>  Components: piggybank
>Affects Versions: site
>Reporter: Murali Rao
>Assignee: Murali Rao
>Priority: Minor
>  Labels: None
> Fix For: site
>
>
> Lets say we have a string = 'A1B2C3D4'. Our objective is to replace A with 1, 
> B with 2, C with 3 and D with 4 to derive 11223344 string. 
> Using existing REPLACE method 
> REPLACE(REPLACE(REPLACE(REPLACE('A1B2C3D4','A','1'),'B','2'),'C','3'),'D','4')
>  
> With proposed UDF : REPLACE_MULTI method
> General Syntax : 
> REPLACE_MULTI ( sourceString,  [  search1#replacement1, ... ] )
> REPLACE_MULTI ( 'A1B2C3D4',  [ 'A'#'1','B'#'2', 'C'#'3', 'D'#'4' ] )
> Advantage : 
>   1. Function calls are reduced. 
>   2. Ease to code and better readable.
>   
> Let me know your thoughts/ inputs on having this UDF in Piggy Bank. Will take 
> this up based on this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4673) Built In UDF - REPLACE_MULTI : For a given string, search and replace all occurrences of search keys with replacement values.

2015-09-16 Thread Murali Rao (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Murali Rao updated PIG-4673:

   Labels: None  (was: )
Fix Version/s: site
 Release Note: Built In UDF - REPLACE_MULTI : Method which take a tuple 
having source string as first parameter and a map having search key and 
replacement values. Method will replace all occurrences of search key in source 
string with the replacement values.
Affects Version/s: site
   Status: Patch Available  (was: Open)

Built In UDF - REPLACE_MULTI  - Patch

> Built In UDF - REPLACE_MULTI : For a given string, search and replace all 
> occurrences of search keys with replacement values. 
> --
>
> Key: PIG-4673
> URL: https://issues.apache.org/jira/browse/PIG-4673
> Project: Pig
>  Issue Type: New Feature
>  Components: piggybank
>Affects Versions: site
>Reporter: Murali Rao
>Assignee: Murali Rao
>Priority: Minor
>  Labels: None
> Fix For: site
>
>
> Lets say we have a string = 'A1B2C3D4'. Our objective is to replace A with 1, 
> B with 2, C with 3 and D with 4 to derive 11223344 string. 
> Using existing REPLACE method 
> REPLACE(REPLACE(REPLACE(REPLACE('A1B2C3D4','A','1'),'B','2'),'C','3'),'D','4')
>  
> With proposed UDF : REPLACE_MULTI method
> General Syntax : 
> REPLACE_MULTI ( sourceString,  [  search1#replacement1, ... ] )
> REPLACE_MULTI ( 'A1B2C3D4',  [ 'A'#'1','B'#'2', 'C'#'3', 'D'#'4' ] )
> Advantage : 
>   1. Function calls are reduced. 
>   2. Ease to code and better readable.
>   
> Let me know your thoughts/ inputs on having this UDF in Piggy Bank. Will take 
> this up based on this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)