[jira] [Updated] (PIG-5362) Parameter substitution of shell cmd results doesn't handle backslash

2018-10-12 Thread Satish Subhashrao Saley (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-5362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-5362:
-
Attachment: (was: test-TestParamSubPreproc.txt)

> Parameter substitution of shell cmd results doesn't handle backslash  
> -
>
> Key: PIG-5362
> URL: https://issues.apache.org/jira/browse/PIG-5362
> Project: Pig
>  Issue Type: Bug
>  Components: parser
>Reporter: Will Lauer
>Assignee: Will Lauer
>Priority: Minor
> Fix For: 0.18.0
>
> Attachments: pig.patch, pig2.patch, pig3.patch, test-failure.txt
>
>
> It looks like there is a bug in how parameter substitution is handled in 
> PreprocessorContext.java that causes parameter values that contain 
> backslashed to not be processed correctly, resulting in the backslashes being 
> lost. For example, if you had the following:
> {code:java}
> %DECLARE A `echo \$foo\\bar`
> B = LOAD $A 
> {code}
> You would expect the echo command to produce the output {{$foo\bar}} but the 
> actual value that gets substituted is {{\$foobar}}. This is happening because 
> the {{substitute}} method in PreprocessorContext.java uses a regular 
> expression replacement instead of a basic string substitution and $ and \ are 
> special characters. The code attempts to escape $, but does not escape 
> backslash.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PIG-5362) Parameter substitution of shell cmd results doesn't handle backslash

2018-10-12 Thread Satish Subhashrao Saley (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-5362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-5362:
-
Attachment: test-failure.txt

> Parameter substitution of shell cmd results doesn't handle backslash  
> -
>
> Key: PIG-5362
> URL: https://issues.apache.org/jira/browse/PIG-5362
> Project: Pig
>  Issue Type: Bug
>  Components: parser
>Reporter: Will Lauer
>Assignee: Will Lauer
>Priority: Minor
> Fix For: 0.18.0
>
> Attachments: pig.patch, pig2.patch, pig3.patch, test-failure.txt
>
>
> It looks like there is a bug in how parameter substitution is handled in 
> PreprocessorContext.java that causes parameter values that contain 
> backslashed to not be processed correctly, resulting in the backslashes being 
> lost. For example, if you had the following:
> {code:java}
> %DECLARE A `echo \$foo\\bar`
> B = LOAD $A 
> {code}
> You would expect the echo command to produce the output {{$foo\bar}} but the 
> actual value that gets substituted is {{\$foobar}}. This is happening because 
> the {{substitute}} method in PreprocessorContext.java uses a regular 
> expression replacement instead of a basic string substitution and $ and \ are 
> special characters. The code attempts to escape $, but does not escape 
> backslash.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PIG-5362) Parameter substitution of shell cmd results doesn't handle backslash

2018-10-12 Thread Satish Subhashrao Saley (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-5362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16648417#comment-16648417
 ] 

Satish Subhashrao Saley commented on PIG-5362:
--

There are test failures in TestParamSubPreproc. I have attached test log.

 

> Parameter substitution of shell cmd results doesn't handle backslash  
> -
>
> Key: PIG-5362
> URL: https://issues.apache.org/jira/browse/PIG-5362
> Project: Pig
>  Issue Type: Bug
>  Components: parser
>Reporter: Will Lauer
>Assignee: Will Lauer
>Priority: Minor
> Fix For: 0.18.0
>
> Attachments: pig.patch, pig2.patch, pig3.patch, 
> test-TestParamSubPreproc.txt
>
>
> It looks like there is a bug in how parameter substitution is handled in 
> PreprocessorContext.java that causes parameter values that contain 
> backslashed to not be processed correctly, resulting in the backslashes being 
> lost. For example, if you had the following:
> {code:java}
> %DECLARE A `echo \$foo\\bar`
> B = LOAD $A 
> {code}
> You would expect the echo command to produce the output {{$foo\bar}} but the 
> actual value that gets substituted is {{\$foobar}}. This is happening because 
> the {{substitute}} method in PreprocessorContext.java uses a regular 
> expression replacement instead of a basic string substitution and $ and \ are 
> special characters. The code attempts to escape $, but does not escape 
> backslash.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PIG-5362) Parameter substitution of shell cmd results doesn't handle backslash

2018-10-12 Thread Satish Subhashrao Saley (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-5362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-5362:
-
Attachment: test-TestParamSubPreproc.txt

> Parameter substitution of shell cmd results doesn't handle backslash  
> -
>
> Key: PIG-5362
> URL: https://issues.apache.org/jira/browse/PIG-5362
> Project: Pig
>  Issue Type: Bug
>  Components: parser
>Reporter: Will Lauer
>Assignee: Will Lauer
>Priority: Minor
> Fix For: 0.18.0
>
> Attachments: pig.patch, pig2.patch, pig3.patch, 
> test-TestParamSubPreproc.txt
>
>
> It looks like there is a bug in how parameter substitution is handled in 
> PreprocessorContext.java that causes parameter values that contain 
> backslashed to not be processed correctly, resulting in the backslashes being 
> lost. For example, if you had the following:
> {code:java}
> %DECLARE A `echo \$foo\\bar`
> B = LOAD $A 
> {code}
> You would expect the echo command to produce the output {{$foo\bar}} but the 
> actual value that gets substituted is {{\$foobar}}. This is happening because 
> the {{substitute}} method in PreprocessorContext.java uses a regular 
> expression replacement instead of a basic string substitution and $ and \ are 
> special characters. The code attempts to escape $, but does not escape 
> backslash.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PIG-5365) Add support for PARALLEL clause in LOAD statement

2018-10-12 Thread Satish Subhashrao Saley (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-5365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-5365:
-
Description: 
It is tiresome to keep telling users to increase pig.maxCombinedSplitSize to 
512MB or 1G when they are reading TBs of data to avoid launching too many map 
tasks (50-100K) for loading data. It has unnecessary overhead in terms of 
container launch and wastes lot of resources. 

Would be good to have a new settings to configure the max number of tasks which 
will override pig.maxCombinedSplitSize and combine more splits into one task. 
For eg: pig.max.input.splits=3 and data size is 2TB, it will combine more 
than 128MB (default pig.maxCombinedSplitSize) per task to have maximum of 30K 
tasks. That will go as default into pig-default.properties and apply to all 
users.

 Thank you [~rohini] for filing the issue.

  was:
It is tiresome to keep telling users to increase pig.maxCombinedSplitSize to 
512MB or 1G when they are reading TBs of data to avoid launching too many map 
tasks (50-100K) for loading data. It has unnecessary overhead in terms of 
container launch and wastes lot of resources. 

Would be good to have a new settings to configure the max number of tasks which 
will override pig.maxCombinedSplitSize and combine more splits into one task. 
For eg: pig.max.input.splits=3 and data size is 2TB, it will combine more 
than 128MB (default pig.maxCombinedSplitSize) per task to have maximum of 30K 
tasks. That will go as default into pig-default.properties and apply to all 
users.

 


> Add support for PARALLEL clause in LOAD statement
> -
>
> Key: PIG-5365
> URL: https://issues.apache.org/jira/browse/PIG-5365
> Project: Pig
>  Issue Type: New Feature
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
>Priority: Major
>
> It is tiresome to keep telling users to increase pig.maxCombinedSplitSize to 
> 512MB or 1G when they are reading TBs of data to avoid launching too many map 
> tasks (50-100K) for loading data. It has unnecessary overhead in terms of 
> container launch and wastes lot of resources. 
> Would be good to have a new settings to configure the max number of tasks 
> which will override pig.maxCombinedSplitSize and combine more splits into one 
> task. For eg: pig.max.input.splits=3 and data size is 2TB, it will 
> combine more than 128MB (default pig.maxCombinedSplitSize) per task to have 
> maximum of 30K tasks. That will go as default into pig-default.properties and 
> apply to all users.
>  Thank you [~rohini] for filing the issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PIG-5365) Add support for PARALLEL clause in LOAD statement

2018-10-12 Thread Satish Subhashrao Saley (JIRA)
Satish Subhashrao Saley created PIG-5365:


 Summary: Add support for PARALLEL clause in LOAD statement
 Key: PIG-5365
 URL: https://issues.apache.org/jira/browse/PIG-5365
 Project: Pig
  Issue Type: New Feature
Reporter: Satish Subhashrao Saley
Assignee: Satish Subhashrao Saley


It is tiresome to keep telling users to increase pig.maxCombinedSplitSize to 
512MB or 1G when they are reading TBs of data to avoid launching too many map 
tasks (50-100K) for loading data. It has unnecessary overhead in terms of 
container launch and wastes lot of resources. 

Would be good to have a new settings to configure the max number of tasks which 
will override pig.maxCombinedSplitSize and combine more splits into one task. 
For eg: pig.max.input.splits=3 and data size is 2TB, it will combine more 
than 128MB (default pig.maxCombinedSplitSize) per task to have maximum of 30K 
tasks. That will go as default into pig-default.properties and apply to all 
users.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PIG-5359) Reduce time spent in split serialization

2018-10-11 Thread Satish Subhashrao Saley (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-5359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-5359:
-
Attachment: PIG-5359-amend-2.patch

> Reduce time spent in split serialization
> 
>
> Key: PIG-5359
> URL: https://issues.apache.org/jira/browse/PIG-5359
> Project: Pig
>  Issue Type: Improvement
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
>Priority: Major
> Fix For: 0.18.0
>
> Attachments: PIG-5359-3.patch, PIG-5359-amend-1.patch, 
> PIG-5359-amend-2.patch
>
>
> 1. Unnecessary serialization of splits in Tez.
>  In LoaderProcessor, pig calls
>  
> [https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/optimizer/LoaderProcessor.java#L172]
> {code:java}
> tezOp.getLoaderInfo().setInputSplitInfo(MRInputHelpers.generateInputSplitsToMem(conf,
>  false, 0));
> {code}
> It ends up serializing the splits, just to print log.
> [https://github.com/apache/tez/blob/master/tez-mapreduce/src/main/java/org/apache/tez/mapreduce/hadoop/MRInputHelpers.java#L317]
> {code:java}
>   public static InputSplitInfoMem generateInputSplitsToMem(Configuration conf,
>   boolean groupSplits, boolean sortSplits, int targetTasks)
>   throws IOException, ClassNotFoundException, InterruptedException {
>   
>   
>   LOG.info("NumSplits: " + splitInfoMem.getNumTasks() + ", 
> SerializedSize: "
> + splitInfoMem.getSplitsProto().getSerializedSize());
> return splitInfoMem;
> {code}
> [https://github.com/apache/tez/blob/master/tez-mapreduce/src/main/java/org/apache/tez/mapreduce/hadoop/InputSplitInfoMem.java#L106]
> {code:java}
>   public MRSplitsProto getSplitsProto() {
> if (isNewSplit) {
>   try {
> return createSplitsProto(newFormatSplits, new 
> SerializationFactory(conf));
> {code}
> [https://github.com/apache/tez/blob/master/tez-mapreduce/src/main/java/org/apache/tez/mapreduce/hadoop/InputSplitInfoMem.java#L152-L170]
> {code:java}
>   private static MRSplitsProto createSplitsProto(
>   org.apache.hadoop.mapreduce.InputSplit[] newSplits,
>   SerializationFactory serializationFactory) throws IOException,
>   InterruptedException {
> MRSplitsProto.Builder splitsBuilder = MRSplitsProto.newBuilder();
> for (org.apache.hadoop.mapreduce.InputSplit newSplit : newSplits) {
>   splitsBuilder.addSplits(MRInputHelpers.createSplitProto(newSplit, 
> serializationFactory));
> }
> return splitsBuilder.build();
>   }
> {code}
> [https://github.com/apache/tez/blob/master/tez-mapreduce/src/main/java/org/apache/tez/mapreduce/hadoop/MRInputHelpers.java#L221-L259]
> 2. In TezDagBuilder, if splitsSerializedSize > spillThreshold, then the 
> InputSplits serialized in MRSplitsProto are not used by Pig and it serializes 
> again directly to disk via JobSplitWriter.createSplitFiles. So the InputSplit 
> serialization logic is called again which is wasteful and expensive in cases 
> like HCat.
> [https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/TezDagBuilder.java#L946-L947]
> {code:java}
> MRSplitsProto splitsProto = inputSplitInfo.getSplitsProto();
> int splitsSerializedSize = splitsProto.getSerializedSize();
> {code}
> The getSplitsProto, creates MRSplitsProto which consists of list of 
> MRSplitProto. MRSplitProto has serialized bytes of each InputFormat. If 
> splitsSerializedSize > spillThreshold, pig writes the splits to disk via
> {code:java}
> if(splitsSerializedSize > spillThreshold) {
> inputPayLoad.setBoolean(
> 
> org.apache.tez.mapreduce.hadoop.MRJobConfig.MR_TEZ_SPLITS_VIA_EVENTS,
> false);
> // Write splits to disk
> Path inputSplitsDir = FileLocalizer.getTemporaryPath(pc);
> log.info("Writing input splits to " + inputSplitsDir
> + " for vertex " + vertex.getName()
> + " as the serialized size in memory is "
> + splitsSerializedSize + ". Configured "
> + PigConfiguration.PIG_TEZ_INPUT_SPLITS_MEM_THRESHOLD
> + " is " + spillThreshold);
> inputSplitInfo = MRToTezHelper.writeInputSplitInfoToDisk(
> (InputSplitInfoMem)inputSplitInfo, inputSplitsDir, payloadConf, 
> fs);
> {code}
> [https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/TezDagBuilder.java#L960]
>  
> [https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/util/MRToTezHelper.java#L302-L314]
> Solution:
>  1. Do not serialize the split in LoaderProcessor.java
>  2. In TezDagBuilder.java, serialize each input split and keep adding its 
> size and if it exceeds spillThreshold, then write the splits to disk reusing 
> the serialized 

[jira] [Updated] (PIG-5359) Reduce time spent in split serialization

2018-10-11 Thread Satish Subhashrao Saley (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-5359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-5359:
-
Attachment: (was: PIG-5359-amend-2.patch)

> Reduce time spent in split serialization
> 
>
> Key: PIG-5359
> URL: https://issues.apache.org/jira/browse/PIG-5359
> Project: Pig
>  Issue Type: Improvement
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
>Priority: Major
> Fix For: 0.18.0
>
> Attachments: PIG-5359-3.patch, PIG-5359-amend-1.patch
>
>
> 1. Unnecessary serialization of splits in Tez.
>  In LoaderProcessor, pig calls
>  
> [https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/optimizer/LoaderProcessor.java#L172]
> {code:java}
> tezOp.getLoaderInfo().setInputSplitInfo(MRInputHelpers.generateInputSplitsToMem(conf,
>  false, 0));
> {code}
> It ends up serializing the splits, just to print log.
> [https://github.com/apache/tez/blob/master/tez-mapreduce/src/main/java/org/apache/tez/mapreduce/hadoop/MRInputHelpers.java#L317]
> {code:java}
>   public static InputSplitInfoMem generateInputSplitsToMem(Configuration conf,
>   boolean groupSplits, boolean sortSplits, int targetTasks)
>   throws IOException, ClassNotFoundException, InterruptedException {
>   
>   
>   LOG.info("NumSplits: " + splitInfoMem.getNumTasks() + ", 
> SerializedSize: "
> + splitInfoMem.getSplitsProto().getSerializedSize());
> return splitInfoMem;
> {code}
> [https://github.com/apache/tez/blob/master/tez-mapreduce/src/main/java/org/apache/tez/mapreduce/hadoop/InputSplitInfoMem.java#L106]
> {code:java}
>   public MRSplitsProto getSplitsProto() {
> if (isNewSplit) {
>   try {
> return createSplitsProto(newFormatSplits, new 
> SerializationFactory(conf));
> {code}
> [https://github.com/apache/tez/blob/master/tez-mapreduce/src/main/java/org/apache/tez/mapreduce/hadoop/InputSplitInfoMem.java#L152-L170]
> {code:java}
>   private static MRSplitsProto createSplitsProto(
>   org.apache.hadoop.mapreduce.InputSplit[] newSplits,
>   SerializationFactory serializationFactory) throws IOException,
>   InterruptedException {
> MRSplitsProto.Builder splitsBuilder = MRSplitsProto.newBuilder();
> for (org.apache.hadoop.mapreduce.InputSplit newSplit : newSplits) {
>   splitsBuilder.addSplits(MRInputHelpers.createSplitProto(newSplit, 
> serializationFactory));
> }
> return splitsBuilder.build();
>   }
> {code}
> [https://github.com/apache/tez/blob/master/tez-mapreduce/src/main/java/org/apache/tez/mapreduce/hadoop/MRInputHelpers.java#L221-L259]
> 2. In TezDagBuilder, if splitsSerializedSize > spillThreshold, then the 
> InputSplits serialized in MRSplitsProto are not used by Pig and it serializes 
> again directly to disk via JobSplitWriter.createSplitFiles. So the InputSplit 
> serialization logic is called again which is wasteful and expensive in cases 
> like HCat.
> [https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/TezDagBuilder.java#L946-L947]
> {code:java}
> MRSplitsProto splitsProto = inputSplitInfo.getSplitsProto();
> int splitsSerializedSize = splitsProto.getSerializedSize();
> {code}
> The getSplitsProto, creates MRSplitsProto which consists of list of 
> MRSplitProto. MRSplitProto has serialized bytes of each InputFormat. If 
> splitsSerializedSize > spillThreshold, pig writes the splits to disk via
> {code:java}
> if(splitsSerializedSize > spillThreshold) {
> inputPayLoad.setBoolean(
> 
> org.apache.tez.mapreduce.hadoop.MRJobConfig.MR_TEZ_SPLITS_VIA_EVENTS,
> false);
> // Write splits to disk
> Path inputSplitsDir = FileLocalizer.getTemporaryPath(pc);
> log.info("Writing input splits to " + inputSplitsDir
> + " for vertex " + vertex.getName()
> + " as the serialized size in memory is "
> + splitsSerializedSize + ". Configured "
> + PigConfiguration.PIG_TEZ_INPUT_SPLITS_MEM_THRESHOLD
> + " is " + spillThreshold);
> inputSplitInfo = MRToTezHelper.writeInputSplitInfoToDisk(
> (InputSplitInfoMem)inputSplitInfo, inputSplitsDir, payloadConf, 
> fs);
> {code}
> [https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/TezDagBuilder.java#L960]
>  
> [https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/util/MRToTezHelper.java#L302-L314]
> Solution:
>  1. Do not serialize the split in LoaderProcessor.java
>  2. In TezDagBuilder.java, serialize each input split and keep adding its 
> size and if it exceeds spillThreshold, then write the splits to disk reusing 
> the serialized buffers for each 

[jira] [Created] (PIG-5364) Using && in EL expressions in oozie bundle.xml files generates parse errors

2018-10-11 Thread Satish Subhashrao Saley (JIRA)
Satish Subhashrao Saley created PIG-5364:


 Summary: Using && in EL expressions in oozie bundle.xml files 
generates parse errors
 Key: PIG-5364
 URL: https://issues.apache.org/jira/browse/PIG-5364
 Project: Pig
  Issue Type: Bug
Reporter: Satish Subhashrao Saley
Assignee: Satish Subhashrao Saley


[~wla...@yahoo-inc.com]reported -

I need to put a logical AND into an expression in an oozie bundle (in the 
enabled flag on the coordinator). When I try "{{A && B}}", oozie gives me a 
perfectly understandable error about this being invalid XML ({{XML schema 
error, The entity name must immediately follow the '&' in the entity 
reference.}}). But when I try to replace this with an XML entity ({{A && B}}), 
I get an unexpected error that seems to indicate that a space (or something 
else) is getting injected unexpectedly into my expression:
{noformat}
Bundle Job submission Error: [E1004: Expression language evaluation error, 
Encountered "&", expected one of ["}", ".", ">", "gt", "<", "lt", "==", "eq", 
"<=", "le", ">=", "ge", "!=", "ne", "[", "+", "-", "*", "/", "div", "%", "mod", 
"and", "&&", "or", "||", ":", "(", "?"]]
{noformat}
Any idea how I can get the expression "A && B" to be parsed correctly in an 
oozie bundle.xml file?

It turned out to be a bug in Oozie. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PIG-5359) Reduce time spent in split serialization

2018-10-11 Thread Satish Subhashrao Saley (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-5359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-5359:
-
Attachment: PIG-5359-amend-2.patch

> Reduce time spent in split serialization
> 
>
> Key: PIG-5359
> URL: https://issues.apache.org/jira/browse/PIG-5359
> Project: Pig
>  Issue Type: Improvement
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
>Priority: Major
> Fix For: 0.18.0
>
> Attachments: PIG-5359-3.patch, PIG-5359-amend-1.patch, 
> PIG-5359-amend-2.patch
>
>
> 1. Unnecessary serialization of splits in Tez.
>  In LoaderProcessor, pig calls
>  
> [https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/optimizer/LoaderProcessor.java#L172]
> {code:java}
> tezOp.getLoaderInfo().setInputSplitInfo(MRInputHelpers.generateInputSplitsToMem(conf,
>  false, 0));
> {code}
> It ends up serializing the splits, just to print log.
> [https://github.com/apache/tez/blob/master/tez-mapreduce/src/main/java/org/apache/tez/mapreduce/hadoop/MRInputHelpers.java#L317]
> {code:java}
>   public static InputSplitInfoMem generateInputSplitsToMem(Configuration conf,
>   boolean groupSplits, boolean sortSplits, int targetTasks)
>   throws IOException, ClassNotFoundException, InterruptedException {
>   
>   
>   LOG.info("NumSplits: " + splitInfoMem.getNumTasks() + ", 
> SerializedSize: "
> + splitInfoMem.getSplitsProto().getSerializedSize());
> return splitInfoMem;
> {code}
> [https://github.com/apache/tez/blob/master/tez-mapreduce/src/main/java/org/apache/tez/mapreduce/hadoop/InputSplitInfoMem.java#L106]
> {code:java}
>   public MRSplitsProto getSplitsProto() {
> if (isNewSplit) {
>   try {
> return createSplitsProto(newFormatSplits, new 
> SerializationFactory(conf));
> {code}
> [https://github.com/apache/tez/blob/master/tez-mapreduce/src/main/java/org/apache/tez/mapreduce/hadoop/InputSplitInfoMem.java#L152-L170]
> {code:java}
>   private static MRSplitsProto createSplitsProto(
>   org.apache.hadoop.mapreduce.InputSplit[] newSplits,
>   SerializationFactory serializationFactory) throws IOException,
>   InterruptedException {
> MRSplitsProto.Builder splitsBuilder = MRSplitsProto.newBuilder();
> for (org.apache.hadoop.mapreduce.InputSplit newSplit : newSplits) {
>   splitsBuilder.addSplits(MRInputHelpers.createSplitProto(newSplit, 
> serializationFactory));
> }
> return splitsBuilder.build();
>   }
> {code}
> [https://github.com/apache/tez/blob/master/tez-mapreduce/src/main/java/org/apache/tez/mapreduce/hadoop/MRInputHelpers.java#L221-L259]
> 2. In TezDagBuilder, if splitsSerializedSize > spillThreshold, then the 
> InputSplits serialized in MRSplitsProto are not used by Pig and it serializes 
> again directly to disk via JobSplitWriter.createSplitFiles. So the InputSplit 
> serialization logic is called again which is wasteful and expensive in cases 
> like HCat.
> [https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/TezDagBuilder.java#L946-L947]
> {code:java}
> MRSplitsProto splitsProto = inputSplitInfo.getSplitsProto();
> int splitsSerializedSize = splitsProto.getSerializedSize();
> {code}
> The getSplitsProto, creates MRSplitsProto which consists of list of 
> MRSplitProto. MRSplitProto has serialized bytes of each InputFormat. If 
> splitsSerializedSize > spillThreshold, pig writes the splits to disk via
> {code:java}
> if(splitsSerializedSize > spillThreshold) {
> inputPayLoad.setBoolean(
> 
> org.apache.tez.mapreduce.hadoop.MRJobConfig.MR_TEZ_SPLITS_VIA_EVENTS,
> false);
> // Write splits to disk
> Path inputSplitsDir = FileLocalizer.getTemporaryPath(pc);
> log.info("Writing input splits to " + inputSplitsDir
> + " for vertex " + vertex.getName()
> + " as the serialized size in memory is "
> + splitsSerializedSize + ". Configured "
> + PigConfiguration.PIG_TEZ_INPUT_SPLITS_MEM_THRESHOLD
> + " is " + spillThreshold);
> inputSplitInfo = MRToTezHelper.writeInputSplitInfoToDisk(
> (InputSplitInfoMem)inputSplitInfo, inputSplitsDir, payloadConf, 
> fs);
> {code}
> [https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/TezDagBuilder.java#L960]
>  
> [https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/util/MRToTezHelper.java#L302-L314]
> Solution:
>  1. Do not serialize the split in LoaderProcessor.java
>  2. In TezDagBuilder.java, serialize each input split and keep adding its 
> size and if it exceeds spillThreshold, then write the splits to disk reusing 
> the serialized 

[jira] [Commented] (PIG-5359) Reduce time spent in split serialization

2018-10-11 Thread Satish Subhashrao Saley (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-5359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16646844#comment-16646844
 ] 

Satish Subhashrao Saley commented on PIG-5359:
--

Updated amend patch since TestTezAutoParallelism was failing.

> Reduce time spent in split serialization
> 
>
> Key: PIG-5359
> URL: https://issues.apache.org/jira/browse/PIG-5359
> Project: Pig
>  Issue Type: Improvement
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
>Priority: Major
> Fix For: 0.18.0
>
> Attachments: PIG-5359-3.patch, PIG-5359-amend-1.patch
>
>
> 1. Unnecessary serialization of splits in Tez.
>  In LoaderProcessor, pig calls
>  
> [https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/optimizer/LoaderProcessor.java#L172]
> {code:java}
> tezOp.getLoaderInfo().setInputSplitInfo(MRInputHelpers.generateInputSplitsToMem(conf,
>  false, 0));
> {code}
> It ends up serializing the splits, just to print log.
> [https://github.com/apache/tez/blob/master/tez-mapreduce/src/main/java/org/apache/tez/mapreduce/hadoop/MRInputHelpers.java#L317]
> {code:java}
>   public static InputSplitInfoMem generateInputSplitsToMem(Configuration conf,
>   boolean groupSplits, boolean sortSplits, int targetTasks)
>   throws IOException, ClassNotFoundException, InterruptedException {
>   
>   
>   LOG.info("NumSplits: " + splitInfoMem.getNumTasks() + ", 
> SerializedSize: "
> + splitInfoMem.getSplitsProto().getSerializedSize());
> return splitInfoMem;
> {code}
> [https://github.com/apache/tez/blob/master/tez-mapreduce/src/main/java/org/apache/tez/mapreduce/hadoop/InputSplitInfoMem.java#L106]
> {code:java}
>   public MRSplitsProto getSplitsProto() {
> if (isNewSplit) {
>   try {
> return createSplitsProto(newFormatSplits, new 
> SerializationFactory(conf));
> {code}
> [https://github.com/apache/tez/blob/master/tez-mapreduce/src/main/java/org/apache/tez/mapreduce/hadoop/InputSplitInfoMem.java#L152-L170]
> {code:java}
>   private static MRSplitsProto createSplitsProto(
>   org.apache.hadoop.mapreduce.InputSplit[] newSplits,
>   SerializationFactory serializationFactory) throws IOException,
>   InterruptedException {
> MRSplitsProto.Builder splitsBuilder = MRSplitsProto.newBuilder();
> for (org.apache.hadoop.mapreduce.InputSplit newSplit : newSplits) {
>   splitsBuilder.addSplits(MRInputHelpers.createSplitProto(newSplit, 
> serializationFactory));
> }
> return splitsBuilder.build();
>   }
> {code}
> [https://github.com/apache/tez/blob/master/tez-mapreduce/src/main/java/org/apache/tez/mapreduce/hadoop/MRInputHelpers.java#L221-L259]
> 2. In TezDagBuilder, if splitsSerializedSize > spillThreshold, then the 
> InputSplits serialized in MRSplitsProto are not used by Pig and it serializes 
> again directly to disk via JobSplitWriter.createSplitFiles. So the InputSplit 
> serialization logic is called again which is wasteful and expensive in cases 
> like HCat.
> [https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/TezDagBuilder.java#L946-L947]
> {code:java}
> MRSplitsProto splitsProto = inputSplitInfo.getSplitsProto();
> int splitsSerializedSize = splitsProto.getSerializedSize();
> {code}
> The getSplitsProto, creates MRSplitsProto which consists of list of 
> MRSplitProto. MRSplitProto has serialized bytes of each InputFormat. If 
> splitsSerializedSize > spillThreshold, pig writes the splits to disk via
> {code:java}
> if(splitsSerializedSize > spillThreshold) {
> inputPayLoad.setBoolean(
> 
> org.apache.tez.mapreduce.hadoop.MRJobConfig.MR_TEZ_SPLITS_VIA_EVENTS,
> false);
> // Write splits to disk
> Path inputSplitsDir = FileLocalizer.getTemporaryPath(pc);
> log.info("Writing input splits to " + inputSplitsDir
> + " for vertex " + vertex.getName()
> + " as the serialized size in memory is "
> + splitsSerializedSize + ". Configured "
> + PigConfiguration.PIG_TEZ_INPUT_SPLITS_MEM_THRESHOLD
> + " is " + spillThreshold);
> inputSplitInfo = MRToTezHelper.writeInputSplitInfoToDisk(
> (InputSplitInfoMem)inputSplitInfo, inputSplitsDir, payloadConf, 
> fs);
> {code}
> [https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/TezDagBuilder.java#L960]
>  
> [https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/util/MRToTezHelper.java#L302-L314]
> Solution:
>  1. Do not serialize the split in LoaderProcessor.java
>  2. In TezDagBuilder.java, serialize each input split and keep adding its 
> size and if it exceeds spillThreshold, then write the splits to 

[jira] [Updated] (PIG-5359) Reduce time spent in split serialization

2018-10-10 Thread Satish Subhashrao Saley (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-5359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-5359:
-
Attachment: PIG-5359-amend-1.patch

> Reduce time spent in split serialization
> 
>
> Key: PIG-5359
> URL: https://issues.apache.org/jira/browse/PIG-5359
> Project: Pig
>  Issue Type: Improvement
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
>Priority: Major
> Fix For: 0.18.0
>
> Attachments: PIG-5359-3.patch, PIG-5359-amend-1.patch
>
>
> 1. Unnecessary serialization of splits in Tez.
>  In LoaderProcessor, pig calls
>  
> [https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/optimizer/LoaderProcessor.java#L172]
> {code:java}
> tezOp.getLoaderInfo().setInputSplitInfo(MRInputHelpers.generateInputSplitsToMem(conf,
>  false, 0));
> {code}
> It ends up serializing the splits, just to print log.
> [https://github.com/apache/tez/blob/master/tez-mapreduce/src/main/java/org/apache/tez/mapreduce/hadoop/MRInputHelpers.java#L317]
> {code:java}
>   public static InputSplitInfoMem generateInputSplitsToMem(Configuration conf,
>   boolean groupSplits, boolean sortSplits, int targetTasks)
>   throws IOException, ClassNotFoundException, InterruptedException {
>   
>   
>   LOG.info("NumSplits: " + splitInfoMem.getNumTasks() + ", 
> SerializedSize: "
> + splitInfoMem.getSplitsProto().getSerializedSize());
> return splitInfoMem;
> {code}
> [https://github.com/apache/tez/blob/master/tez-mapreduce/src/main/java/org/apache/tez/mapreduce/hadoop/InputSplitInfoMem.java#L106]
> {code:java}
>   public MRSplitsProto getSplitsProto() {
> if (isNewSplit) {
>   try {
> return createSplitsProto(newFormatSplits, new 
> SerializationFactory(conf));
> {code}
> [https://github.com/apache/tez/blob/master/tez-mapreduce/src/main/java/org/apache/tez/mapreduce/hadoop/InputSplitInfoMem.java#L152-L170]
> {code:java}
>   private static MRSplitsProto createSplitsProto(
>   org.apache.hadoop.mapreduce.InputSplit[] newSplits,
>   SerializationFactory serializationFactory) throws IOException,
>   InterruptedException {
> MRSplitsProto.Builder splitsBuilder = MRSplitsProto.newBuilder();
> for (org.apache.hadoop.mapreduce.InputSplit newSplit : newSplits) {
>   splitsBuilder.addSplits(MRInputHelpers.createSplitProto(newSplit, 
> serializationFactory));
> }
> return splitsBuilder.build();
>   }
> {code}
> [https://github.com/apache/tez/blob/master/tez-mapreduce/src/main/java/org/apache/tez/mapreduce/hadoop/MRInputHelpers.java#L221-L259]
> 2. In TezDagBuilder, if splitsSerializedSize > spillThreshold, then the 
> InputSplits serialized in MRSplitsProto are not used by Pig and it serializes 
> again directly to disk via JobSplitWriter.createSplitFiles. So the InputSplit 
> serialization logic is called again which is wasteful and expensive in cases 
> like HCat.
> [https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/TezDagBuilder.java#L946-L947]
> {code:java}
> MRSplitsProto splitsProto = inputSplitInfo.getSplitsProto();
> int splitsSerializedSize = splitsProto.getSerializedSize();
> {code}
> The getSplitsProto, creates MRSplitsProto which consists of list of 
> MRSplitProto. MRSplitProto has serialized bytes of each InputFormat. If 
> splitsSerializedSize > spillThreshold, pig writes the splits to disk via
> {code:java}
> if(splitsSerializedSize > spillThreshold) {
> inputPayLoad.setBoolean(
> 
> org.apache.tez.mapreduce.hadoop.MRJobConfig.MR_TEZ_SPLITS_VIA_EVENTS,
> false);
> // Write splits to disk
> Path inputSplitsDir = FileLocalizer.getTemporaryPath(pc);
> log.info("Writing input splits to " + inputSplitsDir
> + " for vertex " + vertex.getName()
> + " as the serialized size in memory is "
> + splitsSerializedSize + ". Configured "
> + PigConfiguration.PIG_TEZ_INPUT_SPLITS_MEM_THRESHOLD
> + " is " + spillThreshold);
> inputSplitInfo = MRToTezHelper.writeInputSplitInfoToDisk(
> (InputSplitInfoMem)inputSplitInfo, inputSplitsDir, payloadConf, 
> fs);
> {code}
> [https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/TezDagBuilder.java#L960]
>  
> [https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/util/MRToTezHelper.java#L302-L314]
> Solution:
>  1. Do not serialize the split in LoaderProcessor.java
>  2. In TezDagBuilder.java, serialize each input split and keep adding its 
> size and if it exceeds spillThreshold, then write the splits to disk reusing 
> the serialized buffers for each split.
>  
> 

[jira] [Commented] (PIG-5342) Add setting to turn off bloom join combiner

2018-10-04 Thread Satish Subhashrao Saley (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16638956#comment-16638956
 ] 

Satish Subhashrao Saley commented on PIG-5342:
--

Could you please amend the commit? BloomFilterPartitioner class wasn't 
committed. 
{code:java}
     [echo] *** Building Main Sources ***

     [echo] *** To compile with all warnings enabled, supply -Dall.warnings=1 
on command line ***

     [echo] *** Else, you will only be warned about deprecations ***

     [echo] *** Hadoop version used: 2 ; HBase version used: 1 ; Spark version 
used: 2 ***

    [javac] Compiling 1106 source files to /Users/saley/src/pig/build/classes

    [javac] 
/Users/saley/src/pig/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/TezCompiler.java:113:
 error: cannot find symbol

    [javac] import 
org.apache.pig.backend.hadoop.executionengine.tez.runtime.BloomFilterPartitioner;

    [javac]                                                                 ^

    [javac]   symbol:   class BloomFilterPartitioner

    [javac]   location: package 
org.apache.pig.backend.hadoop.executionengine.tez.runtime

    [javac] 
/Users/saley/src/pig/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/TezCompiler.java:1495:
 error: cannot find symbol

    [javac]             edge.partitionerClass = BloomFilterPartitioner.class;

    [javac]                                     ^

    [javac]   symbol:   class BloomFilterPartitioner

    [javac]   location: class TezCompiler

    [javac] Note: Some input files use or override a deprecated API.

    [javac] Note: Recompile with -Xlint:deprecation for details.

    [javac] Note: Some input files use unchecked or unsafe operations.

    [javac] Note: Recompile with -Xlint:unchecked for details.

    [javac] 2 errors

{code}

> Add setting to turn off bloom join combiner
> ---
>
> Key: PIG-5342
> URL: https://issues.apache.org/jira/browse/PIG-5342
> Project: Pig
>  Issue Type: Sub-task
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
>Priority: Major
> Fix For: 0.18.0
>
> Attachments: PIG-5342-1.patch, PIG-5342-2.patch, PIG-5342-3.patch, 
> PIG-5342-4.patch, PIG-5342-5.patch, PIG-5342-6.patch, PIG-5342-7.patch, 
> PIG-5342-8.patch
>
>
> 1) Need a new setting pig.bloomjoin.nocombiner to turn off combiner for bloom 
> join. When the keys are all unique, the combiner is unnecessary overhead.
> 2) In previous case, the keys were the bloom filter index and the values were 
> the join key. Combining involved doing a distinct on the bag of values which 
> has memory issues for more than 10 million records. That needs to be flipped 
> and distinct combiner used to scale to a billions of records.
> 3) Mention in documentation that bloom join is also ideal in cases of right 
> outer join with smaller dataset on the right. Replicate join only supports 
> left outer join.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PIG-5317) Upgrade old dependencies: commons-lang, hsqldb, commons-logging

2018-10-03 Thread Satish Subhashrao Saley (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-5317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16637635#comment-16637635
 ] 

Satish Subhashrao Saley commented on PIG-5317:
--

I tested  PIG-5317_without_new_dep_2.patch, it looks good. +1 (non-binding)

> Upgrade old dependencies: commons-lang, hsqldb, commons-logging
> ---
>
> Key: PIG-5317
> URL: https://issues.apache.org/jira/browse/PIG-5317
> Project: Pig
>  Issue Type: Improvement
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Minor
> Fix For: 0.18.0
>
> Attachments: PIG-5317_1.patch, PIG-5317_2.patch, 
> PIG-5317_amend.patch, PIG-5317_without_new_dep.patch, 
> PIG-5317_without_new_dep_2.patch
>
>
> Pig depends on old version of commons-lang, hsqldb and commons-logging. It 
> would be nice to upgrade the version of these dependencies, for commons-lang 
> Pig should depend on commons-lang3 instead (which is already present in the 
> ivy.xml)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PIG-5342) Add setting to turn off bloom join combiner

2018-10-03 Thread Satish Subhashrao Saley (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-5342:
-
Attachment: (was: PIG-5342-7.patch)

> Add setting to turn off bloom join combiner
> ---
>
> Key: PIG-5342
> URL: https://issues.apache.org/jira/browse/PIG-5342
> Project: Pig
>  Issue Type: Sub-task
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
>Priority: Major
> Attachments: PIG-5342-1.patch, PIG-5342-2.patch, PIG-5342-3.patch, 
> PIG-5342-4.patch, PIG-5342-5.patch, PIG-5342-6.patch, PIG-5342-7.patch, 
> PIG-5342-8.patch
>
>
> 1) Need a new setting pig.bloomjoin.nocombiner to turn off combiner for bloom 
> join. When the keys are all unique, the combiner is unnecessary overhead.
> 2) In previous case, the keys were the bloom filter index and the values were 
> the join key. Combining involved doing a distinct on the bag of values which 
> has memory issues for more than 10 million records. That needs to be flipped 
> and distinct combiner used to scale to a billions of records.
> 3) Mention in documentation that bloom join is also ideal in cases of right 
> outer join with smaller dataset on the right. Replicate join only supports 
> left outer join.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PIG-5342) Add setting to turn off bloom join combiner

2018-10-03 Thread Satish Subhashrao Saley (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-5342:
-
Attachment: PIG-5342-8.patch

> Add setting to turn off bloom join combiner
> ---
>
> Key: PIG-5342
> URL: https://issues.apache.org/jira/browse/PIG-5342
> Project: Pig
>  Issue Type: Sub-task
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
>Priority: Major
> Attachments: PIG-5342-1.patch, PIG-5342-2.patch, PIG-5342-3.patch, 
> PIG-5342-4.patch, PIG-5342-5.patch, PIG-5342-6.patch, PIG-5342-7.patch, 
> PIG-5342-7.patch, PIG-5342-8.patch
>
>
> 1) Need a new setting pig.bloomjoin.nocombiner to turn off combiner for bloom 
> join. When the keys are all unique, the combiner is unnecessary overhead.
> 2) In previous case, the keys were the bloom filter index and the values were 
> the join key. Combining involved doing a distinct on the bag of values which 
> has memory issues for more than 10 million records. That needs to be flipped 
> and distinct combiner used to scale to a billions of records.
> 3) Mention in documentation that bloom join is also ideal in cases of right 
> outer join with smaller dataset on the right. Replicate join only supports 
> left outer join.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PIG-5342) Add setting to turn off bloom join combiner

2018-10-03 Thread Satish Subhashrao Saley (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-5342:
-
Attachment: PIG-5342-7.patch

> Add setting to turn off bloom join combiner
> ---
>
> Key: PIG-5342
> URL: https://issues.apache.org/jira/browse/PIG-5342
> Project: Pig
>  Issue Type: Sub-task
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
>Priority: Major
> Attachments: PIG-5342-1.patch, PIG-5342-2.patch, PIG-5342-3.patch, 
> PIG-5342-4.patch, PIG-5342-5.patch, PIG-5342-6.patch, PIG-5342-7.patch, 
> PIG-5342-7.patch, PIG-5342-8.patch
>
>
> 1) Need a new setting pig.bloomjoin.nocombiner to turn off combiner for bloom 
> join. When the keys are all unique, the combiner is unnecessary overhead.
> 2) In previous case, the keys were the bloom filter index and the values were 
> the join key. Combining involved doing a distinct on the bag of values which 
> has memory issues for more than 10 million records. That needs to be flipped 
> and distinct combiner used to scale to a billions of records.
> 3) Mention in documentation that bloom join is also ideal in cases of right 
> outer join with smaller dataset on the right. Replicate join only supports 
> left outer join.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PIG-5342) Add setting to turn off bloom join combiner

2018-10-03 Thread Satish Subhashrao Saley (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-5342:
-
Attachment: PIG-5342-7.patch

> Add setting to turn off bloom join combiner
> ---
>
> Key: PIG-5342
> URL: https://issues.apache.org/jira/browse/PIG-5342
> Project: Pig
>  Issue Type: Sub-task
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
>Priority: Major
> Attachments: PIG-5342-1.patch, PIG-5342-2.patch, PIG-5342-3.patch, 
> PIG-5342-4.patch, PIG-5342-5.patch, PIG-5342-6.patch, PIG-5342-7.patch
>
>
> 1) Need a new setting pig.bloomjoin.nocombiner to turn off combiner for bloom 
> join. When the keys are all unique, the combiner is unnecessary overhead.
> 2) In previous case, the keys were the bloom filter index and the values were 
> the join key. Combining involved doing a distinct on the bag of values which 
> has memory issues for more than 10 million records. That needs to be flipped 
> and distinct combiner used to scale to a billions of records.
> 3) Mention in documentation that bloom join is also ideal in cases of right 
> outer join with smaller dataset on the right. Replicate join only supports 
> left outer join.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PIG-5359) Reduce time spent in split serialization

2018-10-02 Thread Satish Subhashrao Saley (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-5359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16635847#comment-16635847
 ] 

Satish Subhashrao Saley commented on PIG-5359:
--

Updated patch to review board

> Reduce time spent in split serialization
> 
>
> Key: PIG-5359
> URL: https://issues.apache.org/jira/browse/PIG-5359
> Project: Pig
>  Issue Type: Improvement
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
>Priority: Major
>
> 1. Unnecessary serialization of splits in Tez.
>  In LoaderProcessor, pig calls
>  
> [https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/optimizer/LoaderProcessor.java#L172]
> {code:java}
> tezOp.getLoaderInfo().setInputSplitInfo(MRInputHelpers.generateInputSplitsToMem(conf,
>  false, 0));
> {code}
> It ends up serializing the splits, just to print log.
> [https://github.com/apache/tez/blob/master/tez-mapreduce/src/main/java/org/apache/tez/mapreduce/hadoop/MRInputHelpers.java#L317]
> {code:java}
>   public static InputSplitInfoMem generateInputSplitsToMem(Configuration conf,
>   boolean groupSplits, boolean sortSplits, int targetTasks)
>   throws IOException, ClassNotFoundException, InterruptedException {
>   
>   
>   LOG.info("NumSplits: " + splitInfoMem.getNumTasks() + ", 
> SerializedSize: "
> + splitInfoMem.getSplitsProto().getSerializedSize());
> return splitInfoMem;
> {code}
> [https://github.com/apache/tez/blob/master/tez-mapreduce/src/main/java/org/apache/tez/mapreduce/hadoop/InputSplitInfoMem.java#L106]
> {code:java}
>   public MRSplitsProto getSplitsProto() {
> if (isNewSplit) {
>   try {
> return createSplitsProto(newFormatSplits, new 
> SerializationFactory(conf));
> {code}
> [https://github.com/apache/tez/blob/master/tez-mapreduce/src/main/java/org/apache/tez/mapreduce/hadoop/InputSplitInfoMem.java#L152-L170]
> {code:java}
>   private static MRSplitsProto createSplitsProto(
>   org.apache.hadoop.mapreduce.InputSplit[] newSplits,
>   SerializationFactory serializationFactory) throws IOException,
>   InterruptedException {
> MRSplitsProto.Builder splitsBuilder = MRSplitsProto.newBuilder();
> for (org.apache.hadoop.mapreduce.InputSplit newSplit : newSplits) {
>   splitsBuilder.addSplits(MRInputHelpers.createSplitProto(newSplit, 
> serializationFactory));
> }
> return splitsBuilder.build();
>   }
> {code}
> [https://github.com/apache/tez/blob/master/tez-mapreduce/src/main/java/org/apache/tez/mapreduce/hadoop/MRInputHelpers.java#L221-L259]
> 2. In TezDagBuilder, if splitsSerializedSize > spillThreshold, then the 
> InputSplits serialized in MRSplitsProto are not used by Pig and it serializes 
> again directly to disk via JobSplitWriter.createSplitFiles. So the InputSplit 
> serialization logic is called again which is wasteful and expensive in cases 
> like HCat.
> [https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/TezDagBuilder.java#L946-L947]
> {code:java}
> MRSplitsProto splitsProto = inputSplitInfo.getSplitsProto();
> int splitsSerializedSize = splitsProto.getSerializedSize();
> {code}
> The getSplitsProto, creates MRSplitsProto which consists of list of 
> MRSplitProto. MRSplitProto has serialized bytes of each InputFormat. If 
> splitsSerializedSize > spillThreshold, pig writes the splits to disk via
> {code:java}
> if(splitsSerializedSize > spillThreshold) {
> inputPayLoad.setBoolean(
> 
> org.apache.tez.mapreduce.hadoop.MRJobConfig.MR_TEZ_SPLITS_VIA_EVENTS,
> false);
> // Write splits to disk
> Path inputSplitsDir = FileLocalizer.getTemporaryPath(pc);
> log.info("Writing input splits to " + inputSplitsDir
> + " for vertex " + vertex.getName()
> + " as the serialized size in memory is "
> + splitsSerializedSize + ". Configured "
> + PigConfiguration.PIG_TEZ_INPUT_SPLITS_MEM_THRESHOLD
> + " is " + spillThreshold);
> inputSplitInfo = MRToTezHelper.writeInputSplitInfoToDisk(
> (InputSplitInfoMem)inputSplitInfo, inputSplitsDir, payloadConf, 
> fs);
> {code}
> [https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/TezDagBuilder.java#L960]
>  
> [https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/util/MRToTezHelper.java#L302-L314]
> Solution:
>  1. Do not serialize the split in LoaderProcessor.java
>  2. In TezDagBuilder.java, serialize each input split and keep adding its 
> size and if it exceeds spillThreshold, then write the splits to disk reusing 
> the serialized buffers for each split.
>  
> Thank you [~rohini] for identifying the issue.



--
This message 

[jira] [Updated] (PIG-5359) Reduce time spent in split serialization

2018-10-01 Thread Satish Subhashrao Saley (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-5359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-5359:
-
Status: Patch Available  (was: Open)

> Reduce time spent in split serialization
> 
>
> Key: PIG-5359
> URL: https://issues.apache.org/jira/browse/PIG-5359
> Project: Pig
>  Issue Type: Improvement
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
>Priority: Major
>
> 1. Unnecessary serialization of splits in Tez.
>  In LoaderProcessor, pig calls
>  
> [https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/optimizer/LoaderProcessor.java#L172]
> {code:java}
> tezOp.getLoaderInfo().setInputSplitInfo(MRInputHelpers.generateInputSplitsToMem(conf,
>  false, 0));
> {code}
> It ends up serializing the splits, just to print log.
> [https://github.com/apache/tez/blob/master/tez-mapreduce/src/main/java/org/apache/tez/mapreduce/hadoop/MRInputHelpers.java#L317]
> {code:java}
>   public static InputSplitInfoMem generateInputSplitsToMem(Configuration conf,
>   boolean groupSplits, boolean sortSplits, int targetTasks)
>   throws IOException, ClassNotFoundException, InterruptedException {
>   
>   
>   LOG.info("NumSplits: " + splitInfoMem.getNumTasks() + ", 
> SerializedSize: "
> + splitInfoMem.getSplitsProto().getSerializedSize());
> return splitInfoMem;
> {code}
> [https://github.com/apache/tez/blob/master/tez-mapreduce/src/main/java/org/apache/tez/mapreduce/hadoop/InputSplitInfoMem.java#L106]
> {code:java}
>   public MRSplitsProto getSplitsProto() {
> if (isNewSplit) {
>   try {
> return createSplitsProto(newFormatSplits, new 
> SerializationFactory(conf));
> {code}
> [https://github.com/apache/tez/blob/master/tez-mapreduce/src/main/java/org/apache/tez/mapreduce/hadoop/InputSplitInfoMem.java#L152-L170]
> {code:java}
>   private static MRSplitsProto createSplitsProto(
>   org.apache.hadoop.mapreduce.InputSplit[] newSplits,
>   SerializationFactory serializationFactory) throws IOException,
>   InterruptedException {
> MRSplitsProto.Builder splitsBuilder = MRSplitsProto.newBuilder();
> for (org.apache.hadoop.mapreduce.InputSplit newSplit : newSplits) {
>   splitsBuilder.addSplits(MRInputHelpers.createSplitProto(newSplit, 
> serializationFactory));
> }
> return splitsBuilder.build();
>   }
> {code}
> [https://github.com/apache/tez/blob/master/tez-mapreduce/src/main/java/org/apache/tez/mapreduce/hadoop/MRInputHelpers.java#L221-L259]
> 2. In TezDagBuilder, if splitsSerializedSize > spillThreshold, then the 
> InputSplits serialized in MRSplitsProto are not used by Pig and it serializes 
> again directly to disk via JobSplitWriter.createSplitFiles. So the InputSplit 
> serialization logic is called again which is wasteful and expensive in cases 
> like HCat.
> [https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/TezDagBuilder.java#L946-L947]
> {code:java}
> MRSplitsProto splitsProto = inputSplitInfo.getSplitsProto();
> int splitsSerializedSize = splitsProto.getSerializedSize();
> {code}
> The getSplitsProto, creates MRSplitsProto which consists of list of 
> MRSplitProto. MRSplitProto has serialized bytes of each InputFormat. If 
> splitsSerializedSize > spillThreshold, pig writes the splits to disk via
> {code:java}
> if(splitsSerializedSize > spillThreshold) {
> inputPayLoad.setBoolean(
> 
> org.apache.tez.mapreduce.hadoop.MRJobConfig.MR_TEZ_SPLITS_VIA_EVENTS,
> false);
> // Write splits to disk
> Path inputSplitsDir = FileLocalizer.getTemporaryPath(pc);
> log.info("Writing input splits to " + inputSplitsDir
> + " for vertex " + vertex.getName()
> + " as the serialized size in memory is "
> + splitsSerializedSize + ". Configured "
> + PigConfiguration.PIG_TEZ_INPUT_SPLITS_MEM_THRESHOLD
> + " is " + spillThreshold);
> inputSplitInfo = MRToTezHelper.writeInputSplitInfoToDisk(
> (InputSplitInfoMem)inputSplitInfo, inputSplitsDir, payloadConf, 
> fs);
> {code}
> [https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/TezDagBuilder.java#L960]
>  
> [https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/util/MRToTezHelper.java#L302-L314]
> Solution:
>  1. Do not serialize the split in LoaderProcessor.java
>  2. In TezDagBuilder.java, serialize each input split and keep adding its 
> size and if it exceeds spillThreshold, then write the splits to disk reusing 
> the serialized buffers for each split.
>  
> Thank you [~rohini] for identifying the issue.



--
This message was sent by Atlassian JIRA

[jira] [Commented] (PIG-5342) Add setting to turn off bloom join combiner

2018-10-01 Thread Satish Subhashrao Saley (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16634662#comment-16634662
 ] 

Satish Subhashrao Saley commented on PIG-5342:
--

Updated patch.

> Add setting to turn off bloom join combiner
> ---
>
> Key: PIG-5342
> URL: https://issues.apache.org/jira/browse/PIG-5342
> Project: Pig
>  Issue Type: Sub-task
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
>Priority: Major
> Attachments: PIG-5342-1.patch, PIG-5342-2.patch, PIG-5342-3.patch, 
> PIG-5342-4.patch, PIG-5342-5.patch, PIG-5342-6.patch
>
>
> 1) Need a new setting pig.bloomjoin.nocombiner to turn off combiner for bloom 
> join. When the keys are all unique, the combiner is unnecessary overhead.
> 2) In previous case, the keys were the bloom filter index and the values were 
> the join key. Combining involved doing a distinct on the bag of values which 
> has memory issues for more than 10 million records. That needs to be flipped 
> and distinct combiner used to scale to a billions of records.
> 3) Mention in documentation that bloom join is also ideal in cases of right 
> outer join with smaller dataset on the right. Replicate join only supports 
> left outer join.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PIG-5342) Add setting to turn off bloom join combiner

2018-10-01 Thread Satish Subhashrao Saley (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-5342:
-
Attachment: PIG-5342-6.patch

> Add setting to turn off bloom join combiner
> ---
>
> Key: PIG-5342
> URL: https://issues.apache.org/jira/browse/PIG-5342
> Project: Pig
>  Issue Type: Sub-task
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
>Priority: Major
> Attachments: PIG-5342-1.patch, PIG-5342-2.patch, PIG-5342-3.patch, 
> PIG-5342-4.patch, PIG-5342-5.patch, PIG-5342-6.patch
>
>
> 1) Need a new setting pig.bloomjoin.nocombiner to turn off combiner for bloom 
> join. When the keys are all unique, the combiner is unnecessary overhead.
> 2) In previous case, the keys were the bloom filter index and the values were 
> the join key. Combining involved doing a distinct on the bag of values which 
> has memory issues for more than 10 million records. That needs to be flipped 
> and distinct combiner used to scale to a billions of records.
> 3) Mention in documentation that bloom join is also ideal in cases of right 
> outer join with smaller dataset on the right. Replicate join only supports 
> left outer join.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PIG-5342) Add setting to turn off bloom join combiner

2018-10-01 Thread Satish Subhashrao Saley (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-5342:
-
Attachment: (was: PIG-5342-6.patch)

> Add setting to turn off bloom join combiner
> ---
>
> Key: PIG-5342
> URL: https://issues.apache.org/jira/browse/PIG-5342
> Project: Pig
>  Issue Type: Sub-task
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
>Priority: Major
> Attachments: PIG-5342-1.patch, PIG-5342-2.patch, PIG-5342-3.patch, 
> PIG-5342-4.patch, PIG-5342-5.patch, PIG-5342-6.patch
>
>
> 1) Need a new setting pig.bloomjoin.nocombiner to turn off combiner for bloom 
> join. When the keys are all unique, the combiner is unnecessary overhead.
> 2) In previous case, the keys were the bloom filter index and the values were 
> the join key. Combining involved doing a distinct on the bag of values which 
> has memory issues for more than 10 million records. That needs to be flipped 
> and distinct combiner used to scale to a billions of records.
> 3) Mention in documentation that bloom join is also ideal in cases of right 
> outer join with smaller dataset on the right. Replicate join only supports 
> left outer join.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PIG-5342) Add setting to turn off bloom join combiner

2018-10-01 Thread Satish Subhashrao Saley (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-5342:
-
Attachment: PIG-5342-6.patch

> Add setting to turn off bloom join combiner
> ---
>
> Key: PIG-5342
> URL: https://issues.apache.org/jira/browse/PIG-5342
> Project: Pig
>  Issue Type: Sub-task
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
>Priority: Major
> Attachments: PIG-5342-1.patch, PIG-5342-2.patch, PIG-5342-3.patch, 
> PIG-5342-4.patch, PIG-5342-5.patch, PIG-5342-6.patch
>
>
> 1) Need a new setting pig.bloomjoin.nocombiner to turn off combiner for bloom 
> join. When the keys are all unique, the combiner is unnecessary overhead.
> 2) In previous case, the keys were the bloom filter index and the values were 
> the join key. Combining involved doing a distinct on the bag of values which 
> has memory issues for more than 10 million records. That needs to be flipped 
> and distinct combiner used to scale to a billions of records.
> 3) Mention in documentation that bloom join is also ideal in cases of right 
> outer join with smaller dataset on the right. Replicate join only supports 
> left outer join.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PIG-3038) Support for Credentials for UDF,Loader and Storer

2018-09-26 Thread Satish Subhashrao Saley (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-3038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16629099#comment-16629099
 ] 

Satish Subhashrao Saley commented on PIG-3038:
--

Updated patch

> Support for Credentials for UDF,Loader and Storer
> -
>
> Key: PIG-3038
> URL: https://issues.apache.org/jira/browse/PIG-3038
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.10.0
>Reporter: Rohini Palaniswamy
>Assignee: Satish Subhashrao Saley
>Priority: Major
> Fix For: 0.18.0
>
> Attachments: PIG-3038-5.patch
>
>
>   Pig does not have a clean way (APIs) to support adding Credentials (hbase 
> token, hcat/hive metastore token) to Job and retrieving it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PIG-3038) Support for Credentials for UDF,Loader and Storer

2018-09-26 Thread Satish Subhashrao Saley (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-3038:
-
Attachment: PIG-3038-5.patch

> Support for Credentials for UDF,Loader and Storer
> -
>
> Key: PIG-3038
> URL: https://issues.apache.org/jira/browse/PIG-3038
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.10.0
>Reporter: Rohini Palaniswamy
>Assignee: Satish Subhashrao Saley
>Priority: Major
> Fix For: 0.18.0
>
> Attachments: PIG-3038-5.patch
>
>
>   Pig does not have a clean way (APIs) to support adding Credentials (hbase 
> token, hcat/hive metastore token) to Job and retrieving it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PIG-3038) Support for Credentials for UDF,Loader and Storer

2018-09-21 Thread Satish Subhashrao Saley (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-3038:
-
Status: Patch Available  (was: Open)

> Support for Credentials for UDF,Loader and Storer
> -
>
> Key: PIG-3038
> URL: https://issues.apache.org/jira/browse/PIG-3038
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.10.0
>Reporter: Rohini Palaniswamy
>Assignee: Satish Subhashrao Saley
>Priority: Major
> Fix For: 0.18.0
>
>
>   Pig does not have a clean way (APIs) to support adding Credentials (hbase 
> token, hcat/hive metastore token) to Job and retrieving it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PIG-5317) Upgrade old dependencies: commons-lang, hsqldb, commons-logging

2018-09-19 Thread Satish Subhashrao Saley (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-5317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16621188#comment-16621188
 ] 

Satish Subhashrao Saley commented on PIG-5317:
--

You can try {{CastScalar}} e2e test in tez mode. It will fail.
{code:java}
ERROR TestDriver::runTestGroup at : 729 Failed to run test CastScalar_11 
_Failed running ./CastScalar_11.pig

Dumping logfile <>/CastScalar_11.log ===
Pig Stack Trace
---
ERROR 2998: Unhandled internal error. org/apache/commons/lang3/ArrayUtils

java.lang.NoClassDefFoundError: org/apache/commons/lang3/ArrayUtils
at 
org.apache.pig.backend.hadoop.executionengine.tez.util.TezCompilerUtil.replaceOutput(TezCompilerUtil.java:192)
at 
org.apache.pig.backend.hadoop.executionengine.tez.util.TezCompilerUtil.connectTezOpToNewSuccesor(TezCompilerUtil.java:182)
at 
org.apache.pig.backend.hadoop.executionengine.tez.plan.optimizer.MultiQueryOptimizerTez.removeSplittee(MultiQueryOptimizerTez.java:324)
at 
org.apache.pig.backend.hadoop.executionengine.tez.plan.optimizer.MultiQueryOptimizerTez.visitTezOp(MultiQueryOptimizerTez.java:289)
at 
org.apache.pig.backend.hadoop.executionengine.tez.plan.TezOperator.visit(TezOperator.java:265)
at 
org.apache.pig.backend.hadoop.executionengine.tez.plan.TezOperator.visit(TezOperator.java:56)
at 
org.apache.pig.impl.plan.ReverseDependencyOrderWalker.walk(ReverseDependencyOrderWalker.java:71)
at org.apache.pig.impl.plan.PlanVisitor.visit(PlanVisitor.java:46)
at 
org.apache.pig.backend.hadoop.executionengine.tez.TezLauncher.optimize(TezLauncher.java:482)
at 
org.apache.pig.backend.hadoop.executionengine.tez.TezLauncher.compile(TezLauncher.java:431)
at 
org.apache.pig.backend.hadoop.executionengine.tez.TezLauncher.launchPig(TezLauncher.java:172)
at 
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.launchPig(HExecutionEngine.java:290)
at org.apache.pig.PigServer.launchPlan(PigServer.java:1479)
at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1464)
at org.apache.pig.PigServer.execute(PigServer.java:1453)
at org.apache.pig.PigServer.executeBatch(PigServer.java:489)
at org.apache.pig.PigServer.executeBatch(PigServer.java:472)
at org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:172)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:235)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:206)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:81)
at org.apache.pig.Main.run(Main.java:630)
at org.apache.pig.Main.main(Main.java:175)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:239)
at org.apache.hadoop.util.RunJar.main(RunJar.java:153)
Caused by: java.lang.ClassNotFoundException: org.apache.commons.lang3.ArrayUtils
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 29 more
{code}

> Upgrade old dependencies: commons-lang, hsqldb, commons-logging
> ---
>
> Key: PIG-5317
> URL: https://issues.apache.org/jira/browse/PIG-5317
> Project: Pig
>  Issue Type: Improvement
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Minor
> Fix For: 0.18.0
>
> Attachments: PIG-5317_1.patch, PIG-5317_2.patch, 
> PIG-5317_amend.patch, PIG-5317_without_new_dep.patch
>
>
> Pig depends on old version of commons-lang, hsqldb and commons-logging. It 
> would be nice to upgrade the version of these dependencies, for commons-lang 
> Pig should depend on commons-lang3 instead (which is already present in the 
> ivy.xml)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PIG-5359) Reduce time spent in split serialization

2018-09-17 Thread Satish Subhashrao Saley (JIRA)
Satish Subhashrao Saley created PIG-5359:


 Summary: Reduce time spent in split serialization
 Key: PIG-5359
 URL: https://issues.apache.org/jira/browse/PIG-5359
 Project: Pig
  Issue Type: Improvement
Reporter: Satish Subhashrao Saley
Assignee: Satish Subhashrao Saley


1. Unnecessary serialization of splits in Tez.
 In LoaderProcessor, pig calls
 
[https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/plan/optimizer/LoaderProcessor.java#L172]
{code:java}
tezOp.getLoaderInfo().setInputSplitInfo(MRInputHelpers.generateInputSplitsToMem(conf,
 false, 0));
{code}
It ends up serializing the splits, just to print log.

[https://github.com/apache/tez/blob/master/tez-mapreduce/src/main/java/org/apache/tez/mapreduce/hadoop/MRInputHelpers.java#L317]
{code:java}
  public static InputSplitInfoMem generateInputSplitsToMem(Configuration conf,
  boolean groupSplits, boolean sortSplits, int targetTasks)
  throws IOException, ClassNotFoundException, InterruptedException {
  
  
  LOG.info("NumSplits: " + splitInfoMem.getNumTasks() + ", 
SerializedSize: "
+ splitInfoMem.getSplitsProto().getSerializedSize());
return splitInfoMem;

{code}
[https://github.com/apache/tez/blob/master/tez-mapreduce/src/main/java/org/apache/tez/mapreduce/hadoop/InputSplitInfoMem.java#L106]
{code:java}
  public MRSplitsProto getSplitsProto() {
if (isNewSplit) {
  try {
return createSplitsProto(newFormatSplits, new 
SerializationFactory(conf));

{code}
[https://github.com/apache/tez/blob/master/tez-mapreduce/src/main/java/org/apache/tez/mapreduce/hadoop/InputSplitInfoMem.java#L152-L170]
{code:java}
  private static MRSplitsProto createSplitsProto(
  org.apache.hadoop.mapreduce.InputSplit[] newSplits,
  SerializationFactory serializationFactory) throws IOException,
  InterruptedException {
MRSplitsProto.Builder splitsBuilder = MRSplitsProto.newBuilder();

for (org.apache.hadoop.mapreduce.InputSplit newSplit : newSplits) {
  splitsBuilder.addSplits(MRInputHelpers.createSplitProto(newSplit, 
serializationFactory));
}
return splitsBuilder.build();
  }

{code}
[https://github.com/apache/tez/blob/master/tez-mapreduce/src/main/java/org/apache/tez/mapreduce/hadoop/MRInputHelpers.java#L221-L259]

2. In TezDagBuilder, if splitsSerializedSize > spillThreshold, then the 
InputSplits serialized in MRSplitsProto are not used by Pig and it serializes 
again directly to disk via JobSplitWriter.createSplitFiles. So the InputSplit 
serialization logic is called again which is wasteful and expensive in cases 
like HCat.

[https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/TezDagBuilder.java#L946-L947]
{code:java}
MRSplitsProto splitsProto = inputSplitInfo.getSplitsProto();
int splitsSerializedSize = splitsProto.getSerializedSize();
{code}
The getSplitsProto, creates MRSplitsProto which consists of list of 
MRSplitProto. MRSplitProto has serialized bytes of each InputFormat. If 
splitsSerializedSize > spillThreshold, pig writes the splits to disk via
{code:java}
if(splitsSerializedSize > spillThreshold) {
inputPayLoad.setBoolean(

org.apache.tez.mapreduce.hadoop.MRJobConfig.MR_TEZ_SPLITS_VIA_EVENTS,
false);
// Write splits to disk
Path inputSplitsDir = FileLocalizer.getTemporaryPath(pc);
log.info("Writing input splits to " + inputSplitsDir
+ " for vertex " + vertex.getName()
+ " as the serialized size in memory is "
+ splitsSerializedSize + ". Configured "
+ PigConfiguration.PIG_TEZ_INPUT_SPLITS_MEM_THRESHOLD
+ " is " + spillThreshold);
inputSplitInfo = MRToTezHelper.writeInputSplitInfoToDisk(
(InputSplitInfoMem)inputSplitInfo, inputSplitsDir, payloadConf, fs);

{code}
[https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/TezDagBuilder.java#L960]
 
[https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/util/MRToTezHelper.java#L302-L314]

Solution:
 1. Do not serialize the split in LoaderProcessor.java
 2. In TezDagBuilder.java, serialize each input split and keep adding its size 
and if it exceeds spillThreshold, then write the splits to disk reusing the 
serialized buffers for each split.

 

Thank you [~rohini] for identifying the issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PIG-5355) Negative progress report by HBaseTableRecordReader

2018-09-10 Thread Satish Subhashrao Saley (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-5355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16609512#comment-16609512
 ] 

Satish Subhashrao Saley commented on PIG-5355:
--

updated patch

> Negative progress report by HBaseTableRecordReader
> --
>
> Key: PIG-5355
> URL: https://issues.apache.org/jira/browse/PIG-5355
> Project: Pig
>  Issue Type: Bug
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
>Priority: Major
> Attachments: PIG-5355-1.patch, PIG-5355-2.patch, PIG-5355-3.patch
>
>
> The logic for padding the current row does not consider the updated padded 
> row during the comparison. It ends up with different length then expected. 
> This results in negative value for {{processed}}.
> {code}
> byte[] lastPadded = currRow_;
> if (currRow_.length < endRow_.length) {
> lastPadded = Bytes.padTail(currRow_, endRow_.length - 
> currRow_.length);
> }
> if (currRow_.length < startRow_.length) {
> lastPadded = Bytes.padTail(currRow_, startRow_.length - 
> currRow_.length);
> }
> byte [] prependHeader = {1, 0};
> BigInteger bigLastRow = new BigInteger(Bytes.add(prependHeader, 
> lastPadded));
> if (bigLastRow.compareTo(bigEnd_) > 0) {
> return progressSoFar_;
> }
> BigDecimal processed = new 
> BigDecimal(bigLastRow.subtract(bigStart_));
> {code}
> The fix is to use {{lastPadded}} in the second {{if}} comparison and 
> {{Bytes.padTail}} call inside that {{if}}
> PIG-4700 added progress reporting. This enabled ProgressHelper in Tez. It 
> calls {{getProgress}} [here 
> |https://github.com/apache/tez/blob/master/tez-api/src/main/java/org/apache/tez/common/ProgressHelper.java#L50]
>  on {{PigRecrodReader}} 
> https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigRecordReader.java#L159
>  . Since Pig is reporting negative progress, job is getting killed by AM.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PIG-5355) Negative progress report by HBaseTableRecordReader

2018-09-10 Thread Satish Subhashrao Saley (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-5355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-5355:
-
Attachment: PIG-5355-3.patch

> Negative progress report by HBaseTableRecordReader
> --
>
> Key: PIG-5355
> URL: https://issues.apache.org/jira/browse/PIG-5355
> Project: Pig
>  Issue Type: Bug
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
>Priority: Major
> Attachments: PIG-5355-1.patch, PIG-5355-2.patch, PIG-5355-3.patch
>
>
> The logic for padding the current row does not consider the updated padded 
> row during the comparison. It ends up with different length then expected. 
> This results in negative value for {{processed}}.
> {code}
> byte[] lastPadded = currRow_;
> if (currRow_.length < endRow_.length) {
> lastPadded = Bytes.padTail(currRow_, endRow_.length - 
> currRow_.length);
> }
> if (currRow_.length < startRow_.length) {
> lastPadded = Bytes.padTail(currRow_, startRow_.length - 
> currRow_.length);
> }
> byte [] prependHeader = {1, 0};
> BigInteger bigLastRow = new BigInteger(Bytes.add(prependHeader, 
> lastPadded));
> if (bigLastRow.compareTo(bigEnd_) > 0) {
> return progressSoFar_;
> }
> BigDecimal processed = new 
> BigDecimal(bigLastRow.subtract(bigStart_));
> {code}
> The fix is to use {{lastPadded}} in the second {{if}} comparison and 
> {{Bytes.padTail}} call inside that {{if}}
> PIG-4700 added progress reporting. This enabled ProgressHelper in Tez. It 
> calls {{getProgress}} [here 
> |https://github.com/apache/tez/blob/master/tez-api/src/main/java/org/apache/tez/common/ProgressHelper.java#L50]
>  on {{PigRecrodReader}} 
> https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigRecordReader.java#L159
>  . Since Pig is reporting negative progress, job is getting killed by AM.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (PIG-3038) Support for Credentials for UDF,Loader and Storer

2018-09-10 Thread Satish Subhashrao Saley (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-3038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley reassigned PIG-3038:


Assignee: Satish Subhashrao Saley

> Support for Credentials for UDF,Loader and Storer
> -
>
> Key: PIG-3038
> URL: https://issues.apache.org/jira/browse/PIG-3038
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.10.0
>Reporter: Rohini Palaniswamy
>Assignee: Satish Subhashrao Saley
>Priority: Major
> Fix For: 0.18.0
>
>
>   Pig does not have a clean way (APIs) to support adding Credentials (hbase 
> token, hcat/hive metastore token) to Job and retrieving it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PIG-5355) Negative progress report by HBaseTableRecordReader

2018-09-07 Thread Satish Subhashrao Saley (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-5355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-5355:
-
Attachment: PIG-5355-2.patch

> Negative progress report by HBaseTableRecordReader
> --
>
> Key: PIG-5355
> URL: https://issues.apache.org/jira/browse/PIG-5355
> Project: Pig
>  Issue Type: Bug
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
>Priority: Major
> Attachments: PIG-5355-1.patch, PIG-5355-2.patch
>
>
> The logic for padding the current row does not consider the updated padded 
> row during the comparison. It ends up with different length then expected. 
> This results in negative value for {{processed}}.
> {code}
> byte[] lastPadded = currRow_;
> if (currRow_.length < endRow_.length) {
> lastPadded = Bytes.padTail(currRow_, endRow_.length - 
> currRow_.length);
> }
> if (currRow_.length < startRow_.length) {
> lastPadded = Bytes.padTail(currRow_, startRow_.length - 
> currRow_.length);
> }
> byte [] prependHeader = {1, 0};
> BigInteger bigLastRow = new BigInteger(Bytes.add(prependHeader, 
> lastPadded));
> if (bigLastRow.compareTo(bigEnd_) > 0) {
> return progressSoFar_;
> }
> BigDecimal processed = new 
> BigDecimal(bigLastRow.subtract(bigStart_));
> {code}
> The fix is to use {{lastPadded}} in the second {{if}} comparison and 
> {{Bytes.padTail}} call inside that {{if}}
> PIG-4700 added progress reporting. This enabled ProgressHelper in Tez. It 
> calls {{getProgress}} [here 
> |https://github.com/apache/tez/blob/master/tez-api/src/main/java/org/apache/tez/common/ProgressHelper.java#L50]
>  on {{PigRecrodReader}} 
> https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigRecordReader.java#L159
>  . Since Pig is reporting negative progress, job is getting killed by AM.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PIG-5355) Negative progress report by HBaseTableRecordReader

2018-09-04 Thread Satish Subhashrao Saley (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-5355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-5355:
-
Attachment: PIG-5355-1.patch

> Negative progress report by HBaseTableRecordReader
> --
>
> Key: PIG-5355
> URL: https://issues.apache.org/jira/browse/PIG-5355
> Project: Pig
>  Issue Type: Bug
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
>Priority: Major
> Attachments: PIG-5355-1.patch
>
>
> The logic for padding the current row does not consider the updated padded 
> row during the comparison. It ends up with different length then expected. 
> This results in negative value for {{processed}}.
> {code}
> byte[] lastPadded = currRow_;
> if (currRow_.length < endRow_.length) {
> lastPadded = Bytes.padTail(currRow_, endRow_.length - 
> currRow_.length);
> }
> if (currRow_.length < startRow_.length) {
> lastPadded = Bytes.padTail(currRow_, startRow_.length - 
> currRow_.length);
> }
> byte [] prependHeader = {1, 0};
> BigInteger bigLastRow = new BigInteger(Bytes.add(prependHeader, 
> lastPadded));
> if (bigLastRow.compareTo(bigEnd_) > 0) {
> return progressSoFar_;
> }
> BigDecimal processed = new 
> BigDecimal(bigLastRow.subtract(bigStart_));
> {code}
> The fix is to use {{lastPadded}} in the second {{if}} comparison and 
> {{Bytes.padTail}} call inside that {{if}}
> PIG-4700 added progress reporting. This enabled ProgressHelper in Tez. It 
> calls {{getProgress}} [here 
> |https://github.com/apache/tez/blob/master/tez-api/src/main/java/org/apache/tez/common/ProgressHelper.java#L50]
>  on {{PigRecrodReader}} 
> https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigRecordReader.java#L159
>  . Since Pig is reporting negative progress, job is getting killed by AM.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PIG-5355) Negative progress report by HBaseTableRecordReader

2018-09-04 Thread Satish Subhashrao Saley (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-5355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-5355:
-
Status: Patch Available  (was: Open)

> Negative progress report by HBaseTableRecordReader
> --
>
> Key: PIG-5355
> URL: https://issues.apache.org/jira/browse/PIG-5355
> Project: Pig
>  Issue Type: Bug
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
>Priority: Major
> Attachments: PIG-5355-1.patch
>
>
> The logic for padding the current row does not consider the updated padded 
> row during the comparison. It ends up with different length then expected. 
> This results in negative value for {{processed}}.
> {code}
> byte[] lastPadded = currRow_;
> if (currRow_.length < endRow_.length) {
> lastPadded = Bytes.padTail(currRow_, endRow_.length - 
> currRow_.length);
> }
> if (currRow_.length < startRow_.length) {
> lastPadded = Bytes.padTail(currRow_, startRow_.length - 
> currRow_.length);
> }
> byte [] prependHeader = {1, 0};
> BigInteger bigLastRow = new BigInteger(Bytes.add(prependHeader, 
> lastPadded));
> if (bigLastRow.compareTo(bigEnd_) > 0) {
> return progressSoFar_;
> }
> BigDecimal processed = new 
> BigDecimal(bigLastRow.subtract(bigStart_));
> {code}
> The fix is to use {{lastPadded}} in the second {{if}} comparison and 
> {{Bytes.padTail}} call inside that {{if}}
> PIG-4700 added progress reporting. This enabled ProgressHelper in Tez. It 
> calls {{getProgress}} [here 
> |https://github.com/apache/tez/blob/master/tez-api/src/main/java/org/apache/tez/common/ProgressHelper.java#L50]
>  on {{PigRecrodReader}} 
> https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigRecordReader.java#L159
>  . Since Pig is reporting negative progress, job is getting killed by AM.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PIG-5355) Negative progress report by HBaseTableRecordReader

2018-08-29 Thread Satish Subhashrao Saley (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-5355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-5355:
-
Description: 
The logic for padding the current row does not consider the updated padded row 
during the comparison. It ends up with different length then expected. This 
results in negative value for {{processed}}.

{code}
byte[] lastPadded = currRow_;
if (currRow_.length < endRow_.length) {
lastPadded = Bytes.padTail(currRow_, endRow_.length - 
currRow_.length);
}
if (currRow_.length < startRow_.length) {
lastPadded = Bytes.padTail(currRow_, startRow_.length - 
currRow_.length);
}

byte [] prependHeader = {1, 0};
BigInteger bigLastRow = new BigInteger(Bytes.add(prependHeader, 
lastPadded));
if (bigLastRow.compareTo(bigEnd_) > 0) {
return progressSoFar_;
}
BigDecimal processed = new 
BigDecimal(bigLastRow.subtract(bigStart_));
{code}
The fix is to use {{lastPadded}} in the second {{if}} comparison and 
{{Bytes.padTail}} call inside that {{if}}

PIG-4700 added progress reporting. This enabled ProgressHelper in Tez. It calls 
{{getProgress}} [here 
|https://github.com/apache/tez/blob/master/tez-api/src/main/java/org/apache/tez/common/ProgressHelper.java#L50]
 on {{PigRecrodReader}} 
https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigRecordReader.java#L159
 . Since Pig is reporting negative progress, job is getting killed by AM.
 

 

  was:
The logic for padding the current row does not consider the updated padded row 
during the comparison. It ends up with different length then expected. This 
results in negative value for {{processed}}.

{code}
byte[] lastPadded = currRow_;
if (currRow_.length < endRow_.length) {
lastPadded = Bytes.padTail(currRow_, endRow_.length - 
currRow_.length);
}
if (currRow_.length < startRow_.length) {
lastPadded = Bytes.padTail(currRow_, startRow_.length - 
currRow_.length);
}

byte [] prependHeader = {1, 0};
BigInteger bigLastRow = new BigInteger(Bytes.add(prependHeader, 
lastPadded));
if (bigLastRow.compareTo(bigEnd_) > 0) {
return progressSoFar_;
}
BigDecimal processed = new 
BigDecimal(bigLastRow.subtract(bigStart_));
{code}
The fix is to use {{lastPadded}} in the second {{if}} comparison and 
{{Bytes.padTail}} call inside that {{if}}
 

 


> Negative progress report by HBaseTableRecordReader
> --
>
> Key: PIG-5355
> URL: https://issues.apache.org/jira/browse/PIG-5355
> Project: Pig
>  Issue Type: Bug
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
>Priority: Major
>
> The logic for padding the current row does not consider the updated padded 
> row during the comparison. It ends up with different length then expected. 
> This results in negative value for {{processed}}.
> {code}
> byte[] lastPadded = currRow_;
> if (currRow_.length < endRow_.length) {
> lastPadded = Bytes.padTail(currRow_, endRow_.length - 
> currRow_.length);
> }
> if (currRow_.length < startRow_.length) {
> lastPadded = Bytes.padTail(currRow_, startRow_.length - 
> currRow_.length);
> }
> byte [] prependHeader = {1, 0};
> BigInteger bigLastRow = new BigInteger(Bytes.add(prependHeader, 
> lastPadded));
> if (bigLastRow.compareTo(bigEnd_) > 0) {
> return progressSoFar_;
> }
> BigDecimal processed = new 
> BigDecimal(bigLastRow.subtract(bigStart_));
> {code}
> The fix is to use {{lastPadded}} in the second {{if}} comparison and 
> {{Bytes.padTail}} call inside that {{if}}
> PIG-4700 added progress reporting. This enabled ProgressHelper in Tez. It 
> calls {{getProgress}} [here 
> |https://github.com/apache/tez/blob/master/tez-api/src/main/java/org/apache/tez/common/ProgressHelper.java#L50]
>  on {{PigRecrodReader}} 
> https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/mapReduceLayer/PigRecordReader.java#L159
>  . Since Pig is reporting negative progress, job is getting killed by AM.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PIG-5355) Negative progress report by HBaseTableRecordReader

2018-08-29 Thread Satish Subhashrao Saley (JIRA)
Satish Subhashrao Saley created PIG-5355:


 Summary: Negative progress report by HBaseTableRecordReader
 Key: PIG-5355
 URL: https://issues.apache.org/jira/browse/PIG-5355
 Project: Pig
  Issue Type: Bug
Reporter: Satish Subhashrao Saley


The logic for padding the current row does not consider the updated padded row 
during the comparison. It ends up with different length then expected. This 
results in negative value for {{processed}}.

{code}
byte[] lastPadded = currRow_;
if (currRow_.length < endRow_.length) {
lastPadded = Bytes.padTail(currRow_, endRow_.length - 
currRow_.length);
}
if (currRow_.length < startRow_.length) {
lastPadded = Bytes.padTail(currRow_, startRow_.length - 
currRow_.length);
}

byte [] prependHeader = {1, 0};
BigInteger bigLastRow = new BigInteger(Bytes.add(prependHeader, 
lastPadded));
if (bigLastRow.compareTo(bigEnd_) > 0) {
return progressSoFar_;
}
BigDecimal processed = new 
BigDecimal(bigLastRow.subtract(bigStart_));
{code}
The fix is to use {{lastPadded}} in the second {{if}} comparison and 
{{Bytes.padTail}} call inside that {{if}}
 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (PIG-5355) Negative progress report by HBaseTableRecordReader

2018-08-29 Thread Satish Subhashrao Saley (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-5355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley reassigned PIG-5355:


Assignee: Satish Subhashrao Saley

> Negative progress report by HBaseTableRecordReader
> --
>
> Key: PIG-5355
> URL: https://issues.apache.org/jira/browse/PIG-5355
> Project: Pig
>  Issue Type: Bug
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
>Priority: Major
>
> The logic for padding the current row does not consider the updated padded 
> row during the comparison. It ends up with different length then expected. 
> This results in negative value for {{processed}}.
> {code}
> byte[] lastPadded = currRow_;
> if (currRow_.length < endRow_.length) {
> lastPadded = Bytes.padTail(currRow_, endRow_.length - 
> currRow_.length);
> }
> if (currRow_.length < startRow_.length) {
> lastPadded = Bytes.padTail(currRow_, startRow_.length - 
> currRow_.length);
> }
> byte [] prependHeader = {1, 0};
> BigInteger bigLastRow = new BigInteger(Bytes.add(prependHeader, 
> lastPadded));
> if (bigLastRow.compareTo(bigEnd_) > 0) {
> return progressSoFar_;
> }
> BigDecimal processed = new 
> BigDecimal(bigLastRow.subtract(bigStart_));
> {code}
> The fix is to use {{lastPadded}} in the second {{if}} comparison and 
> {{Bytes.padTail}} call inside that {{if}}
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PIG-5342) Add setting to turn off bloom join combiner

2018-07-06 Thread Satish Subhashrao Saley (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-5342:
-
Attachment: PIG-5342-5.patch

> Add setting to turn off bloom join combiner
> ---
>
> Key: PIG-5342
> URL: https://issues.apache.org/jira/browse/PIG-5342
> Project: Pig
>  Issue Type: Sub-task
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
>Priority: Major
> Attachments: PIG-5342-1.patch, PIG-5342-2.patch, PIG-5342-3.patch, 
> PIG-5342-4.patch, PIG-5342-5.patch
>
>
> 1) Need a new setting pig.bloomjoin.nocombiner to turn off combiner for bloom 
> join. When the keys are all unique, the combiner is unnecessary overhead.
> 2) In previous case, the keys were the bloom filter index and the values were 
> the join key. Combining involved doing a distinct on the bag of values which 
> has memory issues for more than 10 million records. That needs to be flipped 
> and distinct combiner used to scale to a billions of records.
> 3) Mention in documentation that bloom join is also ideal in cases of right 
> outer join with smaller dataset on the right. Replicate join only supports 
> left outer join.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (PIG-5347) Add new target for generating dependency tree

2018-07-06 Thread Satish Subhashrao Saley (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-5347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley resolved PIG-5347.
--
Resolution: Invalid

Ah..It is already there in ivy-resolve target.

> Add new target for generating dependency tree
> -
>
> Key: PIG-5347
> URL: https://issues.apache.org/jira/browse/PIG-5347
> Project: Pig
>  Issue Type: Bug
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
>Priority: Major
>
> It would be really helpful in debugging dependency conflicts if we have some 
> easy way to get dependency tree. ivy:report - 
> http://ant.apache.org/ivy/history/latest-milestone/use/report.html task 
> generates html showing dependencies. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PIG-5347) Add new target for generating dependency tree

2018-07-06 Thread Satish Subhashrao Saley (JIRA)
Satish Subhashrao Saley created PIG-5347:


 Summary: Add new target for generating dependency tree
 Key: PIG-5347
 URL: https://issues.apache.org/jira/browse/PIG-5347
 Project: Pig
  Issue Type: Bug
Reporter: Satish Subhashrao Saley
Assignee: Satish Subhashrao Saley


It would be really helpful in debugging dependency conflicts if we have some 
easy way to get dependency tree. ivy:report - 
http://ant.apache.org/ivy/history/latest-milestone/use/report.html task 
generates html showing dependencies. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PIG-5342) Add setting to turn off bloom join combiner

2018-07-06 Thread Satish Subhashrao Saley (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-5342:
-
Attachment: PIG-5342-4.patch

> Add setting to turn off bloom join combiner
> ---
>
> Key: PIG-5342
> URL: https://issues.apache.org/jira/browse/PIG-5342
> Project: Pig
>  Issue Type: Sub-task
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
>Priority: Major
> Attachments: PIG-5342-1.patch, PIG-5342-2.patch, PIG-5342-3.patch, 
> PIG-5342-4.patch
>
>
> 1) Need a new setting pig.bloomjoin.nocombiner to turn off combiner for bloom 
> join. When the keys are all unique, the combiner is unnecessary overhead.
> 2) In previous case, the keys were the bloom filter index and the values were 
> the join key. Combining involved doing a distinct on the bag of values which 
> has memory issues for more than 10 million records. That needs to be flipped 
> and distinct combiner used to scale to a billions of records.
> 3) Mention in documentation that bloom join is also ideal in cases of right 
> outer join with smaller dataset on the right. Replicate join only supports 
> left outer join.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PIG-5342) Add setting to turn off bloom join combiner

2018-07-06 Thread Satish Subhashrao Saley (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16535081#comment-16535081
 ] 

Satish Subhashrao Saley commented on PIG-5342:
--

Updated patch

> Add setting to turn off bloom join combiner
> ---
>
> Key: PIG-5342
> URL: https://issues.apache.org/jira/browse/PIG-5342
> Project: Pig
>  Issue Type: Sub-task
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
>Priority: Major
> Attachments: PIG-5342-1.patch, PIG-5342-2.patch, PIG-5342-3.patch, 
> PIG-5342-4.patch
>
>
> 1) Need a new setting pig.bloomjoin.nocombiner to turn off combiner for bloom 
> join. When the keys are all unique, the combiner is unnecessary overhead.
> 2) In previous case, the keys were the bloom filter index and the values were 
> the join key. Combining involved doing a distinct on the bag of values which 
> has memory issues for more than 10 million records. That needs to be flipped 
> and distinct combiner used to scale to a billions of records.
> 3) Mention in documentation that bloom join is also ideal in cases of right 
> outer join with smaller dataset on the right. Replicate join only supports 
> left outer join.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PIG-5342) Add setting to turn off bloom join combiner

2018-06-28 Thread Satish Subhashrao Saley (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16526504#comment-16526504
 ] 

Satish Subhashrao Saley commented on PIG-5342:
--

Updated patch.

 

> Add setting to turn off bloom join combiner
> ---
>
> Key: PIG-5342
> URL: https://issues.apache.org/jira/browse/PIG-5342
> Project: Pig
>  Issue Type: Sub-task
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
>Priority: Major
> Attachments: PIG-5342-1.patch, PIG-5342-2.patch, PIG-5342-3.patch
>
>
> 1) Need a new setting pig.bloomjoin.nocombiner to turn off combiner for bloom 
> join. When the keys are all unique, the combiner is unnecessary overhead.
> 2) In previous case, the keys were the bloom filter index and the values were 
> the join key. Combining involved doing a distinct on the bag of values which 
> has memory issues for more than 10 million records. That needs to be flipped 
> and distinct combiner used to scale to a billions of records.
> 3) Mention in documentation that bloom join is also ideal in cases of right 
> outer join with smaller dataset on the right. Replicate join only supports 
> left outer join.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PIG-5342) Add setting to turn off bloom join combiner

2018-06-28 Thread Satish Subhashrao Saley (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-5342:
-
Attachment: PIG-5342-3.patch

> Add setting to turn off bloom join combiner
> ---
>
> Key: PIG-5342
> URL: https://issues.apache.org/jira/browse/PIG-5342
> Project: Pig
>  Issue Type: Sub-task
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
>Priority: Major
> Attachments: PIG-5342-1.patch, PIG-5342-2.patch, PIG-5342-3.patch
>
>
> 1) Need a new setting pig.bloomjoin.nocombiner to turn off combiner for bloom 
> join. When the keys are all unique, the combiner is unnecessary overhead.
> 2) In previous case, the keys were the bloom filter index and the values were 
> the join key. Combining involved doing a distinct on the bag of values which 
> has memory issues for more than 10 million records. That needs to be flipped 
> and distinct combiner used to scale to a billions of records.
> 3) Mention in documentation that bloom join is also ideal in cases of right 
> outer join with smaller dataset on the right. Replicate join only supports 
> left outer join.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PIG-5342) Add setting to turn off bloom join combiner

2018-06-15 Thread Satish Subhashrao Saley (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16514304#comment-16514304
 ] 

Satish Subhashrao Saley commented on PIG-5342:
--

Updated the patch.

> Add setting to turn off bloom join combiner
> ---
>
> Key: PIG-5342
> URL: https://issues.apache.org/jira/browse/PIG-5342
> Project: Pig
>  Issue Type: Sub-task
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
>Priority: Major
> Attachments: PIG-5342-1.patch, PIG-5342-2.patch
>
>
> 1) Need a new setting pig.bloomjoin.nocombiner to turn off combiner for bloom 
> join. When the keys are all unique, the combiner is unnecessary overhead.
> 2) Mention in documentation that bloom join is also ideal in cases of right 
> outer join with smaller dataset on the right. Replicate join only supports 
> left outer join.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PIG-5342) Add setting to turn off bloom join combiner

2018-06-15 Thread Satish Subhashrao Saley (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-5342:
-
Attachment: PIG-5342-2.patch

> Add setting to turn off bloom join combiner
> ---
>
> Key: PIG-5342
> URL: https://issues.apache.org/jira/browse/PIG-5342
> Project: Pig
>  Issue Type: Sub-task
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
>Priority: Major
> Attachments: PIG-5342-1.patch, PIG-5342-2.patch
>
>
> 1) Need a new setting pig.bloomjoin.nocombiner to turn off combiner for bloom 
> join. When the keys are all unique, the combiner is unnecessary overhead.
> 2) Mention in documentation that bloom join is also ideal in cases of right 
> outer join with smaller dataset on the right. Replicate join only supports 
> left outer join.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (PIG-5255) Improvements to bloom join

2018-06-13 Thread Satish Subhashrao Saley (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-5255?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511566#comment-16511566
 ] 

Satish Subhashrao Saley commented on PIG-5255:
--

Create subtask PIG-5342 to address item 1 and 2.

> Improvements to bloom join
> --
>
> Key: PIG-5255
> URL: https://issues.apache.org/jira/browse/PIG-5255
> Project: Pig
>  Issue Type: New Feature
>Reporter: Rohini Palaniswamy
>Assignee: Satish Subhashrao Saley
>Priority: Major
> Fix For: 0.18.0
>
>
> 1) Need a new setting pig.bloomjoin.nocombiner to turn off combiner for bloom 
> join. When the keys are all unique, the combiner is unnecessary overhead.
> 2) Mention in documentation that bloom join is also ideal in cases of right 
> outer join with smaller dataset on the right. Replicate join only supports 
> left outer join.
> 3) Write own bloom implementation for Murmur3 and Murmur3 with Kirsch & 
> Mitzenmacher optimization which Cassandra uses 
> (http://spyced.blogspot.com/2009/01/all-you-ever-wanted-to-know-about.html). 
> Currently we use Hadoop's bloomfilter implementation which only has Jenkins 
> and Murmur2. Murmur3 is faster and offers better distribution.
> 4) Move from BitSet to RoaringBitMap for
>   - Speed and better compression
>   - Scale
>   Currently bloom join does not scale for billions of keys. Really need large 
> bloom filters in those cases and cost of broadcasting those is greater than 
> actual data size. For eg: Join of 32B records (4TB of data) with 4 billion 
> records with keys being mostly unique. Lets say we construct  61 partitioned 
> bloom filters of 3MB each (still not good enough bit vector size for the 
> amount of keys) it is close to 200MB. If we broadcast 200MB to 30K tasks it 
> becomes 6TB which is higher than the actual data size. In practice broadcast 
> would only download once per node. Even considering that in a 6K nodes 
> cluster the amount of data transfer would be around 1.2TB. Using 
> RoaringBitMap should make a big difference in this case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PIG-5342) Add setting to turn off combiner

2018-06-13 Thread Satish Subhashrao Saley (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-5342:
-
Status: Patch Available  (was: Open)

> Add setting to turn off combiner
> 
>
> Key: PIG-5342
> URL: https://issues.apache.org/jira/browse/PIG-5342
> Project: Pig
>  Issue Type: Sub-task
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
>Priority: Major
> Attachments: PIG-5342-1.patch
>
>
> 1) Need a new setting pig.bloomjoin.nocombiner to turn off combiner for bloom 
> join. When the keys are all unique, the combiner is unnecessary overhead.
> 2) Mention in documentation that bloom join is also ideal in cases of right 
> outer join with smaller dataset on the right. Replicate join only supports 
> left outer join.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PIG-5342) Add setting to turn off combiner

2018-06-13 Thread Satish Subhashrao Saley (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-5342:
-
Attachment: PIG-5342-1.patch

> Add setting to turn off combiner
> 
>
> Key: PIG-5342
> URL: https://issues.apache.org/jira/browse/PIG-5342
> Project: Pig
>  Issue Type: Sub-task
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
>Priority: Major
> Attachments: PIG-5342-1.patch
>
>
> 1) Need a new setting pig.bloomjoin.nocombiner to turn off combiner for bloom 
> join. When the keys are all unique, the combiner is unnecessary overhead.
> 2) Mention in documentation that bloom join is also ideal in cases of right 
> outer join with smaller dataset on the right. Replicate join only supports 
> left outer join.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PIG-5342) Add setting to turn off combiner

2018-06-13 Thread Satish Subhashrao Saley (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-5342:
-
Description: 
1) Need a new setting pig.bloomjoin.nocombiner to turn off combiner for bloom 
join. When the keys are all unique, the combiner is unnecessary overhead.
2) Mention in documentation that bloom join is also ideal in cases of right 
outer join with smaller dataset on the right. Replicate join only supports left 
outer join.

 

> Add setting to turn off combiner
> 
>
> Key: PIG-5342
> URL: https://issues.apache.org/jira/browse/PIG-5342
> Project: Pig
>  Issue Type: Sub-task
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
>Priority: Major
>
> 1) Need a new setting pig.bloomjoin.nocombiner to turn off combiner for bloom 
> join. When the keys are all unique, the combiner is unnecessary overhead.
> 2) Mention in documentation that bloom join is also ideal in cases of right 
> outer join with smaller dataset on the right. Replicate join only supports 
> left outer join.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (PIG-5342) Add setting to turn off combiner

2018-06-13 Thread Satish Subhashrao Saley (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley reassigned PIG-5342:


Assignee: Satish Subhashrao Saley

> Add setting to turn off combiner
> 
>
> Key: PIG-5342
> URL: https://issues.apache.org/jira/browse/PIG-5342
> Project: Pig
>  Issue Type: Sub-task
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (PIG-5342) Add setting to turn off combiner

2018-06-13 Thread Satish Subhashrao Saley (JIRA)
Satish Subhashrao Saley created PIG-5342:


 Summary: Add setting to turn off combiner
 Key: PIG-5342
 URL: https://issues.apache.org/jira/browse/PIG-5342
 Project: Pig
  Issue Type: Sub-task
Reporter: Satish Subhashrao Saley






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (PIG-5255) Improvements to bloom join

2018-06-13 Thread Satish Subhashrao Saley (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-5255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley reassigned PIG-5255:


Assignee: Satish Subhashrao Saley  (was: Rohini Palaniswamy)

> Improvements to bloom join
> --
>
> Key: PIG-5255
> URL: https://issues.apache.org/jira/browse/PIG-5255
> Project: Pig
>  Issue Type: New Feature
>Reporter: Rohini Palaniswamy
>Assignee: Satish Subhashrao Saley
>Priority: Major
> Fix For: 0.18.0
>
>
> 1) Need a new setting pig.bloomjoin.nocombiner to turn off combiner for bloom 
> join. When the keys are all unique, the combiner is unnecessary overhead.
> 2) Mention in documentation that bloom join is also ideal in cases of right 
> outer join with smaller dataset on the right. Replicate join only supports 
> left outer join.
> 3) Write own bloom implementation for Murmur3 and Murmur3 with Kirsch & 
> Mitzenmacher optimization which Cassandra uses 
> (http://spyced.blogspot.com/2009/01/all-you-ever-wanted-to-know-about.html). 
> Currently we use Hadoop's bloomfilter implementation which only has Jenkins 
> and Murmur2. Murmur3 is faster and offers better distribution.
> 4) Move from BitSet to RoaringBitMap for
>   - Speed and better compression
>   - Scale
>   Currently bloom join does not scale for billions of keys. Really need large 
> bloom filters in those cases and cost of broadcasting those is greater than 
> actual data size. For eg: Join of 32B records (4TB of data) with 4 billion 
> records with keys being mostly unique. Lets say we construct  61 partitioned 
> bloom filters of 3MB each (still not good enough bit vector size for the 
> amount of keys) it is close to 200MB. If we broadcast 200MB to 30K tasks it 
> becomes 6TB which is higher than the actual data size. In practice broadcast 
> would only download once per node. Even considering that in a 6K nodes 
> cluster the amount of data transfer would be around 1.2TB. Using 
> RoaringBitMap should make a big difference in this case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (PIG-5315) pig.script is not set for scripts run via PigServer

2017-12-04 Thread Satish Subhashrao Saley (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-5315:
-
Attachment: PIG-5315-2.patch

> pig.script is not set for scripts run via PigServer
> ---
>
> Key: PIG-5315
> URL: https://issues.apache.org/jira/browse/PIG-5315
> Project: Pig
>  Issue Type: Bug
>Reporter: Rohini Palaniswamy
>Assignee: Satish Subhashrao Saley
>Priority: Minor
>  Labels: newbie
> Fix For: 0.18.0
>
> Attachments: PIG-5315-1.patch, PIG-5315-2.patch
>
>
> ScriptState.get().setScript() is only called in Main and BoundScript and not 
> in PigServer.registerScript



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PIG-5310) MergeJoin throwing NullPointer Exception

2017-11-28 Thread Satish Subhashrao Saley (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16269890#comment-16269890
 ] 

Satish Subhashrao Saley commented on PIG-5310:
--

I checked the execution plans for Left/Right outer joins using merge, it takes 
different code path.

> MergeJoin throwing NullPointer Exception
> 
>
> Key: PIG-5310
> URL: https://issues.apache.org/jira/browse/PIG-5310
> Project: Pig
>  Issue Type: Bug
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
> Attachments: PIG-5310-1.patch, PIG-5310-2.patch
>
>
> Merge join throws NullPointerException if left input's first key doesn't 
> exist in right input and if it is smaller than first key of right input.
> For ex
> |left|right|
> |1|3|
> |1|5|
> |1| |
> Error we get - 
> {code}
> ERROR 2998: Unhandled internal error. Vertex failed, vertexName=scope-16, 
> vertexId=vertex_1509400259446_0001_1_02, diagnostics=[Task failed, 
> taskId=task_1509400259446_0001_1_02_00, diagnostics=[TaskAttempt 0 
> failed, info=[Error: Error while running task ( failure ) : 
> attempt_1509400259446_0001_1_02_00_0:java.lang.NullPointerException
>   at java.lang.Integer.compareTo(Integer.java:1216)
>   at java.lang.Integer.compareTo(Integer.java:52)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextTuple(POMergeJoin.java:525)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:305)
>   at 
> org.apache.pig.backend.hadoop.executionengine.tez.plan.operator.POStoreTez.getNextTuple(POStoreTez.java:123)
>   at 
> org.apache.pig.backend.hadoop.executionengine.tez.runtime.PigProcessor.runPipeline(PigProcessor.java:416)
>   at 
> org.apache.pig.backend.hadoop.executionengine.tez.runtime.PigProcessor.run(PigProcessor.java:281)
>   at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:370)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1945)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
>   at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Here, the key used in join is an integer. Integer.compareTo(other) method 
> throws null pointer exception if comparison is made against null. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (PIG-5310) MergeJoin throwing NullPointer Exception

2017-11-28 Thread Satish Subhashrao Saley (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16267896#comment-16267896
 ] 

Satish Subhashrao Saley edited comment on PIG-5310 at 11/28/17 7:53 PM:


Fixed the test case. 
The test case earlier was not throwing exception without the fix because I was 
not loading key as integer. By default it was DataByteArray. 


was (Author: satishsaley):
Fixed the test case.

> MergeJoin throwing NullPointer Exception
> 
>
> Key: PIG-5310
> URL: https://issues.apache.org/jira/browse/PIG-5310
> Project: Pig
>  Issue Type: Bug
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
> Attachments: PIG-5310-1.patch, PIG-5310-2.patch
>
>
> Merge join throws NullPointerException if left input's first key doesn't 
> exist in right input and if it is smaller than first key of right input.
> For ex
> |left|right|
> |1|3|
> |1|5|
> |1| |
> Error we get - 
> {code}
> ERROR 2998: Unhandled internal error. Vertex failed, vertexName=scope-16, 
> vertexId=vertex_1509400259446_0001_1_02, diagnostics=[Task failed, 
> taskId=task_1509400259446_0001_1_02_00, diagnostics=[TaskAttempt 0 
> failed, info=[Error: Error while running task ( failure ) : 
> attempt_1509400259446_0001_1_02_00_0:java.lang.NullPointerException
>   at java.lang.Integer.compareTo(Integer.java:1216)
>   at java.lang.Integer.compareTo(Integer.java:52)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextTuple(POMergeJoin.java:525)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:305)
>   at 
> org.apache.pig.backend.hadoop.executionengine.tez.plan.operator.POStoreTez.getNextTuple(POStoreTez.java:123)
>   at 
> org.apache.pig.backend.hadoop.executionengine.tez.runtime.PigProcessor.runPipeline(PigProcessor.java:416)
>   at 
> org.apache.pig.backend.hadoop.executionengine.tez.runtime.PigProcessor.run(PigProcessor.java:281)
>   at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:370)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1945)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
>   at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Here, the key used in join is an integer. Integer.compareTo(other) method 
> throws null pointer exception if comparison is made against null. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (PIG-5310) MergeJoin throwing NullPointer Exception

2017-11-28 Thread Satish Subhashrao Saley (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-5310:
-
Description: 
Merge join throws NullPointerException if left input's first key doesn't exist 
in right input and if it is smaller than first key of right input.
For ex

|left|right|
|1|3|
|1|5|
|1| |

Error we get - 
{code}
ERROR 2998: Unhandled internal error. Vertex failed, vertexName=scope-16, 
vertexId=vertex_1509400259446_0001_1_02, diagnostics=[Task failed, 
taskId=task_1509400259446_0001_1_02_00, diagnostics=[TaskAttempt 0 failed, 
info=[Error: Error while running task ( failure ) : 
attempt_1509400259446_0001_1_02_00_0:java.lang.NullPointerException
at java.lang.Integer.compareTo(Integer.java:1216)
at java.lang.Integer.compareTo(Integer.java:52)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextTuple(POMergeJoin.java:525)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:305)
at 
org.apache.pig.backend.hadoop.executionengine.tez.plan.operator.POStoreTez.getNextTuple(POStoreTez.java:123)
at 
org.apache.pig.backend.hadoop.executionengine.tez.runtime.PigProcessor.runPipeline(PigProcessor.java:416)
at 
org.apache.pig.backend.hadoop.executionengine.tez.runtime.PigProcessor.run(PigProcessor.java:281)
at 
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:370)
at 
org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
at 
org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:61)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1945)
at 
org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:61)
at 
org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:37)
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}

Here, the key used in join is an integer. Integer.compareTo(other) method 
throws null pointer exception if comparison is made against null. 

  was:
Merge join throws NullPointerException if left input's first key doesn't exist 
in right input and if it is smaller than first key of right input.
For ex

|left|right|
|1|3|
|1|5|
|1| |


> MergeJoin throwing NullPointer Exception
> 
>
> Key: PIG-5310
> URL: https://issues.apache.org/jira/browse/PIG-5310
> Project: Pig
>  Issue Type: Bug
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
> Attachments: PIG-5310-1.patch, PIG-5310-2.patch
>
>
> Merge join throws NullPointerException if left input's first key doesn't 
> exist in right input and if it is smaller than first key of right input.
> For ex
> |left|right|
> |1|3|
> |1|5|
> |1| |
> Error we get - 
> {code}
> ERROR 2998: Unhandled internal error. Vertex failed, vertexName=scope-16, 
> vertexId=vertex_1509400259446_0001_1_02, diagnostics=[Task failed, 
> taskId=task_1509400259446_0001_1_02_00, diagnostics=[TaskAttempt 0 
> failed, info=[Error: Error while running task ( failure ) : 
> attempt_1509400259446_0001_1_02_00_0:java.lang.NullPointerException
>   at java.lang.Integer.compareTo(Integer.java:1216)
>   at java.lang.Integer.compareTo(Integer.java:52)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POMergeJoin.getNextTuple(POMergeJoin.java:525)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:305)
>   at 
> org.apache.pig.backend.hadoop.executionengine.tez.plan.operator.POStoreTez.getNextTuple(POStoreTez.java:123)
>   at 
> org.apache.pig.backend.hadoop.executionengine.tez.runtime.PigProcessor.runPipeline(PigProcessor.java:416)
>   at 
> org.apache.pig.backend.hadoop.executionengine.tez.runtime.PigProcessor.run(PigProcessor.java:281)
>   at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:370)
>   at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:73)
>   at 
> 

[jira] [Updated] (PIG-5310) MergeJoin throwing NullPointer Exception

2017-11-27 Thread Satish Subhashrao Saley (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-5310:
-
Attachment: PIG-5310-2.patch

Fixed the test case.

> MergeJoin throwing NullPointer Exception
> 
>
> Key: PIG-5310
> URL: https://issues.apache.org/jira/browse/PIG-5310
> Project: Pig
>  Issue Type: Bug
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
> Attachments: PIG-5310-1.patch, PIG-5310-2.patch
>
>
> Merge join throws NullPointerException if left input's first key doesn't 
> exist in right input and if it is smaller than first key of right input.
> For ex
> |left|right|
> |1|3|
> |1|5|
> |1| |



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (PIG-5315) pig.script is not set for scripts run via PigServer

2017-11-27 Thread Satish Subhashrao Saley (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-5315:
-
Status: Patch Available  (was: Open)

> pig.script is not set for scripts run via PigServer
> ---
>
> Key: PIG-5315
> URL: https://issues.apache.org/jira/browse/PIG-5315
> Project: Pig
>  Issue Type: Bug
>Reporter: Rohini Palaniswamy
>Assignee: Satish Subhashrao Saley
>Priority: Minor
>  Labels: newbie
> Fix For: 0.18.0
>
> Attachments: PIG-5315-1.patch
>
>
> ScriptState.get().setScript() is only called in Main and BoundScript and not 
> in PigServer.registerScript



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (PIG-5315) pig.script is not set for scripts run via PigServer

2017-11-27 Thread Satish Subhashrao Saley (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-5315:
-
Attachment: PIG-5315-1.patch

> pig.script is not set for scripts run via PigServer
> ---
>
> Key: PIG-5315
> URL: https://issues.apache.org/jira/browse/PIG-5315
> Project: Pig
>  Issue Type: Bug
>Reporter: Rohini Palaniswamy
>Assignee: Satish Subhashrao Saley
>Priority: Minor
>  Labels: newbie
> Fix For: 0.18.0
>
> Attachments: PIG-5315-1.patch
>
>
> ScriptState.get().setScript() is only called in Main and BoundScript and not 
> in PigServer.registerScript



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (PIG-5314) Abort method is not implemented in PigProcessor

2017-11-27 Thread Satish Subhashrao Saley (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-5314:
-
Status: Patch Available  (was: Open)

> Abort method is not implemented in PigProcessor
> ---
>
> Key: PIG-5314
> URL: https://issues.apache.org/jira/browse/PIG-5314
> Project: Pig
>  Issue Type: Bug
>Reporter: Rohini Palaniswamy
>Assignee: Satish Subhashrao Saley
> Fix For: 0.18.0
>
> Attachments: PIG-5314-1.patch, PIG-5314-2.patch, PIG-5314-3.patch
>
>
> Found a hung job caused by a task stuck in a infinite loop in the 
> https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/runtime/PigProcessor.java#L308-L310
> {code}
> 2017-11-08 23:23:47,904 [INFO] [TezChild] |task.TezTaskRunner2|: returning 
> canCommit=false since task is not in a running state
> {code}
> The task runner keeps returning false for canCommit because task abort has 
> been already called which Pig ignored.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (PIG-5314) Abort method is not implemented in PigProcessor

2017-11-27 Thread Satish Subhashrao Saley (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-5314:
-
Attachment: PIG-5314-3.patch

updated

> Abort method is not implemented in PigProcessor
> ---
>
> Key: PIG-5314
> URL: https://issues.apache.org/jira/browse/PIG-5314
> Project: Pig
>  Issue Type: Bug
>Reporter: Rohini Palaniswamy
>Assignee: Satish Subhashrao Saley
> Fix For: 0.18.0
>
> Attachments: PIG-5314-1.patch, PIG-5314-2.patch, PIG-5314-3.patch
>
>
> Found a hung job caused by a task stuck in a infinite loop in the 
> https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/runtime/PigProcessor.java#L308-L310
> {code}
> 2017-11-08 23:23:47,904 [INFO] [TezChild] |task.TezTaskRunner2|: returning 
> canCommit=false since task is not in a running state
> {code}
> The task runner keeps returning false for canCommit because task abort has 
> been already called which Pig ignored.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (PIG-5314) Abort method is not implemented in PigProcessor

2017-11-21 Thread Satish Subhashrao Saley (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-5314:
-
Attachment: PIG-5314-2.patch

> Abort method is not implemented in PigProcessor
> ---
>
> Key: PIG-5314
> URL: https://issues.apache.org/jira/browse/PIG-5314
> Project: Pig
>  Issue Type: Bug
>Reporter: Rohini Palaniswamy
>Assignee: Satish Subhashrao Saley
> Fix For: 0.18.0
>
> Attachments: PIG-5314-1.patch, PIG-5314-2.patch
>
>
> Found a hung job caused by a task stuck in a infinite loop in the 
> https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/runtime/PigProcessor.java#L308-L310
> {code}
> 2017-11-08 23:23:47,904 [INFO] [TezChild] |task.TezTaskRunner2|: returning 
> canCommit=false since task is not in a running state
> {code}
> The task runner keeps returning false for canCommit because task abort has 
> been already called which Pig ignored.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (PIG-5314) Abort method is not implemented in PigProcessor

2017-11-20 Thread Satish Subhashrao Saley (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-5314:
-
Attachment: PIG-5314-1.patch

> Abort method is not implemented in PigProcessor
> ---
>
> Key: PIG-5314
> URL: https://issues.apache.org/jira/browse/PIG-5314
> Project: Pig
>  Issue Type: Bug
>Reporter: Rohini Palaniswamy
>Assignee: Satish Subhashrao Saley
> Fix For: 0.18.0
>
> Attachments: PIG-5314-1.patch
>
>
> Found a hung job caused by a task stuck in a infinite loop in the 
> https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/runtime/PigProcessor.java#L308-L310
> {code}
> 2017-11-08 23:23:47,904 [INFO] [TezChild] |task.TezTaskRunner2|: returning 
> canCommit=false since task is not in a running state
> {code}
> The task runner keeps returning false for canCommit because task abort has 
> been already called which Pig ignored.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (PIG-5315) pig.script is not set for scripts run via PigServer

2017-11-17 Thread Satish Subhashrao Saley (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5315?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley reassigned PIG-5315:


Assignee: Satish Subhashrao Saley

> pig.script is not set for scripts run via PigServer
> ---
>
> Key: PIG-5315
> URL: https://issues.apache.org/jira/browse/PIG-5315
> Project: Pig
>  Issue Type: Bug
>Reporter: Rohini Palaniswamy
>Assignee: Satish Subhashrao Saley
>Priority: Minor
>  Labels: newbie
> Fix For: 0.18.0
>
>
> ScriptState.get().setScript() is only called in Main and BoundScript and not 
> in PigServer.registerScript



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (PIG-5314) Abort method is not implemented in PigProcessor

2017-11-17 Thread Satish Subhashrao Saley (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley reassigned PIG-5314:


Assignee: Satish Subhashrao Saley

> Abort method is not implemented in PigProcessor
> ---
>
> Key: PIG-5314
> URL: https://issues.apache.org/jira/browse/PIG-5314
> Project: Pig
>  Issue Type: Bug
>Reporter: Rohini Palaniswamy
>Assignee: Satish Subhashrao Saley
> Fix For: 0.18.0
>
>
> Found a hung job caused by a task stuck in a infinite loop in the 
> https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/tez/runtime/PigProcessor.java#L308-L310
> {code}
> 2017-11-08 23:23:47,904 [INFO] [TezChild] |task.TezTaskRunner2|: returning 
> canCommit=false since task is not in a running state
> {code}
> The task runner keeps returning false for canCommit because task abort has 
> been already called which Pig ignored.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (PIG-5310) MergeJoin throwing NullPointer Exception

2017-10-27 Thread Satish Subhashrao Saley (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-5310:
-
Attachment: PIG-5310-1.patch

> MergeJoin throwing NullPointer Exception
> 
>
> Key: PIG-5310
> URL: https://issues.apache.org/jira/browse/PIG-5310
> Project: Pig
>  Issue Type: Bug
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
> Attachments: PIG-5310-1.patch
>
>
> Merge join throws NullPointerException if left input's first key doesn't 
> exist in right input and if it is smaller than first key of right input.
> For ex
> |left|right|
> |1|3|
> |1|5|
> |1| |



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (PIG-5310) MergeJoin throwing NullPointer Exception

2017-10-27 Thread Satish Subhashrao Saley (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-5310:
-
Status: Patch Available  (was: Open)

> MergeJoin throwing NullPointer Exception
> 
>
> Key: PIG-5310
> URL: https://issues.apache.org/jira/browse/PIG-5310
> Project: Pig
>  Issue Type: Bug
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
> Attachments: PIG-5310-1.patch
>
>
> Merge join throws NullPointerException if left input's first key doesn't 
> exist in right input and if it is smaller than first key of right input.
> For ex
> |left|right|
> |1|3|
> |1|5|
> |1| |



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (PIG-5310) MergeJoin throwing NullPointer Exception

2017-10-25 Thread Satish Subhashrao Saley (JIRA)
Satish Subhashrao Saley created PIG-5310:


 Summary: MergeJoin throwing NullPointer Exception
 Key: PIG-5310
 URL: https://issues.apache.org/jira/browse/PIG-5310
 Project: Pig
  Issue Type: Bug
Reporter: Satish Subhashrao Saley
Assignee: Satish Subhashrao Saley


Merge join throws NullPointerException if left input's first key doesn't exist 
in right input and if it is smaller than first key of right input.
For ex

|left|right|
|1|3|
|1|5|
|1| |



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (PIG-4120) Broadcast the index file in case of POMergeCoGroup and POMergeJoin

2017-10-03 Thread Satish Subhashrao Saley (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-4120:
-
Attachment: PIG-4120-5.patch

> Broadcast the index file in case of POMergeCoGroup and POMergeJoin
> --
>
> Key: PIG-4120
> URL: https://issues.apache.org/jira/browse/PIG-4120
> Project: Pig
>  Issue Type: Sub-task
>  Components: tez
>Reporter: Rohini Palaniswamy
>Assignee: Satish Subhashrao Saley
> Fix For: 0.18.0
>
> Attachments: PIG-4120-1.patch, PIG-4120-2.patch, PIG-4120-3.patch, 
> PIG-4120-4.patch, PIG-4120-5.patch
>
>
> Currently merge join and merge cogroup use two DAGs - the first DAG creates 
> the index file in hdfs and second DAG does the merge join.  Similar to 
> replicate join, we can broadcast the index file and cache it and use it in 
> merge join and merge cogroup. This will give better performance and also 
> eliminate need for the second DAG.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (PIG-4120) Broadcast the index file in case of POMergeCoGroup and POMergeJoin

2017-09-29 Thread Satish Subhashrao Saley (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-4120:
-
Attachment: PIG-4120-4.patch

> Broadcast the index file in case of POMergeCoGroup and POMergeJoin
> --
>
> Key: PIG-4120
> URL: https://issues.apache.org/jira/browse/PIG-4120
> Project: Pig
>  Issue Type: Sub-task
>  Components: tez
>Reporter: Rohini Palaniswamy
>Assignee: Satish Subhashrao Saley
> Fix For: 0.18.0
>
> Attachments: PIG-4120-1.patch, PIG-4120-2.patch, PIG-4120-3.patch, 
> PIG-4120-4.patch
>
>
> Currently merge join and merge cogroup use two DAGs - the first DAG creates 
> the index file in hdfs and second DAG does the merge join.  Similar to 
> replicate join, we can broadcast the index file and cache it and use it in 
> merge join and merge cogroup. This will give better performance and also 
> eliminate need for the second DAG.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (PIG-4120) Broadcast the index file in case of POMergeCoGroup and POMergeJoin

2017-09-27 Thread Satish Subhashrao Saley (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-4120:
-
Attachment: PIG-4120-3.patch

> Broadcast the index file in case of POMergeCoGroup and POMergeJoin
> --
>
> Key: PIG-4120
> URL: https://issues.apache.org/jira/browse/PIG-4120
> Project: Pig
>  Issue Type: Sub-task
>  Components: tez
>Reporter: Rohini Palaniswamy
>Assignee: Satish Subhashrao Saley
> Fix For: 0.18.0
>
> Attachments: PIG-4120-1.patch, PIG-4120-2.patch, PIG-4120-3.patch
>
>
> Currently merge join and merge cogroup use two DAGs - the first DAG creates 
> the index file in hdfs and second DAG does the merge join.  Similar to 
> replicate join, we can broadcast the index file and cache it and use it in 
> merge join and merge cogroup. This will give better performance and also 
> eliminate need for the second DAG.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (PIG-5306) REGEX_EXTRACT() logs every line that doesn't match

2017-09-22 Thread Satish Subhashrao Saley (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-5306:
-
Attachment: PIG-5306-1.patch

> REGEX_EXTRACT() logs every line that doesn't match
> --
>
> Key: PIG-5306
> URL: https://issues.apache.org/jira/browse/PIG-5306
> Project: Pig
>  Issue Type: Bug
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
>Priority: Minor
> Attachments: PIG-5306-1.patch
>
>
> Pig logs a warning message for every call that doesn't doesn't match a 
> capture group. The documentation only says this case returns NULL. From a 
> developer standpoint, the messages are unlikely to be useful.
> https://github.com/apache/pig/blob/trunk/src/org/apache/pig/builtin/REGEX_EXTRACT.java#L107



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (PIG-5306) REGEX_EXTRACT() logs every line that doesn't match

2017-09-22 Thread Satish Subhashrao Saley (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-5306:
-
Status: Patch Available  (was: Open)

> REGEX_EXTRACT() logs every line that doesn't match
> --
>
> Key: PIG-5306
> URL: https://issues.apache.org/jira/browse/PIG-5306
> Project: Pig
>  Issue Type: Bug
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
>Priority: Minor
> Attachments: PIG-5306-1.patch
>
>
> Pig logs a warning message for every call that doesn't doesn't match a 
> capture group. The documentation only says this case returns NULL. From a 
> developer standpoint, the messages are unlikely to be useful.
> https://github.com/apache/pig/blob/trunk/src/org/apache/pig/builtin/REGEX_EXTRACT.java#L107



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (PIG-5306) REGEX_EXTRACT() logs every line that doesn't match

2017-09-22 Thread Satish Subhashrao Saley (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley reassigned PIG-5306:


Assignee: Satish Subhashrao Saley

> REGEX_EXTRACT() logs every line that doesn't match
> --
>
> Key: PIG-5306
> URL: https://issues.apache.org/jira/browse/PIG-5306
> Project: Pig
>  Issue Type: Bug
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
>Priority: Minor
>
> Pig logs a warning message for every call that doesn't doesn't match a 
> capture group. The documentation only says this case returns NULL. From a 
> developer standpoint, the messages are unlikely to be useful.
> https://github.com/apache/pig/blob/trunk/src/org/apache/pig/builtin/REGEX_EXTRACT.java#L107



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (PIG-5306) REGEX_EXTRACT() logs every line that doesn't match

2017-09-22 Thread Satish Subhashrao Saley (JIRA)
Satish Subhashrao Saley created PIG-5306:


 Summary: REGEX_EXTRACT() logs every line that doesn't match
 Key: PIG-5306
 URL: https://issues.apache.org/jira/browse/PIG-5306
 Project: Pig
  Issue Type: Bug
Reporter: Satish Subhashrao Saley
Priority: Minor



Pig logs a warning message for every call that doesn't doesn't match a capture 
group. The documentation only says this case returns NULL. From a developer 
standpoint, the messages are unlikely to be useful.

https://github.com/apache/pig/blob/trunk/src/org/apache/pig/builtin/REGEX_EXTRACT.java#L107



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PIG-4120) Broadcast the index file in case of POMergeCoGroup and POMergeJoin

2017-09-22 Thread Satish Subhashrao Saley (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16176757#comment-16176757
 ] 

Satish Subhashrao Saley commented on PIG-4120:
--

Updated patch in review board.

> Broadcast the index file in case of POMergeCoGroup and POMergeJoin
> --
>
> Key: PIG-4120
> URL: https://issues.apache.org/jira/browse/PIG-4120
> Project: Pig
>  Issue Type: Sub-task
>  Components: tez
>Reporter: Rohini Palaniswamy
>Assignee: Satish Subhashrao Saley
> Fix For: 0.18.0
>
> Attachments: PIG-4120-1.patch, PIG-4120-2.patch
>
>
> Currently merge join and merge cogroup use two DAGs - the first DAG creates 
> the index file in hdfs and second DAG does the merge join.  Similar to 
> replicate join, we can broadcast the index file and cache it and use it in 
> merge join and merge cogroup. This will give better performance and also 
> eliminate need for the second DAG.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (PIG-4120) Broadcast the index file in case of POMergeCoGroup and POMergeJoin

2017-09-22 Thread Satish Subhashrao Saley (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-4120:
-
Status: Patch Available  (was: Open)

> Broadcast the index file in case of POMergeCoGroup and POMergeJoin
> --
>
> Key: PIG-4120
> URL: https://issues.apache.org/jira/browse/PIG-4120
> Project: Pig
>  Issue Type: Sub-task
>  Components: tez
>Reporter: Rohini Palaniswamy
>Assignee: Satish Subhashrao Saley
> Fix For: 0.18.0
>
> Attachments: PIG-4120-1.patch, PIG-4120-2.patch
>
>
> Currently merge join and merge cogroup use two DAGs - the first DAG creates 
> the index file in hdfs and second DAG does the merge join.  Similar to 
> replicate join, we can broadcast the index file and cache it and use it in 
> merge join and merge cogroup. This will give better performance and also 
> eliminate need for the second DAG.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (PIG-4120) Broadcast the index file in case of POMergeCoGroup and POMergeJoin

2017-09-22 Thread Satish Subhashrao Saley (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-4120:
-
Attachment: PIG-4120-2.patch

> Broadcast the index file in case of POMergeCoGroup and POMergeJoin
> --
>
> Key: PIG-4120
> URL: https://issues.apache.org/jira/browse/PIG-4120
> Project: Pig
>  Issue Type: Sub-task
>  Components: tez
>Reporter: Rohini Palaniswamy
>Assignee: Satish Subhashrao Saley
> Fix For: 0.18.0
>
> Attachments: PIG-4120-1.patch, PIG-4120-2.patch
>
>
> Currently merge join and merge cogroup use two DAGs - the first DAG creates 
> the index file in hdfs and second DAG does the merge join.  Similar to 
> replicate join, we can broadcast the index file and cache it and use it in 
> merge join and merge cogroup. This will give better performance and also 
> eliminate need for the second DAG.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (PIG-4120) Broadcast the index file in case of POMergeCoGroup and POMergeJoin

2017-09-19 Thread Satish Subhashrao Saley (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-4120:
-
Attachment: PIG-4120-1.patch

> Broadcast the index file in case of POMergeCoGroup and POMergeJoin
> --
>
> Key: PIG-4120
> URL: https://issues.apache.org/jira/browse/PIG-4120
> Project: Pig
>  Issue Type: Sub-task
>  Components: tez
>Reporter: Rohini Palaniswamy
>Assignee: Satish Subhashrao Saley
> Fix For: 0.18.0
>
> Attachments: PIG-4120-1.patch
>
>
> Currently merge join and merge cogroup use two DAGs - the first DAG creates 
> the index file in hdfs and second DAG does the merge join.  Similar to 
> replicate join, we can broadcast the index file and cache it and use it in 
> merge join and merge cogroup. This will give better performance and also 
> eliminate need for the second DAG.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (PIG-4662) New optimizer rule: filter nulls before inner joins

2017-09-05 Thread Satish Subhashrao Saley (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-4662:
-
Attachment: PIG-4662-1.patch

> New optimizer rule: filter nulls before inner joins
> ---
>
> Key: PIG-4662
> URL: https://issues.apache.org/jira/browse/PIG-4662
> Project: Pig
>  Issue Type: Improvement
>Reporter: Ido Hadanny
>Assignee: Satish Subhashrao Saley
>Priority: Minor
>  Labels: Performance
> Fix For: 0.18.0
>
> Attachments: PIG-4662-1.patch
>
>
> As stated in the docs, rewriting an inner join and filtering nulls from 
> inputs can be a big performance gain: 
> http://pig.apache.org/docs/r0.14.0/perf.html#nulls
> We would like to add an optimizer rule which detects inner joins, and filters 
> nulls in all inputs:
> A = filter A by t is not null;
> B = filter B by x is not null;
> C = join A by t, B by x;
> see also: 
> http://stackoverflow.com/questions/32088389/is-the-pig-optimizer-filtering-nulls-before-joining



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PIG-5273) _SUCCESS file should be created at the end of the job

2017-09-01 Thread Satish Subhashrao Saley (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5273?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16151334#comment-16151334
 ] 

Satish Subhashrao Saley commented on PIG-5273:
--

Updated patch.

> _SUCCESS file should be created at the end of the job
> -
>
> Key: PIG-5273
> URL: https://issues.apache.org/jira/browse/PIG-5273
> Project: Pig
>  Issue Type: Bug
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
> Attachments: PIG-5273-1.patch, PIG-5273-2.patch
>
>
> One of the users ran into issues because _SUCCESS file was created by 
> FileOutputCommitter.commitJob() and storeCleanup() called after that in 
> PigOutputCommitter failed to store schema due to network outage. abortJob was 
> then called and the StoreFunc.cleanupOnFailure method in it deleted the 
> output directory. Downstream jobs that started because of _SUCCESS file ran 
> with empty data 
> Possible solutions:
> 1) Move storeCleanup before commit. Found that order was reversed in 
> https://issues.apache.org/jira/browse/PIG-2642, probably due to 
> FileOutputCommitter version 1 and might not be a problem with 
> FileOutputCommitter version 2. This would still not help when there are 
> multiple outputs as main problem is cleanupOnFailure in abortJob deleting 
> directories.
> 2) We can change cleanupOnFailure not delete output directories. It still 
> does not help. The Oozie action retry might kick in and delete the directory 
> while the downstream has already started running because of the _SUCCESS 
> file. 
> 3) It cannot be done in the OutputCommitter at all as multiple output 
> committers are called in parallel in Tez. We can have Pig suppress _SUCCESS 
> creation and try creating them all at the end in TezLauncher if job has 
> succeeded before calling cleanupOnSuccess. Can probably add it as a 
> configurable setting and turn on by default in our clusters. This is probably 
> the possible solution
> Thank you [~rohini] for finding out the issue and providing solution.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (PIG-5273) _SUCCESS file should be created at the end of the job

2017-09-01 Thread Satish Subhashrao Saley (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-5273:
-
Attachment: PIG-5273-2.patch

> _SUCCESS file should be created at the end of the job
> -
>
> Key: PIG-5273
> URL: https://issues.apache.org/jira/browse/PIG-5273
> Project: Pig
>  Issue Type: Bug
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
> Attachments: PIG-5273-1.patch, PIG-5273-2.patch
>
>
> One of the users ran into issues because _SUCCESS file was created by 
> FileOutputCommitter.commitJob() and storeCleanup() called after that in 
> PigOutputCommitter failed to store schema due to network outage. abortJob was 
> then called and the StoreFunc.cleanupOnFailure method in it deleted the 
> output directory. Downstream jobs that started because of _SUCCESS file ran 
> with empty data 
> Possible solutions:
> 1) Move storeCleanup before commit. Found that order was reversed in 
> https://issues.apache.org/jira/browse/PIG-2642, probably due to 
> FileOutputCommitter version 1 and might not be a problem with 
> FileOutputCommitter version 2. This would still not help when there are 
> multiple outputs as main problem is cleanupOnFailure in abortJob deleting 
> directories.
> 2) We can change cleanupOnFailure not delete output directories. It still 
> does not help. The Oozie action retry might kick in and delete the directory 
> while the downstream has already started running because of the _SUCCESS 
> file. 
> 3) It cannot be done in the OutputCommitter at all as multiple output 
> committers are called in parallel in Tez. We can have Pig suppress _SUCCESS 
> creation and try creating them all at the end in TezLauncher if job has 
> succeeded before calling cleanupOnSuccess. Can probably add it as a 
> configurable setting and turn on by default in our clusters. This is probably 
> the possible solution
> Thank you [~rohini] for finding out the issue and providing solution.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (PIG-5273) _SUCCESS file should be created at the end of the job

2017-09-01 Thread Satish Subhashrao Saley (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-5273:
-
Attachment: (was: PIG-5273-2.patch)

> _SUCCESS file should be created at the end of the job
> -
>
> Key: PIG-5273
> URL: https://issues.apache.org/jira/browse/PIG-5273
> Project: Pig
>  Issue Type: Bug
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
> Attachments: PIG-5273-1.patch, PIG-5273-2.patch
>
>
> One of the users ran into issues because _SUCCESS file was created by 
> FileOutputCommitter.commitJob() and storeCleanup() called after that in 
> PigOutputCommitter failed to store schema due to network outage. abortJob was 
> then called and the StoreFunc.cleanupOnFailure method in it deleted the 
> output directory. Downstream jobs that started because of _SUCCESS file ran 
> with empty data 
> Possible solutions:
> 1) Move storeCleanup before commit. Found that order was reversed in 
> https://issues.apache.org/jira/browse/PIG-2642, probably due to 
> FileOutputCommitter version 1 and might not be a problem with 
> FileOutputCommitter version 2. This would still not help when there are 
> multiple outputs as main problem is cleanupOnFailure in abortJob deleting 
> directories.
> 2) We can change cleanupOnFailure not delete output directories. It still 
> does not help. The Oozie action retry might kick in and delete the directory 
> while the downstream has already started running because of the _SUCCESS 
> file. 
> 3) It cannot be done in the OutputCommitter at all as multiple output 
> committers are called in parallel in Tez. We can have Pig suppress _SUCCESS 
> creation and try creating them all at the end in TezLauncher if job has 
> succeeded before calling cleanupOnSuccess. Can probably add it as a 
> configurable setting and turn on by default in our clusters. This is probably 
> the possible solution
> Thank you [~rohini] for finding out the issue and providing solution.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (PIG-5273) _SUCCESS file should be created at the end of the job

2017-09-01 Thread Satish Subhashrao Saley (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-5273:
-
Attachment: PIG-5273-2.patch

> _SUCCESS file should be created at the end of the job
> -
>
> Key: PIG-5273
> URL: https://issues.apache.org/jira/browse/PIG-5273
> Project: Pig
>  Issue Type: Bug
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
> Attachments: PIG-5273-1.patch, PIG-5273-2.patch
>
>
> One of the users ran into issues because _SUCCESS file was created by 
> FileOutputCommitter.commitJob() and storeCleanup() called after that in 
> PigOutputCommitter failed to store schema due to network outage. abortJob was 
> then called and the StoreFunc.cleanupOnFailure method in it deleted the 
> output directory. Downstream jobs that started because of _SUCCESS file ran 
> with empty data 
> Possible solutions:
> 1) Move storeCleanup before commit. Found that order was reversed in 
> https://issues.apache.org/jira/browse/PIG-2642, probably due to 
> FileOutputCommitter version 1 and might not be a problem with 
> FileOutputCommitter version 2. This would still not help when there are 
> multiple outputs as main problem is cleanupOnFailure in abortJob deleting 
> directories.
> 2) We can change cleanupOnFailure not delete output directories. It still 
> does not help. The Oozie action retry might kick in and delete the directory 
> while the downstream has already started running because of the _SUCCESS 
> file. 
> 3) It cannot be done in the OutputCommitter at all as multiple output 
> committers are called in parallel in Tez. We can have Pig suppress _SUCCESS 
> creation and try creating them all at the end in TezLauncher if job has 
> succeeded before calling cleanupOnSuccess. Can probably add it as a 
> configurable setting and turn on by default in our clusters. This is probably 
> the possible solution
> Thank you [~rohini] for finding out the issue and providing solution.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PIG-5282) Upgade to Java 8

2017-08-17 Thread Satish Subhashrao Saley (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16130884#comment-16130884
 ] 

Satish Subhashrao Saley commented on PIG-5282:
--

Updated the patch. In test/e2e/pig/udfs/java/build.xml and 
test/perf/pigmix/build.xml, we don't specify any java version. So the version 
of java which we use to run ant command will gets picked up by default. 

> Upgade to Java 8
> 
>
> Key: PIG-5282
> URL: https://issues.apache.org/jira/browse/PIG-5282
> Project: Pig
>  Issue Type: Improvement
>Reporter: Nandor Kollar
>Assignee: Satish Subhashrao Saley
> Fix For: 0.18.0
>
> Attachments: PIG-5282-1.patch, PIG-5282-2.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (PIG-5282) Upgade to Java 8

2017-08-17 Thread Satish Subhashrao Saley (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-5282:
-
Attachment: PIG-5282-2.patch

> Upgade to Java 8
> 
>
> Key: PIG-5282
> URL: https://issues.apache.org/jira/browse/PIG-5282
> Project: Pig
>  Issue Type: Improvement
>Reporter: Nandor Kollar
>Assignee: Satish Subhashrao Saley
> Fix For: 0.18.0
>
> Attachments: PIG-5282-1.patch, PIG-5282-2.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (PIG-5282) Upgade to Java 8

2017-08-17 Thread Satish Subhashrao Saley (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-5282:
-
Attachment: PIG-5282-1.patch

> Upgade to Java 8
> 
>
> Key: PIG-5282
> URL: https://issues.apache.org/jira/browse/PIG-5282
> Project: Pig
>  Issue Type: Improvement
>Reporter: Nandor Kollar
>Assignee: Satish Subhashrao Saley
> Fix For: 0.18.0
>
> Attachments: PIG-5282-1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (PIG-5282) Upgade to Java 8

2017-08-17 Thread Satish Subhashrao Saley (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-5282:
-
Status: Patch Available  (was: Open)

> Upgade to Java 8
> 
>
> Key: PIG-5282
> URL: https://issues.apache.org/jira/browse/PIG-5282
> Project: Pig
>  Issue Type: Improvement
>Reporter: Nandor Kollar
>Assignee: Satish Subhashrao Saley
> Fix For: 0.18.0
>
> Attachments: PIG-5282-1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (PIG-5282) Upgade to Java 8

2017-08-17 Thread Satish Subhashrao Saley (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley reassigned PIG-5282:


Assignee: Satish Subhashrao Saley

> Upgade to Java 8
> 
>
> Key: PIG-5282
> URL: https://issues.apache.org/jira/browse/PIG-5282
> Project: Pig
>  Issue Type: Improvement
>Reporter: Nandor Kollar
>Assignee: Satish Subhashrao Saley
> Fix For: 0.18.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (PIG-5273) _SUCCESS file should be created at the end of the job

2017-08-16 Thread Satish Subhashrao Saley (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-5273:
-
Attachment: PIG-5273-1.patch

> _SUCCESS file should be created at the end of the job
> -
>
> Key: PIG-5273
> URL: https://issues.apache.org/jira/browse/PIG-5273
> Project: Pig
>  Issue Type: Bug
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
> Attachments: PIG-5273-1.patch
>
>
> One of the users ran into issues because _SUCCESS file was created by 
> FileOutputCommitter.commitJob() and storeCleanup() called after that in 
> PigOutputCommitter failed to store schema due to network outage. abortJob was 
> then called and the StoreFunc.cleanupOnFailure method in it deleted the 
> output directory. Downstream jobs that started because of _SUCCESS file ran 
> with empty data 
> Possible solutions:
> 1) Move storeCleanup before commit. Found that order was reversed in 
> https://issues.apache.org/jira/browse/PIG-2642, probably due to 
> FileOutputCommitter version 1 and might not be a problem with 
> FileOutputCommitter version 2. This would still not help when there are 
> multiple outputs as main problem is cleanupOnFailure in abortJob deleting 
> directories.
> 2) We can change cleanupOnFailure not delete output directories. It still 
> does not help. The Oozie action retry might kick in and delete the directory 
> while the downstream has already started running because of the _SUCCESS 
> file. 
> 3) It cannot be done in the OutputCommitter at all as multiple output 
> committers are called in parallel in Tez. We can have Pig suppress _SUCCESS 
> creation and try creating them all at the end in TezLauncher if job has 
> succeeded before calling cleanupOnSuccess. Can probably add it as a 
> configurable setting and turn on by default in our clusters. This is probably 
> the possible solution
> Thank you [~rohini] for finding out the issue and providing solution.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (PIG-5273) _SUCCESS file should be created at the end of the job

2017-08-16 Thread Satish Subhashrao Saley (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-5273:
-
Status: Patch Available  (was: Open)

> _SUCCESS file should be created at the end of the job
> -
>
> Key: PIG-5273
> URL: https://issues.apache.org/jira/browse/PIG-5273
> Project: Pig
>  Issue Type: Bug
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
> Attachments: PIG-5273-1.patch
>
>
> One of the users ran into issues because _SUCCESS file was created by 
> FileOutputCommitter.commitJob() and storeCleanup() called after that in 
> PigOutputCommitter failed to store schema due to network outage. abortJob was 
> then called and the StoreFunc.cleanupOnFailure method in it deleted the 
> output directory. Downstream jobs that started because of _SUCCESS file ran 
> with empty data 
> Possible solutions:
> 1) Move storeCleanup before commit. Found that order was reversed in 
> https://issues.apache.org/jira/browse/PIG-2642, probably due to 
> FileOutputCommitter version 1 and might not be a problem with 
> FileOutputCommitter version 2. This would still not help when there are 
> multiple outputs as main problem is cleanupOnFailure in abortJob deleting 
> directories.
> 2) We can change cleanupOnFailure not delete output directories. It still 
> does not help. The Oozie action retry might kick in and delete the directory 
> while the downstream has already started running because of the _SUCCESS 
> file. 
> 3) It cannot be done in the OutputCommitter at all as multiple output 
> committers are called in parallel in Tez. We can have Pig suppress _SUCCESS 
> creation and try creating them all at the end in TezLauncher if job has 
> succeeded before calling cleanupOnSuccess. Can probably add it as a 
> configurable setting and turn on by default in our clusters. This is probably 
> the possible solution
> Thank you [~rohini] for finding out the issue and providing solution.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (PIG-5264) Remove deprecated keys from PigConfiguration

2017-08-03 Thread Satish Subhashrao Saley (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16113706#comment-16113706
 ] 

Satish Subhashrao Saley commented on PIG-5264:
--

bq.We should put ResourceStatistics.setmBytes back
Attached PIG-5264-amend-1.patch 

> Remove deprecated keys from PigConfiguration
> 
>
> Key: PIG-5264
> URL: https://issues.apache.org/jira/browse/PIG-5264
> Project: Pig
>  Issue Type: Improvement
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Minor
> Fix For: 0.18.0
>
> Attachments: PIG-5264_1.patch, PIG-5264-amend-1.patch
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> PigConfiguration includes several deprecated constants (like INSERT_ENABLED, 
> SCHEMA_TUPLE_SHOULD_ALLOW_FORCE, etc.). This should be removed as all a 
> deprecated since multiple version.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (PIG-5264) Remove deprecated keys from PigConfiguration

2017-08-03 Thread Satish Subhashrao Saley (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-5264:
-
Attachment: PIG-5264-amend-1.patch

> Remove deprecated keys from PigConfiguration
> 
>
> Key: PIG-5264
> URL: https://issues.apache.org/jira/browse/PIG-5264
> Project: Pig
>  Issue Type: Improvement
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Minor
> Fix For: 0.18.0
>
> Attachments: PIG-5264_1.patch, PIG-5264-amend-1.patch
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> PigConfiguration includes several deprecated constants (like INSERT_ENABLED, 
> SCHEMA_TUPLE_SHOULD_ALLOW_FORCE, etc.). This should be removed as all a 
> deprecated since multiple version.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (PIG-5278) Unit test failures because of PIG-5264

2017-07-31 Thread Satish Subhashrao Saley (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-5278:
-
Description: 
Following unit tests are failing after commit 
cdd48d8c448221b2bde7f423dd26bbfc51102399
PIG-5264 
https://github.com/apache/pig/commit/cdd48d8c448221b2bde7f423dd26bbfc51102399

# TestAutoLocalMode
# TestMultiQueryCompiler

  was:
Following unit tests are failing after commit 
cdd48d8c448221b2bde7f423dd26bbfc51102399
PIG-5264 
https://github.com/apache/pig/commit/cdd48d8c448221b2bde7f423dd26bbfc51102399

#TestAutoLocalMode
#TestMultiQueryCompiler


> Unit test failures because of PIG-5264
> --
>
> Key: PIG-5278
> URL: https://issues.apache.org/jira/browse/PIG-5278
> Project: Pig
>  Issue Type: Bug
>Reporter: Satish Subhashrao Saley
> Attachments: TEST-org.apache.pig.test.TestAutoLocalMode.txt, 
> TEST-org.apache.pig.test.TestMultiQueryCompiler.txt
>
>
> Following unit tests are failing after commit 
> cdd48d8c448221b2bde7f423dd26bbfc51102399
> PIG-5264 
> https://github.com/apache/pig/commit/cdd48d8c448221b2bde7f423dd26bbfc51102399
> # TestAutoLocalMode
> # TestMultiQueryCompiler



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (PIG-5278) Unit test failures because of PIG-5264

2017-07-31 Thread Satish Subhashrao Saley (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-5278:
-
Attachment: TEST-org.apache.pig.test.TestMultiQueryCompiler.txt

> Unit test failures because of PIG-5264
> --
>
> Key: PIG-5278
> URL: https://issues.apache.org/jira/browse/PIG-5278
> Project: Pig
>  Issue Type: Bug
>Reporter: Satish Subhashrao Saley
> Attachments: TEST-org.apache.pig.test.TestAutoLocalMode.txt, 
> TEST-org.apache.pig.test.TestMultiQueryCompiler.txt
>
>
> Following unit tests are failing after commit 
> cdd48d8c448221b2bde7f423dd26bbfc51102399
> PIG-5264 
> https://github.com/apache/pig/commit/cdd48d8c448221b2bde7f423dd26bbfc51102399
> #TestAutoLocalMode
> #TestMultiQueryCompiler



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (PIG-5278) Unit test failures because of PIG-5264

2017-07-31 Thread Satish Subhashrao Saley (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-5278:
-
Attachment: TEST-org.apache.pig.test.TestAutoLocalMode.txt

> Unit test failures because of PIG-5264
> --
>
> Key: PIG-5278
> URL: https://issues.apache.org/jira/browse/PIG-5278
> Project: Pig
>  Issue Type: Bug
>Reporter: Satish Subhashrao Saley
> Attachments: TEST-org.apache.pig.test.TestAutoLocalMode.txt
>
>
> Following unit tests are failing after commit 
> cdd48d8c448221b2bde7f423dd26bbfc51102399
> PIG-5264 
> https://github.com/apache/pig/commit/cdd48d8c448221b2bde7f423dd26bbfc51102399
> #TestAutoLocalMode
> #TestMultiQueryCompiler



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (PIG-5278) Unit test failures because of PIG-5264

2017-07-31 Thread Satish Subhashrao Saley (JIRA)
Satish Subhashrao Saley created PIG-5278:


 Summary: Unit test failures because of PIG-5264
 Key: PIG-5278
 URL: https://issues.apache.org/jira/browse/PIG-5278
 Project: Pig
  Issue Type: Bug
Reporter: Satish Subhashrao Saley


Following unit tests are failing after commit 
cdd48d8c448221b2bde7f423dd26bbfc51102399
PIG-5264 
https://github.com/apache/pig/commit/cdd48d8c448221b2bde7f423dd26bbfc51102399

#TestAutoLocalMode
#TestMultiQueryCompiler



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (PIG-5273) _SUCCESS file should be created at the end of the job

2017-07-14 Thread Satish Subhashrao Saley (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Satish Subhashrao Saley updated PIG-5273:
-
Description: 
One of the users ran into issues because _SUCCESS file was created by 
FileOutputCommitter.commitJob() and storeCleanup() called after that in 
PigOutputCommitter failed to store schema due to network outage. abortJob was 
then called and the StoreFunc.cleanupOnFailure method in it deleted the output 
directory. Downstream jobs that started because of _SUCCESS file ran with empty 
data 
Possible solutions:
1) Move storeCleanup before commit. Found that order was reversed in 
https://issues.apache.org/jira/browse/PIG-2642, probably due to 
FileOutputCommitter version 1 and might not be a problem with 
FileOutputCommitter version 2. This would still not help when there are 
multiple outputs as main problem is cleanupOnFailure in abortJob deleting 
directories.
2) We can change cleanupOnFailure not delete output directories. It still does 
not help. The Oozie action retry might kick in and delete the directory while 
the downstream has already started running because of the _SUCCESS file. 
3) It cannot be done in the OutputCommitter at all as multiple output 
committers are called in parallel in Tez. We can have Pig suppress _SUCCESS 
creation and try creating them all at the end in TezLauncher if job has 
succeeded before calling cleanupOnSuccess. Can probably add it as a 
configurable setting and turn on by default in our clusters. This is probably 
the possible solution

Thank you [~rohini] for finding out the issue and providing solution.

  was:
One of the users ran into issues because _SUCCESS file was created by 
FileOutputCommitter.commitJob() and storeCleanup() called after that in 
PigOutputCommitter failed to store schema due to network outage. abortJob was 
then called and the StoreFunc.cleanupOnFailure method in it deleted the output 
directory. Downstream jobs that started because of _SUCCESS file ran with empty 
data 
Possible solutions:
1) Move storeCleanup before commit. Found that order was reversed in 
https://issues.apache.org/jira/browse/PIG-2642, probably due to 
FileOutputCommitter version 1 and might not be a problem with 
FileOutputCommitter version 2. This would still not help when there are 
multiple outputs as main problem is cleanupOnFailure in abortJob deleting 
directories.
2) We can change cleanupOnFailure not delete output directories. It still does 
not help. The Oozie action retry might kick in and delete the directory while 
the downstream has already started running because of the _SUCCESS file. 
3) It cannot be done in the OutputCommitter at all as multiple output 
committers are called in parallel in Tez. We can have Pig suppress _SUCCESS 
creation and try creating them all at the end in TezLauncher if job has 
succeeded before calling cleanupOnSuccess. Can probably add it as a 
configurable setting and turn on by default in our clusters. This is probably 
the possible solution


> _SUCCESS file should be created at the end of the job
> -
>
> Key: PIG-5273
> URL: https://issues.apache.org/jira/browse/PIG-5273
> Project: Pig
>  Issue Type: Bug
>Reporter: Satish Subhashrao Saley
>Assignee: Satish Subhashrao Saley
>
> One of the users ran into issues because _SUCCESS file was created by 
> FileOutputCommitter.commitJob() and storeCleanup() called after that in 
> PigOutputCommitter failed to store schema due to network outage. abortJob was 
> then called and the StoreFunc.cleanupOnFailure method in it deleted the 
> output directory. Downstream jobs that started because of _SUCCESS file ran 
> with empty data 
> Possible solutions:
> 1) Move storeCleanup before commit. Found that order was reversed in 
> https://issues.apache.org/jira/browse/PIG-2642, probably due to 
> FileOutputCommitter version 1 and might not be a problem with 
> FileOutputCommitter version 2. This would still not help when there are 
> multiple outputs as main problem is cleanupOnFailure in abortJob deleting 
> directories.
> 2) We can change cleanupOnFailure not delete output directories. It still 
> does not help. The Oozie action retry might kick in and delete the directory 
> while the downstream has already started running because of the _SUCCESS 
> file. 
> 3) It cannot be done in the OutputCommitter at all as multiple output 
> committers are called in parallel in Tez. We can have Pig suppress _SUCCESS 
> creation and try creating them all at the end in TezLauncher if job has 
> succeeded before calling cleanupOnSuccess. Can probably add it as a 
> configurable setting and turn on by default in our clusters. This is probably 
> the possible solution
> Thank you [~rohini] for finding out the issue and providing solution.



--
This message was 

  1   2   >