[jira] [Commented] (PIG-4976) streaming job with store clause stuck if the script fail

2016-09-20 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15506435#comment-15506435
 ] 

Nandor Kollar commented on PIG-4976:


Ok, so in this case it seems that the file is not getting created in 
FileOutputHandler:
File file = new File(this.fileName);
BufferedPositionedInputStream fileInStream = 
new BufferedPositionedInputStream(new FileInputStream(file)); 
This won't create the file, I added a test to show this problem, created a 
patch based on PIG-4976-3.patch (PIG-4976-4.patch) with an additional test for 
this case and changes FileOutputHandler to create the output file. I'm not sure 
what we should do if the output file exist, should we just append to it, or we 
should throw an exception instead?

> streaming job with store clause stuck if the script fail
> 
>
> Key: PIG-4976
> URL: https://issues.apache.org/jira/browse/PIG-4976
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.17.0
>
> Attachments: PIG-4976-1.patch, PIG-4976-2.patch, PIG-4976-3.patch
>
>
> When investigating PIG-4972, I also notice Pig job stuck when the perl script 
> have syntax error. This happens if we have output clause in stream 
> specification (means use a file as staging). The bug exist in both Tez and 
> MR, and it is not a regression.
> Here is an example:
> {code}
> define CMD `perl kk.pl` output('foo') ship('kk.pl');
> A = load 'studenttab10k' as (name, age, gpa);
> B = foreach A generate name;
> C = stream B through CMD;
> store C into 'ooo';
> {code}
> kk.pl is any perl script contain a syntax error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4976) streaming job with store clause stuck if the script fail

2016-09-20 Thread Nandor Kollar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PIG-4976:
---
Attachment: PIG-4976-4.patch

> streaming job with store clause stuck if the script fail
> 
>
> Key: PIG-4976
> URL: https://issues.apache.org/jira/browse/PIG-4976
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.17.0
>
> Attachments: PIG-4976-1.patch, PIG-4976-2.patch, PIG-4976-3.patch, 
> PIG-4976-4.patch
>
>
> When investigating PIG-4972, I also notice Pig job stuck when the perl script 
> have syntax error. This happens if we have output clause in stream 
> specification (means use a file as staging). The bug exist in both Tez and 
> MR, and it is not a regression.
> Here is an example:
> {code}
> define CMD `perl kk.pl` output('foo') ship('kk.pl');
> A = load 'studenttab10k' as (name, age, gpa);
> B = foreach A generate name;
> C = stream B through CMD;
> store C into 'ooo';
> {code}
> kk.pl is any perl script contain a syntax error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4976) streaming job with store clause stuck if the script fail

2016-09-16 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15496132#comment-15496132
 ] 

Nandor Kollar commented on PIG-4976:


When I applied PIG-4976-2.patch I had the same problem as [~knoguchi], the test 
case hung. I noticed, that the test passed when I apply first PIG-4976-1.patch 
then PIG-4976-3.patch. I also noticed that when I apply just the first patch, 
the test hung, but not because of syntax error in Perl script, but because of 
NullPointerException while trying to close the OutputHandler:
Exception in thread "Thread-31" java.lang.NullPointerException
at 
org.apache.pig.impl.streaming.OutputHandler.close(OutputHandler.java:178)
at 
org.apache.pig.impl.streaming.ExecutableManager.killProcess(ExecutableManager.java:188)
at 
org.apache.pig.impl.streaming.ExecutableManager.access$200(ExecutableManager.java:52)
at 
org.apache.pig.impl.streaming.ExecutableManager$ProcessInputThread.run(ExecutableManager.java:372)
2016-09-16 13:53:03,784 ERROR [Thread-31] streaming.ExecutableManager 
(ExecutableManager.java:run(369)) - Error while reading from POStream and 
passing it to the streaming process
java.io.FileNotFoundException: foo (No such file or directory)

It looks like 'foo' should exist before executing the test? For me, it looks 
like we have two similar issues here (two test case): syntax error in script 
and writing to a file that doesn't exist.

> streaming job with store clause stuck if the script fail
> 
>
> Key: PIG-4976
> URL: https://issues.apache.org/jira/browse/PIG-4976
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.17.0
>
> Attachments: PIG-4976-1.patch, PIG-4976-2.patch, PIG-4976-3.patch
>
>
> When investigating PIG-4972, I also notice Pig job stuck when the perl script 
> have syntax error. This happens if we have output clause in stream 
> specification (means use a file as staging). The bug exist in both Tez and 
> MR, and it is not a regression.
> Here is an example:
> {code}
> define CMD `perl kk.pl` output('foo') ship('kk.pl');
> A = load 'studenttab10k' as (name, age, gpa);
> B = foreach A generate name;
> C = stream B through CMD;
> store C into 'ooo';
> {code}
> kk.pl is any perl script contain a syntax error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4976) streaming job with store clause stuck if the script fail

2016-09-22 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15513257#comment-15513257
 ] 

Nandor Kollar commented on PIG-4976:


Ok, thanks for the explanation, I agree, no need to create a new file in Pig 
like I would have done in patch #3. The only think I'm wondering if we need two 
separate test cases: one for non-existing file and one for existing file but 
syntactically incorrect script, or the one we have is enough. What do you think?

> streaming job with store clause stuck if the script fail
> 
>
> Key: PIG-4976
> URL: https://issues.apache.org/jira/browse/PIG-4976
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.17.0
>
> Attachments: PIG-4976-1.patch, PIG-4976-2.patch, PIG-4976-3.patch, 
> PIG-4976-4.patch, PIG-4976-5-knoguchi.patch
>
>
> When investigating PIG-4972, I also notice Pig job stuck when the perl script 
> have syntax error. This happens if we have output clause in stream 
> specification (means use a file as staging). The bug exist in both Tez and 
> MR, and it is not a regression.
> Here is an example:
> {code}
> define CMD `perl kk.pl` output('foo') ship('kk.pl');
> A = load 'studenttab10k' as (name, age, gpa);
> B = foreach A generate name;
> C = stream B through CMD;
> store C into 'ooo';
> {code}
> kk.pl is any perl script contain a syntax error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-2125) Make Pig work with hadoop .NEXT

2016-08-22 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15432037#comment-15432037
 ] 

Nandor Kollar commented on PIG-2125:


How can I make sure that if I remove the ClientProtocolProvider, it won't break 
the Hadoop integration? Is there a specific unit test for that, or I should 
execute the entire test suite (which is takes a lot of time)?

> Make Pig work with hadoop .NEXT
> ---
>
> Key: PIG-2125
> URL: https://issues.apache.org/jira/browse/PIG-2125
> Project: Pig
>  Issue Type: New Feature
>  Components: impl
>Affects Versions: 0.9.1, 0.10.0
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.9.2, 0.10.0
>
> Attachments: ContextFactory.java, PIG-2125-1.patch, 
> PIG-2125-10.patch, PIG-2125-10_0.9.patch, PIG-2125-2.patch, PIG-2125-3.patch, 
> PIG-2125-4.patch, PIG-2125-5.patch, PIG-2125-6.patch, PIG-2125-7.patch, 
> PIG-2125-7_0.9.patch, PIG-2125-8.patch, PIG-2125-9.patch, 
> PIG-2125-9_0.9.patch, PIG-2125-buildxml-0.9.patch, PIG-2125-commitJob.patch, 
> PIG-2125-ivy-0.9-3.patch, PIG-2125-zebra.patch, e2e-hadoop23.patch
>
>
> We need to make Pig work with hadoop .NEXT, the svn branch currently is: 
> https://svn.apache.org/repos/asf/hadoop/common/branches/MR-279



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4976) streaming job with store clause stuck if the script fail

2016-09-27 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15525748#comment-15525748
 ] 

Nandor Kollar commented on PIG-4976:


Ok, PIG-4976-5-knoguchi.patch LGTM, no test is hanging in TestStreamingLocal on 
my mac.

> streaming job with store clause stuck if the script fail
> 
>
> Key: PIG-4976
> URL: https://issues.apache.org/jira/browse/PIG-4976
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.17.0
>
> Attachments: PIG-4976-1.patch, PIG-4976-2.patch, PIG-4976-3.patch, 
> PIG-4976-4.patch, PIG-4976-5-knoguchi.patch
>
>
> When investigating PIG-4972, I also notice Pig job stuck when the perl script 
> have syntax error. This happens if we have output clause in stream 
> specification (means use a file as staging). The bug exist in both Tez and 
> MR, and it is not a regression.
> Here is an example:
> {code}
> define CMD `perl kk.pl` output('foo') ship('kk.pl');
> A = load 'studenttab10k' as (name, age, gpa);
> B = foreach A generate name;
> C = stream B through CMD;
> store C into 'ooo';
> {code}
> kk.pl is any perl script contain a syntax error.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PIG-5034) Remove org.apache.hadoop.hive.serde2.objectinspector.primitive package

2016-09-29 Thread Nandor Kollar (JIRA)
Nandor Kollar created PIG-5034:
--

 Summary: Remove 
org.apache.hadoop.hive.serde2.objectinspector.primitive package
 Key: PIG-5034
 URL: https://issues.apache.org/jira/browse/PIG-5034
 Project: Pig
  Issue Type: Improvement
Reporter: Nandor Kollar
Priority: Minor


Object inspector classes in 
org.apache.hadoop.hive.serde2.objectinspector.primitive are no longer required 
since Pig depends on Hive 1.2.1 and all of these were moved to Hive code base.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-5048) HiveUDTF fail if it is the first expression in projection

2016-10-27 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15612202#comment-15612202
 ] 

Nandor Kollar commented on PIG-5048:


Do we need this UnlimitedNullTuple class at all? The only place where it is 
used is in POForEach:
{code}
if (inp.returnStatus == POStatus.STATUS_EOP) {
if (parentPlan!=null && parentPlan.endOfAllInput && 
!endOfAllInputProcessed && endOfAllInputProcessing) {
// continue pull one more output
inp = new Result(POStatus.STATUS_OK, new 
UnlimitedNullTuple());
} else {
return inp;
}
}
{code}
As far as I understood, this is used to allow UDF to produce a last record in 
close, does close here mean the cleanup phase of map tasks? What if we use 
RESULT_EMPTY from PhysicalOperator instead of UnlimitedNullTuple with OK 
status? The description of STATUS_NULL tells 'This is represented as 'null' 
with STATUS_OK', and it seems this is what we need instead of 
UnlimitedNullTuple. [~daijy] could you please review my second patch, and help 
me understand why UnlimitedNullTuple was required? I'd like to add a test case 
where the UDF produces a last record in close to ensure that my patch doesn't 
brake it, but I don't know when this happens.

> HiveUDTF fail if it is the first expression in projection
> -
>
> Key: PIG-5048
> URL: https://issues.apache.org/jira/browse/PIG-5048
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Daniel Dai
>Assignee: Nandor Kollar
> Fix For: 0.17.0, 0.16.1
>
> Attachments: PIG-5048.patch
>
>
> The following script fail:
> {code}
> define explode HiveUDTF('explode');
> A = load 'bag.txt' as (a0:{(b0:chararray)});
> B = foreach A generate explode(a0);
> dump B;
> {code}
> Message: Unimplemented at 
> org.apache.pig.data.UnlimitedNullTuple.size(UnlimitedNullTuple.java:31)
> If it is not the first projection, the script pass:
> {code}
> define explode HiveUDTF('explode');
> A = load 'bag.txt' as (a0:{(b0:chararray)});
> B = foreach A generate a0, explode(a0);
> dump B;
> {code}
> Thanks [~nkollar] reporting it!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-5048) HiveUDTF fail if it is the first expression in projection

2016-10-27 Thread Nandor Kollar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PIG-5048:
---
Attachment: PIG-5048-1.patch

> HiveUDTF fail if it is the first expression in projection
> -
>
> Key: PIG-5048
> URL: https://issues.apache.org/jira/browse/PIG-5048
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Daniel Dai
>Assignee: Nandor Kollar
> Fix For: 0.17.0, 0.16.1
>
> Attachments: PIG-5048-1.patch, PIG-5048.patch
>
>
> The following script fail:
> {code}
> define explode HiveUDTF('explode');
> A = load 'bag.txt' as (a0:{(b0:chararray)});
> B = foreach A generate explode(a0);
> dump B;
> {code}
> Message: Unimplemented at 
> org.apache.pig.data.UnlimitedNullTuple.size(UnlimitedNullTuple.java:31)
> If it is not the first projection, the script pass:
> {code}
> define explode HiveUDTF('explode');
> A = load 'bag.txt' as (a0:{(b0:chararray)});
> B = foreach A generate a0, explode(a0);
> dump B;
> {code}
> Thanks [~nkollar] reporting it!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-5048) HiveUDTF fail if it is the first expression in projection

2016-11-09 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15651841#comment-15651841
 ] 

Nandor Kollar commented on PIG-5048:


Thank you Daniel for the review and for helping me learn more about how 
HiveUDFs work!

> HiveUDTF fail if it is the first expression in projection
> -
>
> Key: PIG-5048
> URL: https://issues.apache.org/jira/browse/PIG-5048
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Daniel Dai
>Assignee: Nandor Kollar
> Fix For: 0.17.0, 0.16.1
>
> Attachments: PIG-5048-1.patch, PIG-5048-2.patch, PIG-5048-3.patch, 
> PIG-5048-4.patch, PIG-5048.patch
>
>
> The following script fail:
> {code}
> define explode HiveUDTF('explode');
> A = load 'bag.txt' as (a0:{(b0:chararray)});
> B = foreach A generate explode(a0);
> dump B;
> {code}
> Message: Unimplemented at 
> org.apache.pig.data.UnlimitedNullTuple.size(UnlimitedNullTuple.java:31)
> If it is not the first projection, the script pass:
> {code}
> define explode HiveUDTF('explode');
> A = load 'bag.txt' as (a0:{(b0:chararray)});
> B = foreach A generate a0, explode(a0);
> dump B;
> {code}
> Thanks [~nkollar] reporting it!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-5048) HiveUDTF fail if it is the first expression in projection

2016-11-09 Thread Nandor Kollar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PIG-5048:
---
Attachment: PIG-5048-4.patch

> HiveUDTF fail if it is the first expression in projection
> -
>
> Key: PIG-5048
> URL: https://issues.apache.org/jira/browse/PIG-5048
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Daniel Dai
>Assignee: Nandor Kollar
> Fix For: 0.17.0, 0.16.1
>
> Attachments: PIG-5048-1.patch, PIG-5048-2.patch, PIG-5048-3.patch, 
> PIG-5048-4.patch, PIG-5048.patch
>
>
> The following script fail:
> {code}
> define explode HiveUDTF('explode');
> A = load 'bag.txt' as (a0:{(b0:chararray)});
> B = foreach A generate explode(a0);
> dump B;
> {code}
> Message: Unimplemented at 
> org.apache.pig.data.UnlimitedNullTuple.size(UnlimitedNullTuple.java:31)
> If it is not the first projection, the script pass:
> {code}
> define explode HiveUDTF('explode');
> A = load 'bag.txt' as (a0:{(b0:chararray)});
> B = foreach A generate a0, explode(a0);
> dump B;
> {code}
> Thanks [~nkollar] reporting it!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-5048) HiveUDTF fail if it is the first expression in projection

2016-11-09 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15650405#comment-15650405
 ] 

Nandor Kollar commented on PIG-5048:


Ok, deleted the UDF and UDAF tests, since those are covered in the e2e test 
suite. I think all UDTF tests are needed, there is one for the simple bag 
projection (this was case where the issue was discovered), one for two 
projections, and one for the GenericUDTFCount2 to verify UDTF close() is not 
broken.

> HiveUDTF fail if it is the first expression in projection
> -
>
> Key: PIG-5048
> URL: https://issues.apache.org/jira/browse/PIG-5048
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Daniel Dai
>Assignee: Nandor Kollar
> Fix For: 0.17.0, 0.16.1
>
> Attachments: PIG-5048-1.patch, PIG-5048-2.patch, PIG-5048-3.patch, 
> PIG-5048-4.patch, PIG-5048.patch
>
>
> The following script fail:
> {code}
> define explode HiveUDTF('explode');
> A = load 'bag.txt' as (a0:{(b0:chararray)});
> B = foreach A generate explode(a0);
> dump B;
> {code}
> Message: Unimplemented at 
> org.apache.pig.data.UnlimitedNullTuple.size(UnlimitedNullTuple.java:31)
> If it is not the first projection, the script pass:
> {code}
> define explode HiveUDTF('explode');
> A = load 'bag.txt' as (a0:{(b0:chararray)});
> B = foreach A generate a0, explode(a0);
> dump B;
> {code}
> Thanks [~nkollar] reporting it!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-5048) HiveUDTF fail if it is the first expression in projection

2016-11-08 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15647088#comment-15647088
 ] 

Nandor Kollar commented on PIG-5048:


Thanks Daniel for the comments. I added these tests to the Junit test suite, 
because it was easier to execute and verify, should I delete those which are 
already covered in e2e tests, or it is fine? I'll change hive-contrib to test 
dependency.
As for the collector, the UDTF exec was called twice: first from the normal 
process, then for close, and I noticed that in the test, the output for the 
process case was cleared in the close call. I just realized that this is 
probably a test issue with mock Storage I used in the tests, putNext doesn't 
make a deep copy from the tuples:
{code}
  @Override
  public void putNext(Tuple t) throws IOException {
  mockRecordWriter.dataBeingWritten.add(TF.newTuple(t.getAll()));
  }
{code}
Since in tests the tuple contains a bag, both the output bag of process and 
close will point to the same bag instance. Will figure out how to test this 
properly.

> HiveUDTF fail if it is the first expression in projection
> -
>
> Key: PIG-5048
> URL: https://issues.apache.org/jira/browse/PIG-5048
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Daniel Dai
>Assignee: Nandor Kollar
> Fix For: 0.17.0, 0.16.1
>
> Attachments: PIG-5048-1.patch, PIG-5048-2.patch, PIG-5048.patch
>
>
> The following script fail:
> {code}
> define explode HiveUDTF('explode');
> A = load 'bag.txt' as (a0:{(b0:chararray)});
> B = foreach A generate explode(a0);
> dump B;
> {code}
> Message: Unimplemented at 
> org.apache.pig.data.UnlimitedNullTuple.size(UnlimitedNullTuple.java:31)
> If it is not the first projection, the script pass:
> {code}
> define explode HiveUDTF('explode');
> A = load 'bag.txt' as (a0:{(b0:chararray)});
> B = foreach A generate a0, explode(a0);
> dump B;
> {code}
> Thanks [~nkollar] reporting it!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-5048) HiveUDTF fail if it is the first expression in projection

2016-11-08 Thread Nandor Kollar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PIG-5048:
---
Attachment: PIG-5048-3.patch

> HiveUDTF fail if it is the first expression in projection
> -
>
> Key: PIG-5048
> URL: https://issues.apache.org/jira/browse/PIG-5048
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Daniel Dai
>Assignee: Nandor Kollar
> Fix For: 0.17.0, 0.16.1
>
> Attachments: PIG-5048-1.patch, PIG-5048-2.patch, PIG-5048-3.patch, 
> PIG-5048.patch
>
>
> The following script fail:
> {code}
> define explode HiveUDTF('explode');
> A = load 'bag.txt' as (a0:{(b0:chararray)});
> B = foreach A generate explode(a0);
> dump B;
> {code}
> Message: Unimplemented at 
> org.apache.pig.data.UnlimitedNullTuple.size(UnlimitedNullTuple.java:31)
> If it is not the first projection, the script pass:
> {code}
> define explode HiveUDTF('explode');
> A = load 'bag.txt' as (a0:{(b0:chararray)});
> B = foreach A generate a0, explode(a0);
> dump B;
> {code}
> Thanks [~nkollar] reporting it!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-5048) HiveUDTF fail if it is the first expression in projection

2016-11-08 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15647550#comment-15647550
 ] 

Nandor Kollar commented on PIG-5048:


Attached patch version 3, decided not to use mock Store in the tests and 
changed hive-contrib to test dependency.

> HiveUDTF fail if it is the first expression in projection
> -
>
> Key: PIG-5048
> URL: https://issues.apache.org/jira/browse/PIG-5048
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Daniel Dai
>Assignee: Nandor Kollar
> Fix For: 0.17.0, 0.16.1
>
> Attachments: PIG-5048-1.patch, PIG-5048-2.patch, PIG-5048-3.patch, 
> PIG-5048.patch
>
>
> The following script fail:
> {code}
> define explode HiveUDTF('explode');
> A = load 'bag.txt' as (a0:{(b0:chararray)});
> B = foreach A generate explode(a0);
> dump B;
> {code}
> Message: Unimplemented at 
> org.apache.pig.data.UnlimitedNullTuple.size(UnlimitedNullTuple.java:31)
> If it is not the first projection, the script pass:
> {code}
> define explode HiveUDTF('explode');
> A = load 'bag.txt' as (a0:{(b0:chararray)});
> B = foreach A generate a0, explode(a0);
> dump B;
> {code}
> Thanks [~nkollar] reporting it!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-5026) Remove src/META-INF/services/org.apache.hadoop.mapreduce.protocol.ClientProtocolProvider

2016-10-18 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15584881#comment-15584881
 ] 

Nandor Kollar commented on PIG-5026:


All unit test passed both in MR and in Tez mode on my mac, [~daijy] do you 
think that it is now safe to commit this change?

> Remove 
> src/META-INF/services/org.apache.hadoop.mapreduce.protocol.ClientProtocolProvider
> 
>
> Key: PIG-5026
> URL: https://issues.apache.org/jira/browse/PIG-5026
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.16.0
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Minor
> Attachments: PIG-5026.patch
>
>
> ClientProtocolProvider service is implemented in Hadoop client, remove the 
> service provider configuration file from Pig code. This file was a workaround 
> in PIG-2125 and looks like due to this and MAPREDUCE-6473, Pig related unit 
> tests break in Hive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-5048) HiveUDTF fail if it is the first expression in projection

2016-10-25 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15604499#comment-15604499
 ] 

Nandor Kollar commented on PIG-5048:


I just realized that I wanted to say I can attach a patch, but missed 'I'.

> HiveUDTF fail if it is the first expression in projection
> -
>
> Key: PIG-5048
> URL: https://issues.apache.org/jira/browse/PIG-5048
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.17.0, 0.16.1
>
>
> The following script fail:
> {code}
> define explode HiveUDTF('explode');
> A = load 'bag.txt' as (a0:{(b0:chararray)});
> B = foreach A generate explode(a0);
> dump B;
> {code}
> Message: Unimplemented at 
> org.apache.pig.data.UnlimitedNullTuple.size(UnlimitedNullTuple.java:31)
> If it is not the first projection, the script pass:
> {code}
> define explode HiveUDTF('explode');
> A = load 'bag.txt' as (a0:{(b0:chararray)});
> B = foreach A generate a0, explode(a0);
> dump B;
> {code}
> Thanks [~nkollar] reporting it!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-5048) HiveUDTF fail if it is the first expression in projection

2016-10-25 Thread Nandor Kollar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PIG-5048:
---
Attachment: PIG-5048.patch

> HiveUDTF fail if it is the first expression in projection
> -
>
> Key: PIG-5048
> URL: https://issues.apache.org/jira/browse/PIG-5048
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.17.0, 0.16.1
>
> Attachments: PIG-5048.patch
>
>
> The following script fail:
> {code}
> define explode HiveUDTF('explode');
> A = load 'bag.txt' as (a0:{(b0:chararray)});
> B = foreach A generate explode(a0);
> dump B;
> {code}
> Message: Unimplemented at 
> org.apache.pig.data.UnlimitedNullTuple.size(UnlimitedNullTuple.java:31)
> If it is not the first projection, the script pass:
> {code}
> define explode HiveUDTF('explode');
> A = load 'bag.txt' as (a0:{(b0:chararray)});
> B = foreach A generate a0, explode(a0);
> dump B;
> {code}
> Thanks [~nkollar] reporting it!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-5048) HiveUDTF fail if it is the first expression in projection

2016-10-24 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15602951#comment-15602951
 ] 

Nandor Kollar commented on PIG-5048:


It seems that with this trivial modification in UnlimitedNullTuple
{code}
public int size() {
return -1;
}
{code}
the UDAF gives the (almost) correct result, though I'm afraid this is not the 
best solution. I say almost, because I also noticed, that there is an extra 
empty tuple at the end (no idea why), but this is present even when the 
projection includes both a0 and explode(a0). Also, it would be nice to 
implement the other methods in this class, [~daijy] if you don't mind, can 
attach a patch.

> HiveUDTF fail if it is the first expression in projection
> -
>
> Key: PIG-5048
> URL: https://issues.apache.org/jira/browse/PIG-5048
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Daniel Dai
>Assignee: Daniel Dai
> Fix For: 0.17.0, 0.16.1
>
>
> The following script fail:
> {code}
> define explode HiveUDTF('explode');
> A = load 'bag.txt' as (a0:{(b0:chararray)});
> B = foreach A generate explode(a0);
> dump B;
> {code}
> Message: Unimplemented at 
> org.apache.pig.data.UnlimitedNullTuple.size(UnlimitedNullTuple.java:31)
> If it is not the first projection, the script pass:
> {code}
> define explode HiveUDTF('explode');
> A = load 'bag.txt' as (a0:{(b0:chararray)});
> B = foreach A generate a0, explode(a0);
> dump B;
> {code}
> Thanks [~nkollar] reporting it!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-5057) IndexOutOfBoundsException when pig reducer processOnePackageOutput

2016-11-14 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15663167#comment-15663167
 ] 

Nandor Kollar commented on PIG-5057:


Could you please add a test case too which demonstrates the problem?

> IndexOutOfBoundsException when pig reducer processOnePackageOutput
> --
>
> Key: PIG-5057
> URL: https://issues.apache.org/jira/browse/PIG-5057
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.14.0, 0.15.0
> Environment: pig-0.14/pig-0.15 cdh5.3.2
>Reporter: shenxianqiang
>Priority: Minor
> Attachments: PIG-5057.patch
>
>
> When runing a pig job,the reducer throw out of bands exception,which leads 
> the job failed.
> {quote}
> 2016-11-14 15:31:04,752 WARN [main] org.apache.hadoop.io.compress.LzoCodec: 
> org.apache.hadoop.io.compress.LzoCodec is deprecated. You should use 
> com.hadoop.compression.lzo.LzoCodec instead to generate LZO compressed data.
> 2016-11-14 15:32:08,300 WARN [main] org.apache.hadoop.mapred.YarnChild: 
> Exception running child : java.lang.IndexOutOfBoundsException: Index: 1, 
> Size: 1
>   at java.util.ArrayList.rangeCheck(ArrayList.java:635)
>   at java.util.ArrayList.get(ArrayList.java:411)
>   at org.apache.pig.data.DefaultTuple.get(DefaultTuple.java:117)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.Packager.getValueTuple(Packager.java:234)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPackage$PeekedBag$1.next(POPackage.java:435)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPackage$PeekedBag$1.next(POPackage.java:408)
>   at 
> org.apache.pig.data.DefaultAbstractBag.addAll(DefaultAbstractBag.java:151)
>   at 
> org.apache.pig.data.DefaultAbstractBag.addAll(DefaultAbstractBag.java:137)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.Packager.attachInput(Packager.java:125)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POPackage.getNextTuple(POPackage.java:283)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.processOnePackageOutput(PigGenericMapReduce.java:431)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:422)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:269)
>   at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:171)
>   at 
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:627)
>   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:389)
>   at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1892)
>   at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4952) Calculate the value of parallism for spark mode

2016-11-23 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15691496#comment-15691496
 ] 

Nandor Kollar commented on PIG-4952:


Apart from the blog post you found, I didn't find anything else yet.

> Calculate the value of parallism for spark mode
> ---
>
> Key: PIG-4952
> URL: https://issues.apache.org/jira/browse/PIG-4952
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: Nandor Kollar
> Fix For: spark-branch
>
> Attachments: PIG-4952_1.patch
>
>
> Calculate the value of parallism for spark mode like what 
> org.apache.pig.backend.hadoop.executionengine.tez.plan.optimizer.ParallelismSetter
>  does.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (PIG-4952) Calculate the value of parallism for spark mode

2016-11-22 Thread Nandor Kollar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar reassigned PIG-4952:
--

Assignee: Nandor Kollar

> Calculate the value of parallism for spark mode
> ---
>
> Key: PIG-4952
> URL: https://issues.apache.org/jira/browse/PIG-4952
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: Nandor Kollar
> Fix For: spark-branch
>
>
> Calculate the value of parallism for spark mode like what 
> org.apache.pig.backend.hadoop.executionengine.tez.plan.optimizer.ParallelismSetter
>  does.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4952) Calculate the value of parallism for spark mode

2016-11-22 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15687248#comment-15687248
 ] 

Nandor Kollar commented on PIG-4952:


[~kellyzly] could you please explain, what should ParallelismSetter do in Spark 
mode? As far as I understood, those Spark operations, which form the stage 
boundary, the partition count should be a configuration option (like in your 
example for join). I took a look at the current converter implementations, and 
found 1-2 operations where this is not yet implemented, will attach a patch 
soon.

> Calculate the value of parallism for spark mode
> ---
>
> Key: PIG-4952
> URL: https://issues.apache.org/jira/browse/PIG-4952
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: Nandor Kollar
> Fix For: spark-branch
>
>
> Calculate the value of parallism for spark mode like what 
> org.apache.pig.backend.hadoop.executionengine.tez.plan.optimizer.ParallelismSetter
>  does.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4952) Calculate the value of parallism for spark mode

2016-11-22 Thread Nandor Kollar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PIG-4952:
---
Attachment: PIG-4952_1.patch

> Calculate the value of parallism for spark mode
> ---
>
> Key: PIG-4952
> URL: https://issues.apache.org/jira/browse/PIG-4952
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: Nandor Kollar
> Fix For: spark-branch
>
> Attachments: PIG-4952_1.patch
>
>
> Calculate the value of parallism for spark mode like what 
> org.apache.pig.backend.hadoop.executionengine.tez.plan.optimizer.ParallelismSetter
>  does.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-5069) Skewed Join is crashing job

2016-11-28 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15701371#comment-15701371
 ] 

Nandor Kollar commented on PIG-5069:


It looks similar to PIG-3417. I took a look at that issue, and it seems that 
secondary key optimizer causes the problem, if you turn it off 
(-Dpig.exec.nosecondarykey=true), does it still fail?

> Skewed Join is crashing job
> ---
>
> Key: PIG-5069
> URL: https://issues.apache.org/jira/browse/PIG-5069
> Project: Pig
>  Issue Type: Bug
>Reporter: Carlos Flores
>
> Script below was working fine, but when i added the skewed join it began to 
> give errors.
> ERROR: java.lang.Long cannot be cast to org.apache.pig.data.Tuple
> {code:sql}
> SET  mapred.job.queue.name marathon;
> SET pig.maxCombinedSplitSize 2147483648;
> SET default_parallel 500;
> dim_member_skill_final_opp_1 = LOAD 
> '/user/username/SkillsDashboardUS/OPP-JOIN' USING LiAvroStorage();
> top_skills_1 = LOAD '/user/username/SkillsDashboardUS/Top_Skills_Only' using 
> LiAvroStorage();
> 
> dim_member_skill_final_opp = GROUP dim_member_skill_final_opp_1 by 
> (country_sk,skill);
> top_skills = GROUP top_skills_1 by (country_sk,skill);
> opp_country = JOIN dim_member_skill_final_opp BY (group), top_skills BY 
> (group) using 'skewed';
> opp_country_generate = FOREACH opp_country GENERATE
> FLATTEN(top_skills::group) as (country_sk,skill),
> FLATTEN(top_skills::top_skills_1) as 
> (country_sk2,title_sk,skill2,sum_of_members),
> FLATTEN(dim_member_skill_final_opp::dim_member_skill_final_opp_1) as 
> (member_sk,country_sk1,skill1);
> opp_generate = FOREACH opp_country_generate GENERATE
> country_sk,
> title_sk,
> member_sk;
> opp_distinct = DISTINCT opp_generate;
> opp_grouping = GROUP opp_distinct BY (country_sk,title_sk);
> opp_count = FOREACH opp_grouping GENERATE
> FLATTEN(group) AS (country_sk,title_sk),
> COUNT(opp_distinct) AS sum_of_members;
> store opp_count into '/user/username/update/OPP-Index-US-skewed' using 
> LiAvroStorage();{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PIG-5053) Can't change e2e test HDFS user home using Ant

2016-10-28 Thread Nandor Kollar (JIRA)
Nandor Kollar created PIG-5053:
--

 Summary: Can't change e2e test HDFS user home using Ant
 Key: PIG-5053
 URL: https://issues.apache.org/jira/browse/PIG-5053
 Project: Pig
  Issue Type: Improvement
Reporter: Nandor Kollar
Assignee: Nandor Kollar
Priority: Minor


HDFS user home is /user/pig by default for e2e tests. I can change this in the 
Perl script, but there's no corresponding parameter when I start the e2e tests 
via Ant.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-5053) Can't change e2e test HDFS user home using Ant

2016-10-28 Thread Nandor Kollar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PIG-5053:
---
Attachment: PIG-5053.patch

> Can't change e2e test HDFS user home using Ant
> --
>
> Key: PIG-5053
> URL: https://issues.apache.org/jira/browse/PIG-5053
> Project: Pig
>  Issue Type: Improvement
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Minor
> Attachments: PIG-5053.patch
>
>
> HDFS user home is /user/pig by default for e2e tests. I can change this in 
> the Perl script, but there's no corresponding parameter when I start the e2e 
> tests via Ant.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-5053) Can't change HDFS user home in e2e tests using Ant

2016-10-28 Thread Nandor Kollar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PIG-5053:
---
Summary: Can't change HDFS user home in e2e tests using Ant  (was: Can't 
change e2e test HDFS user home using Ant)

> Can't change HDFS user home in e2e tests using Ant
> --
>
> Key: PIG-5053
> URL: https://issues.apache.org/jira/browse/PIG-5053
> Project: Pig
>  Issue Type: Improvement
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Minor
> Attachments: PIG-5053.patch
>
>
> HDFS user home is /user/pig by default for e2e tests. I can change this in 
> the Perl script, but there's no corresponding parameter when I start the e2e 
> tests via Ant.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-5048) HiveUDTF fail if it is the first expression in projection

2016-11-03 Thread Nandor Kollar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PIG-5048:
---
Attachment: PIG-5048-2.patch

> HiveUDTF fail if it is the first expression in projection
> -
>
> Key: PIG-5048
> URL: https://issues.apache.org/jira/browse/PIG-5048
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Daniel Dai
>Assignee: Nandor Kollar
> Fix For: 0.17.0, 0.16.1
>
> Attachments: PIG-5048-1.patch, PIG-5048-2.patch, PIG-5048.patch
>
>
> The following script fail:
> {code}
> define explode HiveUDTF('explode');
> A = load 'bag.txt' as (a0:{(b0:chararray)});
> B = foreach A generate explode(a0);
> dump B;
> {code}
> Message: Unimplemented at 
> org.apache.pig.data.UnlimitedNullTuple.size(UnlimitedNullTuple.java:31)
> If it is not the first projection, the script pass:
> {code}
> define explode HiveUDTF('explode');
> A = load 'bag.txt' as (a0:{(b0:chararray)});
> B = foreach A generate a0, explode(a0);
> dump B;
> {code}
> Thanks [~nkollar] reporting it!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-5048) HiveUDTF fail if it is the first expression in projection

2016-11-03 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15632699#comment-15632699
 ] 

Nandor Kollar commented on PIG-5048:


Attached a new version of my patch, it includes these changes:
- test cases for Hive UDFs UDTFs and UDAFs
- extracted the UnlimitedNullTuple to a constant in POForEach
- added Hive contrib package to the dependencies to be able to use 
GenericUDTFCount2 in the tests
- UnlimitedNullTuple's size method doesn't throw an exception, but returns 
Integer.MAX_VALUE
- In HiveUDTF class, the collector is reused in close and in normal process 
case, thus if init doesn't create a new bag, but just clears the current, the 
close() will erase the result of normal process.

One thing I still don't really like is that it seems that if close doesn't 
produce any new tuple(s) because close is not implemented at all in the UDF, an 
empty tuple is still appended to the end of the output. I don't know how to 
handle this case, since we don't know if Hive UDF actually did something in 
close, but the result was empty (in this case I think we have to append the 
empty result to the output), or close was not even implemented (here I think it 
doesn't make sense to append an empty tuple). [~daijy] what do you think? Could 
you please help with the review?

> HiveUDTF fail if it is the first expression in projection
> -
>
> Key: PIG-5048
> URL: https://issues.apache.org/jira/browse/PIG-5048
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Reporter: Daniel Dai
>Assignee: Nandor Kollar
> Fix For: 0.17.0, 0.16.1
>
> Attachments: PIG-5048-1.patch, PIG-5048-2.patch, PIG-5048.patch
>
>
> The following script fail:
> {code}
> define explode HiveUDTF('explode');
> A = load 'bag.txt' as (a0:{(b0:chararray)});
> B = foreach A generate explode(a0);
> dump B;
> {code}
> Message: Unimplemented at 
> org.apache.pig.data.UnlimitedNullTuple.size(UnlimitedNullTuple.java:31)
> If it is not the first projection, the script pass:
> {code}
> define explode HiveUDTF('explode');
> A = load 'bag.txt' as (a0:{(b0:chararray)});
> B = foreach A generate a0, explode(a0);
> dump B;
> {code}
> Thanks [~nkollar] reporting it!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (PIG-5034) Remove org.apache.hadoop.hive.serde2.objectinspector.primitive package

2016-11-03 Thread Nandor Kollar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar reassigned PIG-5034:
--

Assignee: Nandor Kollar

> Remove org.apache.hadoop.hive.serde2.objectinspector.primitive package
> --
>
> Key: PIG-5034
> URL: https://issues.apache.org/jira/browse/PIG-5034
> Project: Pig
>  Issue Type: Improvement
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Minor
>
> Object inspector classes in 
> org.apache.hadoop.hive.serde2.objectinspector.primitive are no longer 
> required since Pig depends on Hive 1.2.1 and all of these were moved to Hive 
> code base.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-5034) Remove org.apache.hadoop.hive.serde2.objectinspector.primitive package

2016-11-03 Thread Nandor Kollar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PIG-5034:
---
Status: Patch Available  (was: Open)

> Remove org.apache.hadoop.hive.serde2.objectinspector.primitive package
> --
>
> Key: PIG-5034
> URL: https://issues.apache.org/jira/browse/PIG-5034
> Project: Pig
>  Issue Type: Improvement
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Minor
> Attachments: PIG-5034.patch
>
>
> Object inspector classes in 
> org.apache.hadoop.hive.serde2.objectinspector.primitive are no longer 
> required since Pig depends on Hive 1.2.1 and all of these were moved to Hive 
> code base.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-5034) Remove org.apache.hadoop.hive.serde2.objectinspector.primitive package

2016-11-03 Thread Nandor Kollar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5034?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PIG-5034:
---
Attachment: PIG-5034.patch

> Remove org.apache.hadoop.hive.serde2.objectinspector.primitive package
> --
>
> Key: PIG-5034
> URL: https://issues.apache.org/jira/browse/PIG-5034
> Project: Pig
>  Issue Type: Improvement
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Minor
> Attachments: PIG-5034.patch
>
>
> Object inspector classes in 
> org.apache.hadoop.hive.serde2.objectinspector.primitive are no longer 
> required since Pig depends on Hive 1.2.1 and all of these were moved to Hive 
> code base.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-5039) TestTypeCheckingValidatorNewLP.TestTypeCheckingValidatorNewLP is failing

2016-10-11 Thread Nandor Kollar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PIG-5039:
---
Attachment: PIG-5039.patch

> TestTypeCheckingValidatorNewLP.TestTypeCheckingValidatorNewLP is failing
> 
>
> Key: PIG-5039
> URL: https://issues.apache.org/jira/browse/PIG-5039
> Project: Pig
>  Issue Type: Bug
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Minor
> Attachments: PIG-5039.patch
>
>
> The test asserts for "Cannot resolve load function to use for casting from 
> bytearray to double at" but the casting does chararray casting



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (PIG-5039) TestTypeCheckingValidatorNewLP.TestTypeCheckingValidatorNewLP is failing

2016-10-11 Thread Nandor Kollar (JIRA)
Nandor Kollar created PIG-5039:
--

 Summary: 
TestTypeCheckingValidatorNewLP.TestTypeCheckingValidatorNewLP is failing
 Key: PIG-5039
 URL: https://issues.apache.org/jira/browse/PIG-5039
 Project: Pig
  Issue Type: Bug
Reporter: Nandor Kollar
Assignee: Nandor Kollar
Priority: Minor


The test asserts for "Cannot resolve load function to use for casting from 
bytearray to double at" but the casting does chararray casting



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-5039) TestTypeCheckingValidatorNewLP.TestTypeCheckingValidatorNewLP is failing

2016-10-11 Thread Nandor Kollar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PIG-5039:
---
Status: Patch Available  (was: Open)

> TestTypeCheckingValidatorNewLP.TestTypeCheckingValidatorNewLP is failing
> 
>
> Key: PIG-5039
> URL: https://issues.apache.org/jira/browse/PIG-5039
> Project: Pig
>  Issue Type: Bug
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Minor
> Attachments: PIG-5039.patch
>
>
> The test asserts for "Cannot resolve load function to use for casting from 
> bytearray to double at" but the casting does chararray casting



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Work started] (PIG-5026) Remove src/META-INF/services/org.apache.hadoop.mapreduce.protocol.ClientProtocolProvider

2016-10-11 Thread Nandor Kollar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on PIG-5026 started by Nandor Kollar.
--
> Remove 
> src/META-INF/services/org.apache.hadoop.mapreduce.protocol.ClientProtocolProvider
> 
>
> Key: PIG-5026
> URL: https://issues.apache.org/jira/browse/PIG-5026
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.16.0
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Minor
> Attachments: PIG-5026.patch
>
>
> ClientProtocolProvider service is implemented in Hadoop client, remove the 
> service provider configuration file from Pig code. This file was a workaround 
> in PIG-2125 and looks like due to this and MAPREDUCE-6473, Pig related unit 
> tests break in Hive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-5026) Remove src/META-INF/services/org.apache.hadoop.mapreduce.protocol.ClientProtocolProvider

2016-10-11 Thread Nandor Kollar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PIG-5026:
---
Attachment: PIG-5026.patch

> Remove 
> src/META-INF/services/org.apache.hadoop.mapreduce.protocol.ClientProtocolProvider
> 
>
> Key: PIG-5026
> URL: https://issues.apache.org/jira/browse/PIG-5026
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.16.0
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Minor
> Attachments: PIG-5026.patch
>
>
> ClientProtocolProvider service is implemented in Hadoop client, remove the 
> service provider configuration file from Pig code. This file was a workaround 
> in PIG-2125 and looks like due to this and MAPREDUCE-6473, Pig related unit 
> tests break in Hive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-5039) TestTypeCheckingValidatorNewLP.TestTypeCheckingValidatorNewLP is failing

2016-10-11 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15565546#comment-15565546
 ] 

Nandor Kollar commented on PIG-5039:


Thanks for the review! :)

> TestTypeCheckingValidatorNewLP.TestTypeCheckingValidatorNewLP is failing
> 
>
> Key: PIG-5039
> URL: https://issues.apache.org/jira/browse/PIG-5039
> Project: Pig
>  Issue Type: Bug
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Minor
> Fix For: 0.17.0
>
> Attachments: PIG-5039.patch
>
>
> The test asserts for "Cannot resolve load function to use for casting from 
> bytearray to double at" but the casting does chararray casting



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-5026) Remove src/META-INF/services/org.apache.hadoop.mapreduce.protocol.ClientProtocolProvider

2016-10-11 Thread Nandor Kollar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PIG-5026:
---
Status: Patch Available  (was: In Progress)

> Remove 
> src/META-INF/services/org.apache.hadoop.mapreduce.protocol.ClientProtocolProvider
> 
>
> Key: PIG-5026
> URL: https://issues.apache.org/jira/browse/PIG-5026
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.16.0
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
>Priority: Minor
> Attachments: PIG-5026.patch
>
>
> ClientProtocolProvider service is implemented in Hadoop client, remove the 
> service provider configuration file from Pig code. This file was a workaround 
> in PIG-2125 and looks like due to this and MAPREDUCE-6473, Pig related unit 
> tests break in Hive.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4750) REPLACE_MULTI should compile Pattern once and reuse it

2016-10-10 Thread Nandor Kollar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PIG-4750:
---
Attachment: PIG-4750.patch

> REPLACE_MULTI should compile Pattern once and reuse it
> --
>
> Key: PIG-4750
> URL: https://issues.apache.org/jira/browse/PIG-4750
> Project: Pig
>  Issue Type: Improvement
>Reporter: Rohini Palaniswamy
>Assignee: Murali Rao
>  Labels: newbie
> Attachments: PIG-4750.patch
>
>
> Details in 
> https://issues.apache.org/jira/browse/PIG-4673?focusedCommentId=14876190=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14876190



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (PIG-4750) REPLACE_MULTI should compile Pattern once and reuse it

2016-10-10 Thread Nandor Kollar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar reassigned PIG-4750:
--

Assignee: Nandor Kollar  (was: Murali Rao)

> REPLACE_MULTI should compile Pattern once and reuse it
> --
>
> Key: PIG-4750
> URL: https://issues.apache.org/jira/browse/PIG-4750
> Project: Pig
>  Issue Type: Improvement
>Reporter: Rohini Palaniswamy
>Assignee: Nandor Kollar
>  Labels: newbie
> Attachments: PIG-4750.patch
>
>
> Details in 
> https://issues.apache.org/jira/browse/PIG-4673?focusedCommentId=14876190=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14876190



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (PIG-4750) REPLACE_MULTI should compile Pattern once and reuse it

2016-10-10 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15561830#comment-15561830
 ] 

Nandor Kollar edited comment on PIG-4750 at 10/10/16 9:48 AM:
--

Attached a patch and two test cases. This patch changes the REPLACE_MULTI 
semantic:
- no chaining of replacement rule, all are applied now in one go, rule "a" -> 
"b" and "b" -> "c" on "a b" will result in "b c" and not "c c", like before
- keys in the replacement map are treated as plain string, and not regex: "|" 
-> "x" will replace all "|" to "x" in input

[~rohini] could you please have a look at my patch?


was (Author: nkollar):
Attached a patch and two test cases. This patch changes the REPLACE_MULTI 
semantic:
- no chaining of replacement rule, all are applied now in one go, rule "a" -> 
"b" and "b" -> "c" on "a b" will result in "b c" and not "c c", like before
- keys in the replacement map are treated as plain string, and not regex: "|" 
-> "x" will replace all "|" to "x" in input
[~rohini] could you please have a look at my patch?

> REPLACE_MULTI should compile Pattern once and reuse it
> --
>
> Key: PIG-4750
> URL: https://issues.apache.org/jira/browse/PIG-4750
> Project: Pig
>  Issue Type: Improvement
>Reporter: Rohini Palaniswamy
>Assignee: Nandor Kollar
>  Labels: newbie
> Attachments: PIG-4750.patch
>
>
> Details in 
> https://issues.apache.org/jira/browse/PIG-4673?focusedCommentId=14876190=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14876190



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4750) REPLACE_MULTI should compile Pattern once and reuse it

2016-10-10 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15561830#comment-15561830
 ] 

Nandor Kollar commented on PIG-4750:


Attached a patch and two test cases. This patch changes the REPLACE_MULTI 
semantic:
- no chaining of replacement rule, all are applied now in one go, rule "a" -> 
"b" and "b" -> "c" on "a b" will result in "b c" and not "c c", like before
- keys in the replacement map are treated as plain string, and not regex: "|" 
-> "x" will replace all "|" to "x" in input
[~rohini] could you please have a look at my patch?

> REPLACE_MULTI should compile Pattern once and reuse it
> --
>
> Key: PIG-4750
> URL: https://issues.apache.org/jira/browse/PIG-4750
> Project: Pig
>  Issue Type: Improvement
>Reporter: Rohini Palaniswamy
>Assignee: Nandor Kollar
>  Labels: newbie
> Attachments: PIG-4750.patch
>
>
> Details in 
> https://issues.apache.org/jira/browse/PIG-4673?focusedCommentId=14876190=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14876190



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4750) REPLACE_MULTI should compile Pattern once and reuse it

2016-10-10 Thread Nandor Kollar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PIG-4750:
---
Status: Patch Available  (was: Open)

> REPLACE_MULTI should compile Pattern once and reuse it
> --
>
> Key: PIG-4750
> URL: https://issues.apache.org/jira/browse/PIG-4750
> Project: Pig
>  Issue Type: Improvement
>Reporter: Rohini Palaniswamy
>Assignee: Nandor Kollar
>  Labels: newbie
> Attachments: PIG-4750.patch
>
>
> Details in 
> https://issues.apache.org/jira/browse/PIG-4673?focusedCommentId=14876190=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14876190



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4930) Skewed Join Breaks On Empty Sampled Input When Key is From Map

2016-12-16 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15754306#comment-15754306
 ] 

Nandor Kollar commented on PIG-4930:


[~daijy] would you mind if I would work on this item?

> Skewed Join Breaks On Empty Sampled Input When Key is From Map
> --
>
> Key: PIG-4930
> URL: https://issues.apache.org/jira/browse/PIG-4930
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.9.2, 0.16.0
>Reporter: William Butler
>Assignee: Daniel Dai
> Fix For: 0.17.0
>
> Attachments: empty_skew.diff
>
>
> When using a skewed join, if the left relation gets its key from a map and 
> said relation is empty, then the skewed join fails during the sampling phase 
> with:
> org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception 
> while executing (Name: Local Rearrange[tuple]{tuple}(false) - scope-27 
> Operator Key: scope-27): 
> org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception 
> while executing [POMapLookUp (Name: POMapLookUp[bytearray] - scope-14 
> Operator Key: scope-14) children: null at [null[3,17]]]: 
> java.lang.ClassCastException: java.lang.String cannot be cast to java.util.Map
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:314)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNextTuple(POLocalRearrange.java:287)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:280)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:275)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:65)
>   at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>   at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
>   at org.apache.hadoop.mapred.Child.main(Child.java:249)
> I think the problem is more fundamental to Pig's skewed join implementation 
> than maps, but it is easily demonstrable with them. I have written an 
> additional test in TestSkewedJoin that demonstrates the problem. The join 
> works correctly if we remove "using 'skewed'"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4748) DateTimeWritable forgets Chronology

2016-12-16 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15754626#comment-15754626
 ] 

Nandor Kollar commented on PIG-4748:


Using zone ID instead of offset looks good to me.

> DateTimeWritable forgets Chronology
> ---
>
> Key: PIG-4748
> URL: https://issues.apache.org/jira/browse/PIG-4748
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.16.0
>Reporter: Martin Junghanns
>Assignee: Adam Szita
> Fix For: 0.17.0
>
> Attachments: PIG-4748.patch
>
>
> The following test fails:
> {code}
> @Test
> public void foo() throws IOException {
> DateTime nowIn = DateTime.now();
> DateTimeWritable in = new DateTimeWritable(nowIn);
> ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
> DataOutputStream dataOut = new DataOutputStream(outputStream);
> in.write(dataOut);
> dataOut.flush();
> // read from byte[]
> DateTimeWritable out = new DateTimeWritable();
> ByteArrayInputStream inputStream = new ByteArrayInputStream(
>   outputStream.toByteArray());
> DataInputStream dataIn = new DataInputStream(inputStream);
> out.readFields(dataIn);
> assertEquals(in.get(), out.get());
> }
> {code}
> In equals(), the original instance has
> {code}
> ISOChronology[Europe/Berlin]
> {code}
> while the deserialized instance has
> {code}
> ISOChronology[+01:00]
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4815) Add xml format support for 'explain' in spark engine

2016-12-12 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15741770#comment-15741770
 ] 

Nandor Kollar commented on PIG-4815:


[~xuefuz] it looks like XMLSparkPrinter is missing from the commit, and the 
build is broken now. Can you please add this file too? It is included in Adam's 
second patch.

> Add xml format support for 'explain' in spark engine 
> -
>
> Key: PIG-4815
> URL: https://issues.apache.org/jira/browse/PIG-4815
> Project: Pig
>  Issue Type: Task
>  Components: spark
>Reporter: Prateek Vaishnav
>Assignee: Adam Szita
> Fix For: spark-branch
>
> Attachments: PIG-4815.2.patch, PIG-4815.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-3417) Skewed Join On Tuple Column Kills Job

2016-12-12 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15741601#comment-15741601
 ] 

Nandor Kollar commented on PIG-3417:


Investigated this issue, and it seems, that the problem is with the 
optimization of the sampling job. When the join key is a composite key, in the 
sampling job it is getting flattened, but since the secondary key optimizer 
expects composite keys to be wrapped into tuples, we get classcast exception 
(this also explains why the query didn't fail when secondary key optimizer is 
switched off). I think not flattening the tuples in the sampling job would 
solve the problem: PartitionSkewedKeys would work on ((key1, key2, ...), (tuple 
mem size, key count)) format for composite keys, and on (key, (tuple mem size, 
key count)) format for non-composite key. This way we can apply secondary key 
optimizer on the sampling job too. Attached a patch, tests in TestSkewedJoin 
(including Nick's test case) passed both on MR and on Tez mode, waiting for the 
result of the entire test suite to make sure it doesn't break other test cases.

> Skewed Join On Tuple Column Kills Job 
> --
>
> Key: PIG-3417
> URL: https://issues.apache.org/jira/browse/PIG-3417
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.11.1
>Reporter: Nick White
>Priority: Critical
> Attachments: TestSkewJoinWithTuples.java
>
>
> I've attached a test case that fails, but should pass. The test case groups 
> two relations separately, then full-outer joins them on the grouped columns. 
> The test case passes if "using 'skewed'" is removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4952) Calculate the value of parallism for spark mode

2016-12-07 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15728194#comment-15728194
 ] 

Nandor Kollar commented on PIG-4952:


This is simpler than I thought! Looks good, however it seemed to me that the 
parallelism is not set for sort, skew join and rank. Can you please have a look 
at the patch I attached? Does it make sense to set the parallelism level for 
these operators too, based on your patch?

> Calculate the value of parallism for spark mode
> ---
>
> Key: PIG-4952
> URL: https://issues.apache.org/jira/browse/PIG-4952
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4952.patch, PIG-4952_1.patch
>
>
> Calculate the value of parallism for spark mode like what 
> org.apache.pig.backend.hadoop.executionengine.tez.plan.optimizer.ParallelismSetter
>  does.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4952) Calculate the value of parallism for spark mode

2016-12-07 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15728334#comment-15728334
 ] 

Nandor Kollar commented on PIG-4952:


[~kellyzly] you search for the max default parallelism value by iterating 
through the parent RDDs, but since you use SparkContext, isn't this the same 
for each parent RDD? How about this instead for tmpParallelism:
{code}
int tmpParallelism = predecessors.get(i).getNumPartitions();
{code}
How can we test that we achieved performance improvement?

> Calculate the value of parallism for spark mode
> ---
>
> Key: PIG-4952
> URL: https://issues.apache.org/jira/browse/PIG-4952
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-4952.patch, PIG-4952_1.patch
>
>
> Calculate the value of parallism for spark mode like what 
> org.apache.pig.backend.hadoop.executionengine.tez.plan.optimizer.ParallelismSetter
>  does.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-3417) Skewed Join On Tuple Column Kills Job

2016-12-15 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15750855#comment-15750855
 ] 

Nandor Kollar commented on PIG-3417:


Looks like the patch didn't break any test, neither in Tez nor in MR mode. 
[~rohini], [~daijy] [~knoguchi] what do you think? Could either of you please 
take a look at my patch? I am interested in your thoughts.

> Skewed Join On Tuple Column Kills Job 
> --
>
> Key: PIG-3417
> URL: https://issues.apache.org/jira/browse/PIG-3417
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.11.1
>Reporter: Nick White
>Assignee: Nandor Kollar
>Priority: Critical
> Fix For: 0.17.0
>
> Attachments: PIG-3417.patch, TestSkewJoinWithTuples.java
>
>
> I've attached a test case that fails, but should pass. The test case groups 
> two relations separately, then full-outer joins them on the grouped columns. 
> The test case passes if "using 'skewed'" is removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-3417) Skewed Join On Tuple Column Kills Job

2016-12-15 Thread Nandor Kollar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PIG-3417:
---
Fix Version/s: 0.17.0
   Status: Patch Available  (was: Open)

> Skewed Join On Tuple Column Kills Job 
> --
>
> Key: PIG-3417
> URL: https://issues.apache.org/jira/browse/PIG-3417
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.11.1
>Reporter: Nick White
>Assignee: Nandor Kollar
>Priority: Critical
> Fix For: 0.17.0
>
> Attachments: PIG-3417.patch, TestSkewJoinWithTuples.java
>
>
> I've attached a test case that fails, but should pass. The test case groups 
> two relations separately, then full-outer joins them on the grouped columns. 
> The test case passes if "using 'skewed'" is removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-3417) Skewed Join On Tuple Column Kills Job

2016-12-17 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15757043#comment-15757043
 ] 

Nandor Kollar commented on PIG-3417:


Thanks Rohini for your comments, I'll upload a new patch soon.

> Skewed Join On Tuple Column Kills Job 
> --
>
> Key: PIG-3417
> URL: https://issues.apache.org/jira/browse/PIG-3417
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.11.1
>Reporter: Nick White
>Assignee: Nandor Kollar
>Priority: Critical
> Fix For: 0.17.0
>
> Attachments: PIG-3417.patch, TestSkewJoinWithTuples.java
>
>
> I've attached a test case that fails, but should pass. The test case groups 
> two relations separately, then full-outer joins them on the grouped columns. 
> The test case passes if "using 'skewed'" is removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4930) Skewed Join Breaks On Empty Sampled Input When Key is From Map

2016-12-17 Thread Nandor Kollar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PIG-4930:
---
Attachment: PIG-4930.patch

> Skewed Join Breaks On Empty Sampled Input When Key is From Map
> --
>
> Key: PIG-4930
> URL: https://issues.apache.org/jira/browse/PIG-4930
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.9.2, 0.16.0
>Reporter: William Butler
>Assignee: Daniel Dai
> Fix For: 0.17.0
>
> Attachments: PIG-4930.patch, empty_skew.diff
>
>
> When using a skewed join, if the left relation gets its key from a map and 
> said relation is empty, then the skewed join fails during the sampling phase 
> with:
> org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception 
> while executing (Name: Local Rearrange[tuple]{tuple}(false) - scope-27 
> Operator Key: scope-27): 
> org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception 
> while executing [POMapLookUp (Name: POMapLookUp[bytearray] - scope-14 
> Operator Key: scope-14) children: null at [null[3,17]]]: 
> java.lang.ClassCastException: java.lang.String cannot be cast to java.util.Map
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:314)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNextTuple(POLocalRearrange.java:287)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:280)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:275)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:65)
>   at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>   at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
>   at org.apache.hadoop.mapred.Child.main(Child.java:249)
> I think the problem is more fundamental to Pig's skewed join implementation 
> than maps, but it is easily demonstrable with them. I have written an 
> additional test in TestSkewedJoin that demonstrates the problem. The join 
> works correctly if we remove "using 'skewed'"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-3417) Skewed Join On Tuple Column Kills Job

2016-12-12 Thread Nandor Kollar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PIG-3417:
---
Attachment: PIG-3417.patch

> Skewed Join On Tuple Column Kills Job 
> --
>
> Key: PIG-3417
> URL: https://issues.apache.org/jira/browse/PIG-3417
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.11.1
>Reporter: Nick White
>Priority: Critical
> Attachments: PIG-3417.patch, TestSkewJoinWithTuples.java
>
>
> I've attached a test case that fails, but should pass. The test case groups 
> two relations separately, then full-outer joins them on the grouped columns. 
> The test case passes if "using 'skewed'" is removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (PIG-3417) Skewed Join On Tuple Column Kills Job

2016-12-12 Thread Nandor Kollar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar reassigned PIG-3417:
--

Assignee: Nandor Kollar

> Skewed Join On Tuple Column Kills Job 
> --
>
> Key: PIG-3417
> URL: https://issues.apache.org/jira/browse/PIG-3417
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.11.1
>Reporter: Nick White
>Assignee: Nandor Kollar
>Priority: Critical
> Attachments: PIG-3417.patch, TestSkewJoinWithTuples.java
>
>
> I've attached a test case that fails, but should pass. The test case groups 
> two relations separately, then full-outer joins them on the grouped columns. 
> The test case passes if "using 'skewed'" is removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-5104) Union_15 e2e test failing on Spark

2017-01-13 Thread Nandor Kollar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PIG-5104:
---
Description: 
While working on PIG-4891 I noticed that Union_15 e2e test is failing on Spark 
mode with this exception:

Caused by: java.lang.RuntimeException: 
org.apache.pig.backend.executionengine.ExecException: ERROR 2078: Caught error 
from UDF: org.apache.pig.impl.builtin.GFCross [Unable to get parallelism hint 
from job conf]
at 
org.apache.pig.backend.hadoop.executionengine.spark.converter.OutputConsumerIterator.readNext(OutputConsumerIterator.java:89)
at 
org.apache.pig.backend.hadoop.executionengine.spark.converter.OutputConsumerIterator.hasNext(OutputConsumerIterator.java:96)
at 
scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2078: 
Caught error from UDF: org.apache.pig.impl.builtin.GFCross [Unable to get 
parallelism hint from job conf]
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:358)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNextDataBag(POUserFunc.java:374)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:335)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:404)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNextTuple(POForEach.java:321)
at 
org.apache.pig.backend.hadoop.executionengine.spark.converter.ForEachConverter$ForEachFunction$1$1.getNextResult(ForEachConverter.java:87)
at 
org.apache.pig.backend.hadoop.executionengine.spark.converter.OutputConsumerIterator.readNext(OutputConsumerIterator.java:69)
... 11 more
Caused by: java.io.IOException: Unable to get parallelism hint from job conf
at org.apache.pig.impl.builtin.GFCross.exec(GFCross.java:66)
at org.apache.pig.impl.builtin.GFCross.exec(GFCross.java:37)
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:330)

  was:
While working on PIG-4891 I noticed that Union_15 e2e test is failing on Spark 
mode with this exception:
Caused by: java.lang.RuntimeException: 
org.apache.pig.backend.executionengine.ExecException: ERROR 2078: Caught error 
from UDF: org.apache.pig.impl.builtin.GFCross [Unable to get parallelism hint 
from job conf]
at 
org.apache.pig.backend.hadoop.executionengine.spark.converter.OutputConsumerIterator.readNext(OutputConsumerIterator.java:89)
at 
org.apache.pig.backend.hadoop.executionengine.spark.converter.OutputConsumerIterator.hasNext(OutputConsumerIterator.java:96)
at 
scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:42)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:369)
at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2078: 
Caught error from UDF: org.apache.pig.impl.builtin.GFCross [Unable to get 
parallelism hint from job conf]
at 
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:358)
at 

[jira] [Updated] (PIG-3891) FileBasedOutputSizeReader does not calculate size of files in sub-directories

2016-12-02 Thread Nandor Kollar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PIG-3891:
---
Attachment: PIG-3891-5.patch

> FileBasedOutputSizeReader does not calculate size of files in sub-directories
> -
>
> Key: PIG-3891
> URL: https://issues.apache.org/jira/browse/PIG-3891
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.12.0
>Reporter: Rohini Palaniswamy
>Assignee: Nandor Kollar
> Attachments: PIG-3891-1.patch, PIG-3891-2.patch, PIG-3891-3.patch, 
> PIG-3891-4.patch, PIG-3891-5.patch
>
>
> FileBasedOutputSizeReader only includes files in the top level output 
> directory. So if files are stored under subdirectories (For eg: 
> MultiStorage), it does not have the bytes written correctly. 
> 0.11 shows the correct number of total bytes written and this is a 
> regression. A quick look at the code shows that the 
> JobStats.addOneOutputStats() in 0.11 also does not recursively iterate and 
> code is same as  FileBasedOutputSizeReader. Need to investigate where the 
> correct value comes from in 0.11 and fix it in 0.12.1/0.13.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-3891) FileBasedOutputSizeReader does not calculate size of files in sub-directories

2016-12-02 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15715158#comment-15715158
 ] 

Nandor Kollar commented on PIG-3891:


Thanks [~rohini] for your comments, I made the required adjustments:
- Reverted CHANGES.txt
- Reverted the changes on ExecType and TezMiniCluster, but I think my patch 
didn't change the visibility of the exec types, since those were declared 
inside an interface.
- Changed the assert message.
- Rewrote the assert as you suggested, indeed, it looks better without the 
branch. Test passes both on Tez and MR mode.

Attached new version: PIG-3891-5.patch. Could you please have a look at it?

> FileBasedOutputSizeReader does not calculate size of files in sub-directories
> -
>
> Key: PIG-3891
> URL: https://issues.apache.org/jira/browse/PIG-3891
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.12.0
>Reporter: Rohini Palaniswamy
>Assignee: Nandor Kollar
> Attachments: PIG-3891-1.patch, PIG-3891-2.patch, PIG-3891-3.patch, 
> PIG-3891-4.patch
>
>
> FileBasedOutputSizeReader only includes files in the top level output 
> directory. So if files are stored under subdirectories (For eg: 
> MultiStorage), it does not have the bytes written correctly. 
> 0.11 shows the correct number of total bytes written and this is a 
> regression. A quick look at the code shows that the 
> JobStats.addOneOutputStats() in 0.11 also does not recursively iterate and 
> code is same as  FileBasedOutputSizeReader. Need to investigate where the 
> correct value comes from in 0.11 and fix it in 0.12.1/0.13.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-3891) FileBasedOutputSizeReader does not calculate size of files in sub-directories

2016-12-02 Thread Nandor Kollar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PIG-3891:
---
Attachment: PIG-3891-5.patch

> FileBasedOutputSizeReader does not calculate size of files in sub-directories
> -
>
> Key: PIG-3891
> URL: https://issues.apache.org/jira/browse/PIG-3891
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.12.0
>Reporter: Rohini Palaniswamy
>Assignee: Nandor Kollar
> Attachments: PIG-3891-1.patch, PIG-3891-2.patch, PIG-3891-3.patch, 
> PIG-3891-4.patch, PIG-3891-5.patch
>
>
> FileBasedOutputSizeReader only includes files in the top level output 
> directory. So if files are stored under subdirectories (For eg: 
> MultiStorage), it does not have the bytes written correctly. 
> 0.11 shows the correct number of total bytes written and this is a 
> regression. A quick look at the code shows that the 
> JobStats.addOneOutputStats() in 0.11 also does not recursively iterate and 
> code is same as  FileBasedOutputSizeReader. Need to investigate where the 
> correct value comes from in 0.11 and fix it in 0.12.1/0.13.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-3891) FileBasedOutputSizeReader does not calculate size of files in sub-directories

2016-12-02 Thread Nandor Kollar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PIG-3891:
---
Attachment: (was: PIG-3891-5.patch)

> FileBasedOutputSizeReader does not calculate size of files in sub-directories
> -
>
> Key: PIG-3891
> URL: https://issues.apache.org/jira/browse/PIG-3891
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.12.0
>Reporter: Rohini Palaniswamy
>Assignee: Nandor Kollar
> Attachments: PIG-3891-1.patch, PIG-3891-2.patch, PIG-3891-3.patch, 
> PIG-3891-4.patch, PIG-3891-5.patch
>
>
> FileBasedOutputSizeReader only includes files in the top level output 
> directory. So if files are stored under subdirectories (For eg: 
> MultiStorage), it does not have the bytes written correctly. 
> 0.11 shows the correct number of total bytes written and this is a 
> regression. A quick look at the code shows that the 
> JobStats.addOneOutputStats() in 0.11 also does not recursively iterate and 
> code is same as  FileBasedOutputSizeReader. Need to investigate where the 
> correct value comes from in 0.11 and fix it in 0.12.1/0.13.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-3891) FileBasedOutputSizeReader does not calculate size of files in sub-directories

2016-12-02 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15715771#comment-15715771
 ] 

Nandor Kollar commented on PIG-3891:


Thanks Rohini for committing and reviewing my patch!

> FileBasedOutputSizeReader does not calculate size of files in sub-directories
> -
>
> Key: PIG-3891
> URL: https://issues.apache.org/jira/browse/PIG-3891
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.12.0
>Reporter: Rohini Palaniswamy
>Assignee: Nandor Kollar
> Attachments: PIG-3891-1.patch, PIG-3891-2.patch, PIG-3891-3.patch, 
> PIG-3891-4.patch, PIG-3891-5.patch
>
>
> FileBasedOutputSizeReader only includes files in the top level output 
> directory. So if files are stored under subdirectories (For eg: 
> MultiStorage), it does not have the bytes written correctly. 
> 0.11 shows the correct number of total bytes written and this is a 
> regression. A quick look at the code shows that the 
> JobStats.addOneOutputStats() in 0.11 also does not recursively iterate and 
> code is same as  FileBasedOutputSizeReader. Need to investigate where the 
> correct value comes from in 0.11 and fix it in 0.12.1/0.13.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4858) Implement Skewed join for spark engine

2017-01-06 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15804056#comment-15804056
 ] 

Nandor Kollar commented on PIG-4858:


Yes, and with this change, we also have to make sure that the e2e test for this 
(one annotated with 'skew join with tuple key') is passing with Spark exec type.

> Implement Skewed join for spark engine
> --
>
> Key: PIG-4858
> URL: https://issues.apache.org/jira/browse/PIG-4858
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: Xianda Ke
> Fix For: spark-branch
>
> Attachments: PIG-4858.patch, PIG-4858_2.patch, PIG-4858_3.patch, 
> SkewedJoinInSparkMode.pdf
>
>
> Now we use regular join to replace skewed join. Need implement it 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (PIG-4930) Skewed Join Breaks On Empty Sampled Input When Key is From Map

2017-01-06 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15804038#comment-15804038
 ] 

Nandor Kollar edited comment on PIG-4930 at 1/6/17 1:20 PM:


Sure, I can, could you please assign the Jira to me?


was (Author: nkollar):
Sure, I can, could you please assign the Jira to?

> Skewed Join Breaks On Empty Sampled Input When Key is From Map
> --
>
> Key: PIG-4930
> URL: https://issues.apache.org/jira/browse/PIG-4930
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.9.2, 0.16.0
>Reporter: William Butler
>Assignee: Nandor Kollar
> Fix For: 0.17.0
>
> Attachments: PIG-4930.patch, empty_skew.diff
>
>
> When using a skewed join, if the left relation gets its key from a map and 
> said relation is empty, then the skewed join fails during the sampling phase 
> with:
> org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception 
> while executing (Name: Local Rearrange[tuple]{tuple}(false) - scope-27 
> Operator Key: scope-27): 
> org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception 
> while executing [POMapLookUp (Name: POMapLookUp[bytearray] - scope-14 
> Operator Key: scope-14) children: null at [null[3,17]]]: 
> java.lang.ClassCastException: java.lang.String cannot be cast to java.util.Map
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:314)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNextTuple(POLocalRearrange.java:287)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:280)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:275)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:65)
>   at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>   at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
>   at org.apache.hadoop.mapred.Child.main(Child.java:249)
> I think the problem is more fundamental to Pig's skewed join implementation 
> than maps, but it is easily demonstrable with them. I have written an 
> additional test in TestSkewedJoin that demonstrates the problem. The join 
> works correctly if we remove "using 'skewed'"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (PIG-4891) Implement FR join by broadcasting small rdd not making more copys of data

2017-01-05 Thread Nandor Kollar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar reassigned PIG-4891:
--

Assignee: Nandor Kollar

> Implement FR join by broadcasting small rdd not making more copys of data
> -
>
> Key: PIG-4891
> URL: https://issues.apache.org/jira/browse/PIG-4891
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: Nandor Kollar
> Fix For: spark-branch
>
>
> In current implementation of FRJoin(PIG-4771), we just set the value of 
> replication of data as 10 to make the data access more efficiency because 
> current FRJoin algrithms can be reused in this way. We need to figure out how 
> to use broadcasting small rdd to implement FRJoin in current code base if we 
> find the performance can be improved a lot by using broadcasting rdd.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-3417) Skewed Join On Tuple Column Kills Job

2016-12-18 Thread Nandor Kollar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-3417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PIG-3417:
---
Attachment: PIG-3417_2.patch

> Skewed Join On Tuple Column Kills Job 
> --
>
> Key: PIG-3417
> URL: https://issues.apache.org/jira/browse/PIG-3417
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.11.1
>Reporter: Nick White
>Assignee: Nandor Kollar
>Priority: Critical
> Fix For: 0.17.0
>
> Attachments: PIG-3417.patch, PIG-3417_2.patch, 
> TestSkewJoinWithTuples.java
>
>
> I've attached a test case that fails, but should pass. The test case groups 
> two relations separately, then full-outer joins them on the grouped columns. 
> The test case passes if "using 'skewed'" is removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-3417) Skewed Join On Tuple Column Kills Job

2016-12-18 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15759517#comment-15759517
 ] 

Nandor Kollar commented on PIG-3417:


Attached PIG-3417_2.patch with these changes:
- key tuple is not flattened, but statistics are
- golden files changed accordingly. TestTezCompiler test passed with these 
changes, and no test failed in TestMRCompiler
- added an assert for the output to testSkewJoinWithTuples and added an e2e 
test too
We might not even need to have testSkewJoinWithTuples, since the e2e tests now 
cover this case. [~rohini] could you please have have a look at the second 
version of my patch? 

> Skewed Join On Tuple Column Kills Job 
> --
>
> Key: PIG-3417
> URL: https://issues.apache.org/jira/browse/PIG-3417
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.11.1
>Reporter: Nick White
>Assignee: Nandor Kollar
>Priority: Critical
> Fix For: 0.17.0
>
> Attachments: PIG-3417.patch, PIG-3417_2.patch, 
> TestSkewJoinWithTuples.java
>
>
> I've attached a test case that fails, but should pass. The test case groups 
> two relations separately, then full-outer joins them on the grouped columns. 
> The test case passes if "using 'skewed'" is removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (PIG-3417) Skewed Join On Tuple Column Kills Job

2016-12-18 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15759517#comment-15759517
 ] 

Nandor Kollar edited comment on PIG-3417 at 12/18/16 9:28 PM:
--

Attached PIG-3417_2.patch with these changes:
- key tuple is not flattened, but statistics are
- golden files changed accordingly. TestTezCompiler test passed with these 
changes, and no test failed in TestMRCompiler
- added an assert for the output to testSkewJoinWithTuples and added an e2e 
test too

We might not even need to have testSkewJoinWithTuples, since the e2e tests now 
cover this case. [~rohini] could you please have have a look at the second 
version of my patch? 


was (Author: nkollar):
Attached PIG-3417_2.patch with these changes:
- key tuple is not flattened, but statistics are
- golden files changed accordingly. TestTezCompiler test passed with these 
changes, and no test failed in TestMRCompiler
- added an assert for the output to testSkewJoinWithTuples and added an e2e 
test too
We might not even need to have testSkewJoinWithTuples, since the e2e tests now 
cover this case. [~rohini] could you please have have a look at the second 
version of my patch? 

> Skewed Join On Tuple Column Kills Job 
> --
>
> Key: PIG-3417
> URL: https://issues.apache.org/jira/browse/PIG-3417
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.11.1
>Reporter: Nick White
>Assignee: Nandor Kollar
>Priority: Critical
> Fix For: 0.17.0
>
> Attachments: PIG-3417.patch, PIG-3417_2.patch, 
> TestSkewJoinWithTuples.java
>
>
> I've attached a test case that fails, but should pass. The test case groups 
> two relations separately, then full-outer joins them on the grouped columns. 
> The test case passes if "using 'skewed'" is removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-4930) Skewed Join Breaks On Empty Sampled Input When Key is From Map

2016-12-18 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-4930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15759565#comment-15759565
 ] 

Nandor Kollar commented on PIG-4930:


When there is no data, PoissonSampleLoader still emits one tuple: 
(NUMROWS_TUPLE_MARKER, 0) and then later in the physical plan processing, the 
cast fails (in this case it fails because the join key is expected to be a map, 
but would fail for tuple key too). I think PoissonSampleLoader shouldn't emit 
any tuple if the dataset is empty. [~daijy] what do you think? Should I add an 
e2e test to cover the empty dataset case, or the test what William provided is 
enough?

> Skewed Join Breaks On Empty Sampled Input When Key is From Map
> --
>
> Key: PIG-4930
> URL: https://issues.apache.org/jira/browse/PIG-4930
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.9.2, 0.16.0
>Reporter: William Butler
>Assignee: Nandor Kollar
> Fix For: 0.17.0
>
> Attachments: PIG-4930.patch, empty_skew.diff
>
>
> When using a skewed join, if the left relation gets its key from a map and 
> said relation is empty, then the skewed join fails during the sampling phase 
> with:
> org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception 
> while executing (Name: Local Rearrange[tuple]{tuple}(false) - scope-27 
> Operator Key: scope-27): 
> org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception 
> while executing [POMapLookUp (Name: POMapLookUp[bytearray] - scope-14 
> Operator Key: scope-14) children: null at [null[3,17]]]: 
> java.lang.ClassCastException: java.lang.String cannot be cast to java.util.Map
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:314)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNextTuple(POLocalRearrange.java:287)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:280)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:275)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:65)
>   at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>   at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
>   at org.apache.hadoop.mapred.Child.main(Child.java:249)
> I think the problem is more fundamental to Pig's skewed join implementation 
> than maps, but it is easily demonstrable with them. I have written an 
> additional test in TestSkewedJoin that demonstrates the problem. The join 
> works correctly if we remove "using 'skewed'"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (PIG-4930) Skewed Join Breaks On Empty Sampled Input When Key is From Map

2016-12-18 Thread Nandor Kollar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar reassigned PIG-4930:
--

Assignee: Nandor Kollar  (was: Daniel Dai)

> Skewed Join Breaks On Empty Sampled Input When Key is From Map
> --
>
> Key: PIG-4930
> URL: https://issues.apache.org/jira/browse/PIG-4930
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.9.2, 0.16.0
>Reporter: William Butler
>Assignee: Nandor Kollar
> Fix For: 0.17.0
>
> Attachments: PIG-4930.patch, empty_skew.diff
>
>
> When using a skewed join, if the left relation gets its key from a map and 
> said relation is empty, then the skewed join fails during the sampling phase 
> with:
> org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception 
> while executing (Name: Local Rearrange[tuple]{tuple}(false) - scope-27 
> Operator Key: scope-27): 
> org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception 
> while executing [POMapLookUp (Name: POMapLookUp[bytearray] - scope-14 
> Operator Key: scope-14) children: null at [null[3,17]]]: 
> java.lang.ClassCastException: java.lang.String cannot be cast to java.util.Map
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:314)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNextTuple(POLocalRearrange.java:287)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:280)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:275)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:65)
>   at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>   at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
>   at org.apache.hadoop.mapred.Child.main(Child.java:249)
> I think the problem is more fundamental to Pig's skewed join implementation 
> than maps, but it is easily demonstrable with them. I have written an 
> additional test in TestSkewedJoin that demonstrates the problem. The join 
> works correctly if we remove "using 'skewed'"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (PIG-4930) Skewed Join Breaks On Empty Sampled Input When Key is From Map

2016-12-18 Thread Nandor Kollar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-4930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PIG-4930:
---
Status: Patch Available  (was: Open)

> Skewed Join Breaks On Empty Sampled Input When Key is From Map
> --
>
> Key: PIG-4930
> URL: https://issues.apache.org/jira/browse/PIG-4930
> Project: Pig
>  Issue Type: Bug
>Affects Versions: 0.16.0, 0.9.2
>Reporter: William Butler
>Assignee: Nandor Kollar
> Fix For: 0.17.0
>
> Attachments: PIG-4930.patch, empty_skew.diff
>
>
> When using a skewed join, if the left relation gets its key from a map and 
> said relation is empty, then the skewed join fails during the sampling phase 
> with:
> org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception 
> while executing (Name: Local Rearrange[tuple]{tuple}(false) - scope-27 
> Operator Key: scope-27): 
> org.apache.pig.backend.executionengine.ExecException: ERROR 0: Exception 
> while executing [POMapLookUp (Name: POMapLookUp[bytearray] - scope-14 
> Operator Key: scope-14) children: null at [null[3,17]]]: 
> java.lang.ClassCastException: java.lang.String cannot be cast to java.util.Map
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:314)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNextTuple(POLocalRearrange.java:287)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:280)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:275)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:65)
>   at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>   at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
>   at org.apache.hadoop.mapred.Child.main(Child.java:249)
> I think the problem is more fundamental to Pig's skewed join implementation 
> than maps, but it is easily demonstrable with them. I have written an 
> additional test in TestSkewedJoin that demonstrates the problem. The join 
> works correctly if we remove "using 'skewed'"



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-3417) Skewed Join On Tuple Column Kills Job

2016-12-19 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15762054#comment-15762054
 ] 

Nandor Kollar commented on PIG-3417:


Thanks Rohini, I agree, let's leave the unit test out, and commit just the e2e 
test. Also, PIG-5069 seems to be the same issue, at least based on the provided 
information: it fails with classcastexception, the script does a skew join on 
tuple keys, and Carlos told that without skew join it was fine. Given this, do 
you think that we can close that item as duplicate of this Jira?

> Skewed Join On Tuple Column Kills Job 
> --
>
> Key: PIG-3417
> URL: https://issues.apache.org/jira/browse/PIG-3417
> Project: Pig
>  Issue Type: Bug
>  Components: impl
>Affects Versions: 0.11.1
>Reporter: Nick White
>Assignee: Nandor Kollar
>Priority: Critical
> Fix For: 0.17.0
>
> Attachments: PIG-3417.patch, PIG-3417_2.patch, 
> TestSkewJoinWithTuples.java
>
>
> I've attached a test case that fails, but should pass. The test case groups 
> two relations separately, then full-outer joins them on the grouped columns. 
> The test case passes if "using 'skewed'" is removed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (PIG-5157) Upgrade to Spark 2.0

2017-03-23 Thread Nandor Kollar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar reassigned PIG-5157:
--

Assignee: Nandor Kollar

> Upgrade to Spark 2.0
> 
>
> Key: PIG-5157
> URL: https://issues.apache.org/jira/browse/PIG-5157
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: spark-branch
>
>
> Upgrade to Spark 2.0 (or latest)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (PIG-5167) Limit_4 is failing with spark exec type

2017-03-24 Thread Nandor Kollar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PIG-5167:
---
Attachment: (was: PIG-5167_3.patch)

> Limit_4 is failing with spark exec type
> ---
>
> Key: PIG-5167
> URL: https://issues.apache.org/jira/browse/PIG-5167
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: spark-branch
>
> Attachments: PIG-5167_2.patch, PIG-5167.patch
>
>
> results are different:
> {code}
> diff <(head -n 5 Limit_4.out/out_sorted) <(head -n 5 
> Limit_4_benchmark.out/out_sorted)
> 1,5c1,5
> < 50  3.00
> < 74  2.22
> < alice carson66  2.42
> < alice quirinius 71  0.03
> < alice van buren 28  2.50
> ---
> > bob allen   0.28
> > bob allen   22  0.92
> > bob allen   25  2.54
> > bob allen   26  2.35
> > bob allen   27  2.17
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (PIG-5167) Limit_4 is failing with spark exec type

2017-03-24 Thread Nandor Kollar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PIG-5167:
---
Attachment: PIG-5167_3.patch

> Limit_4 is failing with spark exec type
> ---
>
> Key: PIG-5167
> URL: https://issues.apache.org/jira/browse/PIG-5167
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: spark-branch
>
> Attachments: PIG-5167_2.patch, PIG-5167_3.patch, PIG-5167.patch
>
>
> results are different:
> {code}
> diff <(head -n 5 Limit_4.out/out_sorted) <(head -n 5 
> Limit_4_benchmark.out/out_sorted)
> 1,5c1,5
> < 50  3.00
> < 74  2.22
> < alice carson66  2.42
> < alice quirinius 71  0.03
> < alice van buren 28  2.50
> ---
> > bob allen   0.28
> > bob allen   22  0.92
> > bob allen   25  2.54
> > bob allen   26  2.35
> > bob allen   27  2.17
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (PIG-5167) Limit_4 is failing with spark exec type

2017-03-24 Thread Nandor Kollar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PIG-5167:
---
Status: Patch Available  (was: In Progress)

> Limit_4 is failing with spark exec type
> ---
>
> Key: PIG-5167
> URL: https://issues.apache.org/jira/browse/PIG-5167
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: spark-branch
>
> Attachments: PIG-5167_2.patch, PIG-5167_3.patch, PIG-5167.patch
>
>
> results are different:
> {code}
> diff <(head -n 5 Limit_4.out/out_sorted) <(head -n 5 
> Limit_4_benchmark.out/out_sorted)
> 1,5c1,5
> < 50  3.00
> < 74  2.22
> < alice carson66  2.42
> < alice quirinius 71  0.03
> < alice van buren 28  2.50
> ---
> > bob allen   0.28
> > bob allen   22  0.92
> > bob allen   25  2.54
> > bob allen   26  2.35
> > bob allen   27  2.17
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (PIG-5167) Limit_4 is failing with spark exec type

2017-03-24 Thread Nandor Kollar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PIG-5167:
---
Attachment: PIG-5167_3.patch

> Limit_4 is failing with spark exec type
> ---
>
> Key: PIG-5167
> URL: https://issues.apache.org/jira/browse/PIG-5167
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: spark-branch
>
> Attachments: PIG-5167_2.patch, PIG-5167_3.patch, PIG-5167.patch
>
>
> results are different:
> {code}
> diff <(head -n 5 Limit_4.out/out_sorted) <(head -n 5 
> Limit_4_benchmark.out/out_sorted)
> 1,5c1,5
> < 50  3.00
> < 74  2.22
> < alice carson66  2.42
> < alice quirinius 71  0.03
> < alice van buren 28  2.50
> ---
> > bob allen   0.28
> > bob allen   22  0.92
> > bob allen   25  2.54
> > bob allen   26  2.35
> > bob allen   27  2.17
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5134) Fix TestAvroStorage unit test in Spark mode

2017-03-24 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15940712#comment-15940712
 ] 

Nandor Kollar commented on PIG-5134:


[~kellyzly] an update: I managed to solve this without using Kryo, but don't 
really like the solution I came up with. Using Kryo would be better choice I 
think. In my solution, I implemented readObject and writeObject methods, these 
methods read/write the Avro schema as well as the data from/to the 
OutputStream/InputStream. This is done only for AvroTupleWrapper, but I'm 
afraid we'll have to implement the same logic for the other Avro wrapper 
classes too. I noticed, that a similar issue related to Spark and Avro 
compatibility was already resolved: AVRO-1502. It seems that this was only 
fixed for SpecificRecords but not for GenericRecords, which we use in Pig. 
[~rohini] do you have any recommendation, which option should we follow?

> Fix TestAvroStorage unit test in Spark mode
> ---
>
> Key: PIG-5134
> URL: https://issues.apache.org/jira/browse/PIG-5134
> Project: Pig
>  Issue Type: Bug
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: Nandor Kollar
> Fix For: spark-branch
>
> Attachments: PIG-5134.patch
>
>
> It seems that test fails, because Avro GenericData#Record doesn't implement 
> Serializable interface:
> {code}
> 2017-02-23 09:14:41,887 ERROR [main] spark.JobGraphBuilder 
> (JobGraphBuilder.java:sparkOperToRDD(183)) - throw exception in 
> sparkOperToRDD: 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0 
> in stage 9.0 (TID 9) had a not serializable result: 
> org.apache.avro.generic.GenericData$Record
> Serialization stack:
>   - object not serializable (class: 
> org.apache.avro.generic.GenericData$Record, value: {"key": "stuff in closet", 
> "value1": {"thing": "hat", "count": 7}, "value2": {"thing": "coat", "count": 
> 2}})
>   - field (class: org.apache.pig.impl.util.avro.AvroTupleWrapper, name: 
> avroObject, type: interface org.apache.avro.generic.IndexedRecord)
>   - object (class org.apache.pig.impl.util.avro.AvroTupleWrapper, 
> org.apache.pig.impl.util.avro.AvroTupleWrapper@3d3a58c1)
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
> {code}
> The failing tests is a new test introduced with merging trunk to spark 
> branch, that's why we didn't see this error before.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5163) MultiQuery_Streaming_1 is failing with spark exec type

2017-03-28 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944842#comment-15944842
 ] 

Nandor Kollar commented on PIG-5163:


It looks like this is an issue with multi-query optimization. Before multiquery 
optimization, the spark plan looks like this:
{code}
Spark node scope-30
Store(hdfs://localhost:50373/tmp/temp274219070/tmp-1212075796:org.apache.pig.impl.io.InterStorage)
 - scope-31
|
|---B: POStream[perl -ne 'print $_;' 
(stdin-org.apache.pig.builtin.PigStreaming/stdout-org.apache.pig.builtin.PigStreaming)]
 - scope-8
|
|---A: New For Each(false,false,false)[bag] - scope-7
|   |
|   Project[bytearray][0] - scope-1
|   |
|   Project[bytearray][1] - scope-3
|   |
|   Project[bytearray][2] - scope-5
|
|---A: 
Load(hdfs://localhost:50373/user/nkollar/studenttab10k:org.apache.pig.builtin.PigStorage)
 - scope-0

Spark node scope-33
B: 
Store(hdfs://localhost:50373/user/nkollar/out.1:org.apache.pig.builtin.PigStorage)
 - scope-12
|
|---Load(hdfs://localhost:50373/tmp/temp274219070/tmp-1212075796:org.apache.pig.impl.io.InterStorage)
 - scope-32

Spark node scope-38
D: 
Store(hdfs://localhost:50373/user/nkollar/out.2:org.apache.pig.builtin.PigStorage)
 - scope-29
|
|---D: New For Each(true,true)[tuple] - scope-28
|   |
|   Project[bag][1] - scope-26
|   |
|   Project[bag][2] - scope-27
|
|---D: Package(Packager)[tuple]{bytearray} - scope-21
|
|---D: Global Rearrange[tuple] - scope-20
|
|---D: Local Rearrange[tuple]{bytearray}(false) - scope-22
|   |   |
|   |   Project[bytearray][0] - scope-23
|   |
|   
|---Load(hdfs://localhost:50373/tmp/temp274219070/tmp-1212075796:org.apache.pig.impl.io.InterStorage)
 - scope-34
|
|---D: Local Rearrange[tuple]{bytearray}(false) - scope-24
|   |
|   Project[bytearray][0] - scope-25
|
|---C: POStream[perl -ne 'print $_;' 
(stdin-org.apache.pig.builtin.PigStreaming/stdout-org.apache.pig.builtin.PigStreaming)]
 - scope-17
|

|---Load(hdfs://localhost:50373/tmp/temp274219070/tmp-1212075796:org.apache.pig.impl.io.InterStorage)
 - scope-36
{code}
and after it:
{code}
Spark node scope-30
Split - scope-42
|   |
|   B: 
Store(hdfs://localhost:50373/user/nkollar/out.1:org.apache.pig.builtin.PigStorage)
 - scope-12
|   |
|   D: 
Store(hdfs://localhost:50373/user/nkollar/out.2:org.apache.pig.builtin.PigStorage)
 - scope-29
|   |
|   |---D: New For Each(true,true)[tuple] - scope-28
|   |   |
|   |   Project[bag][1] - scope-26
|   |   |
|   |   Project[bag][2] - scope-27
|   |
|   |---D: Package(Packager)[tuple]{bytearray} - scope-21
|   |
|   |---D: Global Rearrange[tuple] - scope-20
|   |
|   |---D: Local Rearrange[tuple]{bytearray}(false) - scope-22
|   |   |   |
|   |   |   Project[bytearray][0] - scope-23
|   |
|   |---D: Local Rearrange[tuple]{bytearray}(false) - scope-24
|   |   |
|   |   Project[bytearray][0] - scope-25
|   |
|   |---C: POStream[perl -ne 'print $_;' 
(stdin-org.apache.pig.builtin.PigStreaming/stdout-org.apache.pig.builtin.PigStreaming)]
 - scope-17
|
|---B: POStream[perl -ne 'print $_;' 
(stdin-org.apache.pig.builtin.PigStreaming/stdout-org.apache.pig.builtin.PigStreaming)]
 - scope-8
|
|---A: New For Each(false,false,false)[bag] - scope-7
|   |
|   Project[bytearray][0] - scope-1
|   |
|   Project[bytearray][1] - scope-3
|   |
|   Project[bytearray][2] - scope-5
|
|---A: 
Load(hdfs://localhost:50373/user/nkollar/studenttab10k:org.apache.pig.builtin.PigStorage)
 - scope-0
{code}
The local rearrange in scope-22 doesn't have an input. [~kellyzly] scope-22 
should have gone away after multiquery optimization?

> MultiQuery_Streaming_1 is failing with spark exec type
> --
>
> Key: PIG-5163
> URL: https://issues.apache.org/jira/browse/PIG-5163
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: spark-branch
>
>
> 2nd output was empty, looks like pig on spark didn't generate any data.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5134) Fix TestAvroStorage unit test in Spark mode

2017-03-27 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15942974#comment-15942974
 ] 

Nandor Kollar commented on PIG-5134:


Actually this issue came up before, Kryo was downgraded on spark branch to 2.21 
(but to support Hive UDFs and ORC we need 2.22): PIG-4693

> Fix TestAvroStorage unit test in Spark mode
> ---
>
> Key: PIG-5134
> URL: https://issues.apache.org/jira/browse/PIG-5134
> Project: Pig
>  Issue Type: Bug
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: Nandor Kollar
> Fix For: spark-branch
>
> Attachments: PIG-5134_2.patch, PIG-5134.patch
>
>
> It seems that test fails, because Avro GenericData#Record doesn't implement 
> Serializable interface:
> {code}
> 2017-02-23 09:14:41,887 ERROR [main] spark.JobGraphBuilder 
> (JobGraphBuilder.java:sparkOperToRDD(183)) - throw exception in 
> sparkOperToRDD: 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0 
> in stage 9.0 (TID 9) had a not serializable result: 
> org.apache.avro.generic.GenericData$Record
> Serialization stack:
>   - object not serializable (class: 
> org.apache.avro.generic.GenericData$Record, value: {"key": "stuff in closet", 
> "value1": {"thing": "hat", "count": 7}, "value2": {"thing": "coat", "count": 
> 2}})
>   - field (class: org.apache.pig.impl.util.avro.AvroTupleWrapper, name: 
> avroObject, type: interface org.apache.avro.generic.IndexedRecord)
>   - object (class org.apache.pig.impl.util.avro.AvroTupleWrapper, 
> org.apache.pig.impl.util.avro.AvroTupleWrapper@3d3a58c1)
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
> {code}
> The failing tests is a new test introduced with merging trunk to spark 
> branch, that's why we didn't see this error before.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5134) Fix TestAvroStorage unit test in Spark mode

2017-03-27 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15942812#comment-15942812
 ] 

Nandor Kollar commented on PIG-5134:


[~rohini] yes, unfortunately using Kryo would lead to issues: Hive UDF test 
cases are failing. Like you said, Spark doesn't use shaded Kryo, but Hive does, 
so Hive UDTF test cases fail:
{code}
Caused by: java.lang.ClassNotFoundException: 
com.esotericsoftware.shaded.org.objenesis.strategy.InstantiatorStrategy
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
{code}
I'll try to find a solution for this. Also, it seems Joda date is not 
compatible with Kryo, but I think that's a simpler problem.

> Fix TestAvroStorage unit test in Spark mode
> ---
>
> Key: PIG-5134
> URL: https://issues.apache.org/jira/browse/PIG-5134
> Project: Pig
>  Issue Type: Bug
>  Components: spark
>Reporter: liyunzhang_intel
>Assignee: Nandor Kollar
> Fix For: spark-branch
>
> Attachments: PIG-5134_2.patch, PIG-5134.patch
>
>
> It seems that test fails, because Avro GenericData#Record doesn't implement 
> Serializable interface:
> {code}
> 2017-02-23 09:14:41,887 ERROR [main] spark.JobGraphBuilder 
> (JobGraphBuilder.java:sparkOperToRDD(183)) - throw exception in 
> sparkOperToRDD: 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0 
> in stage 9.0 (TID 9) had a not serializable result: 
> org.apache.avro.generic.GenericData$Record
> Serialization stack:
>   - object not serializable (class: 
> org.apache.avro.generic.GenericData$Record, value: {"key": "stuff in closet", 
> "value1": {"thing": "hat", "count": 7}, "value2": {"thing": "coat", "count": 
> 2}})
>   - field (class: org.apache.pig.impl.util.avro.AvroTupleWrapper, name: 
> avroObject, type: interface org.apache.avro.generic.IndexedRecord)
>   - object (class org.apache.pig.impl.util.avro.AvroTupleWrapper, 
> org.apache.pig.impl.util.avro.AvroTupleWrapper@3d3a58c1)
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)
> {code}
> The failing tests is a new test introduced with merging trunk to spark 
> branch, that's why we didn't see this error before.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (PIG-5163) MultiQuery_Streaming_1 is failing with spark exec type

2017-03-29 Thread Nandor Kollar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PIG-5163:
---
Attachment: PIG-5163_1.patch

> MultiQuery_Streaming_1 is failing with spark exec type
> --
>
> Key: PIG-5163
> URL: https://issues.apache.org/jira/browse/PIG-5163
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-5163_1.patch
>
>
> 2nd output was empty, looks like pig on spark didn't generate any data.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (PIG-5163) MultiQuery_Streaming_1 is failing with spark exec type

2017-03-29 Thread Nandor Kollar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PIG-5163:
---
Attachment: (was: PIG-5163_1.patch)

> MultiQuery_Streaming_1 is failing with spark exec type
> --
>
> Key: PIG-5163
> URL: https://issues.apache.org/jira/browse/PIG-5163
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-5163_1.patch
>
>
> 2nd output was empty, looks like pig on spark didn't generate any data.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (PIG-5204) Implement illustrate in Spark

2017-03-29 Thread Nandor Kollar (JIRA)
Nandor Kollar created PIG-5204:
--

 Summary: Implement illustrate in Spark
 Key: PIG-5204
 URL: https://issues.apache.org/jira/browse/PIG-5204
 Project: Pig
  Issue Type: Improvement
  Components: spark
Affects Versions: spark-branch
Reporter: Nandor Kollar


Illustrate is not supported in Spark exec type right now.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (PIG-5204) Implement illustrate in Spark

2017-03-29 Thread Nandor Kollar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PIG-5204:
---
Priority: Minor  (was: Major)

> Implement illustrate in Spark
> -
>
> Key: PIG-5204
> URL: https://issues.apache.org/jira/browse/PIG-5204
> Project: Pig
>  Issue Type: Improvement
>  Components: spark
>Affects Versions: spark-branch
>Reporter: Nandor Kollar
>Priority: Minor
>
> Illustrate is not supported in Spark exec type right now.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5158) Several e2e tests are marked to run only in Tez or MR mode only

2017-03-29 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15947318#comment-15947318
 ] 

Nandor Kollar commented on PIG-5158:


Yeah, I think this was the original idea. [~knoguchi] can you confirm that, 
this was the original idea in Tez via replacing Limit_5 with Limit_12? I think 
it's fine for Spark too.

> Several e2e tests are marked to run only in Tez or MR mode only
> ---
>
> Key: PIG-5158
> URL: https://issues.apache.org/jira/browse/PIG-5158
> Project: Pig
>  Issue Type: Task
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: spark-branch
>
> Attachments: PIG-5158_2.patch, PIG-5158_3.patch, PIG-5158_4.patch, 
> PIG-5158.patch
>
>
> While executing the e2e tests in spark mode, I noticed that several tests are 
> marked with 'execonly' => 'mapred,local' or 'execonly' => 'mapred,tez' Revise 
> these tests, add spark for those, where it makes sense to test on Spark.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (PIG-5163) MultiQuery_Streaming_1 is failing with spark exec type

2017-03-29 Thread Nandor Kollar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PIG-5163:
---
Attachment: PIG-5163_1.patch

> MultiQuery_Streaming_1 is failing with spark exec type
> --
>
> Key: PIG-5163
> URL: https://issues.apache.org/jira/browse/PIG-5163
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-5163_1.patch
>
>
> 2nd output was empty, looks like pig on spark didn't generate any data.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5163) MultiQuery_Streaming_1 is failing with spark exec type

2017-03-29 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15947177#comment-15947177
 ] 

Nandor Kollar commented on PIG-5163:


I did some more investigation, and it seems that the problem is related to 
PIG-4675. We maintain the operator keys between spark operators, and in this 
case, after multiquery optimization and before join group optimization, we have 
a mapping in SparkOperator[scope-30] multiQueryOptimizeConnectionMap: scope-22 
-> scope-8. But after join group optimization the local rearrange in scope-22 
is deleted, and replaced with POJoinGroupSpark with same operator key as the 
global rearrange. I think replacing the mapping in the 
multiQueryOptimizeConnectionMap would fix the problem (see attached patch). 
However, I'm not sure why do we need this map?

> MultiQuery_Streaming_1 is failing with spark exec type
> --
>
> Key: PIG-5163
> URL: https://issues.apache.org/jira/browse/PIG-5163
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-5163_1.patch
>
>
> 2nd output was empty, looks like pig on spark didn't generate any data.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (PIG-5158) Several e2e tests are marked to run only in Tez or MR mode only

2017-03-29 Thread Nandor Kollar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PIG-5158:
---
Attachment: PIG-5158_4.patch

> Several e2e tests are marked to run only in Tez or MR mode only
> ---
>
> Key: PIG-5158
> URL: https://issues.apache.org/jira/browse/PIG-5158
> Project: Pig
>  Issue Type: Task
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: spark-branch
>
> Attachments: PIG-5158_2.patch, PIG-5158_3.patch, PIG-5158_4.patch, 
> PIG-5158.patch
>
>
> While executing the e2e tests in spark mode, I noticed that several tests are 
> marked with 'execonly' => 'mapred,local' or 'execonly' => 'mapred,tez' Revise 
> these tests, add spark for those, where it makes sense to test on Spark.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5176) Several ComputeSpec test cases fail

2017-03-30 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15948768#comment-15948768
 ] 

Nandor Kollar commented on PIG-5176:


Let's leave this open for now, I'll verify, but I'm afraid it is still an issue 
on my cluster. It might be something related the to the Spark version I use on 
my cluster, this needs more investigation.

> Several ComputeSpec test cases fail
> ---
>
> Key: PIG-5176
> URL: https://issues.apache.org/jira/browse/PIG-5176
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: spark-branch
>
> Attachments: PIG-5176.patch
>
>
> Several ComputeSpec test cases failed on my cluster:
> ComputeSpec_5 - ComputeSpec_13
> These scripts have a ship() part in the define, where the ship includes the 
> script file too, so we add the same file to spark context twice. This is not 
> a problem with Hadoop, but looks like Spark doesn't like adding the same 
> filename twice:
> {code}
> Caused by: java.lang.IllegalArgumentException: requirement failed: File 
> PigStreamingDepend.pl already registered.
> at scala.Predef$.require(Predef.scala:233)
> at 
> org.apache.spark.rpc.netty.NettyStreamManager.addFile(NettyStreamManager.scala:69)
> at org.apache.spark.SparkContext.addFile(SparkContext.scala:1386)
> at org.apache.spark.SparkContext.addFile(SparkContext.scala:1348)
> at 
> org.apache.spark.api.java.JavaSparkContext.addFile(JavaSparkContext.scala:662)
> at 
> org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.addResourceToSparkJobWorkingDirectory(SparkLauncher.java:462)
> at 
> org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.shipFiles(SparkLauncher.java:371)
> at 
> org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.addFilesToSparkJob(SparkLauncher.java:357)
> at 
> org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.uploadResources(SparkLauncher.java:235)
> at 
> org.apache.pig.backend.hadoop.executionengine.spark.SparkLauncher.launchPig(SparkLauncher.java:222)
> at 
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.launchPig(HExecutionEngine.java:290)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (PIG-5163) MultiQuery_Streaming_1 is failing with spark exec type

2017-03-29 Thread Nandor Kollar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PIG-5163:
---
Attachment: PIG-5163_1.patch

> MultiQuery_Streaming_1 is failing with spark exec type
> --
>
> Key: PIG-5163
> URL: https://issues.apache.org/jira/browse/PIG-5163
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
> Attachments: PIG-5163_1.patch
>
>
> 2nd output was empty, looks like pig on spark didn't generate any data.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (PIG-5163) MultiQuery_Streaming_1 is failing with spark exec type

2017-03-29 Thread Nandor Kollar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar updated PIG-5163:
---
Attachment: (was: PIG-5163_1.patch)

> MultiQuery_Streaming_1 is failing with spark exec type
> --
>
> Key: PIG-5163
> URL: https://issues.apache.org/jira/browse/PIG-5163
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
>
> 2nd output was empty, looks like pig on spark didn't generate any data.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5158) Several e2e tests are marked to run only in Tez or MR mode only

2017-03-29 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15947324#comment-15947324
 ] 

Nandor Kollar commented on PIG-5158:


Uploaded PIG-5158_4.patch: illustrate is not ye implemented for Spark, skipping 
the related test case on spark for now, created a separate Jira to track this: 
PIG-5204

> Several e2e tests are marked to run only in Tez or MR mode only
> ---
>
> Key: PIG-5158
> URL: https://issues.apache.org/jira/browse/PIG-5158
> Project: Pig
>  Issue Type: Task
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: spark-branch
>
> Attachments: PIG-5158_2.patch, PIG-5158_3.patch, PIG-5158_4.patch, 
> PIG-5158.patch
>
>
> While executing the e2e tests in spark mode, I noticed that several tests are 
> marked with 'execonly' => 'mapred,local' or 'execonly' => 'mapred,tez' Revise 
> these tests, add spark for those, where it makes sense to test on Spark.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (PIG-5163) MultiQuery_Streaming_1 is failing with spark exec type

2017-03-29 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15948015#comment-15948015
 ] 

Nandor Kollar edited comment on PIG-5163 at 3/29/17 10:42 PM:
--

[~kellyzly] thanks you for the review! MapReduceLauncher was modified, because 
I think instead of hardcoded strings it is better to use existing constants, 
but reverting that change is fine, it wasn't related to this fix.
I've one more question regarding multiquery optimization: do we need 
multiQueryOptimizeConnectionMap? The plan after the optimization looked 
strange, I don't understand why do we connect the two spark operators via a map 
instead of replacing scope-34 (loading of temp file) with scope-8 (POStream) in 
the optimized plan. My patch fixes the e2e test, and if it doesn't break any 
other test case (I'll execute the entire e2e test suite to make sure it 
doesn't) then it is fine, though we might have to reconsider the need of 
multiQueryOptimizeConnectionMap later.


was (Author: nkollar):
[~kellyzly] thank you for the review! MapReduceLauncher was modified, because I 
think instead of hardcoded strings it is better to use existing constants, but 
reverting that change is fine, it wasn't related to this fix.
I've one more question regarding multiquery optimization: do we need 
multiQueryOptimizeConnectionMap? The plan after the optimization looked 
strange, I don't understand why do we connect the two spark operators via a map 
instead of replacing scope-34 (loading of temp file) with scope-8 (POStream) in 
the optimized plan. My patch fixes the e2e test, and if it doesn't break any 
other test case (I'll execute the entire e2e test suite to make sure it 
doesn't) then it is fine, though we might have to reconsider the need of 
multiQueryOptimizeConnectionMap later.

> MultiQuery_Streaming_1 is failing with spark exec type
> --
>
> Key: PIG-5163
> URL: https://issues.apache.org/jira/browse/PIG-5163
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: spark-branch
>
> Attachments: PIG-5163_1.patch
>
>
> 2nd output was empty, looks like pig on spark didn't generate any data.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5163) MultiQuery_Streaming_1 is failing with spark exec type

2017-03-29 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15948015#comment-15948015
 ] 

Nandor Kollar commented on PIG-5163:


[~kellyzly] thank you for the review! MapReduceLauncher was modified, because I 
think instead of hardcoded strings it is better to use existing constants, but 
reverting that change is fine, it wasn't related to this fix.
I've one more question regarding multiquery optimization: do we need 
multiQueryOptimizeConnectionMap? The plan after the optimization looked 
strange, I don't understand why do we connect the two spark operators via a map 
instead of replacing scope-34 (loading of temp file) with scope-8 (POStream) in 
the optimized plan. My patch fixes the e2e test, and if it doesn't break any 
other test case (I'll execute the entire e2e test suite to make sure it 
doesn't) then it is fine, though we might have to reconsider the need of 
multiQueryOptimizeConnectionMap later.

> MultiQuery_Streaming_1 is failing with spark exec type
> --
>
> Key: PIG-5163
> URL: https://issues.apache.org/jira/browse/PIG-5163
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: spark-branch
>
> Attachments: PIG-5163_1.patch
>
>
> 2nd output was empty, looks like pig on spark didn't generate any data.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (PIG-5172) StreamingPerformance_2 is failing with spark exec type

2017-03-29 Thread Nandor Kollar (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-5172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nandor Kollar resolved PIG-5172.

Resolution: Duplicate

This test fails due to the same reason.

> StreamingPerformance_2 is failing with spark exec type
> --
>
> Key: PIG-5172
> URL: https://issues.apache.org/jira/browse/PIG-5172
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Nandor Kollar
> Fix For: spark-branch
>
>
> results are different



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5163) MultiQuery_Streaming_1 is failing with spark exec type

2017-03-29 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5163?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15946895#comment-15946895
 ] 

Nandor Kollar commented on PIG-5163:


[~kellyzly] the plan after multiquery optimization you mentioned above is in 
fact after multiquery optimization and JoinGroupOptimizerSpark. The plan I 
mentioned is after multiquery optimization and before JoinGroupOptimizerSpark. 
Join group optimizer, as far as I understood just merges LocalRearrange, 
GlobalRearrange and Package to one operator, POJoinGroupSpark. When it tries to 
merge the LR, GR, P pattern in scope-22, since multiquery optimizer deleted the 
predecessor (the loading of temporary file in scope-34), POJoinGroupSpark will 
have only one predecessor:
{code}
List predOfLRAList = plan.getPredecessors(lra);
{code}
for scope-22 will be null. Correct me if I'm wrong, but according to this *I 
think think the bug is somewhere in MultiQueryOptimizerSpark#visitSparkOp*.

By the way, I think JoinGroupSparkConverter#convert should have failed I don't 
think joining 1 RDD makes sense at all, does it? So instead of
{code}
SparkUtil.assertPredecessorSizeGreaterThan(predecessors, op, 0)
{code}
we should check for
{code}
SparkUtil.assertPredecessorSizeGreaterThan(predecessors, op, 1)
{code}

> MultiQuery_Streaming_1 is failing with spark exec type
> --
>
> Key: PIG-5163
> URL: https://issues.apache.org/jira/browse/PIG-5163
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: liyunzhang_intel
> Fix For: spark-branch
>
>
> 2nd output was empty, looks like pig on spark didn't generate any data.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5170) SkewedJoin_14 is failing with spark exec type

2017-03-31 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5170?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15950708#comment-15950708
 ] 

Nandor Kollar commented on PIG-5170:


Thanks [~kexianda]! [~kellyzly] I'll close these issues as duplicate then, 
since now we know the problem. Do we plan to support PIG-5206 in 0.17? If we 
don't, then we should exclude the related e2e test cases for now.

> SkewedJoin_14 is failing with spark exec type
> -
>
> Key: PIG-5170
> URL: https://issues.apache.org/jira/browse/PIG-5170
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Xianda Ke
> Fix For: spark-branch
>
>
> results are different



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (PIG-5167) Limit_4 is failing with spark exec type

2017-03-24 Thread Nandor Kollar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5167?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15940432#comment-15940432
 ] 

Nandor Kollar commented on PIG-5167:


[~kellyzly] actually I updated PIG-5158 with patch #3: excluded Limit_5 and 
included Limit_12. I guess that's what we need to fix Limit_5 failure right?

> Limit_4 is failing with spark exec type
> ---
>
> Key: PIG-5167
> URL: https://issues.apache.org/jira/browse/PIG-5167
> Project: Pig
>  Issue Type: Sub-task
>  Components: spark
>Reporter: Nandor Kollar
>Assignee: Nandor Kollar
> Fix For: spark-branch
>
> Attachments: PIG-5167_2.patch, PIG-5167_3.patch, PIG-5167.patch
>
>
> results are different:
> {code}
> diff <(head -n 5 Limit_4.out/out_sorted) <(head -n 5 
> Limit_4_benchmark.out/out_sorted)
> 1,5c1,5
> < 50  3.00
> < 74  2.22
> < alice carson66  2.42
> < alice quirinius 71  0.03
> < alice van buren 28  2.50
> ---
> > bob allen   0.28
> > bob allen   22  0.92
> > bob allen   25  2.54
> > bob allen   26  2.35
> > bob allen   27  2.17
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


<    1   2   3   4   5   6   >