[jira] [Commented] (PIG-2901) Errors and lacks in document "Pig Latin Basics"

2012-09-07 Thread Miyakawa Taku (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13451263#comment-13451263
 ] 

Miyakawa Taku commented on PIG-2901:


Thank you, now I understand what the section means.

Could you apply the patch?


> Errors and lacks in document "Pig Latin Basics"
> ---
>
> Key: PIG-2901
> URL: https://issues.apache.org/jira/browse/PIG-2901
> Project: Pig
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 0.10.0
>Reporter: Miyakawa Taku
>Assignee: Miyakawa Taku
>  Labels: documentation
> Attachments: PIG-2901.patch
>
>
> This is a patch to fix errors and lacks in document "Pig Latin Basics".
> # States that COGROUP groups records with a null key _from different 
> relations_ separately.
> # "A map key must be a -scalar- +chararray+ "
> # Removes a statement which says that a star expression is a tuple expression 
> (it seems incorrect)
> # Fixes a subject confusion of a sentence "When two bytearrays are used in 
> arithmetic expressions..."
> # Updates a link to Java API documentation.
> # Fixes a tuple example: "LOAD 'data' as..." -> "A = LOAD 'data' as..."
> # "the asterisk (\*) is used to project all -tuples- +fields+ "
> # A result of COGROUP with two relations contains _three_ fields, not _two_
> # Removes an example of COGROUP INNER, which is deprecated
> # Removes a sentence which says "JOIN operator always performs an inner 
> join". Actually, JOIN also perform an outer join.
> # JOIN "Performs an outer join of two -ore more- relations"
> # Replaces an example of "-Dpig.additional.jars" with a jar file on HDFS. The 
> current version incorrectly shows an example of a Pig script on HDFS.
> # Fixes typos, lack of hyperlinks, inappropriate indentation, and incorrect 
> chaptering.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Re: Modifying databag on the fly

2012-09-07 Thread Dmitriy Ryaboy
FYI -- we wound up going with a much cleaner and memory-friendly
solution of returning a new databag implementation which simply
proxied all the calls to the original bag, but returned a special
Iterator which applied the necessary transformation to tuples on the
fly. That way, we don't need to have the whole thing in memory twice
and cause spillage.

D

On Wed, Sep 5, 2012 at 7:38 PM, Alan Gates  wrote:
>
> On Sep 5, 2012, at 6:30 PM, Prasanth J wrote:
>
>> Ahh.. Now it makes more sense.
>>
>> I think I got the solution. I was adding to List and then finally 
>> creating a DataBag with that list.. Instead I should create a bag and keep 
>> adding to it..!! Is that correct?
> Yes.
>
> Alan.
>
>> Thanks Alan.
>>
>> Thanks
>> -- Prasanth
>>
>> On Sep 5, 2012, at 9:24 PM, Alan Gates  wrote:
>>
>>> You cannot modify a bag once it is written.  The implementation is written 
>>> around the assumption that bags are immutable after they are written.
>>>
>>> Creating a new bag should not create an OOM exception, as bags are built to 
>>> spill when they grow too large.  In fact it's this spilling feature that 
>>> makes in place modification impossible.
>>>
>>> Alan.
>>>
>>> On Sep 5, 2012, at 6:08 PM, Prasanth J wrote:
>>>
 Hello devs

 I have specific case where I need to modify the contents (remove a field 
 from each tuples) of Databag but I want to do it in-place and do not want 
 to create another databag with new set of tuples.
 The situation is, say I have the following input tuple for an UDF

 {(111,222,3,121), (112,223,2,131), (113,224,4,141)}

 I want to iterate through this bag and generate an output bag removing the 
 3rd the of each tuples in the bag to get the following output
 {(111,222,121), (112,223,131), (113,224,141)}

 Since the number of tuples in this bag are expected to be large I cannot 
 create new set of tuples and create a bag, as this will cause OOM 
 exception.

 Also I do not want to flatten this bag as this bag will be passed to 
 DISTINCT operator for computing distinct elements in the bag.
 As seen from the javadocs for DataBag, there is no way to convert a bag on 
 the fly. I wonder if there is any other way to solve this?

 Thanks
 -- Prasanth

>>>
>>
>


[jira] [Created] (PIG-2912) Pig should clone JobConf while creating JobContextImpl and TaskAttemptContextImpl in Hadoop23

2012-09-07 Thread Rohini Palaniswamy (JIRA)
Rohini Palaniswamy created PIG-2912:
---

 Summary: Pig should clone JobConf while creating JobContextImpl 
and TaskAttemptContextImpl in Hadoop23
 Key: PIG-2912
 URL: https://issues.apache.org/jira/browse/PIG-2912
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.9.3, 0.10.1
Reporter: Rohini Palaniswamy
Assignee: Rohini Palaniswamy
 Fix For: 0.9.3, 0.11, 0.10.1


There is change in the semantics of
JobContext::JobContext(Configuration, JobID). While in .20, the Config was
cloned, in .23 the Config is adopted (if it's a JobConf). This causes the same
Configuration instance to be written-to for different tables in the same job.

It would affect multi store commands in pig on Hadoop 23/2.0. The
cloning in HadoopShims was part of PIG-2578 but was reverted to other issues.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (PIG-1891) Enable StoreFunc to make intelligent decision based on job success or failure

2012-09-07 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates resolved PIG-1891.
-

   Resolution: Fixed
Fix Version/s: 0.11
 Release Note: This adds a new method, cleanupOnSuccess, to the StoreFunc 
interface, and thus will cause backward compatibility issues for users who 
directly implement this interface.  Most store functions implement 
StoreFuncImpl, which will shield them from this issue as it implements the new 
method.

Patch checked in.  Thanks Eli.

> Enable StoreFunc to make intelligent decision based on job success or failure
> -
>
> Key: PIG-1891
> URL: https://issues.apache.org/jira/browse/PIG-1891
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.10.0
>Reporter: Alex Rovner
>Assignee: Eli Reisman
>Priority: Minor
>  Labels: patch
> Fix For: 0.11
>
> Attachments: PIG-1891-1.patch, PIG-1891-2.patch, PIG-1891-3.patch
>
>
> We are in the process of using PIG for various data processing and component 
> integration. Here is where we feel pig storage funcs lack:
> They are not aware if the over all job has succeeded. This creates a problem 
> for storage funcs which needs to "upload" results into another system:
> DB, FTP, another file system etc.
> I looked at the DBStorage in the piggybank 
> (http://svn.apache.org/viewvc/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/DBStorage.java?view=markup)
>  and what I see is essentially a mechanism which for each task does the 
> following:
> 1. Creates a recordwriter (in this case open connection to db)
> 2. Open transaction.
> 3. Writes records into a batch
> 4. Executes commit or rollback depending if the task was successful.
> While this aproach works great on a task level, it does not work at all on a 
> job level. 
> If certain tasks will succeed but over job will fail, partial records are 
> going to get uploaded into the DB.
> Any ideas on the workaround? 
> Our current workaround is fairly ugly: We created a java wrapper that 
> launches pig jobs and then uploads to DB's once pig's job is successful. 
> While the approach works, it's not really integrated into pig.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Assigned] (PIG-1891) Enable StoreFunc to make intelligent decision based on job success or failure

2012-09-07 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates reassigned PIG-1891:
---

Assignee: Eli Reisman

> Enable StoreFunc to make intelligent decision based on job success or failure
> -
>
> Key: PIG-1891
> URL: https://issues.apache.org/jira/browse/PIG-1891
> Project: Pig
>  Issue Type: New Feature
>Affects Versions: 0.10.0
>Reporter: Alex Rovner
>Assignee: Eli Reisman
>Priority: Minor
>  Labels: patch
> Attachments: PIG-1891-1.patch, PIG-1891-2.patch, PIG-1891-3.patch
>
>
> We are in the process of using PIG for various data processing and component 
> integration. Here is where we feel pig storage funcs lack:
> They are not aware if the over all job has succeeded. This creates a problem 
> for storage funcs which needs to "upload" results into another system:
> DB, FTP, another file system etc.
> I looked at the DBStorage in the piggybank 
> (http://svn.apache.org/viewvc/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/DBStorage.java?view=markup)
>  and what I see is essentially a mechanism which for each task does the 
> following:
> 1. Creates a recordwriter (in this case open connection to db)
> 2. Open transaction.
> 3. Writes records into a batch
> 4. Executes commit or rollback depending if the task was successful.
> While this aproach works great on a task level, it does not work at all on a 
> job level. 
> If certain tasks will succeed but over job will fail, partial records are 
> going to get uploaded into the DB.
> Any ideas on the workaround? 
> Our current workaround is fairly ugly: We created a java wrapper that 
> launches pig jobs and then uploads to DB's once pig's job is successful. 
> While the approach works, it's not really integrated into pig.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2911) GenMRSkewJoinProcessor uses File.Separator instead of Path.Separator

2012-09-07 Thread Thejas M Nair (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair updated PIG-2911:
---

Resolution: Invalid
Status: Resolved  (was: Patch Available)

Sorry, created the bug on wrong product! :)


> GenMRSkewJoinProcessor uses File.Separator instead of Path.Separator
> 
>
> Key: PIG-2911
> URL: https://issues.apache.org/jira/browse/PIG-2911
> Project: Pig
>  Issue Type: Bug
>Reporter: Thejas M Nair
>Assignee: Thejas M Nair
> Attachments: PIG-2911.1.patch
>
>
> This causes testcase skewjoin.q to fail on windows.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2911) GenMRSkewJoinProcessor uses File.Separator instead of Path.Separator

2012-09-07 Thread Thejas M Nair (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair updated PIG-2911:
---

Status: Patch Available  (was: Open)

> GenMRSkewJoinProcessor uses File.Separator instead of Path.Separator
> 
>
> Key: PIG-2911
> URL: https://issues.apache.org/jira/browse/PIG-2911
> Project: Pig
>  Issue Type: Bug
>Reporter: Thejas M Nair
>Assignee: Thejas M Nair
> Attachments: PIG-2911.1.patch
>
>
> This causes testcase skewjoin.q to fail on windows.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2911) GenMRSkewJoinProcessor uses File.Separator instead of Path.Separator

2012-09-07 Thread Thejas M Nair (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair updated PIG-2911:
---

Attachment: PIG-2911.1.patch

> GenMRSkewJoinProcessor uses File.Separator instead of Path.Separator
> 
>
> Key: PIG-2911
> URL: https://issues.apache.org/jira/browse/PIG-2911
> Project: Pig
>  Issue Type: Bug
>Reporter: Thejas M Nair
>Assignee: Thejas M Nair
> Attachments: PIG-2911.1.patch
>
>
> This causes testcase skewjoin.q to fail on windows.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (PIG-2911) GenMRSkewJoinProcessor uses File.Separator instead of Path.Separator

2012-09-07 Thread Thejas M Nair (JIRA)
Thejas M Nair created PIG-2911:
--

 Summary: GenMRSkewJoinProcessor uses File.Separator instead of 
Path.Separator
 Key: PIG-2911
 URL: https://issues.apache.org/jira/browse/PIG-2911
 Project: Pig
  Issue Type: Bug
Reporter: Thejas M Nair
Assignee: Thejas M Nair
 Attachments: PIG-2911.1.patch

This causes testcase skewjoin.q to fail on windows.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-2898) Multithreaded execution of e2e tests

2012-09-07 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13450854#comment-13450854
 ] 

Rohini Palaniswamy commented on PIG-2898:
-

Ivan,
  All the changes to e2e framework to get it working with H23 and benchmark 
caching is removed with this patch. I think this is because you started before 
https://issues.apache.org/jira/browse/PIG-2484 went into 0.9 branch. You will 
have to update the patch with those included.

> Multithreaded execution of e2e tests
> 
>
> Key: PIG-2898
> URL: https://issues.apache.org/jira/browse/PIG-2898
> Project: Pig
>  Issue Type: Improvement
>  Components: e2e harness
>Reporter: Andrey Klochkov
>Assignee: Andrey Klochkov
> Attachments: pig-2898-for-svn-branch-0.9.patch
>
>
> Today it takes ~19 hours to run the full set of e2e tests in mapred mode. The 
> bottleneck here is the client side, and per our observations it can help a 
> lot if the e2e harness would be able to run tests in parallel threads.
> We prototyped changes in e2e harness allowing to run tests in a configurable 
> number of threads. Preliminary results show more than 6x reduction in 
> execution time when using a small 3-nodes M/R cluster with modest 
> configuration. Going to share a patch shortly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2898) Multithreaded execution of e2e tests

2012-09-07 Thread Ivan A. Veselovsky (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan A. Veselovsky updated PIG-2898:


Attachment: pig-2898-for-svn-branch-0.9.patch

the patch pig-2898-for-svn-branch-0.9.patch is attached.

> Multithreaded execution of e2e tests
> 
>
> Key: PIG-2898
> URL: https://issues.apache.org/jira/browse/PIG-2898
> Project: Pig
>  Issue Type: Improvement
>  Components: e2e harness
>Reporter: Andrey Klochkov
>Assignee: Andrey Klochkov
> Attachments: pig-2898-for-svn-branch-0.9.patch
>
>
> Today it takes ~19 hours to run the full set of e2e tests in mapred mode. The 
> bottleneck here is the client side, and per our observations it can help a 
> lot if the e2e harness would be able to run tests in parallel threads.
> We prototyped changes in e2e harness allowing to run tests in a configurable 
> number of threads. Preliminary results show more than 6x reduction in 
> execution time when using a small 3-nodes M/R cluster with modest 
> configuration. Going to share a patch shortly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (PIG-2898) Multithreaded execution of e2e tests

2012-09-07 Thread Ivan A. Veselovsky (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-2898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan A. Veselovsky updated PIG-2898:


Patch Info: Patch Available

We provided parallelized mode of the e2e tests execution using 
Parallel::ForkManager.
Two parameters affect the behavior: 
1) file.fork.factor -- max number of subprocesses when running test 
configuration files (.conf);
2) fork.factor -- max number of subprocesses when running tests within one 
.conf file.
Total max number of subprocesses canot exceed the product of the 2 values.
Value <= 1 mean no paralellizing.
Example: ant -Dfork.factor=3 -Dfile.fork.factor=3 ... test-e2e

The attached patch is to be applied to 
http://svn.apache.org/repos/asf/pig/branches/branch-0.9/ branch.

The patch testing procedure gives the following results for the patch:
 [exec] -1 overall.  
 [exec] 
 [exec] +1 @author.  The patch does not contain any @author tags.
 [exec] 
 [exec] +1 tests included.  The patch appears to include 24 new or 
modified tests.
 [exec] 
 [exec] -1 javadoc.  The javadoc tool appears to have generated 1 
warning messages.
 [exec] 
 [exec] +1 javac.  The applied patch does not increase the total number 
of javac compiler warnings.
 [exec] 
 [exec] +1 findbugs.  The patch does not introduce any new Findbugs 
warnings.
 [exec] 
 [exec] +1 release audit.  The applied patch does not increase the 
total number of release audit warnings.
 [exec] 

> Multithreaded execution of e2e tests
> 
>
> Key: PIG-2898
> URL: https://issues.apache.org/jira/browse/PIG-2898
> Project: Pig
>  Issue Type: Improvement
>  Components: e2e harness
>Reporter: Andrey Klochkov
>Assignee: Andrey Klochkov
>
> Today it takes ~19 hours to run the full set of e2e tests in mapred mode. The 
> bottleneck here is the client side, and per our observations it can help a 
> lot if the e2e harness would be able to run tests in parallel threads.
> We prototyped changes in e2e harness allowing to run tests in a configurable 
> number of threads. Preliminary results show more than 6x reduction in 
> execution time when using a small 3-nodes M/R cluster with modest 
> configuration. Going to share a patch shortly.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


POCollectedGroup and LoadFunc indicator interface

2012-09-07 Thread Vasco Visser
Hi,

Hi I am new to the list. I've been working on the Pig code base,
adding my own blocking map side POs (e.g., map side join, map side
grouping) for when assertions can be made with regard to fragmentation
of input relations. Partly inspired by the new block placement policy
possibilities in hadoop-2.

Anyway, my question to the list is the following. Whilst looking at
the code for POCollectedGroup I noticed that this PO expects split
content to be sorted. On the other hand the Collectable loader
interface only seems to indicate that keys are unique across splits.
Why is this discrepancy? Is there a good reason not to have a
indicator interface that captures all input requirements, e.g., smt
like OrderedCollectableLoadFunc.


regards,
Vasco


[jira] [Commented] (PIG-2904) Scripting UDFs should allow DEFINE statements to pass parameters to the UDF's constructor

2012-09-07 Thread Cheolsoo Park (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-2904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13450454#comment-13450454
 ] 

Cheolsoo Park commented on PIG-2904:


Hello Julien,

I have a question about this jira.

I've been reading Pig UDF code to understand why I need different DEFINEs in 
order to pass different constructor parameters to the AvroStorage constructor. 
What I found is that aliases are mapped with FuncSpec instances that store not 
only function names but also args specified in DEFINE statements. Later when 
aliases are expanded by LogicalPlanBuilder, function objects are instantiated 
from those FuncSpec instances, resulting that args specified in DEFINE 
statements are used to instantiate function objects instead of ones specified 
in LOAD statements.

My question is whether this jira is to solve the same problem or not. I am a 
bit confused because the title says "scripting UDFs", but I thought that 
scripting UDFs are EvalFuncs, and EvalFuncs take no parameters in their 
constructors. Please forgive me if I am misunderstanding something here. I am 
still learning Pig internal.

Thanks!

> Scripting UDFs should allow DEFINE statements to pass parameters to the UDF's 
> constructor
> -
>
> Key: PIG-2904
> URL: https://issues.apache.org/jira/browse/PIG-2904
> Project: Pig
>  Issue Type: New Feature
>Reporter: Julien Le Dem
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira