[jira] [Commented] (PIG-2901) Errors and lacks in document "Pig Latin Basics"
[ https://issues.apache.org/jira/browse/PIG-2901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13451263#comment-13451263 ] Miyakawa Taku commented on PIG-2901: Thank you, now I understand what the section means. Could you apply the patch? > Errors and lacks in document "Pig Latin Basics" > --- > > Key: PIG-2901 > URL: https://issues.apache.org/jira/browse/PIG-2901 > Project: Pig > Issue Type: Bug > Components: documentation >Affects Versions: 0.10.0 >Reporter: Miyakawa Taku >Assignee: Miyakawa Taku > Labels: documentation > Attachments: PIG-2901.patch > > > This is a patch to fix errors and lacks in document "Pig Latin Basics". > # States that COGROUP groups records with a null key _from different > relations_ separately. > # "A map key must be a -scalar- +chararray+ " > # Removes a statement which says that a star expression is a tuple expression > (it seems incorrect) > # Fixes a subject confusion of a sentence "When two bytearrays are used in > arithmetic expressions..." > # Updates a link to Java API documentation. > # Fixes a tuple example: "LOAD 'data' as..." -> "A = LOAD 'data' as..." > # "the asterisk (\*) is used to project all -tuples- +fields+ " > # A result of COGROUP with two relations contains _three_ fields, not _two_ > # Removes an example of COGROUP INNER, which is deprecated > # Removes a sentence which says "JOIN operator always performs an inner > join". Actually, JOIN also perform an outer join. > # JOIN "Performs an outer join of two -ore more- relations" > # Replaces an example of "-Dpig.additional.jars" with a jar file on HDFS. The > current version incorrectly shows an example of a Pig script on HDFS. > # Fixes typos, lack of hyperlinks, inappropriate indentation, and incorrect > chaptering. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Modifying databag on the fly
FYI -- we wound up going with a much cleaner and memory-friendly solution of returning a new databag implementation which simply proxied all the calls to the original bag, but returned a special Iterator which applied the necessary transformation to tuples on the fly. That way, we don't need to have the whole thing in memory twice and cause spillage. D On Wed, Sep 5, 2012 at 7:38 PM, Alan Gates wrote: > > On Sep 5, 2012, at 6:30 PM, Prasanth J wrote: > >> Ahh.. Now it makes more sense. >> >> I think I got the solution. I was adding to List and then finally >> creating a DataBag with that list.. Instead I should create a bag and keep >> adding to it..!! Is that correct? > Yes. > > Alan. > >> Thanks Alan. >> >> Thanks >> -- Prasanth >> >> On Sep 5, 2012, at 9:24 PM, Alan Gates wrote: >> >>> You cannot modify a bag once it is written. The implementation is written >>> around the assumption that bags are immutable after they are written. >>> >>> Creating a new bag should not create an OOM exception, as bags are built to >>> spill when they grow too large. In fact it's this spilling feature that >>> makes in place modification impossible. >>> >>> Alan. >>> >>> On Sep 5, 2012, at 6:08 PM, Prasanth J wrote: >>> Hello devs I have specific case where I need to modify the contents (remove a field from each tuples) of Databag but I want to do it in-place and do not want to create another databag with new set of tuples. The situation is, say I have the following input tuple for an UDF {(111,222,3,121), (112,223,2,131), (113,224,4,141)} I want to iterate through this bag and generate an output bag removing the 3rd the of each tuples in the bag to get the following output {(111,222,121), (112,223,131), (113,224,141)} Since the number of tuples in this bag are expected to be large I cannot create new set of tuples and create a bag, as this will cause OOM exception. Also I do not want to flatten this bag as this bag will be passed to DISTINCT operator for computing distinct elements in the bag. As seen from the javadocs for DataBag, there is no way to convert a bag on the fly. I wonder if there is any other way to solve this? Thanks -- Prasanth >>> >> >
[jira] [Created] (PIG-2912) Pig should clone JobConf while creating JobContextImpl and TaskAttemptContextImpl in Hadoop23
Rohini Palaniswamy created PIG-2912: --- Summary: Pig should clone JobConf while creating JobContextImpl and TaskAttemptContextImpl in Hadoop23 Key: PIG-2912 URL: https://issues.apache.org/jira/browse/PIG-2912 Project: Pig Issue Type: Bug Affects Versions: 0.9.3, 0.10.1 Reporter: Rohini Palaniswamy Assignee: Rohini Palaniswamy Fix For: 0.9.3, 0.11, 0.10.1 There is change in the semantics of JobContext::JobContext(Configuration, JobID). While in .20, the Config was cloned, in .23 the Config is adopted (if it's a JobConf). This causes the same Configuration instance to be written-to for different tables in the same job. It would affect multi store commands in pig on Hadoop 23/2.0. The cloning in HadoopShims was part of PIG-2578 but was reverted to other issues. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (PIG-1891) Enable StoreFunc to make intelligent decision based on job success or failure
[ https://issues.apache.org/jira/browse/PIG-1891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates resolved PIG-1891. - Resolution: Fixed Fix Version/s: 0.11 Release Note: This adds a new method, cleanupOnSuccess, to the StoreFunc interface, and thus will cause backward compatibility issues for users who directly implement this interface. Most store functions implement StoreFuncImpl, which will shield them from this issue as it implements the new method. Patch checked in. Thanks Eli. > Enable StoreFunc to make intelligent decision based on job success or failure > - > > Key: PIG-1891 > URL: https://issues.apache.org/jira/browse/PIG-1891 > Project: Pig > Issue Type: New Feature >Affects Versions: 0.10.0 >Reporter: Alex Rovner >Assignee: Eli Reisman >Priority: Minor > Labels: patch > Fix For: 0.11 > > Attachments: PIG-1891-1.patch, PIG-1891-2.patch, PIG-1891-3.patch > > > We are in the process of using PIG for various data processing and component > integration. Here is where we feel pig storage funcs lack: > They are not aware if the over all job has succeeded. This creates a problem > for storage funcs which needs to "upload" results into another system: > DB, FTP, another file system etc. > I looked at the DBStorage in the piggybank > (http://svn.apache.org/viewvc/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/DBStorage.java?view=markup) > and what I see is essentially a mechanism which for each task does the > following: > 1. Creates a recordwriter (in this case open connection to db) > 2. Open transaction. > 3. Writes records into a batch > 4. Executes commit or rollback depending if the task was successful. > While this aproach works great on a task level, it does not work at all on a > job level. > If certain tasks will succeed but over job will fail, partial records are > going to get uploaded into the DB. > Any ideas on the workaround? > Our current workaround is fairly ugly: We created a java wrapper that > launches pig jobs and then uploads to DB's once pig's job is successful. > While the approach works, it's not really integrated into pig. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (PIG-1891) Enable StoreFunc to make intelligent decision based on job success or failure
[ https://issues.apache.org/jira/browse/PIG-1891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alan Gates reassigned PIG-1891: --- Assignee: Eli Reisman > Enable StoreFunc to make intelligent decision based on job success or failure > - > > Key: PIG-1891 > URL: https://issues.apache.org/jira/browse/PIG-1891 > Project: Pig > Issue Type: New Feature >Affects Versions: 0.10.0 >Reporter: Alex Rovner >Assignee: Eli Reisman >Priority: Minor > Labels: patch > Attachments: PIG-1891-1.patch, PIG-1891-2.patch, PIG-1891-3.patch > > > We are in the process of using PIG for various data processing and component > integration. Here is where we feel pig storage funcs lack: > They are not aware if the over all job has succeeded. This creates a problem > for storage funcs which needs to "upload" results into another system: > DB, FTP, another file system etc. > I looked at the DBStorage in the piggybank > (http://svn.apache.org/viewvc/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/DBStorage.java?view=markup) > and what I see is essentially a mechanism which for each task does the > following: > 1. Creates a recordwriter (in this case open connection to db) > 2. Open transaction. > 3. Writes records into a batch > 4. Executes commit or rollback depending if the task was successful. > While this aproach works great on a task level, it does not work at all on a > job level. > If certain tasks will succeed but over job will fail, partial records are > going to get uploaded into the DB. > Any ideas on the workaround? > Our current workaround is fairly ugly: We created a java wrapper that > launches pig jobs and then uploads to DB's once pig's job is successful. > While the approach works, it's not really integrated into pig. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2911) GenMRSkewJoinProcessor uses File.Separator instead of Path.Separator
[ https://issues.apache.org/jira/browse/PIG-2911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thejas M Nair updated PIG-2911: --- Resolution: Invalid Status: Resolved (was: Patch Available) Sorry, created the bug on wrong product! :) > GenMRSkewJoinProcessor uses File.Separator instead of Path.Separator > > > Key: PIG-2911 > URL: https://issues.apache.org/jira/browse/PIG-2911 > Project: Pig > Issue Type: Bug >Reporter: Thejas M Nair >Assignee: Thejas M Nair > Attachments: PIG-2911.1.patch > > > This causes testcase skewjoin.q to fail on windows. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2911) GenMRSkewJoinProcessor uses File.Separator instead of Path.Separator
[ https://issues.apache.org/jira/browse/PIG-2911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thejas M Nair updated PIG-2911: --- Status: Patch Available (was: Open) > GenMRSkewJoinProcessor uses File.Separator instead of Path.Separator > > > Key: PIG-2911 > URL: https://issues.apache.org/jira/browse/PIG-2911 > Project: Pig > Issue Type: Bug >Reporter: Thejas M Nair >Assignee: Thejas M Nair > Attachments: PIG-2911.1.patch > > > This causes testcase skewjoin.q to fail on windows. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2911) GenMRSkewJoinProcessor uses File.Separator instead of Path.Separator
[ https://issues.apache.org/jira/browse/PIG-2911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thejas M Nair updated PIG-2911: --- Attachment: PIG-2911.1.patch > GenMRSkewJoinProcessor uses File.Separator instead of Path.Separator > > > Key: PIG-2911 > URL: https://issues.apache.org/jira/browse/PIG-2911 > Project: Pig > Issue Type: Bug >Reporter: Thejas M Nair >Assignee: Thejas M Nair > Attachments: PIG-2911.1.patch > > > This causes testcase skewjoin.q to fail on windows. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (PIG-2911) GenMRSkewJoinProcessor uses File.Separator instead of Path.Separator
Thejas M Nair created PIG-2911: -- Summary: GenMRSkewJoinProcessor uses File.Separator instead of Path.Separator Key: PIG-2911 URL: https://issues.apache.org/jira/browse/PIG-2911 Project: Pig Issue Type: Bug Reporter: Thejas M Nair Assignee: Thejas M Nair Attachments: PIG-2911.1.patch This causes testcase skewjoin.q to fail on windows. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-2898) Multithreaded execution of e2e tests
[ https://issues.apache.org/jira/browse/PIG-2898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13450854#comment-13450854 ] Rohini Palaniswamy commented on PIG-2898: - Ivan, All the changes to e2e framework to get it working with H23 and benchmark caching is removed with this patch. I think this is because you started before https://issues.apache.org/jira/browse/PIG-2484 went into 0.9 branch. You will have to update the patch with those included. > Multithreaded execution of e2e tests > > > Key: PIG-2898 > URL: https://issues.apache.org/jira/browse/PIG-2898 > Project: Pig > Issue Type: Improvement > Components: e2e harness >Reporter: Andrey Klochkov >Assignee: Andrey Klochkov > Attachments: pig-2898-for-svn-branch-0.9.patch > > > Today it takes ~19 hours to run the full set of e2e tests in mapred mode. The > bottleneck here is the client side, and per our observations it can help a > lot if the e2e harness would be able to run tests in parallel threads. > We prototyped changes in e2e harness allowing to run tests in a configurable > number of threads. Preliminary results show more than 6x reduction in > execution time when using a small 3-nodes M/R cluster with modest > configuration. Going to share a patch shortly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2898) Multithreaded execution of e2e tests
[ https://issues.apache.org/jira/browse/PIG-2898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan A. Veselovsky updated PIG-2898: Attachment: pig-2898-for-svn-branch-0.9.patch the patch pig-2898-for-svn-branch-0.9.patch is attached. > Multithreaded execution of e2e tests > > > Key: PIG-2898 > URL: https://issues.apache.org/jira/browse/PIG-2898 > Project: Pig > Issue Type: Improvement > Components: e2e harness >Reporter: Andrey Klochkov >Assignee: Andrey Klochkov > Attachments: pig-2898-for-svn-branch-0.9.patch > > > Today it takes ~19 hours to run the full set of e2e tests in mapred mode. The > bottleneck here is the client side, and per our observations it can help a > lot if the e2e harness would be able to run tests in parallel threads. > We prototyped changes in e2e harness allowing to run tests in a configurable > number of threads. Preliminary results show more than 6x reduction in > execution time when using a small 3-nodes M/R cluster with modest > configuration. Going to share a patch shortly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (PIG-2898) Multithreaded execution of e2e tests
[ https://issues.apache.org/jira/browse/PIG-2898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan A. Veselovsky updated PIG-2898: Patch Info: Patch Available We provided parallelized mode of the e2e tests execution using Parallel::ForkManager. Two parameters affect the behavior: 1) file.fork.factor -- max number of subprocesses when running test configuration files (.conf); 2) fork.factor -- max number of subprocesses when running tests within one .conf file. Total max number of subprocesses canot exceed the product of the 2 values. Value <= 1 mean no paralellizing. Example: ant -Dfork.factor=3 -Dfile.fork.factor=3 ... test-e2e The attached patch is to be applied to http://svn.apache.org/repos/asf/pig/branches/branch-0.9/ branch. The patch testing procedure gives the following results for the patch: [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 24 new or modified tests. [exec] [exec] -1 javadoc. The javadoc tool appears to have generated 1 warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. [exec] > Multithreaded execution of e2e tests > > > Key: PIG-2898 > URL: https://issues.apache.org/jira/browse/PIG-2898 > Project: Pig > Issue Type: Improvement > Components: e2e harness >Reporter: Andrey Klochkov >Assignee: Andrey Klochkov > > Today it takes ~19 hours to run the full set of e2e tests in mapred mode. The > bottleneck here is the client side, and per our observations it can help a > lot if the e2e harness would be able to run tests in parallel threads. > We prototyped changes in e2e harness allowing to run tests in a configurable > number of threads. Preliminary results show more than 6x reduction in > execution time when using a small 3-nodes M/R cluster with modest > configuration. Going to share a patch shortly. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
POCollectedGroup and LoadFunc indicator interface
Hi, Hi I am new to the list. I've been working on the Pig code base, adding my own blocking map side POs (e.g., map side join, map side grouping) for when assertions can be made with regard to fragmentation of input relations. Partly inspired by the new block placement policy possibilities in hadoop-2. Anyway, my question to the list is the following. Whilst looking at the code for POCollectedGroup I noticed that this PO expects split content to be sorted. On the other hand the Collectable loader interface only seems to indicate that keys are unique across splits. Why is this discrepancy? Is there a good reason not to have a indicator interface that captures all input requirements, e.g., smt like OrderedCollectableLoadFunc. regards, Vasco
[jira] [Commented] (PIG-2904) Scripting UDFs should allow DEFINE statements to pass parameters to the UDF's constructor
[ https://issues.apache.org/jira/browse/PIG-2904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13450454#comment-13450454 ] Cheolsoo Park commented on PIG-2904: Hello Julien, I have a question about this jira. I've been reading Pig UDF code to understand why I need different DEFINEs in order to pass different constructor parameters to the AvroStorage constructor. What I found is that aliases are mapped with FuncSpec instances that store not only function names but also args specified in DEFINE statements. Later when aliases are expanded by LogicalPlanBuilder, function objects are instantiated from those FuncSpec instances, resulting that args specified in DEFINE statements are used to instantiate function objects instead of ones specified in LOAD statements. My question is whether this jira is to solve the same problem or not. I am a bit confused because the title says "scripting UDFs", but I thought that scripting UDFs are EvalFuncs, and EvalFuncs take no parameters in their constructors. Please forgive me if I am misunderstanding something here. I am still learning Pig internal. Thanks! > Scripting UDFs should allow DEFINE statements to pass parameters to the UDF's > constructor > - > > Key: PIG-2904 > URL: https://issues.apache.org/jira/browse/PIG-2904 > Project: Pig > Issue Type: New Feature >Reporter: Julien Le Dem > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira