[jira] [Commented] (DATAFU-95) Improve wrong JDK error message
[ https://issues.apache.org/jira/browse/DATAFU-95?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15097858#comment-15097858 ] Eyal Allweil commented on DATAFU-95: As an immediate, easy-to-do improvement, writing what Java version is required in the main README on GitHub would be great. > Improve wrong JDK error message > --- > > Key: DATAFU-95 > URL: https://issues.apache.org/jira/browse/DATAFU-95 > Project: DataFu > Issue Type: Improvement >Reporter: Jakob Homan >Priority: Minor > > Right now if one tries to build against JDK1.7, the resulting failure is a > bit unclear: > {noformat}Download > https://repo1.maven.org/maven2/org/eclipse/equinox/app/1.3.200-v20130910-1609/app-1.3.200-v20130910-1609.jar > /Users/jahoman/repos/datafu/build-plugin/src/main/java/org/adrianwalker/multilinestring/MultilineProcessor.java:18: > error: cannot find symbol > @SupportedSourceVersion(SourceVersion.RELEASE_8) > ^ > symbol: variable RELEASE_8 > location: class SourceVersion > 1 error > :build-plugin:compileJava FAILED > FAILURE: Build failed with an exception. > {noformat} > It may be better to use something like [The > Sweeney|https://github.com/boxheed/gradle-sweeney-plugin] to enforce this and > provide a better, faster message on failure. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (DATAFU-114) Make FirstTupleFromBag implement Accumulator
Eyal Allweil created DATAFU-114: --- Summary: Make FirstTupleFromBag implement Accumulator Key: DATAFU-114 URL: https://issues.apache.org/jira/browse/DATAFU-114 Project: DataFu Issue Type: Improvement Affects Versions: 1.3.0 Environment: All Reporter: Eyal Allweil Priority: Minor FirstTupleFromBag only needs the first tuple from the bag, but because it doesn't implement Accumulator the entire bag needs to be passed to it in-memory. The fix is very minor and will make the UDF support large bags. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (DATAFU-114) Make FirstTupleFromBag implement Accumulator
[ https://issues.apache.org/jira/browse/DATAFU-114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eyal Allweil updated DATAFU-114: Attachment: FirstTupleFromBag.java I wasn't able to test this patch because I can't get the build working on my system (Ubuntu LTS) .. I'm getting the error described [here|https://issues.apache.org/jira/browse/DATAFU-95]. I can't seem to make Gradle use a different Java to get it to compile. However, since the implementation of Accumulator is relatively straightforward, I hopefully haven't made any mistakes and I would appreciate if someone whose build is working tried it out and pulled the patch. > Make FirstTupleFromBag implement Accumulator > > > Key: DATAFU-114 > URL: https://issues.apache.org/jira/browse/DATAFU-114 > Project: DataFu > Issue Type: Improvement >Affects Versions: 1.3.0 > Environment: All >Reporter: Eyal Allweil >Priority: Minor > Labels: easyfix, newbie, performance > Attachments: FirstTupleFromBag.java > > > FirstTupleFromBag only needs the first tuple from the bag, but because it > doesn't implement Accumulator the entire bag needs to be passed to it > in-memory. The fix is very minor and will make the UDF support large bags. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DATAFU-114) Make FirstTupleFromBag implement Accumulator
[ https://issues.apache.org/jira/browse/DATAFU-114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15114990#comment-15114990 ] Eyal Allweil commented on DATAFU-114: - Any comments? Can this patch be pulled? > Make FirstTupleFromBag implement Accumulator > > > Key: DATAFU-114 > URL: https://issues.apache.org/jira/browse/DATAFU-114 > Project: DataFu > Issue Type: Improvement >Affects Versions: 1.3.0 > Environment: All >Reporter: Eyal Allweil >Priority: Minor > Labels: easyfix, newbie, performance > Attachments: FirstTupleFromBag.java > > > FirstTupleFromBag only needs the first tuple from the bag, but because it > doesn't implement Accumulator the entire bag needs to be passed to it > in-memory. The fix is very minor and will make the UDF support large bags. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DATAFU-114) Make FirstTupleFromBag implement Accumulator
[ https://issues.apache.org/jira/browse/DATAFU-114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15131991#comment-15131991 ] Eyal Allweil commented on DATAFU-114: - Anyone? > Make FirstTupleFromBag implement Accumulator > > > Key: DATAFU-114 > URL: https://issues.apache.org/jira/browse/DATAFU-114 > Project: DataFu > Issue Type: Improvement >Affects Versions: 1.3.0 > Environment: All >Reporter: Eyal Allweil >Priority: Minor > Labels: easyfix, newbie, performance > Attachments: FirstTupleFromBag.java > > > FirstTupleFromBag only needs the first tuple from the bag, but because it > doesn't implement Accumulator the entire bag needs to be passed to it > in-memory. The fix is very minor and will make the UDF support large bags. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DATAFU-114) Make FirstTupleFromBag implement Accumulator
[ https://issues.apache.org/jira/browse/DATAFU-114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15135730#comment-15135730 ] Eyal Allweil commented on DATAFU-114: - The test looks fine, and so does your patch for DATAFU-95 - I was able to build and test (after adding the test to BagTests.java). What I still can't do is get an Eclipse project working - the gradlew completes, but the project which results doesn't have source folders or dependencies. In the past I had trouble generating patches from Git which RB accepted, but maybe that's been taken care of. Thanks! > Make FirstTupleFromBag implement Accumulator > > > Key: DATAFU-114 > URL: https://issues.apache.org/jira/browse/DATAFU-114 > Project: DataFu > Issue Type: Improvement >Affects Versions: 1.3.0 > Environment: All >Reporter: Eyal Allweil >Assignee: Eyal Allweil >Priority: Minor > Labels: easyfix, newbie, performance > Fix For: 1.3.1 > > Attachments: FirstTupleFromBag.java > > > FirstTupleFromBag only needs the first tuple from the bag, but because it > doesn't implement Accumulator the entire bag needs to be passed to it > in-memory. The fix is very minor and will make the UDF support large bags. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DATAFU-114) Make FirstTupleFromBag implement Accumulator
[ https://issues.apache.org/jira/browse/DATAFU-114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15150312#comment-15150312 ] Eyal Allweil commented on DATAFU-114: - Thanks! After I imported the projects individually, like you suggested, it works fine in Eclipse ... I suggest adding a sentence about it in the base readme file to help out future contributors > Make FirstTupleFromBag implement Accumulator > > > Key: DATAFU-114 > URL: https://issues.apache.org/jira/browse/DATAFU-114 > Project: DataFu > Issue Type: Improvement >Affects Versions: 1.3.0 > Environment: All >Reporter: Eyal Allweil >Assignee: Eyal Allweil >Priority: Minor > Labels: easyfix, newbie, performance > Fix For: 1.3.1 > > Attachments: FirstTupleFromBag.java > > > FirstTupleFromBag only needs the first tuple from the bag, but because it > doesn't implement Accumulator the entire bag needs to be passed to it > in-memory. The fix is very minor and will make the UDF support large bags. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (DATAFU-115) Make TupleFromBag implement Accumulator
Eyal Allweil created DATAFU-115: --- Summary: Make TupleFromBag implement Accumulator Key: DATAFU-115 URL: https://issues.apache.org/jira/browse/DATAFU-115 Project: DataFu Issue Type: Improvement Affects Versions: 1.3.0 Reporter: Eyal Allweil Priority: Minor Fix For: 1.3.1 Similar to [DATAFU-114|https://issues.apache.org/jira/browse/DATAFU-114]. TupleFromBag doesn't need to hold the bag in memory, and can iterate through it until it reaches the desired tuple. By implementing Accumulator, larger bags can be used and with a smaller memory footprint. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (DATAFU-115) Make TupleFromBag implement Accumulator
[ https://issues.apache.org/jira/browse/DATAFU-115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eyal Allweil updated DATAFU-115: Attachment: DATAFU-115.patch Relatively straightforward patch ... there's one difference from the previous behavior, that if an exception is thrown, I ignore it and try to continue iterating to the desired index. I tried uploading it to the review board, see if [this link|https://reviews.apache.org/r/44351/] works. > Make TupleFromBag implement Accumulator > --- > > Key: DATAFU-115 > URL: https://issues.apache.org/jira/browse/DATAFU-115 > Project: DataFu > Issue Type: Improvement >Affects Versions: 1.3.0 >Reporter: Eyal Allweil >Priority: Minor > Labels: performance > Fix For: 1.3.1 > > Attachments: DATAFU-115.patch > > > Similar to [DATAFU-114|https://issues.apache.org/jira/browse/DATAFU-114]. > TupleFromBag doesn't need to hold the bag in memory, and can iterate through > it until it reaches the desired tuple. By implementing Accumulator, larger > bags can be used and with a smaller memory footprint. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (DATAFU-115) Make TupleFromBag implement Accumulator
[ https://issues.apache.org/jira/browse/DATAFU-115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eyal Allweil updated DATAFU-115: Flags: Patch > Make TupleFromBag implement Accumulator > --- > > Key: DATAFU-115 > URL: https://issues.apache.org/jira/browse/DATAFU-115 > Project: DataFu > Issue Type: Improvement >Affects Versions: 1.3.0 >Reporter: Eyal Allweil >Priority: Minor > Labels: performance > Fix For: 1.3.1 > > Attachments: DATAFU-115.patch > > > Similar to [DATAFU-114|https://issues.apache.org/jira/browse/DATAFU-114]. > TupleFromBag doesn't need to hold the bag in memory, and can iterate through > it until it reaches the desired tuple. By implementing Accumulator, larger > bags can be used and with a smaller memory footprint. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (DATAFU-116) Make SetIntersect and SetDifference implement Accumulator
Eyal Allweil created DATAFU-116: --- Summary: Make SetIntersect and SetDifference implement Accumulator Key: DATAFU-116 URL: https://issues.apache.org/jira/browse/DATAFU-116 Project: DataFu Issue Type: Improvement Affects Versions: 1.3.0 Reporter: Eyal Allweil SetIntersect and SetDifference accept only sorted bags, and the output is always smaller than the inputs. Therefore an accumulator implementation should be possible and it will improve memory usage (somewhat) and allow Pig to optimize loops with these operations better. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DATAFU-116) Make SetIntersect and SetDifference implement Accumulator
[ https://issues.apache.org/jira/browse/DATAFU-116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15185409#comment-15185409 ] Eyal Allweil commented on DATAFU-116: - As far as I can tell, when the accumulator is used, Pig passes _pig.accumulative.batchsize_ tuples from each bag until all the tuples are exhausted. I think an implementation that iterates over the bags and only keeps some of the tuples in between batches is possible - hopefully very few, but the worst case is all of them, which is no worse than the current implementation. I'm assuming Pig passes batches in this way based on the code in [POPackage|https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPackage.java] and from looking through all the documentation I could find on accumulators. If I'm wrong it does mean that an accumulator implementation isn't worthwhile. > Make SetIntersect and SetDifference implement Accumulator > - > > Key: DATAFU-116 > URL: https://issues.apache.org/jira/browse/DATAFU-116 > Project: DataFu > Issue Type: Improvement >Affects Versions: 1.3.0 >Reporter: Eyal Allweil > > SetIntersect and SetDifference accept only sorted bags, and the output is > always smaller than the inputs. Therefore an accumulator implementation > should be possible and it will improve memory usage (somewhat) and allow Pig > to optimize loops with these operations better. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DATAFU-116) Make SetIntersect and SetDifference implement Accumulator
[ https://issues.apache.org/jira/browse/DATAFU-116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15189158#comment-15189158 ] Eyal Allweil commented on DATAFU-116: - As far as I know, the behavior you're describing is how Pig deals with UDF's that implement the Accumulator interface. If the UDF doesn't (if it only extends EvalFunc) the parameters (including bags) are passed in memory in their entirety. I'm basing this on [this quote from Programming Pig|http://stackoverflow.com/a/15813789/150992]. That's why I'm suggesting this change. > Make SetIntersect and SetDifference implement Accumulator > - > > Key: DATAFU-116 > URL: https://issues.apache.org/jira/browse/DATAFU-116 > Project: DataFu > Issue Type: Improvement >Affects Versions: 1.3.0 >Reporter: Eyal Allweil > > SetIntersect and SetDifference accept only sorted bags, and the output is > always smaller than the inputs. Therefore an accumulator implementation > should be possible and it will improve memory usage (somewhat) and allow Pig > to optimize loops with these operations better. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (DATAFU-117) New UDF - CountDistinctUpTo
Eyal Allweil created DATAFU-117: --- Summary: New UDF - CountDistinctUpTo Key: DATAFU-117 URL: https://issues.apache.org/jira/browse/DATAFU-117 Project: DataFu Issue Type: New Feature Reporter: Eyal Allweil A UDF that counts distinct tuples within a bag, but only up to a preset limit. If the bag contains more distinct tuples than the limit, the UDF returns the limit. This UDF can run reasonably well even on large bags if the limit chosen is small enough though the count is done in memory. We use this UDF in PayPal for filtering, when we don't need to use the actual tuples afterward. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (DATAFU-117) New UDF - CountDistinctUpTo
[ https://issues.apache.org/jira/browse/DATAFU-117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eyal Allweil updated DATAFU-117: Attachment: DATAFU-117.patch Patch including new UDF and test (in BagTests) > New UDF - CountDistinctUpTo > --- > > Key: DATAFU-117 > URL: https://issues.apache.org/jira/browse/DATAFU-117 > Project: DataFu > Issue Type: New Feature >Reporter: Eyal Allweil > Attachments: DATAFU-117.patch > > > A UDF that counts distinct tuples within a bag, but only up to a preset > limit. If the bag contains more distinct tuples than the limit, the UDF > returns the limit. > This UDF can run reasonably well even on large bags if the limit chosen is > small enough though the count is done in memory. > We use this UDF in PayPal for filtering, when we don't need to use the actual > tuples afterward. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DATAFU-115) Make TupleFromBag implement Accumulator
[ https://issues.apache.org/jira/browse/DATAFU-115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15213559#comment-15213559 ] Eyal Allweil commented on DATAFU-115: - I'm not sure why, but I can't see this patch in the master branch. I can see https://issues.apache.org/jira/browse/DATAFU-114 - [FirstTupleFromBag|https://github.com/apache/incubator-datafu/blob/master/datafu-pig/src/main/java/datafu/pig/bags/FirstTupleFromBag.java] appears changed - but [TupleFromBag|https://github.com/apache/incubator-datafu/blob/master/datafu-pig/src/main/java/datafu/pig/bags/TupleFromBag.java] looks like it hasn't been changed since August. Does the public GitHub represent the repository accurately? > Make TupleFromBag implement Accumulator > --- > > Key: DATAFU-115 > URL: https://issues.apache.org/jira/browse/DATAFU-115 > Project: DataFu > Issue Type: Improvement >Affects Versions: 1.3.0 >Reporter: Eyal Allweil >Assignee: Eyal Allweil >Priority: Minor > Labels: performance > Fix For: 1.3.1 > > Attachments: DATAFU-115.patch > > > Similar to [DATAFU-114|https://issues.apache.org/jira/browse/DATAFU-114]. > TupleFromBag doesn't need to hold the bag in memory, and can iterate through > it until it reaches the desired tuple. By implementing Accumulator, larger > bags can be used and with a smaller memory footprint. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DATAFU-115) Make TupleFromBag implement Accumulator
[ https://issues.apache.org/jira/browse/DATAFU-115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15215634#comment-15215634 ] Eyal Allweil commented on DATAFU-115: - Thanks! > Make TupleFromBag implement Accumulator > --- > > Key: DATAFU-115 > URL: https://issues.apache.org/jira/browse/DATAFU-115 > Project: DataFu > Issue Type: Improvement >Affects Versions: 1.3.0 >Reporter: Eyal Allweil >Assignee: Eyal Allweil >Priority: Minor > Labels: performance > Fix For: 1.3.1 > > Attachments: DATAFU-115.patch > > > Similar to [DATAFU-114|https://issues.apache.org/jira/browse/DATAFU-114]. > TupleFromBag doesn't need to hold the bag in memory, and can iterate through > it until it reaches the desired tuple. By implementing Accumulator, larger > bags can be used and with a smaller memory footprint. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DATAFU-117) New UDF - CountDistinctUpTo
[ https://issues.apache.org/jira/browse/DATAFU-117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15233995#comment-15233995 ] Eyal Allweil commented on DATAFU-117: - Thanks for the feedback! I will incorporate it in a future patch, but I'm trying to check whether I can revise this UDF to have an Algebraic implementation (which should further improve its performance). I'll open a review board on this version as soon as I can. > New UDF - CountDistinctUpTo > --- > > Key: DATAFU-117 > URL: https://issues.apache.org/jira/browse/DATAFU-117 > Project: DataFu > Issue Type: New Feature >Reporter: Eyal Allweil > Attachments: DATAFU-117.patch > > > A UDF that counts distinct tuples within a bag, but only up to a preset > limit. If the bag contains more distinct tuples than the limit, the UDF > returns the limit. > This UDF can run reasonably well even on large bags if the limit chosen is > small enough though the count is done in memory. > We use this UDF in PayPal for filtering, when we don't need to use the actual > tuples afterward. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DATAFU-117) New UDF - CountDistinctUpTo
[ https://issues.apache.org/jira/browse/DATAFU-117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15258239#comment-15258239 ] Eyal Allweil commented on DATAFU-117: - Ok, I opened a review board for it - can you see it? It's at https://reviews.apache.org/r/46701/ I think all your previous comments are addressed there, except for the one about "this.set.add(o) && (this.set.size() == maxAmount". I don't think this can exceed the max size, because a single add operation can only increment the set's size by one, and the UDF is executed in a single thread. I ran a few tests comparing this UDF to a Pig nested foreach with DISTINCT followed by the builtin COUNT. On small inputs they perform about the same - even up to a million records - but if you have a situation with more skew (I checked 10 million records, with about 4 million distincts) then this UDF with a max value of say, 1, runs in about four minutes, and the nested foreach+distinct+count takes more than an hour - probably because it needs to keep all the distinct records in memory, rather than just reaching the desired threshold. > New UDF - CountDistinctUpTo > --- > > Key: DATAFU-117 > URL: https://issues.apache.org/jira/browse/DATAFU-117 > Project: DataFu > Issue Type: New Feature >Reporter: Eyal Allweil > Attachments: DATAFU-117.patch > > > A UDF that counts distinct tuples within a bag, but only up to a preset > limit. If the bag contains more distinct tuples than the limit, the UDF > returns the limit. > This UDF can run reasonably well even on large bags if the limit chosen is > small enough though the count is done in memory. > We use this UDF in PayPal for filtering, when we don't need to use the actual > tuples afterward. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (DATAFU-117) New UDF - CountDistinctUpTo
[ https://issues.apache.org/jira/browse/DATAFU-117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eyal Allweil updated DATAFU-117: Attachment: DATAFU-117-2.patch This replaces the previous patch file, addresses (most of) Matthew's comments, and adds an Algebraic implementation to the UDF. > New UDF - CountDistinctUpTo > --- > > Key: DATAFU-117 > URL: https://issues.apache.org/jira/browse/DATAFU-117 > Project: DataFu > Issue Type: New Feature >Reporter: Eyal Allweil > Attachments: DATAFU-117-2.patch, DATAFU-117.patch > > > A UDF that counts distinct tuples within a bag, but only up to a preset > limit. If the bag contains more distinct tuples than the limit, the UDF > returns the limit. > This UDF can run reasonably well even on large bags if the limit chosen is > small enough though the count is done in memory. > We use this UDF in PayPal for filtering, when we don't need to use the actual > tuples afterward. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (DATAFU-117) New UDF - CountDistinctUpTo
[ https://issues.apache.org/jira/browse/DATAFU-117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15258239#comment-15258239 ] Eyal Allweil edited comment on DATAFU-117 at 5/9/16 8:50 AM: - Ok, I opened a review board for it - It's at https://reviews.apache.org/r/46701/ I think all your previous comments are addressed there, except for the one about "this.set.add(o) && (this.set.size() == maxAmount". I don't think this can exceed the max size, because a single add operation can only increment the set's size by one, and the UDF is executed in a single thread. I ran a few tests comparing this UDF to a Pig nested foreach with DISTINCT followed by the builtin COUNT. On small inputs they perform about the same - even up to a million records - but if you have a situation with more skew (I checked 10 million records, with about 4 million distincts) then this UDF with a max value of say, 1,000,000, runs in a few minutes, and the nested foreach+distinct+count takes more than an hour - probably because it needs to keep all the distinct records in memory, rather than just reaching the desired threshold. was (Author: eyal): Ok, I opened a review board for it - can you see it? It's at https://reviews.apache.org/r/46701/ I think all your previous comments are addressed there, except for the one about "this.set.add(o) && (this.set.size() == maxAmount". I don't think this can exceed the max size, because a single add operation can only increment the set's size by one, and the UDF is executed in a single thread. I ran a few tests comparing this UDF to a Pig nested foreach with DISTINCT followed by the builtin COUNT. On small inputs they perform about the same - even up to a million records - but if you have a situation with more skew (I checked 10 million records, with about 4 million distincts) then this UDF with a max value of say, 1, runs in about four minutes, and the nested foreach+distinct+count takes more than an hour - probably because it needs to keep all the distinct records in memory, rather than just reaching the desired threshold. > New UDF - CountDistinctUpTo > --- > > Key: DATAFU-117 > URL: https://issues.apache.org/jira/browse/DATAFU-117 > Project: DataFu > Issue Type: New Feature >Reporter: Eyal Allweil > Attachments: DATAFU-117-2.patch, DATAFU-117.patch > > > A UDF that counts distinct tuples within a bag, but only up to a preset > limit. If the bag contains more distinct tuples than the limit, the UDF > returns the limit. > This UDF can run reasonably well even on large bags if the limit chosen is > small enough though the count is done in memory. > We use this UDF in PayPal for filtering, when we don't need to use the actual > tuples afterward. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DATAFU-117) New UDF - CountDistinctUpTo
[ https://issues.apache.org/jira/browse/DATAFU-117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277798#comment-15277798 ] Eyal Allweil commented on DATAFU-117: - Is anyone available to review this? > New UDF - CountDistinctUpTo > --- > > Key: DATAFU-117 > URL: https://issues.apache.org/jira/browse/DATAFU-117 > Project: DataFu > Issue Type: New Feature >Reporter: Eyal Allweil > Attachments: DATAFU-117-2.patch, DATAFU-117.patch > > > A UDF that counts distinct tuples within a bag, but only up to a preset > limit. If the bag contains more distinct tuples than the limit, the UDF > returns the limit. > This UDF can run reasonably well even on large bags if the limit chosen is > small enough though the count is done in memory. > We use this UDF in PayPal for filtering, when we don't need to use the actual > tuples afterward. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (DATAFU-117) New UDF - CountDistinctUpTo
[ https://issues.apache.org/jira/browse/DATAFU-117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eyal Allweil updated DATAFU-117: Attachment: DATAFU-117-3.patch Incorporates changes from [review |https://reviews.apache.org/r/46701/] > New UDF - CountDistinctUpTo > --- > > Key: DATAFU-117 > URL: https://issues.apache.org/jira/browse/DATAFU-117 > Project: DataFu > Issue Type: New Feature >Reporter: Eyal Allweil > Attachments: DATAFU-117-2.patch, DATAFU-117-3.patch, DATAFU-117.patch > > > A UDF that counts distinct tuples within a bag, but only up to a preset > limit. If the bag contains more distinct tuples than the limit, the UDF > returns the limit. > This UDF can run reasonably well even on large bags if the limit chosen is > small enough though the count is done in memory. > We use this UDF in PayPal for filtering, when we don't need to use the actual > tuples afterward. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (DATAFU-117) New UDF - CountDistinctUpTo
[ https://issues.apache.org/jira/browse/DATAFU-117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eyal Allweil updated DATAFU-117: Attachment: DATAFU-117-4.patch This patch incorporates the last remaining comment from the review (clearing instead of reassigning the set in cleanup) > New UDF - CountDistinctUpTo > --- > > Key: DATAFU-117 > URL: https://issues.apache.org/jira/browse/DATAFU-117 > Project: DataFu > Issue Type: New Feature >Reporter: Eyal Allweil >Assignee: Eyal Allweil > Attachments: DATAFU-117-2.patch, DATAFU-117-3.patch, > DATAFU-117-4.patch, DATAFU-117.patch > > > A UDF that counts distinct tuples within a bag, but only up to a preset > limit. If the bag contains more distinct tuples than the limit, the UDF > returns the limit. > This UDF can run reasonably well even on large bags if the limit chosen is > small enough though the count is done in memory. > We use this UDF in PayPal for filtering, when we don't need to use the actual > tuples afterward. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (DATAFU-119) New UDF - TupleDiff
Eyal Allweil created DATAFU-119: --- Summary: New UDF - TupleDiff Key: DATAFU-119 URL: https://issues.apache.org/jira/browse/DATAFU-119 Project: DataFu Issue Type: New Feature Reporter: Eyal Allweil Assignee: Eyal Allweil A UDF that given two tuples, prints out the differences between them in human-readable form. This is not meant for production - we use it in PayPal for regression tests, to compare the results of two runs. Differences are calculated based on position, but the tuples' schemas are used, if available, for displaying more friendly results. If no schema is available the output uses field numbers. It should be used when you want a more fine-grained description of what has changed, unlike [org.apache.pig.builtin.DIFF|https://pig.apache.org/docs/r0.14.0/func.html#diff]. Also, because DIFF takes as its input two bags to be compared, they must fit in memory. This UDF only takes one pair of tuples at a time, so it can run on large inputs. We use a macro much like the following in conjunction with this UDF: DEFINE diff_macro(diff_macro_old, diff_macro_new, diff_macro_pk, diff_macro_ignored_field) returns diffs { DEFINE TupleDiff datafu.pig.util.TupleDiff; old = FOREACH $diff_macro_old GENERATE $diff_macro_pk, TOTUPLE(*) AS original; new = FOREACH $diff_macro_new GENERATE $diff_macro_pk, TOTUPLE(*) AS original; join_data = JOIN new BY $diff_macro_pk full, old BY $diff_macro_pk; join_data = FOREACH join_data GENERATE TupleDiff(old::original, new::original, '$diff_macro_ignored_field') AS tupleDiff, old::original, new::original; $diffs = FILTER join_data BY tupleDiff IS NOT NULL ; }; Currently, the output from the macro looks like this (when comma-separated): added,, missing,, changed field2 field4,, The UDF takes a variable number of parameters - the two tuples to be compared, and any number of field names or numbers to be ignored. We use this to ignore fields representing execution or creation time (the macro I've given as an example assumes only one ignored field) The current implementation "drills down" into tuples, but not bags or maps - tuple boundaries are indicated with parentheses, like this: changed outerEmbeddedTuple(innerEmbeddedTuple(fieldNameThatIsDifferent) innerEmbeddedTuple(anotherFieldThatIsDifferent)) I have a few final things left to do and then I'll put it up on reviewboard. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (DATAFU-119) New UDF - TupleDiff
[ https://issues.apache.org/jira/browse/DATAFU-119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eyal Allweil updated DATAFU-119: Description: A UDF that given two tuples, prints out the differences between them in human-readable form. This is not meant for production - we use it in PayPal for regression tests, to compare the results of two runs. Differences are calculated based on position, but the tuples' schemas are used, if available, for displaying more friendly results. If no schema is available the output uses field numbers. It should be used when you want a more fine-grained description of what has changed, unlike [org.apache.pig.builtin.DIFF|https://pig.apache.org/docs/r0.14.0/func.html#diff]. Also, because DIFF takes as its input two bags to be compared, they must fit in memory. This UDF only takes one pair of tuples at a time, so it can run on large inputs. We use a macro much like the following in conjunction with this UDF: {noformat} DEFINE diff_macro(diff_macro_old, diff_macro_new, diff_macro_pk, diff_macro_ignored_field) returns diffs { DEFINE TupleDiff datafu.pig.util.TupleDiff; old = FOREACH $diff_macro_old GENERATE $diff_macro_pk, TOTUPLE(*) AS original; new = FOREACH $diff_macro_new GENERATE $diff_macro_pk, TOTUPLE(*) AS original; join_data = JOIN new BY $diff_macro_pk full, old BY $diff_macro_pk; join_data = FOREACH join_data GENERATE TupleDiff(old::original, new::original, '$diff_macro_ignored_field') AS tupleDiff, old::original, new::original; $diffs = FILTER join_data BY tupleDiff IS NOT NULL ; }; {noformat} Currently, the output from the macro looks like this (when comma-separated): {noformat} added,, missing,, changed field2 field4,, {noformat} The UDF takes a variable number of parameters - the two tuples to be compared, and any number of field names or numbers to be ignored. We use this to ignore fields representing execution or creation time (the macro I've given as an example assumes only one ignored field) The current implementation "drills down" into tuples, but not bags or maps - tuple boundaries are indicated with parentheses, like this: {noformat} changed outerEmbeddedTuple(innerEmbeddedTuple(fieldNameThatIsDifferent) innerEmbeddedTuple(anotherFieldThatIsDifferent)) {noformat} I have a few final things left to do and then I'll put it up on reviewboard. was: A UDF that given two tuples, prints out the differences between them in human-readable form. This is not meant for production - we use it in PayPal for regression tests, to compare the results of two runs. Differences are calculated based on position, but the tuples' schemas are used, if available, for displaying more friendly results. If no schema is available the output uses field numbers. It should be used when you want a more fine-grained description of what has changed, unlike [org.apache.pig.builtin.DIFF|https://pig.apache.org/docs/r0.14.0/func.html#diff]. Also, because DIFF takes as its input two bags to be compared, they must fit in memory. This UDF only takes one pair of tuples at a time, so it can run on large inputs. We use a macro much like the following in conjunction with this UDF: DEFINE diff_macro(diff_macro_old, diff_macro_new, diff_macro_pk, diff_macro_ignored_field) returns diffs { DEFINE TupleDiff datafu.pig.util.TupleDiff; old = FOREACH $diff_macro_old GENERATE $diff_macro_pk, TOTUPLE(*) AS original; new = FOREACH $diff_macro_new GENERATE $diff_macro_pk, TOTUPLE(*) AS original; join_data = JOIN new BY $diff_macro_pk full, old BY $diff_macro_pk; join_data = FOREACH join_data GENERATE TupleDiff(old::original, new::original, '$diff_macro_ignored_field') AS tupleDiff, old::original, new::original; $diffs = FILTER join_data BY tupleDiff IS NOT NULL ; }; Currently, the output from the macro looks like this (when comma-separated): added,, missing,, changed field2 field4,, The UDF takes a variable number of parameters - the two tuples to be compared, and any number of field names or numbers to be ignored. We use this to ignore fields representing execution or creation time (the macro I've given as an example assumes only one ignored field) The current implementation "drills down" into tuples, but not bags or maps - tuple boundaries are indicated with parentheses, like this: changed outerEmbeddedTuple(innerEmbeddedTuple(fieldNameThatIsDifferent) innerEmbeddedTuple(anotherFieldThatIsDifferent)) I have a few final things left to do and then I'll put it up on reviewboard. > New UDF - TupleDiff >
[jira] [Commented] (DATAFU-119) New UDF - TupleDiff
[ https://issues.apache.org/jira/browse/DATAFU-119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15350489#comment-15350489 ] Eyal Allweil commented on DATAFU-119: - I put up a [reviewboard|https://reviews.apache.org/r/49248/] for this. After some internal discussions, I wonder if the output isn't too specific for general use - I find it very convenient during development for comparing outputs, but it's very much skewed towards human-readability - to make it easy to use the output in Pig, it should have a real schema, not chararray - possibly something with the field names from the original tuples, but boolean or int values to indicate change types. I'd be happy to hear feedback about this. > New UDF - TupleDiff > --- > > Key: DATAFU-119 > URL: https://issues.apache.org/jira/browse/DATAFU-119 > Project: DataFu > Issue Type: New Feature >Reporter: Eyal Allweil >Assignee: Eyal Allweil > > A UDF that given two tuples, prints out the differences between them in > human-readable form. This is not meant for production - we use it in PayPal > for regression tests, to compare the results of two runs. Differences are > calculated based on position, but the tuples' schemas are used, if available, > for displaying more friendly results. If no schema is available the output > uses field numbers. > It should be used when you want a more fine-grained description of what has > changed, unlike > [org.apache.pig.builtin.DIFF|https://pig.apache.org/docs/r0.14.0/func.html#diff]. > Also, because DIFF takes as its input two bags to be compared, they must fit > in memory. This UDF only takes one pair of tuples at a time, so it can run on > large inputs. > We use a macro much like the following in conjunction with this UDF: > {noformat} > DEFINE diff_macro(diff_macro_old, diff_macro_new, diff_macro_pk, > diff_macro_ignored_field) returns diffs { > DEFINE TupleDiff datafu.pig.util.TupleDiff; > > old = FOREACH $diff_macro_old GENERATE $diff_macro_pk, TOTUPLE(*) AS > original; > new = FOREACH $diff_macro_new GENERATE $diff_macro_pk, TOTUPLE(*) AS > original; > > join_data = JOIN new BY $diff_macro_pk full, old BY $diff_macro_pk; > > join_data = FOREACH join_data GENERATE TupleDiff(old::original, > new::original, '$diff_macro_ignored_field') AS tupleDiff, old::original, > new::original; > > $diffs = FILTER join_data BY tupleDiff IS NOT NULL ; > }; > {noformat} > Currently, the output from the macro looks like this (when comma-separated): > {noformat} > added,, > missing,, > changed field2 field4,, > {noformat} > The UDF takes a variable number of parameters - the two tuples to be > compared, and any number of field names or numbers to be ignored. We use this > to ignore fields representing execution or creation time (the macro I've > given as an example assumes only one ignored field) > The current implementation "drills down" into tuples, but not bags or maps - > tuple boundaries are indicated with parentheses, like this: > {noformat} > changed outerEmbeddedTuple(innerEmbeddedTuple(fieldNameThatIsDifferent) > innerEmbeddedTuple(anotherFieldThatIsDifferent)) > {noformat} > I have a few final things left to do and then I'll put it up on reviewboard. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DATAFU-119) New UDF - TupleDiff
[ https://issues.apache.org/jira/browse/DATAFU-119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15471164#comment-15471164 ] Eyal Allweil commented on DATAFU-119: - Any feedback about this? > New UDF - TupleDiff > --- > > Key: DATAFU-119 > URL: https://issues.apache.org/jira/browse/DATAFU-119 > Project: DataFu > Issue Type: New Feature >Reporter: Eyal Allweil >Assignee: Eyal Allweil > > A UDF that given two tuples, prints out the differences between them in > human-readable form. This is not meant for production - we use it in PayPal > for regression tests, to compare the results of two runs. Differences are > calculated based on position, but the tuples' schemas are used, if available, > for displaying more friendly results. If no schema is available the output > uses field numbers. > It should be used when you want a more fine-grained description of what has > changed, unlike > [org.apache.pig.builtin.DIFF|https://pig.apache.org/docs/r0.14.0/func.html#diff]. > Also, because DIFF takes as its input two bags to be compared, they must fit > in memory. This UDF only takes one pair of tuples at a time, so it can run on > large inputs. > We use a macro much like the following in conjunction with this UDF: > {noformat} > DEFINE diff_macro(diff_macro_old, diff_macro_new, diff_macro_pk, > diff_macro_ignored_field) returns diffs { > DEFINE TupleDiff datafu.pig.util.TupleDiff; > > old = FOREACH $diff_macro_old GENERATE $diff_macro_pk, TOTUPLE(*) AS > original; > new = FOREACH $diff_macro_new GENERATE $diff_macro_pk, TOTUPLE(*) AS > original; > > join_data = JOIN new BY $diff_macro_pk full, old BY $diff_macro_pk; > > join_data = FOREACH join_data GENERATE TupleDiff(old::original, > new::original, '$diff_macro_ignored_field') AS tupleDiff, old::original, > new::original; > > $diffs = FILTER join_data BY tupleDiff IS NOT NULL ; > }; > {noformat} > Currently, the output from the macro looks like this (when comma-separated): > {noformat} > added,, > missing,, > changed field2 field4,, > {noformat} > The UDF takes a variable number of parameters - the two tuples to be > compared, and any number of field names or numbers to be ignored. We use this > to ignore fields representing execution or creation time (the macro I've > given as an example assumes only one ignored field) > The current implementation "drills down" into tuples, but not bags or maps - > tuple boundaries are indicated with parentheses, like this: > {noformat} > changed outerEmbeddedTuple(innerEmbeddedTuple(fieldNameThatIsDifferent) > innerEmbeddedTuple(anotherFieldThatIsDifferent)) > {noformat} > I have a few final things left to do and then I'll put it up on reviewboard. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DATAFU-119) New UDF - TupleDiff
[ https://issues.apache.org/jira/browse/DATAFU-119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15500764#comment-15500764 ] Eyal Allweil commented on DATAFU-119: - I've run it on results that were in the tens of millions. I think the main reason for using it / including it in DataFu is that if you're developing Pig code, and running it on a cluster (or on any given environment), being able to stay in the Pig ecosystem is convenient for fast development cycles. If your original job can run on the given environment, a comparison job can run their efficiently, too. And there's less copying because you leave the previous results in the hdfs under a different name, and compare easily. The output is human-readable, but the expected results is that most records return null, because they're identical, and the ones that do come out are usually edge cases that turned out different. That's the reasoning behind having "something" like this UDF. The output type and it's not having a schema is a different story - it would be better to have a schema. But I'm hesitant to spend the time to do it if it isn't likely that someone else will want to write a different output format for it. > New UDF - TupleDiff > --- > > Key: DATAFU-119 > URL: https://issues.apache.org/jira/browse/DATAFU-119 > Project: DataFu > Issue Type: New Feature >Reporter: Eyal Allweil >Assignee: Eyal Allweil > > A UDF that given two tuples, prints out the differences between them in > human-readable form. This is not meant for production - we use it in PayPal > for regression tests, to compare the results of two runs. Differences are > calculated based on position, but the tuples' schemas are used, if available, > for displaying more friendly results. If no schema is available the output > uses field numbers. > It should be used when you want a more fine-grained description of what has > changed, unlike > [org.apache.pig.builtin.DIFF|https://pig.apache.org/docs/r0.14.0/func.html#diff]. > Also, because DIFF takes as its input two bags to be compared, they must fit > in memory. This UDF only takes one pair of tuples at a time, so it can run on > large inputs. > We use a macro much like the following in conjunction with this UDF: > {noformat} > DEFINE diff_macro(diff_macro_old, diff_macro_new, diff_macro_pk, > diff_macro_ignored_field) returns diffs { > DEFINE TupleDiff datafu.pig.util.TupleDiff; > > old = FOREACH $diff_macro_old GENERATE $diff_macro_pk, TOTUPLE(*) AS > original; > new = FOREACH $diff_macro_new GENERATE $diff_macro_pk, TOTUPLE(*) AS > original; > > join_data = JOIN new BY $diff_macro_pk full, old BY $diff_macro_pk; > > join_data = FOREACH join_data GENERATE TupleDiff(old::original, > new::original, '$diff_macro_ignored_field') AS tupleDiff, old::original, > new::original; > > $diffs = FILTER join_data BY tupleDiff IS NOT NULL ; > }; > {noformat} > Currently, the output from the macro looks like this (when comma-separated): > {noformat} > added,, > missing,, > changed field2 field4,, > {noformat} > The UDF takes a variable number of parameters - the two tuples to be > compared, and any number of field names or numbers to be ignored. We use this > to ignore fields representing execution or creation time (the macro I've > given as an example assumes only one ignored field) > The current implementation "drills down" into tuples, but not bags or maps - > tuple boundaries are indicated with parentheses, like this: > {noformat} > changed outerEmbeddedTuple(innerEmbeddedTuple(fieldNameThatIsDifferent) > innerEmbeddedTuple(anotherFieldThatIsDifferent)) > {noformat} > I have a few final things left to do and then I'll put it up on reviewboard. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (DATAFU-122) Documentation error/typo on tips and tricks involving Coalesce
[ https://issues.apache.org/jira/browse/DATAFU-122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eyal Allweil resolved DATAFU-122. - Resolution: Fixed > Documentation error/typo on tips and tricks involving Coalesce > -- > > Key: DATAFU-122 > URL: https://issues.apache.org/jira/browse/DATAFU-122 > Project: DataFu > Issue Type: Bug >Reporter: Ryan Clough >Assignee: Eyal Allweil >Priority: Trivial > Labels: documentation, typo > Fix For: 1.3.2 > > > http://datafu.incubator.apache.org/docs/datafu/guide/more-tips-and-tricks.html > On this page, an example is given for Coalesce: > {code} > DEFINE EmptyBagToNullFields datafu.pig.util.Coalesce(); > data = FOREACH data GENERATE Coalesce(val,0) as result; > {code} > In this example, "EmpyBagToNullFields" should be replaced with "Coalesce", > which is what is used in the code following the define statement. My guess is > this is a copy paste error from an example further down when > EmpyBagToNullFields is actually used. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (DATAFU-122) Documentation error/typo on tips and tricks involving Coalesce
[ https://issues.apache.org/jira/browse/DATAFU-122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eyal Allweil updated DATAFU-122: Assignee: Eyal Allweil Labels: documentation typo (was: docuentation typo) Fix Version/s: 1.3.2 Thanks Ryan! I've fixed this in our sources, and it will show up when we release our next version (probably 1.3.2) > Documentation error/typo on tips and tricks involving Coalesce > -- > > Key: DATAFU-122 > URL: https://issues.apache.org/jira/browse/DATAFU-122 > Project: DataFu > Issue Type: Bug >Reporter: Ryan Clough >Assignee: Eyal Allweil >Priority: Trivial > Labels: documentation, typo > Fix For: 1.3.2 > > > http://datafu.incubator.apache.org/docs/datafu/guide/more-tips-and-tricks.html > On this page, an example is given for Coalesce: > {code} > DEFINE EmptyBagToNullFields datafu.pig.util.Coalesce(); > data = FOREACH data GENERATE Coalesce(val,0) as result; > {code} > In this example, "EmpyBagToNullFields" should be replaced with "Coalesce", > which is what is used in the code following the define statement. My guess is > this is a copy paste error from an example further down when > EmpyBagToNullFields is actually used. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DATAFU-85) Add SPRINTF to provide this functionality to Pig < 0.14.0
[ https://issues.apache.org/jira/browse/DATAFU-85?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15571787#comment-15571787 ] Eyal Allweil commented on DATAFU-85: Given the time that has passed, and that it can't be backported (easily), I think this issue can/should be closed. > Add SPRINTF to provide this functionality to Pig < 0.14.0 > - > > Key: DATAFU-85 > URL: https://issues.apache.org/jira/browse/DATAFU-85 > Project: DataFu > Issue Type: Bug >Reporter: Russell Jurney >Assignee: Russell Jurney > > I need SPRINTF in DataFu for a book I'm working on. I'd like to add this to > DataFu so that CDH, HDP, MapR, etc. users can use SPRINTF as soon as DataFu > cuts a new release. > See PIG-3939 > Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DATAFU-28) Tests are too slow
[ https://issues.apache.org/jira/browse/DATAFU-28?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15571961#comment-15571961 ] Eyal Allweil commented on DATAFU-28: On my machine the datafu-pig tests run in 18 minutes (I ran them with ./gradlew :datafu-pig:test). Is this issue still relevant, or is that an acceptable time? > Tests are too slow > -- > > Key: DATAFU-28 > URL: https://issues.apache.org/jira/browse/DATAFU-28 > Project: DataFu > Issue Type: Improvement >Reporter: Matthew Hayes > > I ran the tests on my laptop and it took nearly 2 hours. > The worst offenders are {{datafu.test.pig.sampling}}, > {{datafu.test.pig.stats}}, and {{datafu.test.pig.stats.entropy}}. > ||Package ||Tests|| Failures|| Duration|| Success rate|| > |datafu.test.pig.bags|27 |0| 1m10.72s|100%| > |datafu.test.pig.geo |1 |0 |9.757s |100%| > |datafu.test.pig.hash|4 |0 |41.039s| 100%| > |datafu.test.pig.linkanalysis|5 |0| 32.677s |100%| > |datafu.test.pig.random |1| 0| 11.789s|100%| > |datafu.test.pig.sampling |25|0 |38m25.81s| 100%| > |datafu.test.pig.sessions |7 |0 |2m50.67s |100%| > |datafu.test.pig.sets |9 |0 |5m46.70s |100%| > |datafu.test.pig.stats| 52| 0 |26m11.98s| 100%| > |datafu.test.pig.stats.entropy|40|0 |31m30.97s |100%| > |datafu.test.pig.urls|1 |0 |1m35.24s |100%| > |datafu.test.pig.util|21 |0| 4m51.64s|100%| -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DATAFU-81) Unit test failures on Java 1.8
[ https://issues.apache.org/jira/browse/DATAFU-81?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15578653#comment-15578653 ] Eyal Allweil commented on DATAFU-81: These tests all pass for me with openjdk-8-amd64 on Ubuntu 16.04, both in Eclipse and in Gradle (I changed build.gradle and MultilineProcessor.java to test it, though). > Unit test failures on Java 1.8 > -- > > Key: DATAFU-81 > URL: https://issues.apache.org/jira/browse/DATAFU-81 > Project: DataFu > Issue Type: Bug >Reporter: Matthew Hayes > > I have Java 8 installed and I noticed the following bag tests fail: > * bagJoinFullOuterTest > * bagLeftOuterJoinTest > * distinctByMultiComplexFieldTest > * duplicateAliasTest > It seems like there may be an ordering assumption that doesn't work in Java > 8. There may be other cases like this too. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (DATAFU-65) Aho-Corasick Pig UDF
[ https://issues.apache.org/jira/browse/DATAFU-65?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eyal Allweil updated DATAFU-65: --- Issue Type: New Feature (was: Bug) > Aho-Corasick Pig UDF > > > Key: DATAFU-65 > URL: https://issues.apache.org/jira/browse/DATAFU-65 > Project: DataFu > Issue Type: New Feature >Affects Versions: 1.3.0 > Environment: Drought >Reporter: Russell Jurney > Attachments: DATAFU-65.diff > > Original Estimate: 8h > Remaining Estimate: 8h > > I need to use the Aho-Corasick algorithm for efficient sub-string matching. A > java implementation is available at > https://github.com/robert-bor/aho-corasick and is available on maven central: > http://maven-repository.com/artifact/org.arabidopsis.ahocorasick/ahocorasick/2.x > A Pig UDF will be very helpful to me. > How do I add a maven dependency with gradle? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DATAFU-45) RFE: CartesianProduct
[ https://issues.apache.org/jira/browse/DATAFU-45?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15584898#comment-15584898 ] Eyal Allweil commented on DATAFU-45: Hi Sam, Did you ever solve this? I agree with Matthew that this should be doable via plain Pig - if not, I'd open a bug there. > RFE: CartesianProduct > - > > Key: DATAFU-45 > URL: https://issues.apache.org/jira/browse/DATAFU-45 > Project: DataFu > Issue Type: New Feature >Reporter: Sam Steingold > > Given two bags, produce their [Cartesian > product|http://en.wikipedia.org/wiki/Cartesian_product]: > {code} > B1: bag{T1} > B2: bag{T2} > CartesianProduct(B1,B2): bag{(T1,T2)} > {code} > Use case: > {code} > toks = TOKENIZE((charray)$0,','); > kwds = CartesianProduct(toks, {1.0/(double)SIZE(toks)}); > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DATAFU-16) weighted reservoir sampling with exponential jumps UDF
[ https://issues.apache.org/jira/browse/DATAFU-16?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15586085#comment-15586085 ] Eyal Allweil commented on DATAFU-16: It looks like this got added - can this issue be closed? > weighted reservoir sampling with exponential jumps UDF > -- > > Key: DATAFU-16 > URL: https://issues.apache.org/jira/browse/DATAFU-16 > Project: DataFu > Issue Type: New Feature > Environment: Mac, Linux > pig-0.11 >Reporter: jian wang >Assignee: jian wang >Priority: Minor > Attachments: ScoredExpJmpReservoir.java, ScoredReservoir.java, > WeightedSamplingCorrectnessTests.java > > > Create a weightedReservoirSampleWithExpJump UDF to implement the weighted > reservoir sampling algorithm with exponential jumps. Investigation is tracked > in https://github.com/linkedin/datafu/issues/80. This task is part of > experiment of different weighted sampling algorithms. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (DATAFU-25) AliasableEvalFunc should use getInputSchema
[ https://issues.apache.org/jira/browse/DATAFU-25?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eyal Allweil updated DATAFU-25: --- Attachment: DATAFU-25.patch This is a minimal fix that uses getInputSchema() instead of the udf context, when possible. It will also solve [DATAFU-6|https://issues.apache.org/jira/browse/DATAFU-6], though not the example provided there - BagLeftOuterJoin. This is because MonitoredUDF prevents the udf context from working (see [PIG-3554|https://issues.apache.org/jira/browse/PIG-3554], and BagJoin uses the udf context for other things, not just those that AliasableEvalFunc provides. I couldn't think of a clean way of adding a test for this, but you can verify that it works by adding the MonitoredUDF annotation to TransposeTupleToBag - this will make the TransposeTest.transposeTest fail, unless my patch is used. I didn't want to expose a "fake" udf with the MonitoredUDF annotation, and adding it in the test package means that TransposeTest can't access it. > AliasableEvalFunc should use getInputSchema > --- > > Key: DATAFU-25 > URL: https://issues.apache.org/jira/browse/DATAFU-25 > Project: DataFu > Issue Type: Improvement >Reporter: Matthew Hayes >Assignee: Will Vaughan > Attachments: DATAFU-25.patch > > > AliasableEvalFunc derives from ContextualEvalFunc and stores a map of aliases > in the UDF context. We can instead use getInputSchema, which was added to > Pig 0.11. This may in the process resolve DATAFU-6. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (DATAFU-25) AliasableEvalFunc should use getInputSchema
[ https://issues.apache.org/jira/browse/DATAFU-25?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eyal Allweil reassigned DATAFU-25: -- Assignee: Eyal Allweil (was: Will Vaughan) > AliasableEvalFunc should use getInputSchema > --- > > Key: DATAFU-25 > URL: https://issues.apache.org/jira/browse/DATAFU-25 > Project: DataFu > Issue Type: Improvement >Reporter: Matthew Hayes >Assignee: Eyal Allweil > Attachments: DATAFU-25.patch > > > AliasableEvalFunc derives from ContextualEvalFunc and stores a map of aliases > in the UDF context. We can instead use getInputSchema, which was added to > Pig 0.11. This may in the process resolve DATAFU-6. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (DATAFU-25) AliasableEvalFunc should use getInputSchema
[ https://issues.apache.org/jira/browse/DATAFU-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15589407#comment-15589407 ] Eyal Allweil edited comment on DATAFU-25 at 10/19/16 6:01 PM: -- This is a minimal fix that uses getInputSchema() instead of the udf context, when possible. It will also solve [DATAFU-6|https://issues.apache.org/jira/browse/DATAFU-6], though not the example provided there - BagLeftOuterJoin. This is because MonitoredUDF prevents the udf context from working (see [PIG-3554|https://issues.apache.org/jira/browse/PIG-3554]), and BagJoin uses the udf context for other things, not just those that AliasableEvalFunc provides. I couldn't think of a clean way of adding a test for this, but you can verify that it works by adding the MonitoredUDF annotation to TransposeTupleToBag - this will make the TransposeTest.transposeTest fail, unless my patch is used. I didn't want to expose a "fake" udf with the MonitoredUDF annotation, and adding it in the test package means that TransposeTest can't access it. was (Author: eyal): This is a minimal fix that uses getInputSchema() instead of the udf context, when possible. It will also solve [DATAFU-6|https://issues.apache.org/jira/browse/DATAFU-6], though not the example provided there - BagLeftOuterJoin. This is because MonitoredUDF prevents the udf context from working (see [PIG-3554|https://issues.apache.org/jira/browse/PIG-3554], and BagJoin uses the udf context for other things, not just those that AliasableEvalFunc provides. I couldn't think of a clean way of adding a test for this, but you can verify that it works by adding the MonitoredUDF annotation to TransposeTupleToBag - this will make the TransposeTest.transposeTest fail, unless my patch is used. I didn't want to expose a "fake" udf with the MonitoredUDF annotation, and adding it in the test package means that TransposeTest can't access it. > AliasableEvalFunc should use getInputSchema > --- > > Key: DATAFU-25 > URL: https://issues.apache.org/jira/browse/DATAFU-25 > Project: DataFu > Issue Type: Improvement >Reporter: Matthew Hayes >Assignee: Will Vaughan > Attachments: DATAFU-25.patch > > > AliasableEvalFunc derives from ContextualEvalFunc and stores a map of aliases > in the UDF context. We can instead use getInputSchema, which was added to > Pig 0.11. This may in the process resolve DATAFU-6. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DATAFU-6) MonitoredUDF annotation does not work with AliasableEvalFunc
[ https://issues.apache.org/jira/browse/DATAFU-6?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15591185#comment-15591185 ] Eyal Allweil commented on DATAFU-6: --- One caveat about the description - adding the annotation to BagLeftOuterJoin is not a good way to test the problem in AliasableEvalFunc, because it depends on other values in the UDF context, and because of PIG-3554 the context doesn't work. In order to test this specific case, it's better to put the MonitoredUDF annotation on TransposeTupleToBag because it doesn't have any other dependencies on the udf context (other than those provided by AliasableEvalFunc) > MonitoredUDF annotation does not work with AliasableEvalFunc > > > Key: DATAFU-6 > URL: https://issues.apache.org/jira/browse/DATAFU-6 > Project: DataFu > Issue Type: Bug >Reporter: Matthew Hayes > > This was reported by seregasheypak on GitHub > (https://github.com/linkedin/datafu/issues/89). We were able to reproduce > this by adding the annotation to BagLeftOuterJoin and running its tests. > Simply adding the annotation causes problems. In > ContextualEvalFunc.getContextProperties, the properties retrieved for the > class are empty. > seregasheypak: > Hi, If I use > {code} > @MonitoredUDF(timeUnit = TimeUnit.MINUTES, duration = 10, errorCallback = > NplRecMatcherErrorCallback.class) > class NplRecFirstLevelMatcher extends AliasableEvalFunc implements > DebuggableUDF{ > //some cool stuff goes here! > } > {code} > I do get exception: > {noformat} > 14/01/15 23:52:52 ERROR udf.NplRecFirstLevelMatcher: Class: class > NplRecFirstLevelMatcher > 14/01/15 23:52:52 ERROR udf.NplRecFirstLevelMatcher: Instance name: 30 > 14/01/15 23:52:52 ERROR udf.NplRecFirstLevelMatcher: Properties: {30={}} > *** ***A debug output from my handler method*** *** > NplRecMatcherErrorCallback.handleError > null > ERROR: java.lang.RuntimeException: Could not retrieve aliases from properties > using aliasMap > java.util.concurrent.ExecutionException: java.lang.RuntimeException: Could > not retrieve aliases from properties using aliasMap > at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:232) > at java.util.concurrent.FutureTask.get(FutureTask.java:91) > at > com.google.common.util.concurrent.ForwardingFuture.get(ForwardingFuture.java:69) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.util.MonitoredUDFExecutor.monitorExec(MonitoredUDFExecutor.java:183) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:335) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:376) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:354) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:372) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:297) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:308) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:241) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:308) > at > org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFilter.getNext(POFilter.java:95) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.runPipeline(PigGenericMapReduce.java:465) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.processOnePackageOutput(PigGenericMapReduce.java:433) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:413) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:257) > at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:164) > at > org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:610) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:444) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:449) > Caused by: java.lang.RuntimeException: Could not retrieve aliases from > properties using aliasMap > at > datafu.pig.util.AliasableEvalFunc.getFieldAliases(AliasableEvalFunc.java:164) > at > datafu.pig.util.AliasableEvalFunc.getPosit
[jira] [Commented] (DATAFU-9) Add datafu.text.ToJson UDF to serialize any relation/field as a JSON String
[ https://issues.apache.org/jira/browse/DATAFU-9?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15591983#comment-15591983 ] Eyal Allweil commented on DATAFU-9: --- Hi Russell, are you still interested in including this in DataFu? It doesn't look like there are intractable problems left, and DataFu is based on a Pig recent enough to have BigInteger and BigDecimals now. > Add datafu.text.ToJson UDF to serialize any relation/field as a JSON String > > > Key: DATAFU-9 > URL: https://issues.apache.org/jira/browse/DATAFU-9 > Project: DataFu > Issue Type: New Feature >Reporter: Russell Jurney > Attachments: DATAFU-9.patch > > > See https://github.com/linkedin/datafu/issues/91 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DATAFU-98) New UDF for Histogram / Frequency counting
[ https://issues.apache.org/jira/browse/DATAFU-98?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15605952#comment-15605952 ] Eyal Allweil commented on DATAFU-98: Hi Russell. First of all, I want to apologize for the time it's taken us to get to your contribution. I think it could be quite useful. Having said that, I wonder if the current version - without counters - gives us enough of an advantage over vanilla Pig. I think the following code (modified from your unit test) gives us nearly the same functionality as the UDF in the patch: {noformat} data_in = LOAD 'input' as (val:int); -- data_in: "0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "20" intermediate_data = FOREACH data_in GENERATE val, (val / 5 * 5) AS binStart; data_out = FOREACH (GROUP intermediate_data BY binStart) GENERATE group AS binStart, COUNT(intermediate_data) AS binCount; -- data_out: (0,5),(5,5),(10,2),(20,1) {noformat} Unlike your UDF, missing bins are not included. But while including missing bins can be useful, I do wonder if a single skewed value can cause problems, especially with small bin sizes and long values. (as a performance-related aside, I would try to have FrequencyCounter.toBag() called only in the Final implementations, instead of the first two stages of the algebraic implementation, to minimize the data copied). So it seems to me the current UDF has the advantage of having the missing bins, and it's obviously more readable and convenient than rewriting the Pig code I wrote above. Did you (or you, [~andrew.musselman]) run any performance tests? Maybe the Algebraic implementation runs faster than the vanilla Pig code by virtue of the combiner use. Last (but not least!) the version you mentioned with counters sounds like it could be really great. > New UDF for Histogram / Frequency counting > -- > > Key: DATAFU-98 > URL: https://issues.apache.org/jira/browse/DATAFU-98 > Project: DataFu > Issue Type: New Feature >Reporter: Russell Melick > Attachments: DATAFU-98.patch > > > I was thinking of creating a new UDF to compute histograms / frequency counts > of input bags. It seems like it would make sense to support ints, longs, > float, and doubles. > I tried looking around to see if this was already implemented, but > ValueHistogram and AggregateWordHistogram were about the only things I found. > They seem to exist as an example job, and only work for Strings. > https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/aggregate/ValueHistogram.html > https://hadoop.apache.org/docs/r1.2.1/api/org/apache/hadoop/examples/AggregateWordHistogram.html > Should the user specify the bin size or the number of bins? Specifying bin > size probably makes the implementation simpler since you can bin things > without having seen all of the data. > I think it would make sense to implement a version of this that didn't need > any reducers. It could use counters to keep track of the counts per bin > without sending any data to a reducer. You would be able to call this > without a preceding GROUP BY as well. > Here's my proposal for the two udfs. This assumes the input data is two > columns, memberId and numConnections. > {code} > DEFINE BinnedFrequency datafu.pig.stats.BinnedFrequency('min=0;binSize=50') > connections = LOAD 'connections' AS memberId, numConnections; > connectionHistogram = FOREACH (GROUP connections ALL) GENERATE > BinnedFrequency(connections.numConnections); > {code} > The output here would be a bag with the frequency counts > {code} > {('0-49', 5), ('50-99', 0), ('100-149', 10)} > {code} > {code} > DEFINE BinnedFrequencyCounter > datafu.pig.stats.BinnedFrequencyCounter('min=0;binSize=50;name=numConnectionsHistogram') > connections = LOAD 'connections' AS memberId, numConnections; > connections = FOREACH connections GENERATE > BinnedFrequencyCounter(numConnections); > {code} > The output here would just be a counter for each bin, all sharing the same > group of numConnectionsHistogram. It would look something like > numConnectionsHistogram.'0-49' = 5 > numConnectionsHistogram.'50-99' = 0 > numConnectionsHistogram.'100-149' = 10 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DATAFU-87) Edit distance
[ https://issues.apache.org/jira/browse/DATAFU-87?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15606106#comment-15606106 ] Eyal Allweil commented on DATAFU-87: Hi Joydeep, I want to begin by apologizing for the time it's taken us to get to your contribution. Did you ever continue with it? Have you compared your implementation with [the one in Apache Commons Text|https://github.com/apache/commons-text/blob/master/src/main/java/org/apache/commons/text/similarity/LevenshteinDistance.java] or [Commons Lang|https://github.com/apache/commons-lang/blob/master/src/main/java/org/apache/commons/lang3/StringUtils.java#L7731]? (I think they follow the same algorithm, from _Algorithms on Strings, Trees and Sequences_ by Dan Gusfield and Chas Emerick) > Edit distance > - > > Key: DATAFU-87 > URL: https://issues.apache.org/jira/browse/DATAFU-87 > Project: DataFu > Issue Type: New Feature >Affects Versions: 1.3.0 >Reporter: Joydeep Banerjee > Attachments: DATAFU-87.patch > > > [This is work-in-progress] > Given 2 strings, provide a measure of dis-similarity (Levenshtein distance) > between them. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DATAFU-25) AliasableEvalFunc should use getInputSchema
[ https://issues.apache.org/jira/browse/DATAFU-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15607593#comment-15607593 ] Eyal Allweil commented on DATAFU-25: I think that if we want to keep the UDF context as fallback, we need to keep the code that sets it. I'll add a comment explaining the logic behind the two methods. Should I commit it afterwards? > AliasableEvalFunc should use getInputSchema > --- > > Key: DATAFU-25 > URL: https://issues.apache.org/jira/browse/DATAFU-25 > Project: DataFu > Issue Type: Improvement >Reporter: Matthew Hayes >Assignee: Eyal Allweil > Attachments: DATAFU-25.patch > > > AliasableEvalFunc derives from ContextualEvalFunc and stores a map of aliases > in the UDF context. We can instead use getInputSchema, which was added to > Pig 0.11. This may in the process resolve DATAFU-6. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (DATAFU-25) AliasableEvalFunc should use getInputSchema
[ https://issues.apache.org/jira/browse/DATAFU-25?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eyal Allweil updated DATAFU-25: --- Fix Version/s: 1.3.2 > AliasableEvalFunc should use getInputSchema > --- > > Key: DATAFU-25 > URL: https://issues.apache.org/jira/browse/DATAFU-25 > Project: DataFu > Issue Type: Improvement >Reporter: Matthew Hayes >Assignee: Eyal Allweil > Fix For: 1.3.2 > > Attachments: DATAFU-25.patch > > > AliasableEvalFunc derives from ContextualEvalFunc and stores a map of aliases > in the UDF context. We can instead use getInputSchema, which was added to > Pig 0.11. This may in the process resolve DATAFU-6. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (DATAFU-25) AliasableEvalFunc should use getInputSchema
[ https://issues.apache.org/jira/browse/DATAFU-25?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eyal Allweil resolved DATAFU-25. Resolution: Fixed This means that DATAFU-6 should probably be closed too. > AliasableEvalFunc should use getInputSchema > --- > > Key: DATAFU-25 > URL: https://issues.apache.org/jira/browse/DATAFU-25 > Project: DataFu > Issue Type: Improvement >Reporter: Matthew Hayes >Assignee: Eyal Allweil > Fix For: 1.3.2 > > Attachments: DATAFU-25.patch > > > AliasableEvalFunc derives from ContextualEvalFunc and stores a map of aliases > in the UDF context. We can instead use getInputSchema, which was added to > Pig 0.11. This may in the process resolve DATAFU-6. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DATAFU-83) InUDF does not validate that types are compatible
[ https://issues.apache.org/jira/browse/DATAFU-83?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15609317#comment-15609317 ] Eyal Allweil commented on DATAFU-83: Hi [~sonalit], I don't see your review board request. Can you check that it's associated with the DataFu group, or attach your updated patch? This seems like a bug worth fixing, even if Pig already has its own [IN operator|https://pig.apache.org/docs/r0.14.0/basic.html#boolops]. > InUDF does not validate that types are compatible > - > > Key: DATAFU-83 > URL: https://issues.apache.org/jira/browse/DATAFU-83 > Project: DataFu > Issue Type: Improvement >Reporter: Matthew Hayes >Priority: Minor > Attachments: DATAFU-83.patch > > > See the example below. The input data is a long, but ints are provided to > match against. Because it uses the Java equals to compare and these are > different types, this will never match, which can lead to confusing results. > I believe it should at least throw an error. > {code} > define I datafu.pig.util.InUDF(); > > data = LOAD 'input' AS (B: bag {T: tuple(v:LONG)}); > > data2 = FOREACH data { > C = FILTER B By I(v, 1,2,3); > GENERATE C; > } > > describe data2; > > STORE data2 INTO 'output'; > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DATAFU-48) Upgrade Guava to 17.0
[ https://issues.apache.org/jira/browse/DATAFU-48?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15611419#comment-15611419 ] Eyal Allweil commented on DATAFU-48: What was the symptom of the problem with Guava 17.0? I updated it to 17.0 and ran _gradlew clean test_, and all the pig and hourglass tests passed. In fact, I also tried updating to Guava 19.0 (the latest version) and that built successfully, too. > Upgrade Guava to 17.0 > - > > Key: DATAFU-48 > URL: https://issues.apache.org/jira/browse/DATAFU-48 > Project: DataFu > Issue Type: Improvement >Reporter: Philip (flip) Kromer >Assignee: Philip (flip) Kromer >Priority: Minor > Labels: build, dependency, guava, version > Attachments: 0001-DATAFU-48-Upgrade-gradle-to-17.0.patch > > > Specifically motivated by the improvements to hashing library, but also > because we're six versions behind at the moment. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DATAFU-41) BagGroup does not name bag field in some cases
[ https://issues.apache.org/jira/browse/DATAFU-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15616338#comment-15616338 ] Eyal Allweil commented on DATAFU-41: I'm trying to see if I understand what's happening here. Do you mean replacing the line {noformat} data3 = FOREACH data2 GENERATE group as id, BagGroup(data,data.key) as grouped; {noformat} with {noformat} data3 = FOREACH data2 GENERATE group as id, BagGroup(data.(key,val),data.key) as grouped; {noformat} When I do this, I indeed get the schema for data3 as described above - without a name for grouped data that BagGroup returns. But is this really a bug? Because it's receiving a bag without a name as input, so what name can it give? The name _data_ isn't being passed to the UDF at all in this case. ( I debugged and looked at the input schema's value in _BagGroup.getOutputSchema()_ ) > BagGroup does not name bag field in some cases > -- > > Key: DATAFU-41 > URL: https://issues.apache.org/jira/browse/DATAFU-41 > Project: DataFu > Issue Type: Bug >Reporter: Matthew Hayes > > For this test: > {code} > /** > define BagSum datafu.pig.bags.BagSum(); > define BagGroup datafu.pig.bags.BagGroup(); > > data = LOAD 'input' USING PigStorage(',') AS (id:int, key:chararray, > val:int); > describe data; > > data2 = GROUP data BY id; > > describe data2; > > data3 = FOREACH data2 GENERATE group as id, BagGroup(data,data.key) as > grouped; > > describe data3; > > data4 = FOREACH data3 { > summed = FOREACH grouped GENERATE group as key, SUM($1.val) as total; > ordered = ORDER summed BY key; > GENERATE id, ordered; > } > > describe data4; > > STORE data4 INTO 'output'; >*/ > @Multiline > private String bagSumTest; > > @Test > public void bagSumTest() throws Exception > { > PigTest test = createPigTestFromString(bagSumTest); > writeLinesToFile("input", "1,A,1","1,B,2","2,A,3","3,A,4","1,C,5","1,C,6", > "3,A,7","2,B,8","1,A,9","2,A,10"); > test.runScript(); > assertOutput(test, "data4", > "(1,{(A,10),(B,2),(C,11)})", > "(2,{(A,13),(B,8)})", > "(3,{(A,11)})"); > } > {code} > {{data3}} is described as: > {code} > data3: {id: int,grouped: {(group: chararray,data: {(id: int,key: > chararray,val: int)})}} > {code} > However, if we change {{data}} to {{data.(key,val)}} then {{data3}} is > described as: > {code} > data3: {id: int,grouped: {(group: chararray,{(key: chararray,val: int)})}} > {code} > Note that there is no name, so you have to reference it by {{$1}}. There is > a separate issues, DATAFU-40, where even when it has the name {{data}} you > can run into problems later. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (DATAFU-123) Allow DataFu to include macros
Eyal Allweil created DATAFU-123: --- Summary: Allow DataFu to include macros Key: DATAFU-123 URL: https://issues.apache.org/jira/browse/DATAFU-123 Project: DataFu Issue Type: Improvement Reporter: Eyal Allweil Assignee: Eyal Allweil A few changes to allow macros to be contributed to DataFu. If a macro file is placed in src/main/resources, it can be used by registering the DataFu jar. Such macros can then be tested both from within Eclipse and Gradle. There are three small parts: 1) All unit tests that use createPigTest methods will automatically register the DataFu jar. 2) Some changes to fix the PigTests.getJarPath() functionality, which doesn't appear to work. (these changes are aligned with the proposed patch for [DATAFU-106|https://issues.apache.org/jira/browse/DATAFU-106]) 3) A sample macro and test The changes here will allow moving forward with [DATAFU-61|https://issues.apache.org/jira/browse/DATAFU-61] and including the macro I suggested for [DATAFU-119|https://issues.apache.org/jira/browse/DATAFU-119]. (I also have additional content in mind) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DATAFU-119) New UDF - TupleDiff
[ https://issues.apache.org/jira/browse/DATAFU-119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15793097#comment-15793097 ] Eyal Allweil commented on DATAFU-119: - If we add DATAFU-123, we can include the macro I put in the description so that people can use it instead of duplicating it in order to conveniently call the UDF. > New UDF - TupleDiff > --- > > Key: DATAFU-119 > URL: https://issues.apache.org/jira/browse/DATAFU-119 > Project: DataFu > Issue Type: New Feature >Reporter: Eyal Allweil >Assignee: Eyal Allweil > > A UDF that given two tuples, prints out the differences between them in > human-readable form. This is not meant for production - we use it in PayPal > for regression tests, to compare the results of two runs. Differences are > calculated based on position, but the tuples' schemas are used, if available, > for displaying more friendly results. If no schema is available the output > uses field numbers. > It should be used when you want a more fine-grained description of what has > changed, unlike > [org.apache.pig.builtin.DIFF|https://pig.apache.org/docs/r0.14.0/func.html#diff]. > Also, because DIFF takes as its input two bags to be compared, they must fit > in memory. This UDF only takes one pair of tuples at a time, so it can run on > large inputs. > We use a macro much like the following in conjunction with this UDF: > {noformat} > DEFINE diff_macro(diff_macro_old, diff_macro_new, diff_macro_pk, > diff_macro_ignored_field) returns diffs { > DEFINE TupleDiff datafu.pig.util.TupleDiff; > > old = FOREACH $diff_macro_old GENERATE $diff_macro_pk, TOTUPLE(*) AS > original; > new = FOREACH $diff_macro_new GENERATE $diff_macro_pk, TOTUPLE(*) AS > original; > > join_data = JOIN new BY $diff_macro_pk full, old BY $diff_macro_pk; > > join_data = FOREACH join_data GENERATE TupleDiff(old::original, > new::original, '$diff_macro_ignored_field') AS tupleDiff, old::original, > new::original; > > $diffs = FILTER join_data BY tupleDiff IS NOT NULL ; > }; > {noformat} > Currently, the output from the macro looks like this (when comma-separated): > {noformat} > added,, > missing,, > changed field2 field4,, > {noformat} > The UDF takes a variable number of parameters - the two tuples to be > compared, and any number of field names or numbers to be ignored. We use this > to ignore fields representing execution or creation time (the macro I've > given as an example assumes only one ignored field) > The current implementation "drills down" into tuples, but not bags or maps - > tuple boundaries are indicated with parentheses, like this: > {noformat} > changed outerEmbeddedTuple(innerEmbeddedTuple(fieldNameThatIsDifferent) > innerEmbeddedTuple(anotherFieldThatIsDifferent)) > {noformat} > I have a few final things left to do and then I'll put it up on reviewboard. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (DATAFU-106) Test files should be created in a subfolder of projects
[ https://issues.apache.org/jira/browse/DATAFU-106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15817823#comment-15817823 ] Eyal Allweil commented on DATAFU-106: - [~takias], I will try to sort our Jira issues out and mark those that are easier to begin with. Have you worked on Pig UDF's before? Piyush - I will try to finish our review as soon as I can! > Test files should be created in a subfolder of projects > --- > > Key: DATAFU-106 > URL: https://issues.apache.org/jira/browse/DATAFU-106 > Project: DataFu > Issue Type: Improvement >Reporter: Matthew Hayes >Priority: Minor > Fix For: 1.3.1 > > > Test files are currently created in the subdirectory folder (e.g. > datafu-pig/input*). For better organization, we should create them in a > subdirectory. This also makes it easier to exclude them all with gitignore. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (DATAFU-123) Allow DataFu to include macros
[ https://issues.apache.org/jira/browse/DATAFU-123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eyal Allweil updated DATAFU-123: Attachment: DATAFU-123.patch The change ended up being smaller than what I originally described - all I did was add the "pig.import.search.path" property with the value of the src/main/resources directory to PigTests. This means that any macro files that are put there can be tested, both in Gradle and Eclipse. I put some sample counting macros there and a test for them. In general, any macro file placed in src/main/resources can be used by registering the DataFu jar. If we include this patch, we should update the Contributing page so that instructions for contributing Pig macros are easy to find and understand. > Allow DataFu to include macros > --- > > Key: DATAFU-123 > URL: https://issues.apache.org/jira/browse/DATAFU-123 > Project: DataFu > Issue Type: Improvement >Reporter: Eyal Allweil >Assignee: Eyal Allweil > Labels: testability > Attachments: DATAFU-123.patch > > > A few changes to allow macros to be contributed to DataFu. If a macro file is > placed in src/main/resources, it can be used by registering the DataFu jar. > Such macros can then be tested both from within Eclipse and Gradle. > There are three small parts: > 1) All unit tests that use createPigTest methods will automatically register > the DataFu jar. > 2) Some changes to fix the PigTests.getJarPath() functionality, which doesn't > appear to work. (these changes are aligned with the proposed patch for > [DATAFU-106|https://issues.apache.org/jira/browse/DATAFU-106]) > 3) A sample macro and test > The changes here will allow moving forward with > [DATAFU-61|https://issues.apache.org/jira/browse/DATAFU-61] and including the > macro I suggested for > [DATAFU-119|https://issues.apache.org/jira/browse/DATAFU-119]. (I also have > additional content in mind) -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (DATAFU-12) Implement Lead UDF based on version from SQL
[ https://issues.apache.org/jira/browse/DATAFU-12?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15972397#comment-15972397 ] Eyal Allweil commented on DATAFU-12: It looks like this functionality is implemented in HIve - see the following two links: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics#LanguageManualWindowingAndAnalytics-LEADusingdefault1rowleadandnotspecifyingdefaultvalue https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFLead.java Since Pig now supports using Hive UDF's, I think this Jira can be closed. Alternately, if we want to provide a DataFu implementation, I'll copy the proposed patch and discussion from the Github issue mentioned in the description, so it's easier for a possible-implementer to continue where work stalled. > Implement Lead UDF based on version from SQL > > > Key: DATAFU-12 > URL: https://issues.apache.org/jira/browse/DATAFU-12 > Project: DataFu > Issue Type: New Feature >Reporter: Matthew Hayes > > Min Zhou has provided this suggestion ([Issue #88 on > GitHub|https://github.com/linkedin/datafu/pull/88]): > Lead is an analytic function like Oracle's Lead function. It provides access > to more than one tuple of a bag at the same time without a self join. Given a > bag of tuple returned from a query, LEAD provides access to a tuple at a > given physical offset beyond that position. Generates pairs of all items in a > bag. > If you do not specify offset, then its default is 1. Null is returned if the > offset goes beyond the scope of the bag. > Example 1: > {noformat} >register ba-pig-0.1.jar >define Lead datafu.pig.bags.Lead('2'); >-- INPUT: ({(1),(2),(3),(4)}) >data = LOAD 'input' AS (data: bag {T: tuple(v:INT)}); >describe data; >-- OUTPUT: ({((1),(2),(3)),((2),(3),(4)),((3),(4),),((4),,)}) >-- OUTPUT SCHEMA: data2: {lead_data: {(elem0: (v: int),elem1: (v: > int),elem2: (v: int))}} >data2 = FOREACH data GENERATE Lead(data); >describe data2; >DUMP data2; > {noformat} > Example 2 > {noformat} >register ba-pig-0.1.jar >define Lead datafu.pig.bags.Lead(); >-- INPUT: > ({(10,{(1),(2),(3)}),(20,{(4),(5),(6)}),(30,{(7),(8)}),(40,{(9),(10),(11)}),(50,{(12),(13),(14),(15)})}) >data = LOAD 'input' AS (data: bag {T: tuple(v1:INT,B: bag{T: > tuple(v2:INT)})}); >--describe data; >-- OUPUT: > ({((10,{(1),(2),(3)}),(20,{(4),(5),(6)})),((20,{(4),(5),(6)}),(30,{(7),(8)})),((30,{(7),(8)}),(40,{(9),(10),(11)})),((40,{(9),(10),(11)}),(50,{(12),(13),(14),(15)})),((50,{(12),(13),(14),(15)}),)}) >data2 = FOREACH data GENERATE Lead(data); >--describe data2; >DUMP data2; > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (DATAFU-124) sessionize() ought to support millisecond periods
[ https://issues.apache.org/jira/browse/DATAFU-124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16067884#comment-16067884 ] Eyal Allweil commented on DATAFU-124: - I reviewed it - looks fine, a nice improvement. I'll try to get it committed soon (unless of course someone has any actionable comments) > sessionize() ought to support millisecond periods > - > > Key: DATAFU-124 > URL: https://issues.apache.org/jira/browse/DATAFU-124 > Project: DataFu > Issue Type: Bug >Reporter: Jacob Tolar > > The sessionize UDF should support a period in milliseconds. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DATAFU-124) sessionize() ought to support millisecond periods
[ https://issues.apache.org/jira/browse/DATAFU-124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16084560#comment-16084560 ] Eyal Allweil commented on DATAFU-124: - Committed. Thanks [~jtolar]! > sessionize() ought to support millisecond periods > - > > Key: DATAFU-124 > URL: https://issues.apache.org/jira/browse/DATAFU-124 > Project: DataFu > Issue Type: Bug >Reporter: Jacob Tolar > > The sessionize UDF should support a period in milliseconds. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DATAFU-124) sessionize() ought to support millisecond periods
[ https://issues.apache.org/jira/browse/DATAFU-124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16086145#comment-16086145 ] Eyal Allweil commented on DATAFU-124: - I'd like to resolve this issue, but it looks like version 1.3.2 isn't marked as released, and 1.3.3 doesn't exist yet (in Jira). [~matterhayes] - do I have permissions to do this? Do you know how to "release" versions in Jira? > sessionize() ought to support millisecond periods > - > > Key: DATAFU-124 > URL: https://issues.apache.org/jira/browse/DATAFU-124 > Project: DataFu > Issue Type: Bug >Reporter: Jacob Tolar > > The sessionize UDF should support a period in milliseconds. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (DATAFU-124) sessionize() ought to support millisecond periods
[ https://issues.apache.org/jira/browse/DATAFU-124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eyal Allweil resolved DATAFU-124. - Resolution: Fixed Assignee: Eyal Allweil Fix Version/s: 1.3.3 > sessionize() ought to support millisecond periods > - > > Key: DATAFU-124 > URL: https://issues.apache.org/jira/browse/DATAFU-124 > Project: DataFu > Issue Type: Bug >Reporter: Jacob Tolar >Assignee: Eyal Allweil > Fix For: 1.3.3 > > > The sessionize UDF should support a period in milliseconds. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DATAFU-83) InUDF does not validate that types are compatible
[ https://issues.apache.org/jira/browse/DATAFU-83?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16106894#comment-16106894 ] Eyal Allweil commented on DATAFU-83: Hi Kyle ([~ItsAUsernameRight?]) Your help is very welcome. I have two comments about the state of the contribution - I'll put them both here and in the review board for maximum visibility. 1. I think the output schema of this UDF is always boolean, not the schema of the first input field. I would make the outputSchema method identical to that in an existing Boolean UDF - for example, [Pig's ENDSWITH built-in function|https://github.com/apache/pig/blob/trunk/src/org/apache/pig/builtin/ENDSWITH.java#L62] 2. As Matthew already wrote in the review board, adding a case to the unit test is a good idea - you can probably just duplicate something from [the existing test|https://github.com/apache/incubator-datafu/blob/master/datafu-pig/src/test/java/datafu/test/pig/util/InTests.java]. Thanks! > InUDF does not validate that types are compatible > - > > Key: DATAFU-83 > URL: https://issues.apache.org/jira/browse/DATAFU-83 > Project: DataFu > Issue Type: Improvement >Reporter: Matthew Hayes >Priority: Minor > Attachments: DATAFU-83.patch, rb36702.patch > > > See the example below. The input data is a long, but ints are provided to > match against. Because it uses the Java equals to compare and these are > different types, this will never match, which can lead to confusing results. > I believe it should at least throw an error. > {code} > define I datafu.pig.util.InUDF(); > > data = LOAD 'input' AS (B: bag {T: tuple(v:LONG)}); > > data2 = FOREACH data { > C = FILTER B By I(v, 1,2,3); > GENERATE C; > } > > describe data2; > > STORE data2 INTO 'output'; > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DATAFU-119) New UDF - TupleDiff
[ https://issues.apache.org/jira/browse/DATAFU-119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16114867#comment-16114867 ] Eyal Allweil commented on DATAFU-119: - Sure. I'll do it as soon as I can. > New UDF - TupleDiff > --- > > Key: DATAFU-119 > URL: https://issues.apache.org/jira/browse/DATAFU-119 > Project: DataFu > Issue Type: New Feature >Reporter: Eyal Allweil >Assignee: Eyal Allweil > > A UDF that given two tuples, prints out the differences between them in > human-readable form. This is not meant for production - we use it in PayPal > for regression tests, to compare the results of two runs. Differences are > calculated based on position, but the tuples' schemas are used, if available, > for displaying more friendly results. If no schema is available the output > uses field numbers. > It should be used when you want a more fine-grained description of what has > changed, unlike > [org.apache.pig.builtin.DIFF|https://pig.apache.org/docs/r0.14.0/func.html#diff]. > Also, because DIFF takes as its input two bags to be compared, they must fit > in memory. This UDF only takes one pair of tuples at a time, so it can run on > large inputs. > We use a macro much like the following in conjunction with this UDF: > {noformat} > DEFINE diff_macro(diff_macro_old, diff_macro_new, diff_macro_pk, > diff_macro_ignored_field) returns diffs { > DEFINE TupleDiff datafu.pig.util.TupleDiff; > > old = FOREACH $diff_macro_old GENERATE $diff_macro_pk, TOTUPLE(*) AS > original; > new = FOREACH $diff_macro_new GENERATE $diff_macro_pk, TOTUPLE(*) AS > original; > > join_data = JOIN new BY $diff_macro_pk full, old BY $diff_macro_pk; > > join_data = FOREACH join_data GENERATE TupleDiff(old::original, > new::original, '$diff_macro_ignored_field') AS tupleDiff, old::original, > new::original; > > $diffs = FILTER join_data BY tupleDiff IS NOT NULL ; > }; > {noformat} > Currently, the output from the macro looks like this (when comma-separated): > {noformat} > added,, > missing,, > changed field2 field4,, > {noformat} > The UDF takes a variable number of parameters - the two tuples to be > compared, and any number of field names or numbers to be ignored. We use this > to ignore fields representing execution or creation time (the macro I've > given as an example assumes only one ignored field) > The current implementation "drills down" into tuples, but not bags or maps - > tuple boundaries are indicated with parentheses, like this: > {noformat} > changed outerEmbeddedTuple(innerEmbeddedTuple(fieldNameThatIsDifferent) > innerEmbeddedTuple(anotherFieldThatIsDifferent)) > {noformat} > I have a few final things left to do and then I'll put it up on reviewboard. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DATAFU-119) New UDF - TupleDiff
[ https://issues.apache.org/jira/browse/DATAFU-119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16115744#comment-16115744 ] Eyal Allweil commented on DATAFU-119: - [~matterhayes] - We want the Apache license header on our macro files too, right? If so, I'll add it to the sample macro from [DATAFU-123|https://issues.apache.org/jira/browse/DATAFU-123] as well. > New UDF - TupleDiff > --- > > Key: DATAFU-119 > URL: https://issues.apache.org/jira/browse/DATAFU-119 > Project: DataFu > Issue Type: New Feature >Reporter: Eyal Allweil >Assignee: Eyal Allweil > > A UDF that given two tuples, prints out the differences between them in > human-readable form. This is not meant for production - we use it in PayPal > for regression tests, to compare the results of two runs. Differences are > calculated based on position, but the tuples' schemas are used, if available, > for displaying more friendly results. If no schema is available the output > uses field numbers. > It should be used when you want a more fine-grained description of what has > changed, unlike > [org.apache.pig.builtin.DIFF|https://pig.apache.org/docs/r0.14.0/func.html#diff]. > Also, because DIFF takes as its input two bags to be compared, they must fit > in memory. This UDF only takes one pair of tuples at a time, so it can run on > large inputs. > We use a macro much like the following in conjunction with this UDF: > {noformat} > DEFINE diff_macro(diff_macro_old, diff_macro_new, diff_macro_pk, > diff_macro_ignored_field) returns diffs { > DEFINE TupleDiff datafu.pig.util.TupleDiff; > > old = FOREACH $diff_macro_old GENERATE $diff_macro_pk, TOTUPLE(*) AS > original; > new = FOREACH $diff_macro_new GENERATE $diff_macro_pk, TOTUPLE(*) AS > original; > > join_data = JOIN new BY $diff_macro_pk full, old BY $diff_macro_pk; > > join_data = FOREACH join_data GENERATE TupleDiff(old::original, > new::original, '$diff_macro_ignored_field') AS tupleDiff, old::original, > new::original; > > $diffs = FILTER join_data BY tupleDiff IS NOT NULL ; > }; > {noformat} > Currently, the output from the macro looks like this (when comma-separated): > {noformat} > added,, > missing,, > changed field2 field4,, > {noformat} > The UDF takes a variable number of parameters - the two tuples to be > compared, and any number of field names or numbers to be ignored. We use this > to ignore fields representing execution or creation time (the macro I've > given as an example assumes only one ignored field) > The current implementation "drills down" into tuples, but not bags or maps - > tuple boundaries are indicated with parentheses, like this: > {noformat} > changed outerEmbeddedTuple(innerEmbeddedTuple(fieldNameThatIsDifferent) > innerEmbeddedTuple(anotherFieldThatIsDifferent)) > {noformat} > I have a few final things left to do and then I'll put it up on reviewboard. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (DATAFU-61) Add TF-IDF Macro to DataFu
[ https://issues.apache.org/jira/browse/DATAFU-61?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eyal Allweil updated DATAFU-61: --- Attachment: DATAFU-61-2.patch Now that macros are supported (and can be tested), I updated this patch. Unfortunately, I couldn't find the sample data, so I just pulled the sample sentences from the Wikipedia page for TF-IDF, and I didn't verify that the results are OK. [~russell.jurney] - want to donate a test case and expected results? > Add TF-IDF Macro to DataFu > -- > > Key: DATAFU-61 > URL: https://issues.apache.org/jira/browse/DATAFU-61 > Project: DataFu > Issue Type: New Feature >Affects Versions: 1.3.0 >Reporter: Russell Jurney > Attachments: DATAFU-61-2.patch, DATAFU-61.patch, DATAFU-61.patch > > > The first macro I would like to add is a Term Frequency, Inverse Document > Frequency implementation. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DATAFU-61) Add TF-IDF Macro to DataFu
[ https://issues.apache.org/jira/browse/DATAFU-61?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16161118#comment-16161118 ] Eyal Allweil commented on DATAFU-61: Came back to this today and tried a little experiment - I verified (calculating manually) that the Russell's code produces the same results as the "augmented TF" IDF flavor for the sample I took from the wikipedia page. Is that good enough for us? > Add TF-IDF Macro to DataFu > -- > > Key: DATAFU-61 > URL: https://issues.apache.org/jira/browse/DATAFU-61 > Project: DataFu > Issue Type: New Feature >Affects Versions: 1.3.0 >Reporter: Russell Jurney > Attachments: DATAFU-61-2.patch, DATAFU-61.patch, DATAFU-61.patch > > > The first macro I would like to add is a Term Frequency, Inverse Document > Frequency implementation. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DATAFU-119) New UDF - TupleDiff
[ https://issues.apache.org/jira/browse/DATAFU-119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16161198#comment-16161198 ] Eyal Allweil commented on DATAFU-119: - I added the macro to the jar. > New UDF - TupleDiff > --- > > Key: DATAFU-119 > URL: https://issues.apache.org/jira/browse/DATAFU-119 > Project: DataFu > Issue Type: New Feature >Reporter: Eyal Allweil >Assignee: Eyal Allweil > > A UDF that given two tuples, prints out the differences between them in > human-readable form. This is not meant for production - we use it in PayPal > for regression tests, to compare the results of two runs. Differences are > calculated based on position, but the tuples' schemas are used, if available, > for displaying more friendly results. If no schema is available the output > uses field numbers. > It should be used when you want a more fine-grained description of what has > changed, unlike > [org.apache.pig.builtin.DIFF|https://pig.apache.org/docs/r0.14.0/func.html#diff]. > Also, because DIFF takes as its input two bags to be compared, they must fit > in memory. This UDF only takes one pair of tuples at a time, so it can run on > large inputs. > We use a macro much like the following in conjunction with this UDF: > {noformat} > DEFINE diff_macro(diff_macro_old, diff_macro_new, diff_macro_pk, > diff_macro_ignored_field) returns diffs { > DEFINE TupleDiff datafu.pig.util.TupleDiff; > > old = FOREACH $diff_macro_old GENERATE $diff_macro_pk, TOTUPLE(*) AS > original; > new = FOREACH $diff_macro_new GENERATE $diff_macro_pk, TOTUPLE(*) AS > original; > > join_data = JOIN new BY $diff_macro_pk full, old BY $diff_macro_pk; > > join_data = FOREACH join_data GENERATE TupleDiff(old::original, > new::original, '$diff_macro_ignored_field') AS tupleDiff, old::original, > new::original; > > $diffs = FILTER join_data BY tupleDiff IS NOT NULL ; > }; > {noformat} > Currently, the output from the macro looks like this (when comma-separated): > {noformat} > added,, > missing,, > changed field2 field4,, > {noformat} > The UDF takes a variable number of parameters - the two tuples to be > compared, and any number of field names or numbers to be ignored. We use this > to ignore fields representing execution or creation time (the macro I've > given as an example assumes only one ignored field) > The current implementation "drills down" into tuples, but not bags or maps - > tuple boundaries are indicated with parentheses, like this: > {noformat} > changed outerEmbeddedTuple(innerEmbeddedTuple(fieldNameThatIsDifferent) > innerEmbeddedTuple(anotherFieldThatIsDifferent)) > {noformat} > I have a few final things left to do and then I'll put it up on reviewboard. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DATAFU-83) InUDF does not validate that types are compatible
[ https://issues.apache.org/jira/browse/DATAFU-83?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16161211#comment-16161211 ] Eyal Allweil commented on DATAFU-83: By the way, [~ItsAUsernameRight?], if you're already looking at InUDF, and you'd like another contribution afterwards, you can also look at [DATAFU-80|https://issues.apache.org/jira/browse/DATAFU-80] - it's another small change to improve InUDF's behavior. (you can ignore the second part of that issue, which deals with Java versions). > InUDF does not validate that types are compatible > - > > Key: DATAFU-83 > URL: https://issues.apache.org/jira/browse/DATAFU-83 > Project: DataFu > Issue Type: Improvement >Reporter: Matthew Hayes >Priority: Minor > Attachments: DATAFU-83.patch, rb36702.patch > > > See the example below. The input data is a long, but ints are provided to > match against. Because it uses the Java equals to compare and these are > different types, this will never match, which can lead to confusing results. > I believe it should at least throw an error. > {code} > define I datafu.pig.util.InUDF(); > > data = LOAD 'input' AS (B: bag {T: tuple(v:LONG)}); > > data2 = FOREACH data { > C = FILTER B By I(v, 1,2,3); > GENERATE C; > } > > describe data2; > > STORE data2 INTO 'output'; > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (DATAFU-126) There is a typo in document
[ https://issues.apache.org/jira/browse/DATAFU-126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eyal Allweil reassigned DATAFU-126: --- Assignee: Eyal Allweil > There is a typo in document > --- > > Key: DATAFU-126 > URL: https://issues.apache.org/jira/browse/DATAFU-126 > Project: DataFu > Issue Type: Improvement >Reporter: Kane >Assignee: Eyal Allweil >Priority: Minor > > It should be "functions" not "functiosn" in the document page > https://datafu.incubator.apache.org/docs/datafu/guide.html. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DATAFU-126) There is a typo in document
[ https://issues.apache.org/jira/browse/DATAFU-126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16161252#comment-16161252 ] Eyal Allweil commented on DATAFU-126: - Thanks Kane! I've fixed this in our sources, and it will show up when we release our next version. > There is a typo in document > --- > > Key: DATAFU-126 > URL: https://issues.apache.org/jira/browse/DATAFU-126 > Project: DataFu > Issue Type: Improvement >Reporter: Kane >Assignee: Eyal Allweil >Priority: Minor > > It should be "functions" not "functiosn" in the document page > https://datafu.incubator.apache.org/docs/datafu/guide.html. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (DATAFU-126) There is a typo in document
[ https://issues.apache.org/jira/browse/DATAFU-126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eyal Allweil resolved DATAFU-126. - Resolution: Fixed > There is a typo in document > --- > > Key: DATAFU-126 > URL: https://issues.apache.org/jira/browse/DATAFU-126 > Project: DataFu > Issue Type: Improvement >Reporter: Kane >Assignee: Eyal Allweil >Priority: Minor > > It should be "functions" not "functiosn" in the document page > https://datafu.incubator.apache.org/docs/datafu/guide.html. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (DATAFU-127) New macro - samply by keys
Eyal Allweil created DATAFU-127: --- Summary: New macro - samply by keys Key: DATAFU-127 URL: https://issues.apache.org/jira/browse/DATAFU-127 Project: DataFu Issue Type: New Feature Reporter: Eyal Allweil Assignee: Eyal Allweil Two macros that return a sample of a larger table based on a list of keys, with the schema of the larger table. One of the macros filters by dates, the other doesn't. If there are multiple rows with a key that appears in the key list, all of them will be returned (no deduplication is done). The results are returned ordered by the key field in a single file. The implementation uses a replicated join for efficiency, but this means the key list shouldn't be too large as to not fit in memory. The first macro's definition looks as follows: DEFINE sample_by_keys(table, sample_set, join_key_table, join_key_sample) returns out { - table_name- table name to sample - sample_set- a set of keys - join_key_table- join column name in the table - join_key_sample - join column name in the sample -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (DATAFU-127) New macro - samply by keys
[ https://issues.apache.org/jira/browse/DATAFU-127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eyal Allweil updated DATAFU-127: Attachment: DATAFU-127.patch Patch including new macros and tests > New macro - samply by keys > -- > > Key: DATAFU-127 > URL: https://issues.apache.org/jira/browse/DATAFU-127 > Project: DataFu > Issue Type: New Feature >Reporter: Eyal Allweil >Assignee: Eyal Allweil > Labels: macro > Attachments: DATAFU-127.patch > > > Two macros that return a sample of a larger table based on a list of keys, > with the schema of the larger table. One of the macros filters by dates, the > other doesn't. > If there are multiple rows with a key that appears in the key list, all of > them will be returned (no deduplication is done). The results are returned > ordered by the key field in a single file. > The implementation uses a replicated join for efficiency, but this means the > key list shouldn't be too large as to not fit in memory. > The first macro's definition looks as follows: > DEFINE sample_by_keys(table, sample_set, join_key_table, join_key_sample) > returns out { > - table_name - table name to sample > - sample_set - a set of keys > - join_key_table - join column name in the table > - join_key_sample - join column name in the sample -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (DATAFU-61) Add TF-IDF Macro to DataFu
[ https://issues.apache.org/jira/browse/DATAFU-61?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eyal Allweil updated DATAFU-61: --- Labels: macro (was: ) > Add TF-IDF Macro to DataFu > -- > > Key: DATAFU-61 > URL: https://issues.apache.org/jira/browse/DATAFU-61 > Project: DataFu > Issue Type: New Feature >Affects Versions: 1.3.0 >Reporter: Russell Jurney > Labels: macro > Attachments: DATAFU-61-2.patch, DATAFU-61.patch, DATAFU-61.patch > > > The first macro I would like to add is a Term Frequency, Inverse Document > Frequency implementation. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (DATAFU-128) Add documentation for macros
Eyal Allweil created DATAFU-128: --- Summary: Add documentation for macros Key: DATAFU-128 URL: https://issues.apache.org/jira/browse/DATAFU-128 Project: DataFu Issue Type: Improvement Reporter: Eyal Allweil Now that it is possible to add Pig macros to Datafu, we should update the documentation to reflect this, and provide guidelines and point would-be contributors to examples. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (DATAFU-129) New macro - dedup
Eyal Allweil created DATAFU-129: --- Summary: New macro - dedup Key: DATAFU-129 URL: https://issues.apache.org/jira/browse/DATAFU-129 Project: DataFu Issue Type: New Feature Reporter: Eyal Allweil Assignee: Eyal Allweil Macro used to dedup (de-duplicate) a table, based on a key or keys and an ordering (typically a date updated field). One thing to consider - the implementation relies on the ExtremalTupleByNthField UDF in PiggyBank. I've added it to the test dependencies in order for the test to run. While I feel that anyone using Pig typically has PiggyBank in the classpath, this might not be true - do we have an alternative? (maybe adding it to the jarjar?) The macro's definition looks as follows: DEFINE dedup(relation, row_key, order_field) returns out { relation - relation to dedup row_key - field(s) for group by order_field - the field for ordering (to find the most recent record) -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (DATAFU-129) New macro - dedup
[ https://issues.apache.org/jira/browse/DATAFU-129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eyal Allweil updated DATAFU-129: Attachment: DATAFU-129.patch Macro and test > New macro - dedup > - > > Key: DATAFU-129 > URL: https://issues.apache.org/jira/browse/DATAFU-129 > Project: DataFu > Issue Type: New Feature >Reporter: Eyal Allweil >Assignee: Eyal Allweil > Labels: macro > Attachments: DATAFU-129.patch > > > Macro used to dedup (de-duplicate) a table, based on a key or keys and an > ordering (typically a date updated field). > One thing to consider - the implementation relies on the > ExtremalTupleByNthField UDF in PiggyBank. I've added it to the test > dependencies in order for the test to run. While I feel that anyone using Pig > typically has PiggyBank in the classpath, this might not be true - do we have > an alternative? (maybe adding it to the jarjar?) > The macro's definition looks as follows: > DEFINE dedup(relation, row_key, order_field) returns out { > relation - relation to dedup > row_key - field(s) for group by > order_field - the field for ordering (to find the most recent record) -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DATAFU-128) Add documentation for macros
[ https://issues.apache.org/jira/browse/DATAFU-128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16162936#comment-16162936 ] Eyal Allweil commented on DATAFU-128: - Is the documentation for updating the website accurate? There are references to svn in there, which lead me to think they might not be relevant anymore ... > Add documentation for macros > > > Key: DATAFU-128 > URL: https://issues.apache.org/jira/browse/DATAFU-128 > Project: DataFu > Issue Type: Improvement >Reporter: Eyal Allweil > > Now that it is possible to add Pig macros to Datafu, we should update the > documentation to reflect this, and provide guidelines and point would-be > contributors to examples. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (DATAFU-130) Add left outer join macro described in the DataFu guide
Eyal Allweil created DATAFU-130: --- Summary: Add left outer join macro described in the DataFu guide Key: DATAFU-130 URL: https://issues.apache.org/jira/browse/DATAFU-130 Project: DataFu Issue Type: New Feature Reporter: Eyal Allweil In our [guide|http://datafu.incubator.apache.org/blog/2013/09/04/datafu-1-0.html], a macro is described for making a three-way left outer join conveniently. We can add this macro to DataFu to make it even easier to use. The macro's code is as follows: {noformat} DEFINE left_outer_join(relation1, key1, relation2, key2, relation3, key3) returns joined { cogrouped = COGROUP $relation1 BY $key1, $relation2 BY $key2, $relation3 BY $key3; $joined = FOREACH cogrouped GENERATE FLATTEN($relation1), FLATTEN(EmptyBagToNullFields($relation2)), FLATTEN(EmptyBagToNullFields($relation3)); } (we would obviously want to add a test for this, too) {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DATAFU-61) Add TF-IDF Macro to DataFu
[ https://issues.apache.org/jira/browse/DATAFU-61?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16164373#comment-16164373 ] Eyal Allweil commented on DATAFU-61: One last thing - I noticed after I uploaded my patch that it has my email, but I think it would be better for it to have your email, [~russell.jurney], since all I did was write the test. Is it OK that I replace my email with yours before committing this, so we get a (more accurate) "eyal committed with russell" type commit? > Add TF-IDF Macro to DataFu > -- > > Key: DATAFU-61 > URL: https://issues.apache.org/jira/browse/DATAFU-61 > Project: DataFu > Issue Type: New Feature >Affects Versions: 1.3.0 >Reporter: Russell Jurney > Labels: macro > Attachments: DATAFU-61-2.patch, DATAFU-61.patch, DATAFU-61.patch > > > The first macro I would like to add is a Term Frequency, Inverse Document > Frequency implementation. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (DATAFU-119) New UDF - TupleDiff
[ https://issues.apache.org/jira/browse/DATAFU-119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eyal Allweil updated DATAFU-119: Attachment: DATAFU-119-2.patch > New UDF - TupleDiff > --- > > Key: DATAFU-119 > URL: https://issues.apache.org/jira/browse/DATAFU-119 > Project: DataFu > Issue Type: New Feature >Reporter: Eyal Allweil >Assignee: Eyal Allweil > Attachments: DATAFU-119-2.patch > > > A UDF that given two tuples, prints out the differences between them in > human-readable form. This is not meant for production - we use it in PayPal > for regression tests, to compare the results of two runs. Differences are > calculated based on position, but the tuples' schemas are used, if available, > for displaying more friendly results. If no schema is available the output > uses field numbers. > It should be used when you want a more fine-grained description of what has > changed, unlike > [org.apache.pig.builtin.DIFF|https://pig.apache.org/docs/r0.14.0/func.html#diff]. > Also, because DIFF takes as its input two bags to be compared, they must fit > in memory. This UDF only takes one pair of tuples at a time, so it can run on > large inputs. > We use a macro much like the following in conjunction with this UDF: > {noformat} > DEFINE diff_macro(diff_macro_old, diff_macro_new, diff_macro_pk, > diff_macro_ignored_field) returns diffs { > DEFINE TupleDiff datafu.pig.util.TupleDiff; > > old = FOREACH $diff_macro_old GENERATE $diff_macro_pk, TOTUPLE(*) AS > original; > new = FOREACH $diff_macro_new GENERATE $diff_macro_pk, TOTUPLE(*) AS > original; > > join_data = JOIN new BY $diff_macro_pk full, old BY $diff_macro_pk; > > join_data = FOREACH join_data GENERATE TupleDiff(old::original, > new::original, '$diff_macro_ignored_field') AS tupleDiff, old::original, > new::original; > > $diffs = FILTER join_data BY tupleDiff IS NOT NULL ; > }; > {noformat} > Currently, the output from the macro looks like this (when comma-separated): > {noformat} > added,, > missing,, > changed field2 field4,, > {noformat} > The UDF takes a variable number of parameters - the two tuples to be > compared, and any number of field names or numbers to be ignored. We use this > to ignore fields representing execution or creation time (the macro I've > given as an example assumes only one ignored field) > The current implementation "drills down" into tuples, but not bags or maps - > tuple boundaries are indicated with parentheses, like this: > {noformat} > changed outerEmbeddedTuple(innerEmbeddedTuple(fieldNameThatIsDifferent) > innerEmbeddedTuple(anotherFieldThatIsDifferent)) > {noformat} > I have a few final things left to do and then I'll put it up on reviewboard. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DATAFU-119) New UDF - TupleDiff
[ https://issues.apache.org/jira/browse/DATAFU-119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16165925#comment-16165925 ] Eyal Allweil commented on DATAFU-119: - The documentation can be part of [DATAFU-128|https://issues.apache.org/jira/browse/DATAFU-128]. > New UDF - TupleDiff > --- > > Key: DATAFU-119 > URL: https://issues.apache.org/jira/browse/DATAFU-119 > Project: DataFu > Issue Type: New Feature >Reporter: Eyal Allweil >Assignee: Eyal Allweil > Attachments: DATAFU-119-2.patch > > > A UDF that given two tuples, prints out the differences between them in > human-readable form. This is not meant for production - we use it in PayPal > for regression tests, to compare the results of two runs. Differences are > calculated based on position, but the tuples' schemas are used, if available, > for displaying more friendly results. If no schema is available the output > uses field numbers. > It should be used when you want a more fine-grained description of what has > changed, unlike > [org.apache.pig.builtin.DIFF|https://pig.apache.org/docs/r0.14.0/func.html#diff]. > Also, because DIFF takes as its input two bags to be compared, they must fit > in memory. This UDF only takes one pair of tuples at a time, so it can run on > large inputs. > We use a macro much like the following in conjunction with this UDF: > {noformat} > DEFINE diff_macro(diff_macro_old, diff_macro_new, diff_macro_pk, > diff_macro_ignored_field) returns diffs { > DEFINE TupleDiff datafu.pig.util.TupleDiff; > > old = FOREACH $diff_macro_old GENERATE $diff_macro_pk, TOTUPLE(*) AS > original; > new = FOREACH $diff_macro_new GENERATE $diff_macro_pk, TOTUPLE(*) AS > original; > > join_data = JOIN new BY $diff_macro_pk full, old BY $diff_macro_pk; > > join_data = FOREACH join_data GENERATE TupleDiff(old::original, > new::original, '$diff_macro_ignored_field') AS tupleDiff, old::original, > new::original; > > $diffs = FILTER join_data BY tupleDiff IS NOT NULL ; > }; > {noformat} > Currently, the output from the macro looks like this (when comma-separated): > {noformat} > added,, > missing,, > changed field2 field4,, > {noformat} > The UDF takes a variable number of parameters - the two tuples to be > compared, and any number of field names or numbers to be ignored. We use this > to ignore fields representing execution or creation time (the macro I've > given as an example assumes only one ignored field) > The current implementation "drills down" into tuples, but not bags or maps - > tuple boundaries are indicated with parentheses, like this: > {noformat} > changed outerEmbeddedTuple(innerEmbeddedTuple(fieldNameThatIsDifferent) > innerEmbeddedTuple(anotherFieldThatIsDifferent)) > {noformat} > I have a few final things left to do and then I'll put it up on reviewboard. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DATAFU-61) Add TF-IDF Macro to DataFu
[ https://issues.apache.org/jira/browse/DATAFU-61?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16165991#comment-16165991 ] Eyal Allweil commented on DATAFU-61: Yes, I'll merge it. I did respond to an open issue in the review request that I only just noticed, something about using COUNT vs. SUM when calculating the IDF part ... as far as I can tell, the existing code is OK but it wouldn't hurt if you or Russell want to take a look at it. > Add TF-IDF Macro to DataFu > -- > > Key: DATAFU-61 > URL: https://issues.apache.org/jira/browse/DATAFU-61 > Project: DataFu > Issue Type: New Feature >Affects Versions: 1.3.0 >Reporter: Russell Jurney > Labels: macro > Attachments: DATAFU-61-2.patch, DATAFU-61.patch, DATAFU-61.patch > > > The first macro I would like to add is a Term Frequency, Inverse Document > Frequency implementation. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (DATAFU-61) Add TF-IDF Macro to DataFu
[ https://issues.apache.org/jira/browse/DATAFU-61?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eyal Allweil resolved DATAFU-61. Resolution: Fixed Assignee: Eyal Allweil Merged. > Add TF-IDF Macro to DataFu > -- > > Key: DATAFU-61 > URL: https://issues.apache.org/jira/browse/DATAFU-61 > Project: DataFu > Issue Type: New Feature >Affects Versions: 1.3.0 >Reporter: Russell Jurney >Assignee: Eyal Allweil > Labels: macro > Attachments: DATAFU-61-2.patch, DATAFU-61.patch, DATAFU-61.patch > > > The first macro I would like to add is a Term Frequency, Inverse Document > Frequency implementation. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DATAFU-130) Add left outer join macro described in the DataFu guide
[ https://issues.apache.org/jira/browse/DATAFU-130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16169225#comment-16169225 ] Eyal Allweil commented on DATAFU-130: - I think this is a good Jira issue to put in the [Apache Help Wanted site|https://helpwanted.apache.org/]. If there's no objection, I'll add it there. > Add left outer join macro described in the DataFu guide > --- > > Key: DATAFU-130 > URL: https://issues.apache.org/jira/browse/DATAFU-130 > Project: DataFu > Issue Type: New Feature >Reporter: Eyal Allweil > Labels: macro, newbie > > In our > [guide|http://datafu.incubator.apache.org/blog/2013/09/04/datafu-1-0.html], a > macro is described for making a three-way left outer join conveniently. We > can add this macro to DataFu to make it even easier to use. > The macro's code is as follows: > {noformat} > DEFINE left_outer_join(relation1, key1, relation2, key2, relation3, key3) > returns joined { > cogrouped = COGROUP $relation1 BY $key1, $relation2 BY $key2, $relation3 BY > $key3; > $joined = FOREACH cogrouped GENERATE > FLATTEN($relation1), > FLATTEN(EmptyBagToNullFields($relation2)), > FLATTEN(EmptyBagToNullFields($relation3)); > } > (we would obviously want to add a test for this, too) > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (DATAFU-130) Add left outer join macro described in the DataFu guide
[ https://issues.apache.org/jira/browse/DATAFU-130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eyal Allweil updated DATAFU-130: Description: In our [guide|http://datafu.incubator.apache.org/blog/2013/09/04/datafu-1-0.html], a macro is described for making a three-way left outer join conveniently. We can add this macro to DataFu to make it even easier to use. The macro's code is as follows: {noformat} DEFINE left_outer_join(relation1, key1, relation2, key2, relation3, key3) returns joined { cogrouped = COGROUP $relation1 BY $key1, $relation2 BY $key2, $relation3 BY $key3; $joined = FOREACH cogrouped GENERATE FLATTEN($relation1), FLATTEN(EmptyBagToNullFields($relation2)), FLATTEN(EmptyBagToNullFields($relation3)); } {noformat} (we would obviously want to add a test for this, too) was: In our [guide|http://datafu.incubator.apache.org/blog/2013/09/04/datafu-1-0.html], a macro is described for making a three-way left outer join conveniently. We can add this macro to DataFu to make it even easier to use. The macro's code is as follows: {noformat} DEFINE left_outer_join(relation1, key1, relation2, key2, relation3, key3) returns joined { cogrouped = COGROUP $relation1 BY $key1, $relation2 BY $key2, $relation3 BY $key3; $joined = FOREACH cogrouped GENERATE FLATTEN($relation1), FLATTEN(EmptyBagToNullFields($relation2)), FLATTEN(EmptyBagToNullFields($relation3)); } (we would obviously want to add a test for this, too) {noformat} > Add left outer join macro described in the DataFu guide > --- > > Key: DATAFU-130 > URL: https://issues.apache.org/jira/browse/DATAFU-130 > Project: DataFu > Issue Type: New Feature >Reporter: Eyal Allweil > Labels: macro, newbie > > In our > [guide|http://datafu.incubator.apache.org/blog/2013/09/04/datafu-1-0.html], a > macro is described for making a three-way left outer join conveniently. We > can add this macro to DataFu to make it even easier to use. > The macro's code is as follows: > {noformat} > DEFINE left_outer_join(relation1, key1, relation2, key2, relation3, key3) > returns joined { > cogrouped = COGROUP $relation1 BY $key1, $relation2 BY $key2, $relation3 BY > $key3; > $joined = FOREACH cogrouped GENERATE > FLATTEN($relation1), > FLATTEN(EmptyBagToNullFields($relation2)), > FLATTEN(EmptyBagToNullFields($relation3)); > } > {noformat} > (we would obviously want to add a test for this, too) -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DATAFU-12) Implement Lead UDF based on version from SQL
[ https://issues.apache.org/jira/browse/DATAFU-12?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16196058#comment-16196058 ] Eyal Allweil commented on DATAFU-12: [~matterhayes], anyone, what do you think? I wouldn't "waste" our time on something that can already be done in Pig via Hive, and I'd like to close jira's that are no longer relevant. > Implement Lead UDF based on version from SQL > > > Key: DATAFU-12 > URL: https://issues.apache.org/jira/browse/DATAFU-12 > Project: DataFu > Issue Type: New Feature >Reporter: Matthew Hayes > > Min Zhou has provided this suggestion ([Issue #88 on > GitHub|https://github.com/linkedin/datafu/pull/88]): > Lead is an analytic function like Oracle's Lead function. It provides access > to more than one tuple of a bag at the same time without a self join. Given a > bag of tuple returned from a query, LEAD provides access to a tuple at a > given physical offset beyond that position. Generates pairs of all items in a > bag. > If you do not specify offset, then its default is 1. Null is returned if the > offset goes beyond the scope of the bag. > Example 1: > {noformat} >register ba-pig-0.1.jar >define Lead datafu.pig.bags.Lead('2'); >-- INPUT: ({(1),(2),(3),(4)}) >data = LOAD 'input' AS (data: bag {T: tuple(v:INT)}); >describe data; >-- OUTPUT: ({((1),(2),(3)),((2),(3),(4)),((3),(4),),((4),,)}) >-- OUTPUT SCHEMA: data2: {lead_data: {(elem0: (v: int),elem1: (v: > int),elem2: (v: int))}} >data2 = FOREACH data GENERATE Lead(data); >describe data2; >DUMP data2; > {noformat} > Example 2 > {noformat} >register ba-pig-0.1.jar >define Lead datafu.pig.bags.Lead(); >-- INPUT: > ({(10,{(1),(2),(3)}),(20,{(4),(5),(6)}),(30,{(7),(8)}),(40,{(9),(10),(11)}),(50,{(12),(13),(14),(15)})}) >data = LOAD 'input' AS (data: bag {T: tuple(v1:INT,B: bag{T: > tuple(v2:INT)})}); >--describe data; >-- OUPUT: > ({((10,{(1),(2),(3)}),(20,{(4),(5),(6)})),((20,{(4),(5),(6)}),(30,{(7),(8)})),((30,{(7),(8)}),(40,{(9),(10),(11)})),((40,{(9),(10),(11)}),(50,{(12),(13),(14),(15)})),((50,{(12),(13),(14),(15)}),)}) >data2 = FOREACH data GENERATE Lead(data); >--describe data2; >DUMP data2; > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (DATAFU-48) Upgrade Guava to 17.0
[ https://issues.apache.org/jira/browse/DATAFU-48?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eyal Allweil updated DATAFU-48: --- Attachment: DATAFU-48-update-gradle-to-20.0.patch I checked, and Guava 20.0 is the last version that we can update to without getting into a Java version conflict. So this is a patch that updates Guava to 20.0. The tests all pass (build plugin, hourglass, and pig) and I ran a simple Pig script that uses the generated DataFu pig jar to see that it's still valid. Let's close this ancient ticket! > Upgrade Guava to 17.0 > - > > Key: DATAFU-48 > URL: https://issues.apache.org/jira/browse/DATAFU-48 > Project: DataFu > Issue Type: Improvement >Reporter: Philip (flip) Kromer >Assignee: Philip (flip) Kromer >Priority: Minor > Labels: build, dependency, guava, version > Attachments: 0001-DATAFU-48-Upgrade-gradle-to-17.0.patch, > DATAFU-48-update-gradle-to-20.0.patch > > > Specifically motivated by the improvements to hashing library, but also > because we're six versions behind at the moment. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DATAFU-87) Edit distance
[ https://issues.apache.org/jira/browse/DATAFU-87?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16197120#comment-16197120 ] Eyal Allweil commented on DATAFU-87: On second thought, since this UDF is now available in Hive, and since Levenshtein distance is a purely local computation, I'm guessing there's no need for a specific DataFu implementation. Shall we close this issue? Here are some links to the Hive UDF. https://issues.apache.org/jira/browse/HIVE-9556 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-StringFunctions > Edit distance > - > > Key: DATAFU-87 > URL: https://issues.apache.org/jira/browse/DATAFU-87 > Project: DataFu > Issue Type: New Feature >Affects Versions: 1.3.0 >Reporter: Joydeep Banerjee > Attachments: DATAFU-87.patch > > > [This is work-in-progress] > Given 2 strings, provide a measure of dis-similarity (Levenshtein distance) > between them. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (DATAFU-131) Update DataFu site to meet graduation requirements
Eyal Allweil created DATAFU-131: --- Summary: Update DataFu site to meet graduation requirements Key: DATAFU-131 URL: https://issues.apache.org/jira/browse/DATAFU-131 Project: DataFu Issue Type: Bug Reporter: Eyal Allweil The following issues were raised with the [DataFu web site|http://datafu.incubator.apache.org] as part of the [graduation discussion on the incubator general maiing list|https://mail-archives.apache.org/mod_mbox/incubator-general/201710.mbox/%3CCAOGo0VZN4mLw-eS-Oz8W1hVhS70Y_BwwikJjRi9Sx3n0s8sFMg%40mail.gmail.com%3E] There's no link to the main ASF website. There's no LICENSE or Thanks link. There's no download link. etc. The quick start guide pages do have download links, but the primary link is to Maven rather than the ASF, and there are no instructions as to how to check sigs or hashes, and no link to the KEYS file that I could find. The SHA-512 checksum must have the extension .sha512 http://www.apache.org/dev/release-distribution.html#sigs-and-sums Also the latest release appears to be 1.3.2 (dated Feb 2017) but the download links point to 1.3.1. The older releases (1.3.1 and 1.3.0) should have been deleted from the release/dist directory by now. There's no Apache feather logo which is often used as the link to the main ASF site. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DATAFU-131) Update DataFu site to meet graduation requirements
[ https://issues.apache.org/jira/browse/DATAFU-131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16199209#comment-16199209 ] Eyal Allweil commented on DATAFU-131: - Here's a link to the Apache site guidelines: https://www.apache.org/foundation/marks/pmcs#navigation > Update DataFu site to meet graduation requirements > -- > > Key: DATAFU-131 > URL: https://issues.apache.org/jira/browse/DATAFU-131 > Project: DataFu > Issue Type: Bug >Reporter: Eyal Allweil > > The following issues were raised with the [DataFu web > site|http://datafu.incubator.apache.org] as part of the [graduation > discussion on the incubator general maiing > list|https://mail-archives.apache.org/mod_mbox/incubator-general/201710.mbox/%3CCAOGo0VZN4mLw-eS-Oz8W1hVhS70Y_BwwikJjRi9Sx3n0s8sFMg%40mail.gmail.com%3E] > There's no link to the main ASF website. > There's no LICENSE or Thanks link. > There's no download link. > etc. > The quick start guide pages do have download links, but the primary > link is to Maven rather than the ASF, and there are no instructions as > to how to check sigs or hashes, and no link to the KEYS file that I > could find. > The SHA-512 checksum must have the extension .sha512 > http://www.apache.org/dev/release-distribution.html#sigs-and-sums > Also the latest release appears to be 1.3.2 (dated Feb 2017) but the > download links point to 1.3.1. > The older releases (1.3.1 and 1.3.0) should have been deleted from the > release/dist directory by now. > There's no Apache feather logo which is often used as the link to the > main ASF site. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DATAFU-48) Upgrade Guava to 17.0
[ https://issues.apache.org/jira/browse/DATAFU-48?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16210769#comment-16210769 ] Eyal Allweil commented on DATAFU-48: None, actually. Hadoop 1 and 2 are using 11.0.2, like us. Hadoop 3 is [using 21|https://issues.apache.org/jira/browse/HADOOP-10101]. > Upgrade Guava to 17.0 > - > > Key: DATAFU-48 > URL: https://issues.apache.org/jira/browse/DATAFU-48 > Project: DataFu > Issue Type: Improvement >Reporter: Philip (flip) Kromer >Assignee: Philip (flip) Kromer >Priority: Minor > Labels: build, dependency, guava, version > Attachments: 0001-DATAFU-48-Upgrade-gradle-to-17.0.patch, > DATAFU-48-update-gradle-to-20.0.patch > > > Specifically motivated by the improvements to hashing library, but also > because we're six versions behind at the moment. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DATAFU-32) Hourglass concrete jobs should have getters and setters for output name and namespace
[ https://issues.apache.org/jira/browse/DATAFU-32?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16210941#comment-16210941 ] Eyal Allweil commented on DATAFU-32: Is this still relevant? If so, I'll open a [Help Wanted task|https://helpwanted.apache.org/] for it. > Hourglass concrete jobs should have getters and setters for output name and > namespace > - > > Key: DATAFU-32 > URL: https://issues.apache.org/jira/browse/DATAFU-32 > Project: DataFu > Issue Type: Improvement >Reporter: Matthew Hayes >Assignee: Matthew Hayes > > With the abstract versions you can override getOutputSchemaName() and > getOutputSchemaNamespace(). But the concrete versions don't expose setters, > so you have to extend the class to override the defaults. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DATAFU-118) Automatically run rat task when running assemble
[ https://issues.apache.org/jira/browse/DATAFU-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16211505#comment-16211505 ] Eyal Allweil commented on DATAFU-118: - (because we have a patch that seems to work on a newer Gradle version linked in the review board) > Automatically run rat task when running assemble > > > Key: DATAFU-118 > URL: https://issues.apache.org/jira/browse/DATAFU-118 > Project: DataFu > Issue Type: Improvement >Reporter: Matthew Hayes >Priority: Minor > > The rat task checks that our files have the right headers. We don't > automatically run it for assemble so it isn't easy for new contributors to > catch issues. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (DATAFU-125) Upgrade Gradle to v4 or later
[ https://issues.apache.org/jira/browse/DATAFU-125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eyal Allweil updated DATAFU-125: Attachment: DATAFU-125.patch This updates Gradle to 3.5.1. It seems to me that gradle-autojar doesn't work in Gradle 4, which is why I settled for an older version. While running _gradlew clean assemble_ I get this warning message: {{:datafu-pig:jarWithDependencies Manifest.writeTo(Writer) has been deprecated and is scheduled to be removed in Gradle 4.0. Please use Manifest.writeTo(Object) instead. }} And the build indeed fails in Gradle 4. However, this update is enough to let us merge DATAFU-118, and it is a far newer version. I ran _assemble_ and _test_ on it; I still want to run a Pig script using the assembled jar but I'll do that next week. What other tasks do we want to check to see that this fix is valid? Or do we want to try to get Gradle 4 working? > Upgrade Gradle to v4 or later > - > > Key: DATAFU-125 > URL: https://issues.apache.org/jira/browse/DATAFU-125 > Project: DataFu > Issue Type: Task >Reporter: Matthew Hayes > Attachments: DATAFU-125.patch > > > We should update to the most recent version of Gradle. We're currently using > 2.4. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DATAFU-125) Upgrade Gradle to v4 or later
[ https://issues.apache.org/jira/browse/DATAFU-125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16214208#comment-16214208 ] Eyal Allweil commented on DATAFU-125: - _check_ and _clean release_ run and return SUCCESS. Are there any special files I should check that are the result of the _release_ task? I also ran a script on the packaged jar (the regular one, not core or the jarjar) and it ran fine. > Upgrade Gradle to v4 or later > - > > Key: DATAFU-125 > URL: https://issues.apache.org/jira/browse/DATAFU-125 > Project: DataFu > Issue Type: Task >Reporter: Matthew Hayes > Attachments: DATAFU-125.patch > > > We should update to the most recent version of Gradle. We're currently using > 2.4. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DATAFU-48) Upgrade Guava to 17.0
[ https://issues.apache.org/jira/browse/DATAFU-48?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16214212#comment-16214212 ] Eyal Allweil commented on DATAFU-48: As an additional check, I ran a Pig script which uses _SimpleRandomSampleWithReplacementVote_ (which uses Guava) to see that it still runs correctly. > Upgrade Guava to 17.0 > - > > Key: DATAFU-48 > URL: https://issues.apache.org/jira/browse/DATAFU-48 > Project: DataFu > Issue Type: Improvement >Reporter: Philip (flip) Kromer >Assignee: Philip (flip) Kromer >Priority: Minor > Labels: build, dependency, guava, version > Attachments: 0001-DATAFU-48-Upgrade-gradle-to-17.0.patch, > DATAFU-48-update-gradle-to-20.0.patch > > > Specifically motivated by the improvements to hashing library, but also > because we're six versions behind at the moment. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DATAFU-17) Improve testing of randomized functions
[ https://issues.apache.org/jira/browse/DATAFU-17?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16214398#comment-16214398 ] Eyal Allweil commented on DATAFU-17: I think we can close this, just as we closed [DATAFU-28|https://issues.apache.org/jira/browse/DATAFU-28]. If all the tests take less than twenty minutes now I don't think it's worth making an effort to minimize the randomized functions. > Improve testing of randomized functions > --- > > Key: DATAFU-17 > URL: https://issues.apache.org/jira/browse/DATAFU-17 > Project: DataFu > Issue Type: Improvement >Reporter: Will Vaughan > > We have a large number of UDFs with a random component that are difficult and > often slow to test. We should improve our testing standards and capabilities > for this class of functions. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (DATAFU-131) Update DataFu site to meet graduation requirements
[ https://issues.apache.org/jira/browse/DATAFU-131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eyal Allweil reassigned DATAFU-131: --- Assignee: Matthew Hayes > Update DataFu site to meet graduation requirements > -- > > Key: DATAFU-131 > URL: https://issues.apache.org/jira/browse/DATAFU-131 > Project: DataFu > Issue Type: Bug >Reporter: Eyal Allweil >Assignee: Matthew Hayes > Attachments: DATAFU-131.patch, Screen Shot 2017-10-25 at 7.21.09 > PM.png > > > The following issues were raised with the [DataFu web > site|http://datafu.incubator.apache.org] as part of the [graduation > discussion on the incubator general maiing > list|https://mail-archives.apache.org/mod_mbox/incubator-general/201710.mbox/%3CCAOGo0VZN4mLw-eS-Oz8W1hVhS70Y_BwwikJjRi9Sx3n0s8sFMg%40mail.gmail.com%3E] > There's no link to the main ASF website. > There's no LICENSE or Thanks link. > There's no download link. > etc. > The quick start guide pages do have download links, but the primary > link is to Maven rather than the ASF, and there are no instructions as > to how to check sigs or hashes, and no link to the KEYS file that I > could find. > The SHA-512 checksum must have the extension .sha512 > http://www.apache.org/dev/release-distribution.html#sigs-and-sums > Also the latest release appears to be 1.3.2 (dated Feb 2017) but the > download links point to 1.3.1. > The older releases (1.3.1 and 1.3.0) should have been deleted from the > release/dist directory by now. > There's no Apache feather logo which is often used as the link to the > main ASF site. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (DATAFU-125) Upgrade Gradle to v4 or later
[ https://issues.apache.org/jira/browse/DATAFU-125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16220371#comment-16220371 ] Eyal Allweil commented on DATAFU-125: - When I build with _./gradlew clean release -Prelease=true_, I don't get a zip file. In fact, I don't get jars either - I need to use _assemble_ to make them (both on master with and without upgrading Gradle). Am I using the wrong command? Does it work for you, [~matterhayes]? > Upgrade Gradle to v4 or later > - > > Key: DATAFU-125 > URL: https://issues.apache.org/jira/browse/DATAFU-125 > Project: DataFu > Issue Type: Task >Reporter: Matthew Hayes >Assignee: Eyal Allweil > Attachments: DATAFU-125.patch > > > We should update to the most recent version of Gradle. We're currently using > 2.4. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (DATAFU-48) Upgrade Guava to 20.0
[ https://issues.apache.org/jira/browse/DATAFU-48?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eyal Allweil updated DATAFU-48: --- Summary: Upgrade Guava to 20.0 (was: Upgrade Guava to 17.0) > Upgrade Guava to 20.0 > - > > Key: DATAFU-48 > URL: https://issues.apache.org/jira/browse/DATAFU-48 > Project: DataFu > Issue Type: Improvement >Reporter: Philip (flip) Kromer >Assignee: Philip (flip) Kromer >Priority: Minor > Labels: build, dependency, guava, version > Attachments: 0001-DATAFU-48-Upgrade-gradle-to-17.0.patch, > DATAFU-48-update-gradle-to-20.0.patch > > > Specifically motivated by the improvements to hashing library, but also > because we're six versions behind at the moment. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (DATAFU-48) Upgrade Guava to 20.0
[ https://issues.apache.org/jira/browse/DATAFU-48?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eyal Allweil resolved DATAFU-48. Resolution: Fixed Assignee: Eyal Allweil (was: Philip (flip) Kromer) Fix Version/s: 1.3.3 Merged > Upgrade Guava to 20.0 > - > > Key: DATAFU-48 > URL: https://issues.apache.org/jira/browse/DATAFU-48 > Project: DataFu > Issue Type: Improvement >Reporter: Philip (flip) Kromer >Assignee: Eyal Allweil >Priority: Minor > Labels: build, dependency, guava, version > Fix For: 1.3.3 > > Attachments: 0001-DATAFU-48-Upgrade-gradle-to-17.0.patch, > DATAFU-48-update-gradle-to-20.0.patch > > > Specifically motivated by the improvements to hashing library, but also > because we're six versions behind at the moment. -- This message was sent by Atlassian JIRA (v6.4.14#64029)