[jira] [Commented] (DATAFU-95) Improve wrong JDK error message

2016-01-14 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-95?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15097858#comment-15097858
 ] 

Eyal Allweil commented on DATAFU-95:


As an immediate, easy-to-do improvement, writing what Java version is required 
in the main README on GitHub would be great.

> Improve wrong JDK error message
> ---
>
> Key: DATAFU-95
> URL: https://issues.apache.org/jira/browse/DATAFU-95
> Project: DataFu
>  Issue Type: Improvement
>Reporter: Jakob Homan
>Priority: Minor
>
> Right now if one tries to build against JDK1.7, the resulting failure is a 
> bit unclear:
> {noformat}Download 
> https://repo1.maven.org/maven2/org/eclipse/equinox/app/1.3.200-v20130910-1609/app-1.3.200-v20130910-1609.jar
> /Users/jahoman/repos/datafu/build-plugin/src/main/java/org/adrianwalker/multilinestring/MultilineProcessor.java:18:
>  error: cannot find symbol
> @SupportedSourceVersion(SourceVersion.RELEASE_8)
>  ^
>   symbol:   variable RELEASE_8
>   location: class SourceVersion
> 1 error
> :build-plugin:compileJava FAILED
> FAILURE: Build failed with an exception.
> {noformat}
> It may be better to use something like [The 
> Sweeney|https://github.com/boxheed/gradle-sweeney-plugin] to enforce this and 
> provide a better, faster message on failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (DATAFU-114) Make FirstTupleFromBag implement Accumulator

2016-01-14 Thread Eyal Allweil (JIRA)
Eyal Allweil created DATAFU-114:
---

 Summary: Make FirstTupleFromBag implement Accumulator
 Key: DATAFU-114
 URL: https://issues.apache.org/jira/browse/DATAFU-114
 Project: DataFu
  Issue Type: Improvement
Affects Versions: 1.3.0
 Environment: All
Reporter: Eyal Allweil
Priority: Minor


FirstTupleFromBag only needs the first tuple from the bag, but because it 
doesn't implement Accumulator the entire bag needs to be passed to it 
in-memory. The fix is very minor and will make the UDF support large bags.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (DATAFU-114) Make FirstTupleFromBag implement Accumulator

2016-01-14 Thread Eyal Allweil (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eyal Allweil updated DATAFU-114:

Attachment: FirstTupleFromBag.java

I wasn't able to test this patch because I can't get the build working on my 
system (Ubuntu LTS) .. I'm getting the error described 
[here|https://issues.apache.org/jira/browse/DATAFU-95]. I can't seem to make 
Gradle use a different Java to get it to compile.

However, since the implementation of Accumulator is relatively straightforward, 
I hopefully haven't made any mistakes and I would appreciate if someone whose 
build is working tried it out and pulled the patch.

> Make FirstTupleFromBag implement Accumulator
> 
>
> Key: DATAFU-114
> URL: https://issues.apache.org/jira/browse/DATAFU-114
> Project: DataFu
>  Issue Type: Improvement
>Affects Versions: 1.3.0
> Environment: All
>Reporter: Eyal Allweil
>Priority: Minor
>  Labels: easyfix, newbie, performance
> Attachments: FirstTupleFromBag.java
>
>
> FirstTupleFromBag only needs the first tuple from the bag, but because it 
> doesn't implement Accumulator the entire bag needs to be passed to it 
> in-memory. The fix is very minor and will make the UDF support large bags.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DATAFU-114) Make FirstTupleFromBag implement Accumulator

2016-01-25 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15114990#comment-15114990
 ] 

Eyal Allweil commented on DATAFU-114:
-

Any comments? Can this patch be pulled?

> Make FirstTupleFromBag implement Accumulator
> 
>
> Key: DATAFU-114
> URL: https://issues.apache.org/jira/browse/DATAFU-114
> Project: DataFu
>  Issue Type: Improvement
>Affects Versions: 1.3.0
> Environment: All
>Reporter: Eyal Allweil
>Priority: Minor
>  Labels: easyfix, newbie, performance
> Attachments: FirstTupleFromBag.java
>
>
> FirstTupleFromBag only needs the first tuple from the bag, but because it 
> doesn't implement Accumulator the entire bag needs to be passed to it 
> in-memory. The fix is very minor and will make the UDF support large bags.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DATAFU-114) Make FirstTupleFromBag implement Accumulator

2016-02-04 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15131991#comment-15131991
 ] 

Eyal Allweil commented on DATAFU-114:
-

Anyone?

> Make FirstTupleFromBag implement Accumulator
> 
>
> Key: DATAFU-114
> URL: https://issues.apache.org/jira/browse/DATAFU-114
> Project: DataFu
>  Issue Type: Improvement
>Affects Versions: 1.3.0
> Environment: All
>Reporter: Eyal Allweil
>Priority: Minor
>  Labels: easyfix, newbie, performance
> Attachments: FirstTupleFromBag.java
>
>
> FirstTupleFromBag only needs the first tuple from the bag, but because it 
> doesn't implement Accumulator the entire bag needs to be passed to it 
> in-memory. The fix is very minor and will make the UDF support large bags.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DATAFU-114) Make FirstTupleFromBag implement Accumulator

2016-02-06 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15135730#comment-15135730
 ] 

Eyal Allweil commented on DATAFU-114:
-

The test looks fine, and so does your patch for DATAFU-95 - I was able to build 
and test (after adding the test to BagTests.java). What I still can't do is get 
an Eclipse project working - the gradlew completes, but the project which 
results doesn't have source folders or dependencies.

In the past I had trouble generating patches from Git which RB accepted, but 
maybe that's been taken care of.

Thanks!

> Make FirstTupleFromBag implement Accumulator
> 
>
> Key: DATAFU-114
> URL: https://issues.apache.org/jira/browse/DATAFU-114
> Project: DataFu
>  Issue Type: Improvement
>Affects Versions: 1.3.0
> Environment: All
>Reporter: Eyal Allweil
>Assignee: Eyal Allweil
>Priority: Minor
>  Labels: easyfix, newbie, performance
> Fix For: 1.3.1
>
> Attachments: FirstTupleFromBag.java
>
>
> FirstTupleFromBag only needs the first tuple from the bag, but because it 
> doesn't implement Accumulator the entire bag needs to be passed to it 
> in-memory. The fix is very minor and will make the UDF support large bags.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DATAFU-114) Make FirstTupleFromBag implement Accumulator

2016-02-17 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15150312#comment-15150312
 ] 

Eyal Allweil commented on DATAFU-114:
-

Thanks!

After I imported the projects individually, like you suggested, it works fine 
in Eclipse ... I suggest adding a sentence about it in the base readme file to 
help out future contributors

> Make FirstTupleFromBag implement Accumulator
> 
>
> Key: DATAFU-114
> URL: https://issues.apache.org/jira/browse/DATAFU-114
> Project: DataFu
>  Issue Type: Improvement
>Affects Versions: 1.3.0
> Environment: All
>Reporter: Eyal Allweil
>Assignee: Eyal Allweil
>Priority: Minor
>  Labels: easyfix, newbie, performance
> Fix For: 1.3.1
>
> Attachments: FirstTupleFromBag.java
>
>
> FirstTupleFromBag only needs the first tuple from the bag, but because it 
> doesn't implement Accumulator the entire bag needs to be passed to it 
> in-memory. The fix is very minor and will make the UDF support large bags.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (DATAFU-115) Make TupleFromBag implement Accumulator

2016-03-03 Thread Eyal Allweil (JIRA)
Eyal Allweil created DATAFU-115:
---

 Summary: Make TupleFromBag implement Accumulator
 Key: DATAFU-115
 URL: https://issues.apache.org/jira/browse/DATAFU-115
 Project: DataFu
  Issue Type: Improvement
Affects Versions: 1.3.0
Reporter: Eyal Allweil
Priority: Minor
 Fix For: 1.3.1


Similar to [DATAFU-114|https://issues.apache.org/jira/browse/DATAFU-114]. 

TupleFromBag doesn't need to hold the bag in memory, and can iterate through it 
until it reaches the desired tuple. By implementing Accumulator, larger bags 
can be used and with a smaller memory footprint.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (DATAFU-115) Make TupleFromBag implement Accumulator

2016-03-03 Thread Eyal Allweil (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eyal Allweil updated DATAFU-115:

Attachment: DATAFU-115.patch

Relatively straightforward patch ... there's one difference from the previous 
behavior, that if an exception is thrown, I ignore it and try to continue 
iterating to the desired index.

I tried uploading it to the review board, see if [this 
link|https://reviews.apache.org/r/44351/] works.

> Make TupleFromBag implement Accumulator
> ---
>
> Key: DATAFU-115
> URL: https://issues.apache.org/jira/browse/DATAFU-115
> Project: DataFu
>  Issue Type: Improvement
>Affects Versions: 1.3.0
>Reporter: Eyal Allweil
>Priority: Minor
>  Labels: performance
> Fix For: 1.3.1
>
> Attachments: DATAFU-115.patch
>
>
> Similar to [DATAFU-114|https://issues.apache.org/jira/browse/DATAFU-114]. 
> TupleFromBag doesn't need to hold the bag in memory, and can iterate through 
> it until it reaches the desired tuple. By implementing Accumulator, larger 
> bags can be used and with a smaller memory footprint.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (DATAFU-115) Make TupleFromBag implement Accumulator

2016-03-03 Thread Eyal Allweil (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eyal Allweil updated DATAFU-115:

Flags: Patch

> Make TupleFromBag implement Accumulator
> ---
>
> Key: DATAFU-115
> URL: https://issues.apache.org/jira/browse/DATAFU-115
> Project: DataFu
>  Issue Type: Improvement
>Affects Versions: 1.3.0
>Reporter: Eyal Allweil
>Priority: Minor
>  Labels: performance
> Fix For: 1.3.1
>
> Attachments: DATAFU-115.patch
>
>
> Similar to [DATAFU-114|https://issues.apache.org/jira/browse/DATAFU-114]. 
> TupleFromBag doesn't need to hold the bag in memory, and can iterate through 
> it until it reaches the desired tuple. By implementing Accumulator, larger 
> bags can be used and with a smaller memory footprint.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (DATAFU-116) Make SetIntersect and SetDifference implement Accumulator

2016-03-08 Thread Eyal Allweil (JIRA)
Eyal Allweil created DATAFU-116:
---

 Summary: Make SetIntersect and SetDifference implement Accumulator
 Key: DATAFU-116
 URL: https://issues.apache.org/jira/browse/DATAFU-116
 Project: DataFu
  Issue Type: Improvement
Affects Versions: 1.3.0
Reporter: Eyal Allweil


SetIntersect and SetDifference accept only sorted bags, and the output is 
always smaller than the inputs. Therefore an accumulator implementation should 
be possible and it will improve memory usage (somewhat) and allow Pig to 
optimize loops with these operations better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DATAFU-116) Make SetIntersect and SetDifference implement Accumulator

2016-03-08 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15185409#comment-15185409
 ] 

Eyal Allweil commented on DATAFU-116:
-

As far as I can tell, when the accumulator is used, Pig passes 
_pig.accumulative.batchsize_ tuples from each bag until all the tuples are 
exhausted. I think an implementation that iterates over the bags and only keeps 
some of the tuples in between batches is possible - hopefully very few, but the 
worst case is all of them, which is no worse than the current implementation.

I'm assuming Pig passes batches in this way based on the code in 
[POPackage|https://github.com/apache/pig/blob/trunk/src/org/apache/pig/backend/hadoop/executionengine/physicalLayer/relationalOperators/POPackage.java]
 and from looking through all the documentation I could find on accumulators. 
If I'm wrong it does mean that an accumulator implementation isn't worthwhile.

> Make SetIntersect and SetDifference implement Accumulator
> -
>
> Key: DATAFU-116
> URL: https://issues.apache.org/jira/browse/DATAFU-116
> Project: DataFu
>  Issue Type: Improvement
>Affects Versions: 1.3.0
>Reporter: Eyal Allweil
>
> SetIntersect and SetDifference accept only sorted bags, and the output is 
> always smaller than the inputs. Therefore an accumulator implementation 
> should be possible and it will improve memory usage (somewhat) and allow Pig 
> to optimize loops with these operations better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DATAFU-116) Make SetIntersect and SetDifference implement Accumulator

2016-03-10 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15189158#comment-15189158
 ] 

Eyal Allweil commented on DATAFU-116:
-

As far as I know, the behavior you're describing is how Pig deals with UDF's 
that implement the Accumulator interface. If the UDF doesn't (if it only 
extends EvalFunc) the parameters (including bags) are passed in memory in their 
entirety. I'm basing this on [this quote from Programming 
Pig|http://stackoverflow.com/a/15813789/150992]. That's why I'm suggesting this 
change.



> Make SetIntersect and SetDifference implement Accumulator
> -
>
> Key: DATAFU-116
> URL: https://issues.apache.org/jira/browse/DATAFU-116
> Project: DataFu
>  Issue Type: Improvement
>Affects Versions: 1.3.0
>Reporter: Eyal Allweil
>
> SetIntersect and SetDifference accept only sorted bags, and the output is 
> always smaller than the inputs. Therefore an accumulator implementation 
> should be possible and it will improve memory usage (somewhat) and allow Pig 
> to optimize loops with these operations better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (DATAFU-117) New UDF - CountDistinctUpTo

2016-03-24 Thread Eyal Allweil (JIRA)
Eyal Allweil created DATAFU-117:
---

 Summary: New UDF - CountDistinctUpTo
 Key: DATAFU-117
 URL: https://issues.apache.org/jira/browse/DATAFU-117
 Project: DataFu
  Issue Type: New Feature
Reporter: Eyal Allweil


A UDF that counts distinct tuples within a bag, but only up to a preset limit. 
If the bag contains more distinct tuples than the limit, the UDF returns the 
limit. 

This UDF can run reasonably well even on large bags if the limit chosen is 
small enough though the count is done in memory.

We use this UDF in PayPal for filtering, when we don't need to use the actual 
tuples afterward.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (DATAFU-117) New UDF - CountDistinctUpTo

2016-03-24 Thread Eyal Allweil (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eyal Allweil updated DATAFU-117:

Attachment: DATAFU-117.patch

Patch including new UDF and test (in BagTests)

> New UDF - CountDistinctUpTo
> ---
>
> Key: DATAFU-117
> URL: https://issues.apache.org/jira/browse/DATAFU-117
> Project: DataFu
>  Issue Type: New Feature
>Reporter: Eyal Allweil
> Attachments: DATAFU-117.patch
>
>
> A UDF that counts distinct tuples within a bag, but only up to a preset 
> limit. If the bag contains more distinct tuples than the limit, the UDF 
> returns the limit. 
> This UDF can run reasonably well even on large bags if the limit chosen is 
> small enough though the count is done in memory.
> We use this UDF in PayPal for filtering, when we don't need to use the actual 
> tuples afterward.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DATAFU-115) Make TupleFromBag implement Accumulator

2016-03-27 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15213559#comment-15213559
 ] 

Eyal Allweil commented on DATAFU-115:
-

I'm not sure why, but I can't see this patch in the master branch. I can see 
https://issues.apache.org/jira/browse/DATAFU-114 - 
[FirstTupleFromBag|https://github.com/apache/incubator-datafu/blob/master/datafu-pig/src/main/java/datafu/pig/bags/FirstTupleFromBag.java]
 appears changed - but 
[TupleFromBag|https://github.com/apache/incubator-datafu/blob/master/datafu-pig/src/main/java/datafu/pig/bags/TupleFromBag.java]
 looks like it hasn't been changed since August. Does the public GitHub 
represent the repository accurately?

> Make TupleFromBag implement Accumulator
> ---
>
> Key: DATAFU-115
> URL: https://issues.apache.org/jira/browse/DATAFU-115
> Project: DataFu
>  Issue Type: Improvement
>Affects Versions: 1.3.0
>Reporter: Eyal Allweil
>Assignee: Eyal Allweil
>Priority: Minor
>  Labels: performance
> Fix For: 1.3.1
>
> Attachments: DATAFU-115.patch
>
>
> Similar to [DATAFU-114|https://issues.apache.org/jira/browse/DATAFU-114]. 
> TupleFromBag doesn't need to hold the bag in memory, and can iterate through 
> it until it reaches the desired tuple. By implementing Accumulator, larger 
> bags can be used and with a smaller memory footprint.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DATAFU-115) Make TupleFromBag implement Accumulator

2016-03-29 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15215634#comment-15215634
 ] 

Eyal Allweil commented on DATAFU-115:
-

Thanks!

> Make TupleFromBag implement Accumulator
> ---
>
> Key: DATAFU-115
> URL: https://issues.apache.org/jira/browse/DATAFU-115
> Project: DataFu
>  Issue Type: Improvement
>Affects Versions: 1.3.0
>Reporter: Eyal Allweil
>Assignee: Eyal Allweil
>Priority: Minor
>  Labels: performance
> Fix For: 1.3.1
>
> Attachments: DATAFU-115.patch
>
>
> Similar to [DATAFU-114|https://issues.apache.org/jira/browse/DATAFU-114]. 
> TupleFromBag doesn't need to hold the bag in memory, and can iterate through 
> it until it reaches the desired tuple. By implementing Accumulator, larger 
> bags can be used and with a smaller memory footprint.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DATAFU-117) New UDF - CountDistinctUpTo

2016-04-10 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15233995#comment-15233995
 ] 

Eyal Allweil commented on DATAFU-117:
-

Thanks for the feedback! I will incorporate it in a future patch, but I'm 
trying to check whether I can revise this UDF to have an Algebraic 
implementation (which should further improve its performance). I'll open a 
review board on this version as soon as I can.

> New UDF - CountDistinctUpTo
> ---
>
> Key: DATAFU-117
> URL: https://issues.apache.org/jira/browse/DATAFU-117
> Project: DataFu
>  Issue Type: New Feature
>Reporter: Eyal Allweil
> Attachments: DATAFU-117.patch
>
>
> A UDF that counts distinct tuples within a bag, but only up to a preset 
> limit. If the bag contains more distinct tuples than the limit, the UDF 
> returns the limit. 
> This UDF can run reasonably well even on large bags if the limit chosen is 
> small enough though the count is done in memory.
> We use this UDF in PayPal for filtering, when we don't need to use the actual 
> tuples afterward.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DATAFU-117) New UDF - CountDistinctUpTo

2016-04-26 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15258239#comment-15258239
 ] 

Eyal Allweil commented on DATAFU-117:
-

Ok, I opened a review board for it - can you see it? It's at 
https://reviews.apache.org/r/46701/

I think all your previous comments are addressed there, except for the one 
about "this.set.add(o) && (this.set.size() == maxAmount". I don't think this 
can exceed the max size, because a single add operation can only increment the 
set's size by one, and the UDF is executed in a single thread.

I ran a few tests comparing this UDF to a Pig nested foreach with DISTINCT 
followed by the builtin COUNT. On small inputs they perform about the same - 
even up to a million records - but if you have a situation with more skew (I 
checked 10 million records, with about 4 million distincts) then this UDF with 
a max value of say, 1, runs in about four minutes, and the nested 
foreach+distinct+count takes more than an hour - probably because it needs to 
keep all the distinct records in memory, rather than just reaching the desired 
threshold.

> New UDF - CountDistinctUpTo
> ---
>
> Key: DATAFU-117
> URL: https://issues.apache.org/jira/browse/DATAFU-117
> Project: DataFu
>  Issue Type: New Feature
>Reporter: Eyal Allweil
> Attachments: DATAFU-117.patch
>
>
> A UDF that counts distinct tuples within a bag, but only up to a preset 
> limit. If the bag contains more distinct tuples than the limit, the UDF 
> returns the limit. 
> This UDF can run reasonably well even on large bags if the limit chosen is 
> small enough though the count is done in memory.
> We use this UDF in PayPal for filtering, when we don't need to use the actual 
> tuples afterward.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (DATAFU-117) New UDF - CountDistinctUpTo

2016-04-26 Thread Eyal Allweil (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eyal Allweil updated DATAFU-117:

Attachment: DATAFU-117-2.patch

This replaces the previous patch file, addresses (most of) Matthew's comments, 
and adds an Algebraic implementation to the UDF.

> New UDF - CountDistinctUpTo
> ---
>
> Key: DATAFU-117
> URL: https://issues.apache.org/jira/browse/DATAFU-117
> Project: DataFu
>  Issue Type: New Feature
>Reporter: Eyal Allweil
> Attachments: DATAFU-117-2.patch, DATAFU-117.patch
>
>
> A UDF that counts distinct tuples within a bag, but only up to a preset 
> limit. If the bag contains more distinct tuples than the limit, the UDF 
> returns the limit. 
> This UDF can run reasonably well even on large bags if the limit chosen is 
> small enough though the count is done in memory.
> We use this UDF in PayPal for filtering, when we don't need to use the actual 
> tuples afterward.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (DATAFU-117) New UDF - CountDistinctUpTo

2016-05-09 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15258239#comment-15258239
 ] 

Eyal Allweil edited comment on DATAFU-117 at 5/9/16 8:50 AM:
-

Ok, I opened a review board for it - It's at https://reviews.apache.org/r/46701/

I think all your previous comments are addressed there, except for the one 
about "this.set.add(o) && (this.set.size() == maxAmount". I don't think this 
can exceed the max size, because a single add operation can only increment the 
set's size by one, and the UDF is executed in a single thread.

I ran a few tests comparing this UDF to a Pig nested foreach with DISTINCT 
followed by the builtin COUNT. On small inputs they perform about the same - 
even up to a million records - but if you have a situation with more skew (I 
checked 10 million records, with about 4 million distincts) then this UDF with 
a max value of say, 1,000,000, runs in a few minutes, and the nested 
foreach+distinct+count takes more than an hour - probably because it needs to 
keep all the distinct records in memory, rather than just reaching the desired 
threshold.


was (Author: eyal):
Ok, I opened a review board for it - can you see it? It's at 
https://reviews.apache.org/r/46701/

I think all your previous comments are addressed there, except for the one 
about "this.set.add(o) && (this.set.size() == maxAmount". I don't think this 
can exceed the max size, because a single add operation can only increment the 
set's size by one, and the UDF is executed in a single thread.

I ran a few tests comparing this UDF to a Pig nested foreach with DISTINCT 
followed by the builtin COUNT. On small inputs they perform about the same - 
even up to a million records - but if you have a situation with more skew (I 
checked 10 million records, with about 4 million distincts) then this UDF with 
a max value of say, 1, runs in about four minutes, and the nested 
foreach+distinct+count takes more than an hour - probably because it needs to 
keep all the distinct records in memory, rather than just reaching the desired 
threshold.

> New UDF - CountDistinctUpTo
> ---
>
> Key: DATAFU-117
> URL: https://issues.apache.org/jira/browse/DATAFU-117
> Project: DataFu
>  Issue Type: New Feature
>Reporter: Eyal Allweil
> Attachments: DATAFU-117-2.patch, DATAFU-117.patch
>
>
> A UDF that counts distinct tuples within a bag, but only up to a preset 
> limit. If the bag contains more distinct tuples than the limit, the UDF 
> returns the limit. 
> This UDF can run reasonably well even on large bags if the limit chosen is 
> small enough though the count is done in memory.
> We use this UDF in PayPal for filtering, when we don't need to use the actual 
> tuples afterward.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DATAFU-117) New UDF - CountDistinctUpTo

2016-05-10 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15277798#comment-15277798
 ] 

Eyal Allweil commented on DATAFU-117:
-

Is anyone available to review this?

> New UDF - CountDistinctUpTo
> ---
>
> Key: DATAFU-117
> URL: https://issues.apache.org/jira/browse/DATAFU-117
> Project: DataFu
>  Issue Type: New Feature
>Reporter: Eyal Allweil
> Attachments: DATAFU-117-2.patch, DATAFU-117.patch
>
>
> A UDF that counts distinct tuples within a bag, but only up to a preset 
> limit. If the bag contains more distinct tuples than the limit, the UDF 
> returns the limit. 
> This UDF can run reasonably well even on large bags if the limit chosen is 
> small enough though the count is done in memory.
> We use this UDF in PayPal for filtering, when we don't need to use the actual 
> tuples afterward.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (DATAFU-117) New UDF - CountDistinctUpTo

2016-05-19 Thread Eyal Allweil (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eyal Allweil updated DATAFU-117:

Attachment: DATAFU-117-3.patch

Incorporates changes from [review |https://reviews.apache.org/r/46701/]

> New UDF - CountDistinctUpTo
> ---
>
> Key: DATAFU-117
> URL: https://issues.apache.org/jira/browse/DATAFU-117
> Project: DataFu
>  Issue Type: New Feature
>Reporter: Eyal Allweil
> Attachments: DATAFU-117-2.patch, DATAFU-117-3.patch, DATAFU-117.patch
>
>
> A UDF that counts distinct tuples within a bag, but only up to a preset 
> limit. If the bag contains more distinct tuples than the limit, the UDF 
> returns the limit. 
> This UDF can run reasonably well even on large bags if the limit chosen is 
> small enough though the count is done in memory.
> We use this UDF in PayPal for filtering, when we don't need to use the actual 
> tuples afterward.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (DATAFU-117) New UDF - CountDistinctUpTo

2016-06-08 Thread Eyal Allweil (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eyal Allweil updated DATAFU-117:

Attachment: DATAFU-117-4.patch

This patch incorporates the last remaining comment from the review (clearing 
instead of reassigning the set in cleanup)

> New UDF - CountDistinctUpTo
> ---
>
> Key: DATAFU-117
> URL: https://issues.apache.org/jira/browse/DATAFU-117
> Project: DataFu
>  Issue Type: New Feature
>Reporter: Eyal Allweil
>Assignee: Eyal Allweil
> Attachments: DATAFU-117-2.patch, DATAFU-117-3.patch, 
> DATAFU-117-4.patch, DATAFU-117.patch
>
>
> A UDF that counts distinct tuples within a bag, but only up to a preset 
> limit. If the bag contains more distinct tuples than the limit, the UDF 
> returns the limit. 
> This UDF can run reasonably well even on large bags if the limit chosen is 
> small enough though the count is done in memory.
> We use this UDF in PayPal for filtering, when we don't need to use the actual 
> tuples afterward.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (DATAFU-119) New UDF - TupleDiff

2016-06-21 Thread Eyal Allweil (JIRA)
Eyal Allweil created DATAFU-119:
---

 Summary: New UDF - TupleDiff
 Key: DATAFU-119
 URL: https://issues.apache.org/jira/browse/DATAFU-119
 Project: DataFu
  Issue Type: New Feature
Reporter: Eyal Allweil
Assignee: Eyal Allweil


A UDF that given two tuples, prints out the differences between them in 
human-readable form. This is not meant for production - we use it in PayPal for 
regression tests, to compare the results of two runs. Differences are 
calculated based on position, but the tuples' schemas are used, if available, 
for displaying more friendly results. If no schema is available the output uses 
field numbers.

It should be used when you want a more fine-grained description of what has 
changed, unlike 
[org.apache.pig.builtin.DIFF|https://pig.apache.org/docs/r0.14.0/func.html#diff].
 Also, because DIFF takes as its input two bags to be compared, they must fit 
in memory. This UDF only takes one pair of tuples at a time, so it can run on 
large inputs.

We use a macro much like the following in conjunction with this UDF:

DEFINE diff_macro(diff_macro_old, diff_macro_new, 
diff_macro_pk, diff_macro_ignored_field) returns diffs
{
DEFINE TupleDiff datafu.pig.util.TupleDiff;

old =   FOREACH $diff_macro_old GENERATE 
$diff_macro_pk, TOTUPLE(*) AS original;
new =   FOREACH $diff_macro_new GENERATE 
$diff_macro_pk, TOTUPLE(*) AS original;

join_data = JOIN new BY $diff_macro_pk full, old BY 
$diff_macro_pk;

join_data = FOREACH join_data GENERATE 
TupleDiff(old::original, new::original, '$diff_macro_ignored_field') AS 
tupleDiff, old::original, new::original;

$diffs = FILTER join_data BY tupleDiff IS NOT NULL ;
};

Currently, the output from the macro looks like this (when comma-separated):

added,,
missing,,
changed field2 field4,,

The UDF takes a variable number of parameters - the two tuples to be compared, 
and any number of field names or numbers to be ignored. We use this to ignore 
fields representing execution or creation time (the macro I've given as an 
example assumes only one ignored field)

The current implementation "drills down" into tuples, but not bags or maps - 
tuple boundaries are indicated with parentheses, like this:

changed outerEmbeddedTuple(innerEmbeddedTuple(fieldNameThatIsDifferent) 
innerEmbeddedTuple(anotherFieldThatIsDifferent))

I have a few final things left to do and then I'll put it up on reviewboard.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (DATAFU-119) New UDF - TupleDiff

2016-06-21 Thread Eyal Allweil (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eyal Allweil updated DATAFU-119:

Description: 
A UDF that given two tuples, prints out the differences between them in 
human-readable form. This is not meant for production - we use it in PayPal for 
regression tests, to compare the results of two runs. Differences are 
calculated based on position, but the tuples' schemas are used, if available, 
for displaying more friendly results. If no schema is available the output uses 
field numbers.

It should be used when you want a more fine-grained description of what has 
changed, unlike 
[org.apache.pig.builtin.DIFF|https://pig.apache.org/docs/r0.14.0/func.html#diff].
 Also, because DIFF takes as its input two bags to be compared, they must fit 
in memory. This UDF only takes one pair of tuples at a time, so it can run on 
large inputs.

We use a macro much like the following in conjunction with this UDF:

{noformat}
DEFINE diff_macro(diff_macro_old, diff_macro_new, diff_macro_pk, 
diff_macro_ignored_field) returns diffs {

DEFINE TupleDiff datafu.pig.util.TupleDiff;

old =   FOREACH $diff_macro_old GENERATE $diff_macro_pk, TOTUPLE(*) AS 
original;
new =   FOREACH $diff_macro_new GENERATE $diff_macro_pk, TOTUPLE(*) AS 
original;

join_data = JOIN new BY $diff_macro_pk full, old BY $diff_macro_pk;

join_data = FOREACH join_data GENERATE TupleDiff(old::original, 
new::original, '$diff_macro_ignored_field') AS tupleDiff, old::original, 
new::original;

$diffs = FILTER join_data BY tupleDiff IS NOT NULL ;
};
{noformat}

Currently, the output from the macro looks like this (when comma-separated):

{noformat}
added,,
missing,,
changed field2 field4,,
{noformat}

The UDF takes a variable number of parameters - the two tuples to be compared, 
and any number of field names or numbers to be ignored. We use this to ignore 
fields representing execution or creation time (the macro I've given as an 
example assumes only one ignored field)

The current implementation "drills down" into tuples, but not bags or maps - 
tuple boundaries are indicated with parentheses, like this:

{noformat}
changed outerEmbeddedTuple(innerEmbeddedTuple(fieldNameThatIsDifferent) 
innerEmbeddedTuple(anotherFieldThatIsDifferent))
{noformat}

I have a few final things left to do and then I'll put it up on reviewboard.

  was:
A UDF that given two tuples, prints out the differences between them in 
human-readable form. This is not meant for production - we use it in PayPal for 
regression tests, to compare the results of two runs. Differences are 
calculated based on position, but the tuples' schemas are used, if available, 
for displaying more friendly results. If no schema is available the output uses 
field numbers.

It should be used when you want a more fine-grained description of what has 
changed, unlike 
[org.apache.pig.builtin.DIFF|https://pig.apache.org/docs/r0.14.0/func.html#diff].
 Also, because DIFF takes as its input two bags to be compared, they must fit 
in memory. This UDF only takes one pair of tuples at a time, so it can run on 
large inputs.

We use a macro much like the following in conjunction with this UDF:

DEFINE diff_macro(diff_macro_old, diff_macro_new, 
diff_macro_pk, diff_macro_ignored_field) returns diffs
{
DEFINE TupleDiff datafu.pig.util.TupleDiff;

old =   FOREACH $diff_macro_old GENERATE 
$diff_macro_pk, TOTUPLE(*) AS original;
new =   FOREACH $diff_macro_new GENERATE 
$diff_macro_pk, TOTUPLE(*) AS original;

join_data = JOIN new BY $diff_macro_pk full, old BY 
$diff_macro_pk;

join_data = FOREACH join_data GENERATE 
TupleDiff(old::original, new::original, '$diff_macro_ignored_field') AS 
tupleDiff, old::original, new::original;

$diffs = FILTER join_data BY tupleDiff IS NOT NULL ;
};

Currently, the output from the macro looks like this (when comma-separated):

added,,
missing,,
changed field2 field4,,

The UDF takes a variable number of parameters - the two tuples to be compared, 
and any number of field names or numbers to be ignored. We use this to ignore 
fields representing execution or creation time (the macro I've given as an 
example assumes only one ignored field)

The current implementation "drills down" into tuples, but not bags or maps - 
tuple boundaries are indicated with parentheses, like this:

changed outerEmbeddedTuple(innerEmbeddedTuple(fieldNameThatIsDifferent) 
innerEmbeddedTuple(anotherFieldThatIsDifferent))

I have a few final things left to do and then I'll put it up on reviewboard.


> New UDF - TupleDiff
>

[jira] [Commented] (DATAFU-119) New UDF - TupleDiff

2016-06-26 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15350489#comment-15350489
 ] 

Eyal Allweil commented on DATAFU-119:
-

I put up a [reviewboard|https://reviews.apache.org/r/49248/] for this. After 
some internal discussions, I wonder if the output isn't too specific for 
general use - I find it very convenient during development for comparing 
outputs, but it's very much skewed towards human-readability - to make it easy 
to use the output in Pig, it should have a real schema, not chararray - 
possibly something with the field names from the original tuples, but boolean 
or int values to indicate change types. I'd be happy to hear feedback about 
this.

> New UDF - TupleDiff
> ---
>
> Key: DATAFU-119
> URL: https://issues.apache.org/jira/browse/DATAFU-119
> Project: DataFu
>  Issue Type: New Feature
>Reporter: Eyal Allweil
>Assignee: Eyal Allweil
>
> A UDF that given two tuples, prints out the differences between them in 
> human-readable form. This is not meant for production - we use it in PayPal 
> for regression tests, to compare the results of two runs. Differences are 
> calculated based on position, but the tuples' schemas are used, if available, 
> for displaying more friendly results. If no schema is available the output 
> uses field numbers.
> It should be used when you want a more fine-grained description of what has 
> changed, unlike 
> [org.apache.pig.builtin.DIFF|https://pig.apache.org/docs/r0.14.0/func.html#diff].
>  Also, because DIFF takes as its input two bags to be compared, they must fit 
> in memory. This UDF only takes one pair of tuples at a time, so it can run on 
> large inputs.
> We use a macro much like the following in conjunction with this UDF:
> {noformat}
> DEFINE diff_macro(diff_macro_old, diff_macro_new, diff_macro_pk, 
> diff_macro_ignored_field) returns diffs {
>   DEFINE TupleDiff datafu.pig.util.TupleDiff;
>   
>   old =   FOREACH $diff_macro_old GENERATE $diff_macro_pk, TOTUPLE(*) AS 
> original;
>   new =   FOREACH $diff_macro_new GENERATE $diff_macro_pk, TOTUPLE(*) AS 
> original;
>   
>   join_data = JOIN new BY $diff_macro_pk full, old BY $diff_macro_pk;
>   
>   join_data = FOREACH join_data GENERATE TupleDiff(old::original, 
> new::original, '$diff_macro_ignored_field') AS tupleDiff, old::original, 
> new::original;
>   
>   $diffs = FILTER join_data BY tupleDiff IS NOT NULL ;
> };
> {noformat}
> Currently, the output from the macro looks like this (when comma-separated):
> {noformat}
> added,,
> missing,,
> changed field2 field4,,
> {noformat}
> The UDF takes a variable number of parameters - the two tuples to be 
> compared, and any number of field names or numbers to be ignored. We use this 
> to ignore fields representing execution or creation time (the macro I've 
> given as an example assumes only one ignored field)
> The current implementation "drills down" into tuples, but not bags or maps - 
> tuple boundaries are indicated with parentheses, like this:
> {noformat}
> changed outerEmbeddedTuple(innerEmbeddedTuple(fieldNameThatIsDifferent) 
> innerEmbeddedTuple(anotherFieldThatIsDifferent))
> {noformat}
> I have a few final things left to do and then I'll put it up on reviewboard.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DATAFU-119) New UDF - TupleDiff

2016-09-07 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15471164#comment-15471164
 ] 

Eyal Allweil commented on DATAFU-119:
-

Any feedback about this?

> New UDF - TupleDiff
> ---
>
> Key: DATAFU-119
> URL: https://issues.apache.org/jira/browse/DATAFU-119
> Project: DataFu
>  Issue Type: New Feature
>Reporter: Eyal Allweil
>Assignee: Eyal Allweil
>
> A UDF that given two tuples, prints out the differences between them in 
> human-readable form. This is not meant for production - we use it in PayPal 
> for regression tests, to compare the results of two runs. Differences are 
> calculated based on position, but the tuples' schemas are used, if available, 
> for displaying more friendly results. If no schema is available the output 
> uses field numbers.
> It should be used when you want a more fine-grained description of what has 
> changed, unlike 
> [org.apache.pig.builtin.DIFF|https://pig.apache.org/docs/r0.14.0/func.html#diff].
>  Also, because DIFF takes as its input two bags to be compared, they must fit 
> in memory. This UDF only takes one pair of tuples at a time, so it can run on 
> large inputs.
> We use a macro much like the following in conjunction with this UDF:
> {noformat}
> DEFINE diff_macro(diff_macro_old, diff_macro_new, diff_macro_pk, 
> diff_macro_ignored_field) returns diffs {
>   DEFINE TupleDiff datafu.pig.util.TupleDiff;
>   
>   old =   FOREACH $diff_macro_old GENERATE $diff_macro_pk, TOTUPLE(*) AS 
> original;
>   new =   FOREACH $diff_macro_new GENERATE $diff_macro_pk, TOTUPLE(*) AS 
> original;
>   
>   join_data = JOIN new BY $diff_macro_pk full, old BY $diff_macro_pk;
>   
>   join_data = FOREACH join_data GENERATE TupleDiff(old::original, 
> new::original, '$diff_macro_ignored_field') AS tupleDiff, old::original, 
> new::original;
>   
>   $diffs = FILTER join_data BY tupleDiff IS NOT NULL ;
> };
> {noformat}
> Currently, the output from the macro looks like this (when comma-separated):
> {noformat}
> added,,
> missing,,
> changed field2 field4,,
> {noformat}
> The UDF takes a variable number of parameters - the two tuples to be 
> compared, and any number of field names or numbers to be ignored. We use this 
> to ignore fields representing execution or creation time (the macro I've 
> given as an example assumes only one ignored field)
> The current implementation "drills down" into tuples, but not bags or maps - 
> tuple boundaries are indicated with parentheses, like this:
> {noformat}
> changed outerEmbeddedTuple(innerEmbeddedTuple(fieldNameThatIsDifferent) 
> innerEmbeddedTuple(anotherFieldThatIsDifferent))
> {noformat}
> I have a few final things left to do and then I'll put it up on reviewboard.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DATAFU-119) New UDF - TupleDiff

2016-09-18 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15500764#comment-15500764
 ] 

Eyal Allweil commented on DATAFU-119:
-

I've run it on results that were in the tens of millions. I think the main 
reason for using it / including it in DataFu is that if you're developing Pig 
code, and running it on a cluster (or on any given environment), being able to 
stay in the Pig ecosystem is convenient for fast development cycles. If your 
original job can run on the given environment, a comparison job can run their 
efficiently, too. And there's less copying because you leave the previous 
results in the hdfs under a different name, and compare easily.

The output is human-readable, but the expected results is that most records 
return null, because they're identical, and the ones that do come out are 
usually edge cases that turned out different.

That's the reasoning behind having "something" like this UDF. The output type 
and it's not having a schema is a different story - it would be better to have 
a schema. But I'm hesitant to spend the time to do it if it isn't likely that 
someone else will want to write a different output format for it.

> New UDF - TupleDiff
> ---
>
> Key: DATAFU-119
> URL: https://issues.apache.org/jira/browse/DATAFU-119
> Project: DataFu
>  Issue Type: New Feature
>Reporter: Eyal Allweil
>Assignee: Eyal Allweil
>
> A UDF that given two tuples, prints out the differences between them in 
> human-readable form. This is not meant for production - we use it in PayPal 
> for regression tests, to compare the results of two runs. Differences are 
> calculated based on position, but the tuples' schemas are used, if available, 
> for displaying more friendly results. If no schema is available the output 
> uses field numbers.
> It should be used when you want a more fine-grained description of what has 
> changed, unlike 
> [org.apache.pig.builtin.DIFF|https://pig.apache.org/docs/r0.14.0/func.html#diff].
>  Also, because DIFF takes as its input two bags to be compared, they must fit 
> in memory. This UDF only takes one pair of tuples at a time, so it can run on 
> large inputs.
> We use a macro much like the following in conjunction with this UDF:
> {noformat}
> DEFINE diff_macro(diff_macro_old, diff_macro_new, diff_macro_pk, 
> diff_macro_ignored_field) returns diffs {
>   DEFINE TupleDiff datafu.pig.util.TupleDiff;
>   
>   old =   FOREACH $diff_macro_old GENERATE $diff_macro_pk, TOTUPLE(*) AS 
> original;
>   new =   FOREACH $diff_macro_new GENERATE $diff_macro_pk, TOTUPLE(*) AS 
> original;
>   
>   join_data = JOIN new BY $diff_macro_pk full, old BY $diff_macro_pk;
>   
>   join_data = FOREACH join_data GENERATE TupleDiff(old::original, 
> new::original, '$diff_macro_ignored_field') AS tupleDiff, old::original, 
> new::original;
>   
>   $diffs = FILTER join_data BY tupleDiff IS NOT NULL ;
> };
> {noformat}
> Currently, the output from the macro looks like this (when comma-separated):
> {noformat}
> added,,
> missing,,
> changed field2 field4,,
> {noformat}
> The UDF takes a variable number of parameters - the two tuples to be 
> compared, and any number of field names or numbers to be ignored. We use this 
> to ignore fields representing execution or creation time (the macro I've 
> given as an example assumes only one ignored field)
> The current implementation "drills down" into tuples, but not bags or maps - 
> tuple boundaries are indicated with parentheses, like this:
> {noformat}
> changed outerEmbeddedTuple(innerEmbeddedTuple(fieldNameThatIsDifferent) 
> innerEmbeddedTuple(anotherFieldThatIsDifferent))
> {noformat}
> I have a few final things left to do and then I'll put it up on reviewboard.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (DATAFU-122) Documentation error/typo on tips and tricks involving Coalesce

2016-10-12 Thread Eyal Allweil (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eyal Allweil resolved DATAFU-122.
-
Resolution: Fixed

> Documentation error/typo on tips and tricks involving Coalesce
> --
>
> Key: DATAFU-122
> URL: https://issues.apache.org/jira/browse/DATAFU-122
> Project: DataFu
>  Issue Type: Bug
>Reporter: Ryan Clough
>Assignee: Eyal Allweil
>Priority: Trivial
>  Labels: documentation, typo
> Fix For: 1.3.2
>
>
> http://datafu.incubator.apache.org/docs/datafu/guide/more-tips-and-tricks.html
> On this page, an example is given for Coalesce:
> {code}
> DEFINE EmptyBagToNullFields datafu.pig.util.Coalesce();
> data = FOREACH data GENERATE Coalesce(val,0) as result;
> {code}
> In this example, "EmpyBagToNullFields" should be replaced with "Coalesce", 
> which is what is used in the code following the define statement. My guess is 
> this is a copy paste error from an example further down when 
> EmpyBagToNullFields is actually used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (DATAFU-122) Documentation error/typo on tips and tricks involving Coalesce

2016-10-12 Thread Eyal Allweil (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eyal Allweil updated DATAFU-122:

 Assignee: Eyal Allweil
   Labels: documentation typo  (was: docuentation typo)
Fix Version/s: 1.3.2

Thanks Ryan! I've fixed this in our sources, and it will show up when we 
release our next version (probably 1.3.2)

> Documentation error/typo on tips and tricks involving Coalesce
> --
>
> Key: DATAFU-122
> URL: https://issues.apache.org/jira/browse/DATAFU-122
> Project: DataFu
>  Issue Type: Bug
>Reporter: Ryan Clough
>Assignee: Eyal Allweil
>Priority: Trivial
>  Labels: documentation, typo
> Fix For: 1.3.2
>
>
> http://datafu.incubator.apache.org/docs/datafu/guide/more-tips-and-tricks.html
> On this page, an example is given for Coalesce:
> {code}
> DEFINE EmptyBagToNullFields datafu.pig.util.Coalesce();
> data = FOREACH data GENERATE Coalesce(val,0) as result;
> {code}
> In this example, "EmpyBagToNullFields" should be replaced with "Coalesce", 
> which is what is used in the code following the define statement. My guess is 
> this is a copy paste error from an example further down when 
> EmpyBagToNullFields is actually used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DATAFU-85) Add SPRINTF to provide this functionality to Pig < 0.14.0

2016-10-13 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-85?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15571787#comment-15571787
 ] 

Eyal Allweil commented on DATAFU-85:


Given the time that has passed, and that it can't be backported (easily), I 
think this issue can/should be closed.

> Add SPRINTF to provide this functionality to Pig < 0.14.0
> -
>
> Key: DATAFU-85
> URL: https://issues.apache.org/jira/browse/DATAFU-85
> Project: DataFu
>  Issue Type: Bug
>Reporter: Russell Jurney
>Assignee: Russell Jurney
>
> I need SPRINTF in DataFu for a book I'm working on. I'd like to add this to 
> DataFu so that CDH, HDP, MapR, etc. users can use SPRINTF as soon as DataFu 
> cuts a new release.
> See PIG-3939
> Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DATAFU-28) Tests are too slow

2016-10-13 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-28?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15571961#comment-15571961
 ] 

Eyal Allweil commented on DATAFU-28:


On my machine the datafu-pig tests run in 18 minutes (I ran them with ./gradlew 
:datafu-pig:test). Is this issue still relevant, or is that an acceptable time?

> Tests are too slow
> --
>
> Key: DATAFU-28
> URL: https://issues.apache.org/jira/browse/DATAFU-28
> Project: DataFu
>  Issue Type: Improvement
>Reporter: Matthew Hayes
>
> I ran the tests on my laptop and it took nearly 2 hours.
> The worst offenders are {{datafu.test.pig.sampling}}, 
> {{datafu.test.pig.stats}}, and {{datafu.test.pig.stats.entropy}}.
> ||Package  ||Tests||  Failures||  Duration||  Success rate||
> |datafu.test.pig.bags|27  |0| 1m10.72s|100%|
> |datafu.test.pig.geo  |1  |0  |9.757s |100%|
> |datafu.test.pig.hash|4   |0  |41.039s|   100%|
> |datafu.test.pig.linkanalysis|5   |0| 32.677s |100%|
> |datafu.test.pig.random   |1| 0|  11.789s|100%|
> |datafu.test.pig.sampling |25|0   |38m25.81s| 100%|
> |datafu.test.pig.sessions |7  |0  |2m50.67s   |100%|
> |datafu.test.pig.sets |9  |0  |5m46.70s   |100%|
> |datafu.test.pig.stats|   52| 0   |26m11.98s| 100%|
> |datafu.test.pig.stats.entropy|40|0   |31m30.97s  |100%|
> |datafu.test.pig.urls|1   |0  |1m35.24s   |100%|
> |datafu.test.pig.util|21  |0| 4m51.64s|100%|



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DATAFU-81) Unit test failures on Java 1.8

2016-10-15 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-81?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15578653#comment-15578653
 ] 

Eyal Allweil commented on DATAFU-81:


These tests all pass for me with openjdk-8-amd64 on Ubuntu 16.04, both in 
Eclipse and in Gradle (I changed build.gradle and MultilineProcessor.java to 
test it, though).

> Unit test failures on Java 1.8
> --
>
> Key: DATAFU-81
> URL: https://issues.apache.org/jira/browse/DATAFU-81
> Project: DataFu
>  Issue Type: Bug
>Reporter: Matthew Hayes
>
> I have Java 8 installed and I noticed the following bag tests fail:
> * bagJoinFullOuterTest
> * bagLeftOuterJoinTest
> * distinctByMultiComplexFieldTest
> * duplicateAliasTest
> It seems like there may be an ordering assumption that doesn't work in Java 
> 8.  There may be other cases like this too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (DATAFU-65) Aho-Corasick Pig UDF

2016-10-18 Thread Eyal Allweil (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-65?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eyal Allweil updated DATAFU-65:
---
Issue Type: New Feature  (was: Bug)

> Aho-Corasick Pig UDF
> 
>
> Key: DATAFU-65
> URL: https://issues.apache.org/jira/browse/DATAFU-65
> Project: DataFu
>  Issue Type: New Feature
>Affects Versions: 1.3.0
> Environment: Drought
>Reporter: Russell Jurney
> Attachments: DATAFU-65.diff
>
>   Original Estimate: 8h
>  Remaining Estimate: 8h
>
> I need to use the Aho-Corasick algorithm for efficient sub-string matching. A 
> java implementation is available at 
> https://github.com/robert-bor/aho-corasick and is available on maven central: 
> http://maven-repository.com/artifact/org.arabidopsis.ahocorasick/ahocorasick/2.x
>  A Pig UDF will be very helpful to me.
> How do I add a maven dependency with gradle?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DATAFU-45) RFE: CartesianProduct

2016-10-18 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-45?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15584898#comment-15584898
 ] 

Eyal Allweil commented on DATAFU-45:


Hi Sam,

Did you ever solve this? I agree with Matthew that this should be doable via 
plain Pig - if not, I'd open a bug there.

> RFE: CartesianProduct
> -
>
> Key: DATAFU-45
> URL: https://issues.apache.org/jira/browse/DATAFU-45
> Project: DataFu
>  Issue Type: New Feature
>Reporter: Sam Steingold
>
> Given two bags, produce their [Cartesian 
> product|http://en.wikipedia.org/wiki/Cartesian_product]:
> {code}
> B1: bag{T1}
> B2: bag{T2}
> CartesianProduct(B1,B2): bag{(T1,T2)}
> {code}
> Use case:
> {code}
> toks = TOKENIZE((charray)$0,',');
> kwds = CartesianProduct(toks, {1.0/(double)SIZE(toks)});
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DATAFU-16) weighted reservoir sampling with exponential jumps UDF

2016-10-18 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-16?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15586085#comment-15586085
 ] 

Eyal Allweil commented on DATAFU-16:


It looks like this got added - can this issue be closed?

> weighted reservoir sampling with exponential jumps UDF
> --
>
> Key: DATAFU-16
> URL: https://issues.apache.org/jira/browse/DATAFU-16
> Project: DataFu
>  Issue Type: New Feature
> Environment: Mac, Linux
> pig-0.11
>Reporter: jian wang
>Assignee: jian wang
>Priority: Minor
> Attachments: ScoredExpJmpReservoir.java, ScoredReservoir.java, 
> WeightedSamplingCorrectnessTests.java
>
>
> Create a weightedReservoirSampleWithExpJump UDF to implement the weighted 
> reservoir sampling algorithm with exponential jumps. Investigation is tracked 
> in  https://github.com/linkedin/datafu/issues/80. This task is part of 
> experiment of different weighted sampling algorithms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (DATAFU-25) AliasableEvalFunc should use getInputSchema

2016-10-19 Thread Eyal Allweil (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-25?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eyal Allweil updated DATAFU-25:
---
Attachment: DATAFU-25.patch

This is a minimal fix that uses getInputSchema() instead of the udf context, 
when possible.

It will also solve [DATAFU-6|https://issues.apache.org/jira/browse/DATAFU-6], 
though not the example provided there - BagLeftOuterJoin. This is because 
MonitoredUDF prevents the udf context from working (see 
[PIG-3554|https://issues.apache.org/jira/browse/PIG-3554], and BagJoin uses the 
udf context for other things, not just those that AliasableEvalFunc provides.

I couldn't think of a clean way of adding a test for this, but you can verify 
that it works by adding the MonitoredUDF annotation to TransposeTupleToBag - 
this will make the TransposeTest.transposeTest fail, unless my patch is used. I 
didn't want to expose a "fake" udf with the MonitoredUDF annotation, and adding 
it in the test package means that TransposeTest can't access it.

> AliasableEvalFunc should use getInputSchema
> ---
>
> Key: DATAFU-25
> URL: https://issues.apache.org/jira/browse/DATAFU-25
> Project: DataFu
>  Issue Type: Improvement
>Reporter: Matthew Hayes
>Assignee: Will Vaughan
> Attachments: DATAFU-25.patch
>
>
> AliasableEvalFunc derives from ContextualEvalFunc and stores a map of aliases 
> in the UDF context.  We can instead use getInputSchema, which was added to 
> Pig 0.11.  This may in the process resolve DATAFU-6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (DATAFU-25) AliasableEvalFunc should use getInputSchema

2016-10-19 Thread Eyal Allweil (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-25?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eyal Allweil reassigned DATAFU-25:
--

Assignee: Eyal Allweil  (was: Will Vaughan)

> AliasableEvalFunc should use getInputSchema
> ---
>
> Key: DATAFU-25
> URL: https://issues.apache.org/jira/browse/DATAFU-25
> Project: DataFu
>  Issue Type: Improvement
>Reporter: Matthew Hayes
>Assignee: Eyal Allweil
> Attachments: DATAFU-25.patch
>
>
> AliasableEvalFunc derives from ContextualEvalFunc and stores a map of aliases 
> in the UDF context.  We can instead use getInputSchema, which was added to 
> Pig 0.11.  This may in the process resolve DATAFU-6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (DATAFU-25) AliasableEvalFunc should use getInputSchema

2016-10-19 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15589407#comment-15589407
 ] 

Eyal Allweil edited comment on DATAFU-25 at 10/19/16 6:01 PM:
--

This is a minimal fix that uses getInputSchema() instead of the udf context, 
when possible.

It will also solve [DATAFU-6|https://issues.apache.org/jira/browse/DATAFU-6], 
though not the example provided there - BagLeftOuterJoin. This is because 
MonitoredUDF prevents the udf context from working (see 
[PIG-3554|https://issues.apache.org/jira/browse/PIG-3554]), and BagJoin uses 
the udf context for other things, not just those that AliasableEvalFunc 
provides.

I couldn't think of a clean way of adding a test for this, but you can verify 
that it works by adding the MonitoredUDF annotation to TransposeTupleToBag - 
this will make the TransposeTest.transposeTest fail, unless my patch is used. I 
didn't want to expose a "fake" udf with the MonitoredUDF annotation, and adding 
it in the test package means that TransposeTest can't access it.


was (Author: eyal):
This is a minimal fix that uses getInputSchema() instead of the udf context, 
when possible.

It will also solve [DATAFU-6|https://issues.apache.org/jira/browse/DATAFU-6], 
though not the example provided there - BagLeftOuterJoin. This is because 
MonitoredUDF prevents the udf context from working (see 
[PIG-3554|https://issues.apache.org/jira/browse/PIG-3554], and BagJoin uses the 
udf context for other things, not just those that AliasableEvalFunc provides.

I couldn't think of a clean way of adding a test for this, but you can verify 
that it works by adding the MonitoredUDF annotation to TransposeTupleToBag - 
this will make the TransposeTest.transposeTest fail, unless my patch is used. I 
didn't want to expose a "fake" udf with the MonitoredUDF annotation, and adding 
it in the test package means that TransposeTest can't access it.

> AliasableEvalFunc should use getInputSchema
> ---
>
> Key: DATAFU-25
> URL: https://issues.apache.org/jira/browse/DATAFU-25
> Project: DataFu
>  Issue Type: Improvement
>Reporter: Matthew Hayes
>Assignee: Will Vaughan
> Attachments: DATAFU-25.patch
>
>
> AliasableEvalFunc derives from ContextualEvalFunc and stores a map of aliases 
> in the UDF context.  We can instead use getInputSchema, which was added to 
> Pig 0.11.  This may in the process resolve DATAFU-6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DATAFU-6) MonitoredUDF annotation does not work with AliasableEvalFunc

2016-10-20 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-6?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15591185#comment-15591185
 ] 

Eyal Allweil commented on DATAFU-6:
---

One caveat about the description - adding the annotation to BagLeftOuterJoin is 
not a good way to test the problem in AliasableEvalFunc, because it depends on 
other values in the UDF context, and because of PIG-3554 the context doesn't 
work. In order to test this specific case, it's better to put the MonitoredUDF 
annotation on TransposeTupleToBag because it doesn't have any other 
dependencies on the udf context (other than those provided by AliasableEvalFunc)

> MonitoredUDF annotation does not work with AliasableEvalFunc
> 
>
> Key: DATAFU-6
> URL: https://issues.apache.org/jira/browse/DATAFU-6
> Project: DataFu
>  Issue Type: Bug
>Reporter: Matthew Hayes
>
> This was reported by seregasheypak on GitHub 
> (https://github.com/linkedin/datafu/issues/89).  We were able to reproduce 
> this by adding the annotation to BagLeftOuterJoin and running its tests.  
> Simply adding the annotation causes problems.  In 
> ContextualEvalFunc.getContextProperties, the properties retrieved for the 
> class are empty.
> seregasheypak:
> Hi, If I use
> {code}
> @MonitoredUDF(timeUnit = TimeUnit.MINUTES, duration = 10, errorCallback = 
> NplRecMatcherErrorCallback.class)
> class NplRecFirstLevelMatcher extends AliasableEvalFunc implements 
> DebuggableUDF{
> //some cool stuff goes here!
> }
> {code}
> I do get exception:
> {noformat}
> 14/01/15 23:52:52 ERROR udf.NplRecFirstLevelMatcher: Class: class 
> NplRecFirstLevelMatcher
> 14/01/15 23:52:52 ERROR udf.NplRecFirstLevelMatcher: Instance name: 30
> 14/01/15 23:52:52 ERROR udf.NplRecFirstLevelMatcher: Properties: {30={}}
> *** ***A debug output from my handler method***  ***
> NplRecMatcherErrorCallback.handleError
> null
> ERROR: java.lang.RuntimeException: Could not retrieve aliases from properties 
> using aliasMap
> java.util.concurrent.ExecutionException: java.lang.RuntimeException: Could 
> not retrieve aliases from properties using aliasMap
>   at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:232)
>   at java.util.concurrent.FutureTask.get(FutureTask.java:91)
>   at 
> com.google.common.util.concurrent.ForwardingFuture.get(ForwardingFuture.java:69)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.util.MonitoredUDFExecutor.monitorExec(MonitoredUDFExecutor.java:183)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:335)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:376)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:354)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:372)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:297)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:308)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:241)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:308)
>   at 
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFilter.getNext(POFilter.java:95)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.runPipeline(PigGenericMapReduce.java:465)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.processOnePackageOutput(PigGenericMapReduce.java:433)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:413)
>   at 
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapReduce$Reduce.reduce(PigGenericMapReduce.java:257)
>   at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:164)
>   at 
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:610)
>   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:444)
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:449)
> Caused by: java.lang.RuntimeException: Could not retrieve aliases from 
> properties using aliasMap
>   at 
> datafu.pig.util.AliasableEvalFunc.getFieldAliases(AliasableEvalFunc.java:164)
>   at 
> datafu.pig.util.AliasableEvalFunc.getPosit

[jira] [Commented] (DATAFU-9) Add datafu.text.ToJson UDF to serialize any relation/field as a JSON String

2016-10-20 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-9?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15591983#comment-15591983
 ] 

Eyal Allweil commented on DATAFU-9:
---

Hi Russell, are you still interested in including this in DataFu? It doesn't 
look like there are intractable problems left, and DataFu is based on a Pig 
recent enough to have BigInteger and BigDecimals now.

>  Add datafu.text.ToJson UDF to serialize any relation/field as a JSON String
> 
>
> Key: DATAFU-9
> URL: https://issues.apache.org/jira/browse/DATAFU-9
> Project: DataFu
>  Issue Type: New Feature
>Reporter: Russell Jurney
> Attachments: DATAFU-9.patch
>
>
> See https://github.com/linkedin/datafu/issues/91



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DATAFU-98) New UDF for Histogram / Frequency counting

2016-10-25 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-98?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15605952#comment-15605952
 ] 

Eyal Allweil commented on DATAFU-98:


Hi Russell.

First of all, I want to apologize for the time it's taken us to get to your 
contribution. I think it could be quite useful. Having said that, I wonder if 
the current version - without counters - gives us enough of an advantage over 
vanilla Pig. I think the following code (modified from your unit test) gives us 
nearly the same functionality as the UDF in the patch:

{noformat}
data_in = LOAD 'input' as (val:int);
-- data_in: "0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "20"

intermediate_data = FOREACH data_in GENERATE val, (val / 5 * 5) AS binStart;

data_out = FOREACH (GROUP intermediate_data BY binStart) GENERATE group AS 
binStart, COUNT(intermediate_data) AS binCount;
-- data_out: (0,5),(5,5),(10,2),(20,1)

{noformat}

Unlike your UDF, missing bins are not included. But while including missing 
bins can be useful, I do wonder if a single skewed value can cause problems, 
especially with small bin sizes and long values. (as a performance-related 
aside, I would try to have FrequencyCounter.toBag() called only in the Final 
implementations, instead of the first two stages of the algebraic 
implementation, to minimize the data copied).

So it seems to me the current UDF has the advantage of having the missing bins, 
and it's obviously more readable and convenient than rewriting the Pig code I 
wrote above. Did you (or you, [~andrew.musselman]) run any performance tests? 
Maybe the Algebraic implementation runs faster than the vanilla Pig code by 
virtue of the combiner use.

Last (but not least!) the version you mentioned with counters sounds like it 
could be really great.


> New UDF for Histogram / Frequency counting
> --
>
> Key: DATAFU-98
> URL: https://issues.apache.org/jira/browse/DATAFU-98
> Project: DataFu
>  Issue Type: New Feature
>Reporter: Russell Melick
> Attachments: DATAFU-98.patch
>
>
> I was thinking of creating a new UDF to compute histograms / frequency counts 
> of input bags.  It seems like it would make sense to support ints, longs, 
> float, and doubles.  
> I tried looking around to see if this was already implemented, but 
> ValueHistogram and AggregateWordHistogram were about the only things I found. 
>  They seem to exist as an example job, and only work for Strings.
> https://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/lib/aggregate/ValueHistogram.html
> https://hadoop.apache.org/docs/r1.2.1/api/org/apache/hadoop/examples/AggregateWordHistogram.html
> Should the user specify the bin size or the number of bins?  Specifying bin 
> size probably makes the implementation simpler since you can bin things 
> without having seen all of the data.
> I think it would make sense to implement a version of this that didn't need 
> any reducers.  It could use counters to keep track of the counts per bin 
> without sending any data to a reducer.  You would be able to call this 
> without a preceding GROUP BY as well.
> Here's my proposal for the two udfs.  This assumes the input data is two 
> columns, memberId and numConnections.
> {code}
> DEFINE BinnedFrequency datafu.pig.stats.BinnedFrequency('min=0;binSize=50')
> connections = LOAD 'connections' AS memberId, numConnections;
> connectionHistogram = FOREACH (GROUP connections ALL) GENERATE 
> BinnedFrequency(connections.numConnections);
> {code}
> The output here would be a bag with the frequency counts
> {code}
> {('0-49', 5), ('50-99', 0), ('100-149', 10)}
> {code}
> {code}
> DEFINE BinnedFrequencyCounter 
> datafu.pig.stats.BinnedFrequencyCounter('min=0;binSize=50;name=numConnectionsHistogram')
> connections = LOAD 'connections' AS memberId, numConnections;
> connections = FOREACH connections GENERATE 
> BinnedFrequencyCounter(numConnections);
> {code}
> The output here would just be a counter for each bin, all sharing the same 
> group of numConnectionsHistogram.  It would look something like
> numConnectionsHistogram.'0-49' = 5
> numConnectionsHistogram.'50-99' = 0
> numConnectionsHistogram.'100-149' = 10



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DATAFU-87) Edit distance

2016-10-25 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-87?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15606106#comment-15606106
 ] 

Eyal Allweil commented on DATAFU-87:


Hi Joydeep,

I want to begin by apologizing for the time it's taken us to get to your 
contribution. Did you ever continue with it? Have you compared your 
implementation with [the one in Apache Commons 
Text|https://github.com/apache/commons-text/blob/master/src/main/java/org/apache/commons/text/similarity/LevenshteinDistance.java]
 or [Commons 
Lang|https://github.com/apache/commons-lang/blob/master/src/main/java/org/apache/commons/lang3/StringUtils.java#L7731]?
 (I think they follow the same algorithm, from _Algorithms on Strings, Trees 
and Sequences_ by Dan Gusfield and Chas Emerick)

> Edit distance
> -
>
> Key: DATAFU-87
> URL: https://issues.apache.org/jira/browse/DATAFU-87
> Project: DataFu
>  Issue Type: New Feature
>Affects Versions: 1.3.0
>Reporter: Joydeep Banerjee
> Attachments: DATAFU-87.patch
>
>
> [This is work-in-progress]
> Given 2 strings, provide a measure of dis-similarity (Levenshtein distance) 
> between them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DATAFU-25) AliasableEvalFunc should use getInputSchema

2016-10-25 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15607593#comment-15607593
 ] 

Eyal Allweil commented on DATAFU-25:


I think that if we want to keep the UDF context as fallback, we need to keep 
the code that sets it.

I'll add a comment explaining the logic behind the two methods. Should I commit 
it afterwards?

> AliasableEvalFunc should use getInputSchema
> ---
>
> Key: DATAFU-25
> URL: https://issues.apache.org/jira/browse/DATAFU-25
> Project: DataFu
>  Issue Type: Improvement
>Reporter: Matthew Hayes
>Assignee: Eyal Allweil
> Attachments: DATAFU-25.patch
>
>
> AliasableEvalFunc derives from ContextualEvalFunc and stores a map of aliases 
> in the UDF context.  We can instead use getInputSchema, which was added to 
> Pig 0.11.  This may in the process resolve DATAFU-6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (DATAFU-25) AliasableEvalFunc should use getInputSchema

2016-10-26 Thread Eyal Allweil (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-25?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eyal Allweil updated DATAFU-25:
---
Fix Version/s: 1.3.2

> AliasableEvalFunc should use getInputSchema
> ---
>
> Key: DATAFU-25
> URL: https://issues.apache.org/jira/browse/DATAFU-25
> Project: DataFu
>  Issue Type: Improvement
>Reporter: Matthew Hayes
>Assignee: Eyal Allweil
> Fix For: 1.3.2
>
> Attachments: DATAFU-25.patch
>
>
> AliasableEvalFunc derives from ContextualEvalFunc and stores a map of aliases 
> in the UDF context.  We can instead use getInputSchema, which was added to 
> Pig 0.11.  This may in the process resolve DATAFU-6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (DATAFU-25) AliasableEvalFunc should use getInputSchema

2016-10-26 Thread Eyal Allweil (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-25?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eyal Allweil resolved DATAFU-25.

Resolution: Fixed

This means that DATAFU-6 should probably be closed too.

> AliasableEvalFunc should use getInputSchema
> ---
>
> Key: DATAFU-25
> URL: https://issues.apache.org/jira/browse/DATAFU-25
> Project: DataFu
>  Issue Type: Improvement
>Reporter: Matthew Hayes
>Assignee: Eyal Allweil
> Fix For: 1.3.2
>
> Attachments: DATAFU-25.patch
>
>
> AliasableEvalFunc derives from ContextualEvalFunc and stores a map of aliases 
> in the UDF context.  We can instead use getInputSchema, which was added to 
> Pig 0.11.  This may in the process resolve DATAFU-6.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DATAFU-83) InUDF does not validate that types are compatible

2016-10-26 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-83?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15609317#comment-15609317
 ] 

Eyal Allweil commented on DATAFU-83:


Hi [~sonalit],

I don't see your review board request. Can you check that it's associated with 
the DataFu group, or attach your updated patch? This seems like a bug worth 
fixing, even if Pig already has its own [IN 
operator|https://pig.apache.org/docs/r0.14.0/basic.html#boolops].

> InUDF does not validate that types are compatible
> -
>
> Key: DATAFU-83
> URL: https://issues.apache.org/jira/browse/DATAFU-83
> Project: DataFu
>  Issue Type: Improvement
>Reporter: Matthew Hayes
>Priority: Minor
> Attachments: DATAFU-83.patch
>
>
> See the example below.  The input data is a long, but ints are provided to 
> match against.  Because it uses the Java equals to compare and these are 
> different types, this will never match, which can lead to confusing results.  
> I believe it should at least throw an error.
> {code}
>   define I datafu.pig.util.InUDF();
>   
>   data = LOAD 'input' AS (B: bag {T: tuple(v:LONG)});
>   
>   data2 = FOREACH data {
> C = FILTER B By I(v, 1,2,3);
> GENERATE C;
>   }
>   
>   describe data2;
>   
>   STORE data2 INTO 'output';
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DATAFU-48) Upgrade Guava to 17.0

2016-10-27 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-48?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15611419#comment-15611419
 ] 

Eyal Allweil commented on DATAFU-48:


What was the symptom of the problem with Guava 17.0? I updated it to 17.0 and 
ran _gradlew clean test_, and all the pig and hourglass tests passed. In fact, 
I also tried updating to Guava 19.0 (the latest version) and that built 
successfully, too.

> Upgrade Guava to 17.0
> -
>
> Key: DATAFU-48
> URL: https://issues.apache.org/jira/browse/DATAFU-48
> Project: DataFu
>  Issue Type: Improvement
>Reporter: Philip (flip) Kromer
>Assignee: Philip (flip) Kromer
>Priority: Minor
>  Labels: build, dependency, guava, version
> Attachments: 0001-DATAFU-48-Upgrade-gradle-to-17.0.patch
>
>
> Specifically motivated by the improvements to hashing library, but also 
> because we're six versions behind at the moment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DATAFU-41) BagGroup does not name bag field in some cases

2016-10-28 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15616338#comment-15616338
 ] 

Eyal Allweil commented on DATAFU-41:


I'm trying to see if I understand what's happening here. Do you mean replacing 
the line

{noformat}
data3 = FOREACH data2 GENERATE group as id, BagGroup(data,data.key) as grouped;
{noformat}

with

{noformat}
data3 = FOREACH data2 GENERATE group as id, BagGroup(data.(key,val),data.key) 
as grouped;
{noformat}

When I do this, I indeed get the schema for data3 as described above - without 
a name for grouped data that BagGroup returns. But is this really a bug? 
Because it's receiving a bag without a name as input, so what name can it give? 
The name _data_ isn't being passed to the UDF at all in this case. ( I debugged 
and looked at the input schema's value in _BagGroup.getOutputSchema()_ )


> BagGroup does not name bag field in some cases
> --
>
> Key: DATAFU-41
> URL: https://issues.apache.org/jira/browse/DATAFU-41
> Project: DataFu
>  Issue Type: Bug
>Reporter: Matthew Hayes
>
> For this test:
> {code}
> /**
>   define BagSum datafu.pig.bags.BagSum();
>   define BagGroup datafu.pig.bags.BagGroup();
>   
>   data = LOAD 'input' USING PigStorage(',') AS (id:int, key:chararray, 
> val:int);
>   describe data;
>   
>   data2 = GROUP data BY id;
>   
>   describe data2;
>   
>   data3 = FOREACH data2 GENERATE group as id, BagGroup(data,data.key) as 
> grouped;
>   
>   describe data3;
>   
>   data4 = FOREACH data3 {
> summed = FOREACH grouped GENERATE group as key, SUM($1.val) as total;
> ordered = ORDER summed BY key;
> GENERATE id, ordered;
>   }
>   
>   describe data4;
>   
>   STORE data4 INTO 'output';
>*/
>   @Multiline
>   private String bagSumTest;
>   
>   @Test
>   public void bagSumTest() throws Exception
>   {
> PigTest test = createPigTestFromString(bagSumTest);
> writeLinesToFile("input", "1,A,1","1,B,2","2,A,3","3,A,4","1,C,5","1,C,6",
>  "3,A,7","2,B,8","1,A,9","2,A,10");
> test.runScript();
> assertOutput(test, "data4", 
>  "(1,{(A,10),(B,2),(C,11)})",
>  "(2,{(A,13),(B,8)})",
>  "(3,{(A,11)})");
>   }
> {code}
> {{data3}} is described as:
> {code}
> data3: {id: int,grouped: {(group: chararray,data: {(id: int,key: 
> chararray,val: int)})}}
> {code}
> However, if we change {{data}} to {{data.(key,val)}} then {{data3}} is 
> described as:
> {code}
> data3: {id: int,grouped: {(group: chararray,{(key: chararray,val: int)})}}
> {code}
> Note that there is no name, so you have to reference it by {{$1}}.  There is 
> a separate issues, DATAFU-40, where even when it has the name {{data}} you 
> can run into problems later.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (DATAFU-123) Allow DataFu to include macros

2017-01-02 Thread Eyal Allweil (JIRA)
Eyal Allweil created DATAFU-123:
---

 Summary: Allow DataFu to include macros 
 Key: DATAFU-123
 URL: https://issues.apache.org/jira/browse/DATAFU-123
 Project: DataFu
  Issue Type: Improvement
Reporter: Eyal Allweil
Assignee: Eyal Allweil


A few changes to allow macros to be contributed to DataFu. If a macro file is 
placed in src/main/resources, it can be used by registering the DataFu jar. 
Such macros can then be tested both from within Eclipse and Gradle.

There are three small parts:

1) All unit tests that use createPigTest methods will automatically register 
the DataFu jar.

2) Some changes to fix the PigTests.getJarPath() functionality, which doesn't 
appear to work. (these changes are aligned with the proposed patch for 
[DATAFU-106|https://issues.apache.org/jira/browse/DATAFU-106])

3) A sample macro and test

The changes here will allow moving forward with 
[DATAFU-61|https://issues.apache.org/jira/browse/DATAFU-61] and including the 
macro I suggested for 
[DATAFU-119|https://issues.apache.org/jira/browse/DATAFU-119]. (I also have 
additional content in mind)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DATAFU-119) New UDF - TupleDiff

2017-01-02 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15793097#comment-15793097
 ] 

Eyal Allweil commented on DATAFU-119:
-

If we add DATAFU-123, we can include the macro I put in the description so that 
people can use it instead of duplicating it in order to conveniently call the 
UDF.

> New UDF - TupleDiff
> ---
>
> Key: DATAFU-119
> URL: https://issues.apache.org/jira/browse/DATAFU-119
> Project: DataFu
>  Issue Type: New Feature
>Reporter: Eyal Allweil
>Assignee: Eyal Allweil
>
> A UDF that given two tuples, prints out the differences between them in 
> human-readable form. This is not meant for production - we use it in PayPal 
> for regression tests, to compare the results of two runs. Differences are 
> calculated based on position, but the tuples' schemas are used, if available, 
> for displaying more friendly results. If no schema is available the output 
> uses field numbers.
> It should be used when you want a more fine-grained description of what has 
> changed, unlike 
> [org.apache.pig.builtin.DIFF|https://pig.apache.org/docs/r0.14.0/func.html#diff].
>  Also, because DIFF takes as its input two bags to be compared, they must fit 
> in memory. This UDF only takes one pair of tuples at a time, so it can run on 
> large inputs.
> We use a macro much like the following in conjunction with this UDF:
> {noformat}
> DEFINE diff_macro(diff_macro_old, diff_macro_new, diff_macro_pk, 
> diff_macro_ignored_field) returns diffs {
>   DEFINE TupleDiff datafu.pig.util.TupleDiff;
>   
>   old =   FOREACH $diff_macro_old GENERATE $diff_macro_pk, TOTUPLE(*) AS 
> original;
>   new =   FOREACH $diff_macro_new GENERATE $diff_macro_pk, TOTUPLE(*) AS 
> original;
>   
>   join_data = JOIN new BY $diff_macro_pk full, old BY $diff_macro_pk;
>   
>   join_data = FOREACH join_data GENERATE TupleDiff(old::original, 
> new::original, '$diff_macro_ignored_field') AS tupleDiff, old::original, 
> new::original;
>   
>   $diffs = FILTER join_data BY tupleDiff IS NOT NULL ;
> };
> {noformat}
> Currently, the output from the macro looks like this (when comma-separated):
> {noformat}
> added,,
> missing,,
> changed field2 field4,,
> {noformat}
> The UDF takes a variable number of parameters - the two tuples to be 
> compared, and any number of field names or numbers to be ignored. We use this 
> to ignore fields representing execution or creation time (the macro I've 
> given as an example assumes only one ignored field)
> The current implementation "drills down" into tuples, but not bags or maps - 
> tuple boundaries are indicated with parentheses, like this:
> {noformat}
> changed outerEmbeddedTuple(innerEmbeddedTuple(fieldNameThatIsDifferent) 
> innerEmbeddedTuple(anotherFieldThatIsDifferent))
> {noformat}
> I have a few final things left to do and then I'll put it up on reviewboard.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DATAFU-106) Test files should be created in a subfolder of projects

2017-01-11 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15817823#comment-15817823
 ] 

Eyal Allweil commented on DATAFU-106:
-

[~takias], I will try to sort our Jira issues out and mark those that are 
easier to begin with. Have you worked on Pig UDF's before?

Piyush - I will try to finish our review as soon as I can!

> Test files should be created in a subfolder of projects
> ---
>
> Key: DATAFU-106
> URL: https://issues.apache.org/jira/browse/DATAFU-106
> Project: DataFu
>  Issue Type: Improvement
>Reporter: Matthew Hayes
>Priority: Minor
> Fix For: 1.3.1
>
>
> Test files are currently created in the subdirectory folder (e.g. 
> datafu-pig/input*).  For better organization, we should create them in a 
> subdirectory.  This also makes it easier to exclude them all with gitignore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (DATAFU-123) Allow DataFu to include macros

2017-03-08 Thread Eyal Allweil (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eyal Allweil updated DATAFU-123:

Attachment: DATAFU-123.patch

The change ended up being smaller than what I originally described - all I did 
was add the "pig.import.search.path" property with the value of the 
src/main/resources directory to PigTests.

This means that any macro files that are put there can be tested, both in 
Gradle and Eclipse. I put some sample counting macros there and a test for them.

In general, any macro file placed in src/main/resources can be used by 
registering the DataFu jar.

If we include this patch, we should update the Contributing page so that 
instructions for contributing Pig macros are easy to find and understand.

> Allow DataFu to include macros 
> ---
>
> Key: DATAFU-123
> URL: https://issues.apache.org/jira/browse/DATAFU-123
> Project: DataFu
>  Issue Type: Improvement
>Reporter: Eyal Allweil
>Assignee: Eyal Allweil
>  Labels: testability
> Attachments: DATAFU-123.patch
>
>
> A few changes to allow macros to be contributed to DataFu. If a macro file is 
> placed in src/main/resources, it can be used by registering the DataFu jar. 
> Such macros can then be tested both from within Eclipse and Gradle.
> There are three small parts:
> 1) All unit tests that use createPigTest methods will automatically register 
> the DataFu jar.
> 2) Some changes to fix the PigTests.getJarPath() functionality, which doesn't 
> appear to work. (these changes are aligned with the proposed patch for 
> [DATAFU-106|https://issues.apache.org/jira/browse/DATAFU-106])
> 3) A sample macro and test
> The changes here will allow moving forward with 
> [DATAFU-61|https://issues.apache.org/jira/browse/DATAFU-61] and including the 
> macro I suggested for 
> [DATAFU-119|https://issues.apache.org/jira/browse/DATAFU-119]. (I also have 
> additional content in mind)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (DATAFU-12) Implement Lead UDF based on version from SQL

2017-04-18 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-12?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15972397#comment-15972397
 ] 

Eyal Allweil commented on DATAFU-12:


It looks like this functionality is implemented in HIve - see the following two 
links:

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics#LanguageManualWindowingAndAnalytics-LEADusingdefault1rowleadandnotspecifyingdefaultvalue

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFLead.java

Since Pig now supports using Hive UDF's, I think this Jira can be closed. 
Alternately, if we want to provide a DataFu implementation, I'll copy the 
proposed patch and discussion from the Github issue mentioned in the 
description, so it's easier for a possible-implementer to continue where work 
stalled.

> Implement Lead UDF based on version from SQL
> 
>
> Key: DATAFU-12
> URL: https://issues.apache.org/jira/browse/DATAFU-12
> Project: DataFu
>  Issue Type: New Feature
>Reporter: Matthew Hayes
>
> Min Zhou has provided this suggestion ([Issue #88 on 
> GitHub|https://github.com/linkedin/datafu/pull/88]):
> Lead is an analytic function like Oracle's Lead function. It provides access 
> to more than one tuple of a bag at the same time without a self join. Given a 
> bag of tuple returned from a query, LEAD provides access to a tuple at a 
> given physical offset beyond that position. Generates pairs of all items in a 
> bag.
> If you do not specify offset, then its default is 1. Null is returned if the 
> offset goes beyond the scope of the bag.
> Example 1:
> {noformat}
>register ba-pig-0.1.jar
>define Lead datafu.pig.bags.Lead('2');
>-- INPUT: ({(1),(2),(3),(4)})
>data = LOAD 'input' AS (data: bag {T: tuple(v:INT)});
>describe data;
>-- OUTPUT:  ({((1),(2),(3)),((2),(3),(4)),((3),(4),),((4),,)})
>-- OUTPUT SCHEMA: data2: {lead_data: {(elem0: (v: int),elem1: (v: 
> int),elem2: (v: int))}}
>data2 = FOREACH data GENERATE Lead(data);
>describe data2;
>DUMP data2;
> {noformat}
> Example 2
> {noformat}
>register  ba-pig-0.1.jar
>define Lead datafu.pig.bags.Lead();
>-- INPUT: 
> ({(10,{(1),(2),(3)}),(20,{(4),(5),(6)}),(30,{(7),(8)}),(40,{(9),(10),(11)}),(50,{(12),(13),(14),(15)})})
>data = LOAD 'input' AS (data: bag {T: tuple(v1:INT,B: bag{T: 
> tuple(v2:INT)})});
>--describe data;
>-- OUPUT: 
> ({((10,{(1),(2),(3)}),(20,{(4),(5),(6)})),((20,{(4),(5),(6)}),(30,{(7),(8)})),((30,{(7),(8)}),(40,{(9),(10),(11)})),((40,{(9),(10),(11)}),(50,{(12),(13),(14),(15)})),((50,{(12),(13),(14),(15)}),)})
>data2 = FOREACH data GENERATE Lead(data);
>--describe data2;
>DUMP data2;
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (DATAFU-124) sessionize() ought to support millisecond periods

2017-06-29 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16067884#comment-16067884
 ] 

Eyal Allweil commented on DATAFU-124:
-

I reviewed it - looks fine, a nice improvement. I'll try to get it committed 
soon (unless of course someone has any actionable comments)

> sessionize() ought to support millisecond periods
> -
>
> Key: DATAFU-124
> URL: https://issues.apache.org/jira/browse/DATAFU-124
> Project: DataFu
>  Issue Type: Bug
>Reporter: Jacob Tolar
>
> The sessionize UDF should support a period in milliseconds.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DATAFU-124) sessionize() ought to support millisecond periods

2017-07-12 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16084560#comment-16084560
 ] 

Eyal Allweil commented on DATAFU-124:
-

Committed. Thanks [~jtolar]!

> sessionize() ought to support millisecond periods
> -
>
> Key: DATAFU-124
> URL: https://issues.apache.org/jira/browse/DATAFU-124
> Project: DataFu
>  Issue Type: Bug
>Reporter: Jacob Tolar
>
> The sessionize UDF should support a period in milliseconds.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DATAFU-124) sessionize() ought to support millisecond periods

2017-07-13 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16086145#comment-16086145
 ] 

Eyal Allweil commented on DATAFU-124:
-

I'd like to resolve this issue, but it looks like version 1.3.2 isn't marked as 
released, and 1.3.3 doesn't exist yet (in Jira).

[~matterhayes] - do I have permissions to do this? Do you know how to "release" 
versions in Jira?

> sessionize() ought to support millisecond periods
> -
>
> Key: DATAFU-124
> URL: https://issues.apache.org/jira/browse/DATAFU-124
> Project: DataFu
>  Issue Type: Bug
>Reporter: Jacob Tolar
>
> The sessionize UDF should support a period in milliseconds.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (DATAFU-124) sessionize() ought to support millisecond periods

2017-07-14 Thread Eyal Allweil (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eyal Allweil resolved DATAFU-124.
-
   Resolution: Fixed
 Assignee: Eyal Allweil
Fix Version/s: 1.3.3

> sessionize() ought to support millisecond periods
> -
>
> Key: DATAFU-124
> URL: https://issues.apache.org/jira/browse/DATAFU-124
> Project: DataFu
>  Issue Type: Bug
>Reporter: Jacob Tolar
>Assignee: Eyal Allweil
> Fix For: 1.3.3
>
>
> The sessionize UDF should support a period in milliseconds.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DATAFU-83) InUDF does not validate that types are compatible

2017-07-31 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-83?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16106894#comment-16106894
 ] 

Eyal Allweil commented on DATAFU-83:


Hi Kyle ([~ItsAUsernameRight?])

Your help is very welcome. I have two comments about the state of the 
contribution - I'll put them both here and in the review board for maximum 
visibility.

1. I think the output schema of this UDF is always boolean, not the schema of 
the first input field. I would make the outputSchema method identical to that 
in an existing Boolean UDF - for example, [Pig's ENDSWITH built-in 
function|https://github.com/apache/pig/blob/trunk/src/org/apache/pig/builtin/ENDSWITH.java#L62]

2. As Matthew already wrote in the review board, adding a case to the unit test 
is a good idea - you can probably just duplicate something from [the existing 
test|https://github.com/apache/incubator-datafu/blob/master/datafu-pig/src/test/java/datafu/test/pig/util/InTests.java].

Thanks!

> InUDF does not validate that types are compatible
> -
>
> Key: DATAFU-83
> URL: https://issues.apache.org/jira/browse/DATAFU-83
> Project: DataFu
>  Issue Type: Improvement
>Reporter: Matthew Hayes
>Priority: Minor
> Attachments: DATAFU-83.patch, rb36702.patch
>
>
> See the example below.  The input data is a long, but ints are provided to 
> match against.  Because it uses the Java equals to compare and these are 
> different types, this will never match, which can lead to confusing results.  
> I believe it should at least throw an error.
> {code}
>   define I datafu.pig.util.InUDF();
>   
>   data = LOAD 'input' AS (B: bag {T: tuple(v:LONG)});
>   
>   data2 = FOREACH data {
> C = FILTER B By I(v, 1,2,3);
> GENERATE C;
>   }
>   
>   describe data2;
>   
>   STORE data2 INTO 'output';
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DATAFU-119) New UDF - TupleDiff

2017-08-04 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16114867#comment-16114867
 ] 

Eyal Allweil commented on DATAFU-119:
-

Sure. I'll do it as soon as I can.

> New UDF - TupleDiff
> ---
>
> Key: DATAFU-119
> URL: https://issues.apache.org/jira/browse/DATAFU-119
> Project: DataFu
>  Issue Type: New Feature
>Reporter: Eyal Allweil
>Assignee: Eyal Allweil
>
> A UDF that given two tuples, prints out the differences between them in 
> human-readable form. This is not meant for production - we use it in PayPal 
> for regression tests, to compare the results of two runs. Differences are 
> calculated based on position, but the tuples' schemas are used, if available, 
> for displaying more friendly results. If no schema is available the output 
> uses field numbers.
> It should be used when you want a more fine-grained description of what has 
> changed, unlike 
> [org.apache.pig.builtin.DIFF|https://pig.apache.org/docs/r0.14.0/func.html#diff].
>  Also, because DIFF takes as its input two bags to be compared, they must fit 
> in memory. This UDF only takes one pair of tuples at a time, so it can run on 
> large inputs.
> We use a macro much like the following in conjunction with this UDF:
> {noformat}
> DEFINE diff_macro(diff_macro_old, diff_macro_new, diff_macro_pk, 
> diff_macro_ignored_field) returns diffs {
>   DEFINE TupleDiff datafu.pig.util.TupleDiff;
>   
>   old =   FOREACH $diff_macro_old GENERATE $diff_macro_pk, TOTUPLE(*) AS 
> original;
>   new =   FOREACH $diff_macro_new GENERATE $diff_macro_pk, TOTUPLE(*) AS 
> original;
>   
>   join_data = JOIN new BY $diff_macro_pk full, old BY $diff_macro_pk;
>   
>   join_data = FOREACH join_data GENERATE TupleDiff(old::original, 
> new::original, '$diff_macro_ignored_field') AS tupleDiff, old::original, 
> new::original;
>   
>   $diffs = FILTER join_data BY tupleDiff IS NOT NULL ;
> };
> {noformat}
> Currently, the output from the macro looks like this (when comma-separated):
> {noformat}
> added,,
> missing,,
> changed field2 field4,,
> {noformat}
> The UDF takes a variable number of parameters - the two tuples to be 
> compared, and any number of field names or numbers to be ignored. We use this 
> to ignore fields representing execution or creation time (the macro I've 
> given as an example assumes only one ignored field)
> The current implementation "drills down" into tuples, but not bags or maps - 
> tuple boundaries are indicated with parentheses, like this:
> {noformat}
> changed outerEmbeddedTuple(innerEmbeddedTuple(fieldNameThatIsDifferent) 
> innerEmbeddedTuple(anotherFieldThatIsDifferent))
> {noformat}
> I have a few final things left to do and then I'll put it up on reviewboard.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DATAFU-119) New UDF - TupleDiff

2017-08-06 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16115744#comment-16115744
 ] 

Eyal Allweil commented on DATAFU-119:
-

[~matterhayes] - We want the Apache license header on our macro files too, 
right? If so, I'll add it to the sample macro from 
[DATAFU-123|https://issues.apache.org/jira/browse/DATAFU-123] as well.

> New UDF - TupleDiff
> ---
>
> Key: DATAFU-119
> URL: https://issues.apache.org/jira/browse/DATAFU-119
> Project: DataFu
>  Issue Type: New Feature
>Reporter: Eyal Allweil
>Assignee: Eyal Allweil
>
> A UDF that given two tuples, prints out the differences between them in 
> human-readable form. This is not meant for production - we use it in PayPal 
> for regression tests, to compare the results of two runs. Differences are 
> calculated based on position, but the tuples' schemas are used, if available, 
> for displaying more friendly results. If no schema is available the output 
> uses field numbers.
> It should be used when you want a more fine-grained description of what has 
> changed, unlike 
> [org.apache.pig.builtin.DIFF|https://pig.apache.org/docs/r0.14.0/func.html#diff].
>  Also, because DIFF takes as its input two bags to be compared, they must fit 
> in memory. This UDF only takes one pair of tuples at a time, so it can run on 
> large inputs.
> We use a macro much like the following in conjunction with this UDF:
> {noformat}
> DEFINE diff_macro(diff_macro_old, diff_macro_new, diff_macro_pk, 
> diff_macro_ignored_field) returns diffs {
>   DEFINE TupleDiff datafu.pig.util.TupleDiff;
>   
>   old =   FOREACH $diff_macro_old GENERATE $diff_macro_pk, TOTUPLE(*) AS 
> original;
>   new =   FOREACH $diff_macro_new GENERATE $diff_macro_pk, TOTUPLE(*) AS 
> original;
>   
>   join_data = JOIN new BY $diff_macro_pk full, old BY $diff_macro_pk;
>   
>   join_data = FOREACH join_data GENERATE TupleDiff(old::original, 
> new::original, '$diff_macro_ignored_field') AS tupleDiff, old::original, 
> new::original;
>   
>   $diffs = FILTER join_data BY tupleDiff IS NOT NULL ;
> };
> {noformat}
> Currently, the output from the macro looks like this (when comma-separated):
> {noformat}
> added,,
> missing,,
> changed field2 field4,,
> {noformat}
> The UDF takes a variable number of parameters - the two tuples to be 
> compared, and any number of field names or numbers to be ignored. We use this 
> to ignore fields representing execution or creation time (the macro I've 
> given as an example assumes only one ignored field)
> The current implementation "drills down" into tuples, but not bags or maps - 
> tuple boundaries are indicated with parentheses, like this:
> {noformat}
> changed outerEmbeddedTuple(innerEmbeddedTuple(fieldNameThatIsDifferent) 
> innerEmbeddedTuple(anotherFieldThatIsDifferent))
> {noformat}
> I have a few final things left to do and then I'll put it up on reviewboard.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (DATAFU-61) Add TF-IDF Macro to DataFu

2017-08-06 Thread Eyal Allweil (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-61?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eyal Allweil updated DATAFU-61:
---
Attachment: DATAFU-61-2.patch

Now that macros are supported (and can be tested), I updated this patch. 
Unfortunately, I couldn't find the sample data, so I just pulled the sample 
sentences from the Wikipedia page for TF-IDF, and I didn't verify that the 
results are OK. [~russell.jurney] - want to donate a test case and expected 
results?

> Add TF-IDF Macro to DataFu
> --
>
> Key: DATAFU-61
> URL: https://issues.apache.org/jira/browse/DATAFU-61
> Project: DataFu
>  Issue Type: New Feature
>Affects Versions: 1.3.0
>Reporter: Russell Jurney
> Attachments: DATAFU-61-2.patch, DATAFU-61.patch, DATAFU-61.patch
>
>
> The first macro I would like to add is a Term Frequency, Inverse Document 
> Frequency implementation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DATAFU-61) Add TF-IDF Macro to DataFu

2017-09-11 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-61?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16161118#comment-16161118
 ] 

Eyal Allweil commented on DATAFU-61:


Came back to this today and tried a little experiment - I verified (calculating 
manually) that the Russell's code produces the same results as the "augmented 
TF" IDF flavor for the sample I took from the wikipedia page. Is that good 
enough for us?

> Add TF-IDF Macro to DataFu
> --
>
> Key: DATAFU-61
> URL: https://issues.apache.org/jira/browse/DATAFU-61
> Project: DataFu
>  Issue Type: New Feature
>Affects Versions: 1.3.0
>Reporter: Russell Jurney
> Attachments: DATAFU-61-2.patch, DATAFU-61.patch, DATAFU-61.patch
>
>
> The first macro I would like to add is a Term Frequency, Inverse Document 
> Frequency implementation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DATAFU-119) New UDF - TupleDiff

2017-09-11 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16161198#comment-16161198
 ] 

Eyal Allweil commented on DATAFU-119:
-

I added the macro to the jar.

> New UDF - TupleDiff
> ---
>
> Key: DATAFU-119
> URL: https://issues.apache.org/jira/browse/DATAFU-119
> Project: DataFu
>  Issue Type: New Feature
>Reporter: Eyal Allweil
>Assignee: Eyal Allweil
>
> A UDF that given two tuples, prints out the differences between them in 
> human-readable form. This is not meant for production - we use it in PayPal 
> for regression tests, to compare the results of two runs. Differences are 
> calculated based on position, but the tuples' schemas are used, if available, 
> for displaying more friendly results. If no schema is available the output 
> uses field numbers.
> It should be used when you want a more fine-grained description of what has 
> changed, unlike 
> [org.apache.pig.builtin.DIFF|https://pig.apache.org/docs/r0.14.0/func.html#diff].
>  Also, because DIFF takes as its input two bags to be compared, they must fit 
> in memory. This UDF only takes one pair of tuples at a time, so it can run on 
> large inputs.
> We use a macro much like the following in conjunction with this UDF:
> {noformat}
> DEFINE diff_macro(diff_macro_old, diff_macro_new, diff_macro_pk, 
> diff_macro_ignored_field) returns diffs {
>   DEFINE TupleDiff datafu.pig.util.TupleDiff;
>   
>   old =   FOREACH $diff_macro_old GENERATE $diff_macro_pk, TOTUPLE(*) AS 
> original;
>   new =   FOREACH $diff_macro_new GENERATE $diff_macro_pk, TOTUPLE(*) AS 
> original;
>   
>   join_data = JOIN new BY $diff_macro_pk full, old BY $diff_macro_pk;
>   
>   join_data = FOREACH join_data GENERATE TupleDiff(old::original, 
> new::original, '$diff_macro_ignored_field') AS tupleDiff, old::original, 
> new::original;
>   
>   $diffs = FILTER join_data BY tupleDiff IS NOT NULL ;
> };
> {noformat}
> Currently, the output from the macro looks like this (when comma-separated):
> {noformat}
> added,,
> missing,,
> changed field2 field4,,
> {noformat}
> The UDF takes a variable number of parameters - the two tuples to be 
> compared, and any number of field names or numbers to be ignored. We use this 
> to ignore fields representing execution or creation time (the macro I've 
> given as an example assumes only one ignored field)
> The current implementation "drills down" into tuples, but not bags or maps - 
> tuple boundaries are indicated with parentheses, like this:
> {noformat}
> changed outerEmbeddedTuple(innerEmbeddedTuple(fieldNameThatIsDifferent) 
> innerEmbeddedTuple(anotherFieldThatIsDifferent))
> {noformat}
> I have a few final things left to do and then I'll put it up on reviewboard.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DATAFU-83) InUDF does not validate that types are compatible

2017-09-11 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-83?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16161211#comment-16161211
 ] 

Eyal Allweil commented on DATAFU-83:


By the way, [~ItsAUsernameRight?], if you're already looking at InUDF, and 
you'd like another contribution afterwards, you can also look at 
[DATAFU-80|https://issues.apache.org/jira/browse/DATAFU-80] - it's another 
small change to improve InUDF's behavior. (you can ignore the second part of 
that issue, which deals with Java versions).


> InUDF does not validate that types are compatible
> -
>
> Key: DATAFU-83
> URL: https://issues.apache.org/jira/browse/DATAFU-83
> Project: DataFu
>  Issue Type: Improvement
>Reporter: Matthew Hayes
>Priority: Minor
> Attachments: DATAFU-83.patch, rb36702.patch
>
>
> See the example below.  The input data is a long, but ints are provided to 
> match against.  Because it uses the Java equals to compare and these are 
> different types, this will never match, which can lead to confusing results.  
> I believe it should at least throw an error.
> {code}
>   define I datafu.pig.util.InUDF();
>   
>   data = LOAD 'input' AS (B: bag {T: tuple(v:LONG)});
>   
>   data2 = FOREACH data {
> C = FILTER B By I(v, 1,2,3);
> GENERATE C;
>   }
>   
>   describe data2;
>   
>   STORE data2 INTO 'output';
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (DATAFU-126) There is a typo in document

2017-09-11 Thread Eyal Allweil (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eyal Allweil reassigned DATAFU-126:
---

Assignee: Eyal Allweil

> There is a typo in document
> ---
>
> Key: DATAFU-126
> URL: https://issues.apache.org/jira/browse/DATAFU-126
> Project: DataFu
>  Issue Type: Improvement
>Reporter: Kane
>Assignee: Eyal Allweil
>Priority: Minor
>
> It should be "functions" not "functiosn" in the document page 
> https://datafu.incubator.apache.org/docs/datafu/guide.html.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DATAFU-126) There is a typo in document

2017-09-11 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16161252#comment-16161252
 ] 

Eyal Allweil commented on DATAFU-126:
-

Thanks Kane! I've fixed this in our sources, and it will show up when we 
release our next version.

> There is a typo in document
> ---
>
> Key: DATAFU-126
> URL: https://issues.apache.org/jira/browse/DATAFU-126
> Project: DataFu
>  Issue Type: Improvement
>Reporter: Kane
>Assignee: Eyal Allweil
>Priority: Minor
>
> It should be "functions" not "functiosn" in the document page 
> https://datafu.incubator.apache.org/docs/datafu/guide.html.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (DATAFU-126) There is a typo in document

2017-09-11 Thread Eyal Allweil (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eyal Allweil resolved DATAFU-126.
-
Resolution: Fixed

> There is a typo in document
> ---
>
> Key: DATAFU-126
> URL: https://issues.apache.org/jira/browse/DATAFU-126
> Project: DataFu
>  Issue Type: Improvement
>Reporter: Kane
>Assignee: Eyal Allweil
>Priority: Minor
>
> It should be "functions" not "functiosn" in the document page 
> https://datafu.incubator.apache.org/docs/datafu/guide.html.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (DATAFU-127) New macro - samply by keys

2017-09-12 Thread Eyal Allweil (JIRA)
Eyal Allweil created DATAFU-127:
---

 Summary: New macro - samply by keys
 Key: DATAFU-127
 URL: https://issues.apache.org/jira/browse/DATAFU-127
 Project: DataFu
  Issue Type: New Feature
Reporter: Eyal Allweil
Assignee: Eyal Allweil


Two macros that return a sample of a larger table based on a list of keys, with 
the schema of the larger table. One of the macros filters by dates, the other 
doesn't.

If there are multiple rows with a key that appears in the key list, all of them 
will be returned (no deduplication is done). The results are returned ordered 
by the key field in a single file.

The implementation uses a replicated join for efficiency, but this means the 
key list shouldn't be too large as to not fit in memory.

The first macro's definition looks as follows:

DEFINE sample_by_keys(table, sample_set, join_key_table, join_key_sample) 
returns out {

- table_name- table name to sample
- sample_set- a set of keys
- join_key_table- join column name in the table
- join_key_sample   - join column name in the sample





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (DATAFU-127) New macro - samply by keys

2017-09-12 Thread Eyal Allweil (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eyal Allweil updated DATAFU-127:

Attachment: DATAFU-127.patch

Patch including new macros and tests

> New macro - samply by keys
> --
>
> Key: DATAFU-127
> URL: https://issues.apache.org/jira/browse/DATAFU-127
> Project: DataFu
>  Issue Type: New Feature
>Reporter: Eyal Allweil
>Assignee: Eyal Allweil
>  Labels: macro
> Attachments: DATAFU-127.patch
>
>
> Two macros that return a sample of a larger table based on a list of keys, 
> with the schema of the larger table. One of the macros filters by dates, the 
> other doesn't.
> If there are multiple rows with a key that appears in the key list, all of 
> them will be returned (no deduplication is done). The results are returned 
> ordered by the key field in a single file.
> The implementation uses a replicated join for efficiency, but this means the 
> key list shouldn't be too large as to not fit in memory.
> The first macro's definition looks as follows:
> DEFINE sample_by_keys(table, sample_set, join_key_table, join_key_sample) 
> returns out {
> - table_name  - table name to sample
> - sample_set  - a set of keys
> - join_key_table  - join column name in the table
> - join_key_sample - join column name in the sample



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (DATAFU-61) Add TF-IDF Macro to DataFu

2017-09-12 Thread Eyal Allweil (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-61?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eyal Allweil updated DATAFU-61:
---
Labels: macro  (was: )

> Add TF-IDF Macro to DataFu
> --
>
> Key: DATAFU-61
> URL: https://issues.apache.org/jira/browse/DATAFU-61
> Project: DataFu
>  Issue Type: New Feature
>Affects Versions: 1.3.0
>Reporter: Russell Jurney
>  Labels: macro
> Attachments: DATAFU-61-2.patch, DATAFU-61.patch, DATAFU-61.patch
>
>
> The first macro I would like to add is a Term Frequency, Inverse Document 
> Frequency implementation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (DATAFU-128) Add documentation for macros

2017-09-12 Thread Eyal Allweil (JIRA)
Eyal Allweil created DATAFU-128:
---

 Summary: Add documentation for macros
 Key: DATAFU-128
 URL: https://issues.apache.org/jira/browse/DATAFU-128
 Project: DataFu
  Issue Type: Improvement
Reporter: Eyal Allweil


Now that it is possible to add Pig macros to Datafu, we should update the 
documentation to reflect this, and provide guidelines and point would-be 
contributors to examples.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (DATAFU-129) New macro - dedup

2017-09-12 Thread Eyal Allweil (JIRA)
Eyal Allweil created DATAFU-129:
---

 Summary: New macro - dedup
 Key: DATAFU-129
 URL: https://issues.apache.org/jira/browse/DATAFU-129
 Project: DataFu
  Issue Type: New Feature
Reporter: Eyal Allweil
Assignee: Eyal Allweil


Macro used to dedup (de-duplicate) a table, based on a key or keys and an 
ordering (typically a date updated field).

One thing to consider - the implementation relies on the 
ExtremalTupleByNthField UDF in PiggyBank. I've added it to the test 
dependencies in order for the test to run. While I feel that anyone using Pig 
typically has PiggyBank in the classpath, this might not be true - do we have 
an alternative? (maybe adding it to the jarjar?)

The macro's definition looks as follows:

DEFINE dedup(relation, row_key, order_field) returns out {

relation - relation to dedup
row_key - field(s) for group by
order_field - the field for ordering (to find the most recent record)




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (DATAFU-129) New macro - dedup

2017-09-12 Thread Eyal Allweil (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eyal Allweil updated DATAFU-129:

Attachment: DATAFU-129.patch

Macro and test

> New macro - dedup
> -
>
> Key: DATAFU-129
> URL: https://issues.apache.org/jira/browse/DATAFU-129
> Project: DataFu
>  Issue Type: New Feature
>Reporter: Eyal Allweil
>Assignee: Eyal Allweil
>  Labels: macro
> Attachments: DATAFU-129.patch
>
>
> Macro used to dedup (de-duplicate) a table, based on a key or keys and an 
> ordering (typically a date updated field).
> One thing to consider - the implementation relies on the 
> ExtremalTupleByNthField UDF in PiggyBank. I've added it to the test 
> dependencies in order for the test to run. While I feel that anyone using Pig 
> typically has PiggyBank in the classpath, this might not be true - do we have 
> an alternative? (maybe adding it to the jarjar?)
> The macro's definition looks as follows:
> DEFINE dedup(relation, row_key, order_field) returns out {
> relation - relation to dedup
> row_key - field(s) for group by
> order_field - the field for ordering (to find the most recent record)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DATAFU-128) Add documentation for macros

2017-09-12 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16162936#comment-16162936
 ] 

Eyal Allweil commented on DATAFU-128:
-

Is the documentation for updating the website accurate? There are references to 
svn in there, which lead me to think they might not be relevant anymore ...

> Add documentation for macros
> 
>
> Key: DATAFU-128
> URL: https://issues.apache.org/jira/browse/DATAFU-128
> Project: DataFu
>  Issue Type: Improvement
>Reporter: Eyal Allweil
>
> Now that it is possible to add Pig macros to Datafu, we should update the 
> documentation to reflect this, and provide guidelines and point would-be 
> contributors to examples.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (DATAFU-130) Add left outer join macro described in the DataFu guide

2017-09-12 Thread Eyal Allweil (JIRA)
Eyal Allweil created DATAFU-130:
---

 Summary: Add left outer join macro described in the DataFu guide
 Key: DATAFU-130
 URL: https://issues.apache.org/jira/browse/DATAFU-130
 Project: DataFu
  Issue Type: New Feature
Reporter: Eyal Allweil


In our 
[guide|http://datafu.incubator.apache.org/blog/2013/09/04/datafu-1-0.html], a 
macro is described for making a three-way left outer join conveniently. We can 
add this macro to DataFu to make it even easier to use.

The macro's code is as follows:


{noformat}
DEFINE left_outer_join(relation1, key1, relation2, key2, relation3, key3) 
returns joined {
  cogrouped = COGROUP $relation1 BY $key1, $relation2 BY $key2, $relation3 BY 
$key3;
  $joined = FOREACH cogrouped GENERATE
FLATTEN($relation1),
FLATTEN(EmptyBagToNullFields($relation2)),
FLATTEN(EmptyBagToNullFields($relation3));
}

(we would obviously want to add a test for this, too)

{noformat}





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DATAFU-61) Add TF-IDF Macro to DataFu

2017-09-13 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-61?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16164373#comment-16164373
 ] 

Eyal Allweil commented on DATAFU-61:


One last thing - I noticed after I uploaded my patch that it has my email, but 
I think it would be better for it to have your email, [~russell.jurney], since 
all I did was write the test. Is it OK that I replace my email with yours 
before committing this, so we get a (more accurate) "eyal committed with 
russell" type commit?

> Add TF-IDF Macro to DataFu
> --
>
> Key: DATAFU-61
> URL: https://issues.apache.org/jira/browse/DATAFU-61
> Project: DataFu
>  Issue Type: New Feature
>Affects Versions: 1.3.0
>Reporter: Russell Jurney
>  Labels: macro
> Attachments: DATAFU-61-2.patch, DATAFU-61.patch, DATAFU-61.patch
>
>
> The first macro I would like to add is a Term Frequency, Inverse Document 
> Frequency implementation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (DATAFU-119) New UDF - TupleDiff

2017-09-14 Thread Eyal Allweil (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eyal Allweil updated DATAFU-119:

Attachment: DATAFU-119-2.patch

> New UDF - TupleDiff
> ---
>
> Key: DATAFU-119
> URL: https://issues.apache.org/jira/browse/DATAFU-119
> Project: DataFu
>  Issue Type: New Feature
>Reporter: Eyal Allweil
>Assignee: Eyal Allweil
> Attachments: DATAFU-119-2.patch
>
>
> A UDF that given two tuples, prints out the differences between them in 
> human-readable form. This is not meant for production - we use it in PayPal 
> for regression tests, to compare the results of two runs. Differences are 
> calculated based on position, but the tuples' schemas are used, if available, 
> for displaying more friendly results. If no schema is available the output 
> uses field numbers.
> It should be used when you want a more fine-grained description of what has 
> changed, unlike 
> [org.apache.pig.builtin.DIFF|https://pig.apache.org/docs/r0.14.0/func.html#diff].
>  Also, because DIFF takes as its input two bags to be compared, they must fit 
> in memory. This UDF only takes one pair of tuples at a time, so it can run on 
> large inputs.
> We use a macro much like the following in conjunction with this UDF:
> {noformat}
> DEFINE diff_macro(diff_macro_old, diff_macro_new, diff_macro_pk, 
> diff_macro_ignored_field) returns diffs {
>   DEFINE TupleDiff datafu.pig.util.TupleDiff;
>   
>   old =   FOREACH $diff_macro_old GENERATE $diff_macro_pk, TOTUPLE(*) AS 
> original;
>   new =   FOREACH $diff_macro_new GENERATE $diff_macro_pk, TOTUPLE(*) AS 
> original;
>   
>   join_data = JOIN new BY $diff_macro_pk full, old BY $diff_macro_pk;
>   
>   join_data = FOREACH join_data GENERATE TupleDiff(old::original, 
> new::original, '$diff_macro_ignored_field') AS tupleDiff, old::original, 
> new::original;
>   
>   $diffs = FILTER join_data BY tupleDiff IS NOT NULL ;
> };
> {noformat}
> Currently, the output from the macro looks like this (when comma-separated):
> {noformat}
> added,,
> missing,,
> changed field2 field4,,
> {noformat}
> The UDF takes a variable number of parameters - the two tuples to be 
> compared, and any number of field names or numbers to be ignored. We use this 
> to ignore fields representing execution or creation time (the macro I've 
> given as an example assumes only one ignored field)
> The current implementation "drills down" into tuples, but not bags or maps - 
> tuple boundaries are indicated with parentheses, like this:
> {noformat}
> changed outerEmbeddedTuple(innerEmbeddedTuple(fieldNameThatIsDifferent) 
> innerEmbeddedTuple(anotherFieldThatIsDifferent))
> {noformat}
> I have a few final things left to do and then I'll put it up on reviewboard.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DATAFU-119) New UDF - TupleDiff

2017-09-14 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16165925#comment-16165925
 ] 

Eyal Allweil commented on DATAFU-119:
-

The documentation can be part of 
[DATAFU-128|https://issues.apache.org/jira/browse/DATAFU-128].

> New UDF - TupleDiff
> ---
>
> Key: DATAFU-119
> URL: https://issues.apache.org/jira/browse/DATAFU-119
> Project: DataFu
>  Issue Type: New Feature
>Reporter: Eyal Allweil
>Assignee: Eyal Allweil
> Attachments: DATAFU-119-2.patch
>
>
> A UDF that given two tuples, prints out the differences between them in 
> human-readable form. This is not meant for production - we use it in PayPal 
> for regression tests, to compare the results of two runs. Differences are 
> calculated based on position, but the tuples' schemas are used, if available, 
> for displaying more friendly results. If no schema is available the output 
> uses field numbers.
> It should be used when you want a more fine-grained description of what has 
> changed, unlike 
> [org.apache.pig.builtin.DIFF|https://pig.apache.org/docs/r0.14.0/func.html#diff].
>  Also, because DIFF takes as its input two bags to be compared, they must fit 
> in memory. This UDF only takes one pair of tuples at a time, so it can run on 
> large inputs.
> We use a macro much like the following in conjunction with this UDF:
> {noformat}
> DEFINE diff_macro(diff_macro_old, diff_macro_new, diff_macro_pk, 
> diff_macro_ignored_field) returns diffs {
>   DEFINE TupleDiff datafu.pig.util.TupleDiff;
>   
>   old =   FOREACH $diff_macro_old GENERATE $diff_macro_pk, TOTUPLE(*) AS 
> original;
>   new =   FOREACH $diff_macro_new GENERATE $diff_macro_pk, TOTUPLE(*) AS 
> original;
>   
>   join_data = JOIN new BY $diff_macro_pk full, old BY $diff_macro_pk;
>   
>   join_data = FOREACH join_data GENERATE TupleDiff(old::original, 
> new::original, '$diff_macro_ignored_field') AS tupleDiff, old::original, 
> new::original;
>   
>   $diffs = FILTER join_data BY tupleDiff IS NOT NULL ;
> };
> {noformat}
> Currently, the output from the macro looks like this (when comma-separated):
> {noformat}
> added,,
> missing,,
> changed field2 field4,,
> {noformat}
> The UDF takes a variable number of parameters - the two tuples to be 
> compared, and any number of field names or numbers to be ignored. We use this 
> to ignore fields representing execution or creation time (the macro I've 
> given as an example assumes only one ignored field)
> The current implementation "drills down" into tuples, but not bags or maps - 
> tuple boundaries are indicated with parentheses, like this:
> {noformat}
> changed outerEmbeddedTuple(innerEmbeddedTuple(fieldNameThatIsDifferent) 
> innerEmbeddedTuple(anotherFieldThatIsDifferent))
> {noformat}
> I have a few final things left to do and then I'll put it up on reviewboard.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DATAFU-61) Add TF-IDF Macro to DataFu

2017-09-14 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-61?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16165991#comment-16165991
 ] 

Eyal Allweil commented on DATAFU-61:


Yes, I'll merge it.

I did respond to an open issue in the review request that I only just noticed, 
something about using COUNT vs. SUM when calculating the IDF part ... as far as 
I can tell, the existing code is OK but it wouldn't hurt if you or Russell want 
to take a look at it.

> Add TF-IDF Macro to DataFu
> --
>
> Key: DATAFU-61
> URL: https://issues.apache.org/jira/browse/DATAFU-61
> Project: DataFu
>  Issue Type: New Feature
>Affects Versions: 1.3.0
>Reporter: Russell Jurney
>  Labels: macro
> Attachments: DATAFU-61-2.patch, DATAFU-61.patch, DATAFU-61.patch
>
>
> The first macro I would like to add is a Term Frequency, Inverse Document 
> Frequency implementation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (DATAFU-61) Add TF-IDF Macro to DataFu

2017-09-14 Thread Eyal Allweil (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-61?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eyal Allweil resolved DATAFU-61.

Resolution: Fixed
  Assignee: Eyal Allweil

Merged.

> Add TF-IDF Macro to DataFu
> --
>
> Key: DATAFU-61
> URL: https://issues.apache.org/jira/browse/DATAFU-61
> Project: DataFu
>  Issue Type: New Feature
>Affects Versions: 1.3.0
>Reporter: Russell Jurney
>Assignee: Eyal Allweil
>  Labels: macro
> Attachments: DATAFU-61-2.patch, DATAFU-61.patch, DATAFU-61.patch
>
>
> The first macro I would like to add is a Term Frequency, Inverse Document 
> Frequency implementation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DATAFU-130) Add left outer join macro described in the DataFu guide

2017-09-17 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16169225#comment-16169225
 ] 

Eyal Allweil commented on DATAFU-130:
-

I think this is a good Jira issue to put in the [Apache Help Wanted 
site|https://helpwanted.apache.org/]. If there's no objection, I'll add it 
there.

> Add left outer join macro described in the DataFu guide
> ---
>
> Key: DATAFU-130
> URL: https://issues.apache.org/jira/browse/DATAFU-130
> Project: DataFu
>  Issue Type: New Feature
>Reporter: Eyal Allweil
>  Labels: macro, newbie
>
> In our 
> [guide|http://datafu.incubator.apache.org/blog/2013/09/04/datafu-1-0.html], a 
> macro is described for making a three-way left outer join conveniently. We 
> can add this macro to DataFu to make it even easier to use.
> The macro's code is as follows:
> {noformat}
> DEFINE left_outer_join(relation1, key1, relation2, key2, relation3, key3) 
> returns joined {
>   cogrouped = COGROUP $relation1 BY $key1, $relation2 BY $key2, $relation3 BY 
> $key3;
>   $joined = FOREACH cogrouped GENERATE
> FLATTEN($relation1),
> FLATTEN(EmptyBagToNullFields($relation2)),
> FLATTEN(EmptyBagToNullFields($relation3));
> }
> (we would obviously want to add a test for this, too)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (DATAFU-130) Add left outer join macro described in the DataFu guide

2017-09-19 Thread Eyal Allweil (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eyal Allweil updated DATAFU-130:

Description: 
In our 
[guide|http://datafu.incubator.apache.org/blog/2013/09/04/datafu-1-0.html], a 
macro is described for making a three-way left outer join conveniently. We can 
add this macro to DataFu to make it even easier to use.

The macro's code is as follows:


{noformat}
DEFINE left_outer_join(relation1, key1, relation2, key2, relation3, key3) 
returns joined {
  cogrouped = COGROUP $relation1 BY $key1, $relation2 BY $key2, $relation3 BY 
$key3;
  $joined = FOREACH cogrouped GENERATE
FLATTEN($relation1),
FLATTEN(EmptyBagToNullFields($relation2)),
FLATTEN(EmptyBagToNullFields($relation3));
}

{noformat}

(we would obviously want to add a test for this, too)



  was:
In our 
[guide|http://datafu.incubator.apache.org/blog/2013/09/04/datafu-1-0.html], a 
macro is described for making a three-way left outer join conveniently. We can 
add this macro to DataFu to make it even easier to use.

The macro's code is as follows:


{noformat}
DEFINE left_outer_join(relation1, key1, relation2, key2, relation3, key3) 
returns joined {
  cogrouped = COGROUP $relation1 BY $key1, $relation2 BY $key2, $relation3 BY 
$key3;
  $joined = FOREACH cogrouped GENERATE
FLATTEN($relation1),
FLATTEN(EmptyBagToNullFields($relation2)),
FLATTEN(EmptyBagToNullFields($relation3));
}

(we would obviously want to add a test for this, too)

{noformat}




> Add left outer join macro described in the DataFu guide
> ---
>
> Key: DATAFU-130
> URL: https://issues.apache.org/jira/browse/DATAFU-130
> Project: DataFu
>  Issue Type: New Feature
>Reporter: Eyal Allweil
>  Labels: macro, newbie
>
> In our 
> [guide|http://datafu.incubator.apache.org/blog/2013/09/04/datafu-1-0.html], a 
> macro is described for making a three-way left outer join conveniently. We 
> can add this macro to DataFu to make it even easier to use.
> The macro's code is as follows:
> {noformat}
> DEFINE left_outer_join(relation1, key1, relation2, key2, relation3, key3) 
> returns joined {
>   cogrouped = COGROUP $relation1 BY $key1, $relation2 BY $key2, $relation3 BY 
> $key3;
>   $joined = FOREACH cogrouped GENERATE
> FLATTEN($relation1),
> FLATTEN(EmptyBagToNullFields($relation2)),
> FLATTEN(EmptyBagToNullFields($relation3));
> }
> {noformat}
> (we would obviously want to add a test for this, too)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DATAFU-12) Implement Lead UDF based on version from SQL

2017-10-08 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-12?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16196058#comment-16196058
 ] 

Eyal Allweil commented on DATAFU-12:


[~matterhayes], anyone, what do you think? I wouldn't "waste" our time on 
something that can already be done in Pig via Hive, and I'd like to close 
jira's that are no longer relevant.

> Implement Lead UDF based on version from SQL
> 
>
> Key: DATAFU-12
> URL: https://issues.apache.org/jira/browse/DATAFU-12
> Project: DataFu
>  Issue Type: New Feature
>Reporter: Matthew Hayes
>
> Min Zhou has provided this suggestion ([Issue #88 on 
> GitHub|https://github.com/linkedin/datafu/pull/88]):
> Lead is an analytic function like Oracle's Lead function. It provides access 
> to more than one tuple of a bag at the same time without a self join. Given a 
> bag of tuple returned from a query, LEAD provides access to a tuple at a 
> given physical offset beyond that position. Generates pairs of all items in a 
> bag.
> If you do not specify offset, then its default is 1. Null is returned if the 
> offset goes beyond the scope of the bag.
> Example 1:
> {noformat}
>register ba-pig-0.1.jar
>define Lead datafu.pig.bags.Lead('2');
>-- INPUT: ({(1),(2),(3),(4)})
>data = LOAD 'input' AS (data: bag {T: tuple(v:INT)});
>describe data;
>-- OUTPUT:  ({((1),(2),(3)),((2),(3),(4)),((3),(4),),((4),,)})
>-- OUTPUT SCHEMA: data2: {lead_data: {(elem0: (v: int),elem1: (v: 
> int),elem2: (v: int))}}
>data2 = FOREACH data GENERATE Lead(data);
>describe data2;
>DUMP data2;
> {noformat}
> Example 2
> {noformat}
>register  ba-pig-0.1.jar
>define Lead datafu.pig.bags.Lead();
>-- INPUT: 
> ({(10,{(1),(2),(3)}),(20,{(4),(5),(6)}),(30,{(7),(8)}),(40,{(9),(10),(11)}),(50,{(12),(13),(14),(15)})})
>data = LOAD 'input' AS (data: bag {T: tuple(v1:INT,B: bag{T: 
> tuple(v2:INT)})});
>--describe data;
>-- OUPUT: 
> ({((10,{(1),(2),(3)}),(20,{(4),(5),(6)})),((20,{(4),(5),(6)}),(30,{(7),(8)})),((30,{(7),(8)}),(40,{(9),(10),(11)})),((40,{(9),(10),(11)}),(50,{(12),(13),(14),(15)})),((50,{(12),(13),(14),(15)}),)})
>data2 = FOREACH data GENERATE Lead(data);
>--describe data2;
>DUMP data2;
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (DATAFU-48) Upgrade Guava to 17.0

2017-10-08 Thread Eyal Allweil (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-48?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eyal Allweil updated DATAFU-48:
---
Attachment: DATAFU-48-update-gradle-to-20.0.patch

I checked, and Guava 20.0 is the last version that we can update to without 
getting into a Java version conflict. So this is a patch that updates Guava to 
20.0.

The tests all pass (build plugin, hourglass, and pig) and I ran a simple Pig 
script that uses the generated DataFu pig jar to see that it's still valid.

Let's close this ancient ticket!



> Upgrade Guava to 17.0
> -
>
> Key: DATAFU-48
> URL: https://issues.apache.org/jira/browse/DATAFU-48
> Project: DataFu
>  Issue Type: Improvement
>Reporter: Philip (flip) Kromer
>Assignee: Philip (flip) Kromer
>Priority: Minor
>  Labels: build, dependency, guava, version
> Attachments: 0001-DATAFU-48-Upgrade-gradle-to-17.0.patch, 
> DATAFU-48-update-gradle-to-20.0.patch
>
>
> Specifically motivated by the improvements to hashing library, but also 
> because we're six versions behind at the moment.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DATAFU-87) Edit distance

2017-10-09 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-87?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16197120#comment-16197120
 ] 

Eyal Allweil commented on DATAFU-87:


On second thought, since this UDF is now available in Hive, and since 
Levenshtein distance is a purely local computation, I'm guessing there's no 
need for a specific DataFu implementation. Shall we close this issue?

Here are some links to the Hive UDF.

https://issues.apache.org/jira/browse/HIVE-9556

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-StringFunctions



> Edit distance
> -
>
> Key: DATAFU-87
> URL: https://issues.apache.org/jira/browse/DATAFU-87
> Project: DataFu
>  Issue Type: New Feature
>Affects Versions: 1.3.0
>Reporter: Joydeep Banerjee
> Attachments: DATAFU-87.patch
>
>
> [This is work-in-progress]
> Given 2 strings, provide a measure of dis-similarity (Levenshtein distance) 
> between them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (DATAFU-131) Update DataFu site to meet graduation requirements

2017-10-10 Thread Eyal Allweil (JIRA)
Eyal Allweil created DATAFU-131:
---

 Summary: Update DataFu site to meet graduation requirements
 Key: DATAFU-131
 URL: https://issues.apache.org/jira/browse/DATAFU-131
 Project: DataFu
  Issue Type: Bug
Reporter: Eyal Allweil


The following issues were raised with the [DataFu web 
site|http://datafu.incubator.apache.org] as part of the [graduation discussion 
on the incubator general maiing 
list|https://mail-archives.apache.org/mod_mbox/incubator-general/201710.mbox/%3CCAOGo0VZN4mLw-eS-Oz8W1hVhS70Y_BwwikJjRi9Sx3n0s8sFMg%40mail.gmail.com%3E]

There's no link to the main ASF website.
There's no LICENSE or Thanks link.
There's no download link.
etc.

The quick start guide pages do have download links, but the primary
link is to Maven rather than the ASF, and there are no instructions as
to how to check sigs or hashes, and no link to the KEYS file that I
could find.

The SHA-512 checksum must have the extension .sha512

http://www.apache.org/dev/release-distribution.html#sigs-and-sums

Also the latest release appears to be 1.3.2 (dated Feb 2017) but the
download links point to 1.3.1.

The older releases (1.3.1 and 1.3.0) should have been deleted from the
release/dist directory by now.

There's no Apache feather logo which is often used as the link to the
main ASF site.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DATAFU-131) Update DataFu site to meet graduation requirements

2017-10-10 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16199209#comment-16199209
 ] 

Eyal Allweil commented on DATAFU-131:
-

Here's a link to the Apache site guidelines:

https://www.apache.org/foundation/marks/pmcs#navigation

> Update DataFu site to meet graduation requirements
> --
>
> Key: DATAFU-131
> URL: https://issues.apache.org/jira/browse/DATAFU-131
> Project: DataFu
>  Issue Type: Bug
>Reporter: Eyal Allweil
>
> The following issues were raised with the [DataFu web 
> site|http://datafu.incubator.apache.org] as part of the [graduation 
> discussion on the incubator general maiing 
> list|https://mail-archives.apache.org/mod_mbox/incubator-general/201710.mbox/%3CCAOGo0VZN4mLw-eS-Oz8W1hVhS70Y_BwwikJjRi9Sx3n0s8sFMg%40mail.gmail.com%3E]
> There's no link to the main ASF website.
> There's no LICENSE or Thanks link.
> There's no download link.
> etc.
> The quick start guide pages do have download links, but the primary
> link is to Maven rather than the ASF, and there are no instructions as
> to how to check sigs or hashes, and no link to the KEYS file that I
> could find.
> The SHA-512 checksum must have the extension .sha512
> http://www.apache.org/dev/release-distribution.html#sigs-and-sums
> Also the latest release appears to be 1.3.2 (dated Feb 2017) but the
> download links point to 1.3.1.
> The older releases (1.3.1 and 1.3.0) should have been deleted from the
> release/dist directory by now.
> There's no Apache feather logo which is often used as the link to the
> main ASF site.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DATAFU-48) Upgrade Guava to 17.0

2017-10-19 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-48?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16210769#comment-16210769
 ] 

Eyal Allweil commented on DATAFU-48:


None, actually. Hadoop 1 and 2 are using 11.0.2, like us. Hadoop 3 is [using 
21|https://issues.apache.org/jira/browse/HADOOP-10101].

> Upgrade Guava to 17.0
> -
>
> Key: DATAFU-48
> URL: https://issues.apache.org/jira/browse/DATAFU-48
> Project: DataFu
>  Issue Type: Improvement
>Reporter: Philip (flip) Kromer
>Assignee: Philip (flip) Kromer
>Priority: Minor
>  Labels: build, dependency, guava, version
> Attachments: 0001-DATAFU-48-Upgrade-gradle-to-17.0.patch, 
> DATAFU-48-update-gradle-to-20.0.patch
>
>
> Specifically motivated by the improvements to hashing library, but also 
> because we're six versions behind at the moment.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DATAFU-32) Hourglass concrete jobs should have getters and setters for output name and namespace

2017-10-19 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-32?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16210941#comment-16210941
 ] 

Eyal Allweil commented on DATAFU-32:


Is this still relevant? If so, I'll open a [Help Wanted 
task|https://helpwanted.apache.org/] for it.

> Hourglass concrete jobs should have getters and setters for output name and 
> namespace
> -
>
> Key: DATAFU-32
> URL: https://issues.apache.org/jira/browse/DATAFU-32
> Project: DataFu
>  Issue Type: Improvement
>Reporter: Matthew Hayes
>Assignee: Matthew Hayes
>
> With the abstract versions you can override getOutputSchemaName() and 
> getOutputSchemaNamespace().  But the concrete versions don't expose setters, 
> so you have to extend the class to override the defaults.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DATAFU-118) Automatically run rat task when running assemble

2017-10-19 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16211505#comment-16211505
 ] 

Eyal Allweil commented on DATAFU-118:
-

(because we have a patch that seems to work on a newer Gradle version linked in 
the review board)

> Automatically run rat task when running assemble
> 
>
> Key: DATAFU-118
> URL: https://issues.apache.org/jira/browse/DATAFU-118
> Project: DataFu
>  Issue Type: Improvement
>Reporter: Matthew Hayes
>Priority: Minor
>
> The rat task checks that our files have the right headers.  We don't 
> automatically run it for assemble so it isn't easy for new contributors to 
> catch issues.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (DATAFU-125) Upgrade Gradle to v4 or later

2017-10-20 Thread Eyal Allweil (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eyal Allweil updated DATAFU-125:

Attachment: DATAFU-125.patch

This updates Gradle to 3.5.1. It seems to me that gradle-autojar doesn't work 
in Gradle 4, which is why I settled for an older version.

While running _gradlew clean assemble_ I get this warning message:

{{:datafu-pig:jarWithDependencies
Manifest.writeTo(Writer) has been deprecated and is scheduled to be removed in 
Gradle 4.0. Please use Manifest.writeTo(Object) instead.
}}

And the build indeed fails in Gradle 4.

However, this update is enough to let us merge DATAFU-118, and it is a far 
newer version. I ran _assemble_ and _test_ on it; I still want to run a Pig 
script using the assembled jar but I'll do that next week.

What other tasks do we want to check to see that this fix is valid? Or do we 
want to try to get Gradle 4 working?

> Upgrade Gradle to v4 or later
> -
>
> Key: DATAFU-125
> URL: https://issues.apache.org/jira/browse/DATAFU-125
> Project: DataFu
>  Issue Type: Task
>Reporter: Matthew Hayes
> Attachments: DATAFU-125.patch
>
>
> We should update to the most recent version of Gradle.  We're currently using 
> 2.4.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DATAFU-125) Upgrade Gradle to v4 or later

2017-10-21 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16214208#comment-16214208
 ] 

Eyal Allweil commented on DATAFU-125:
-

_check_ and _clean release_ run and return SUCCESS. Are there any special files 
I should check that are the result of the _release_ task?

I also ran a script on the packaged jar (the regular one, not core or the 
jarjar) and it ran fine.

> Upgrade Gradle to v4 or later
> -
>
> Key: DATAFU-125
> URL: https://issues.apache.org/jira/browse/DATAFU-125
> Project: DataFu
>  Issue Type: Task
>Reporter: Matthew Hayes
> Attachments: DATAFU-125.patch
>
>
> We should update to the most recent version of Gradle.  We're currently using 
> 2.4.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DATAFU-48) Upgrade Guava to 17.0

2017-10-21 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-48?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16214212#comment-16214212
 ] 

Eyal Allweil commented on DATAFU-48:


As an additional check, I ran a Pig script which uses 
_SimpleRandomSampleWithReplacementVote_ (which uses Guava) to see that it still 
runs correctly.

> Upgrade Guava to 17.0
> -
>
> Key: DATAFU-48
> URL: https://issues.apache.org/jira/browse/DATAFU-48
> Project: DataFu
>  Issue Type: Improvement
>Reporter: Philip (flip) Kromer
>Assignee: Philip (flip) Kromer
>Priority: Minor
>  Labels: build, dependency, guava, version
> Attachments: 0001-DATAFU-48-Upgrade-gradle-to-17.0.patch, 
> DATAFU-48-update-gradle-to-20.0.patch
>
>
> Specifically motivated by the improvements to hashing library, but also 
> because we're six versions behind at the moment.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DATAFU-17) Improve testing of randomized functions

2017-10-22 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-17?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16214398#comment-16214398
 ] 

Eyal Allweil commented on DATAFU-17:


I think we can close this, just as we closed 
[DATAFU-28|https://issues.apache.org/jira/browse/DATAFU-28]. If all the tests 
take less than twenty minutes now I don't think it's worth making an effort to 
minimize the randomized functions.

> Improve testing of randomized functions
> ---
>
> Key: DATAFU-17
> URL: https://issues.apache.org/jira/browse/DATAFU-17
> Project: DataFu
>  Issue Type: Improvement
>Reporter: Will Vaughan
>
> We have a large number of UDFs with a random component that are difficult and 
> often slow to test.  We should improve our testing standards and capabilities 
> for this class of functions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (DATAFU-131) Update DataFu site to meet graduation requirements

2017-10-26 Thread Eyal Allweil (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eyal Allweil reassigned DATAFU-131:
---

Assignee: Matthew Hayes

> Update DataFu site to meet graduation requirements
> --
>
> Key: DATAFU-131
> URL: https://issues.apache.org/jira/browse/DATAFU-131
> Project: DataFu
>  Issue Type: Bug
>Reporter: Eyal Allweil
>Assignee: Matthew Hayes
> Attachments: DATAFU-131.patch, Screen Shot 2017-10-25 at 7.21.09 
> PM.png
>
>
> The following issues were raised with the [DataFu web 
> site|http://datafu.incubator.apache.org] as part of the [graduation 
> discussion on the incubator general maiing 
> list|https://mail-archives.apache.org/mod_mbox/incubator-general/201710.mbox/%3CCAOGo0VZN4mLw-eS-Oz8W1hVhS70Y_BwwikJjRi9Sx3n0s8sFMg%40mail.gmail.com%3E]
> There's no link to the main ASF website.
> There's no LICENSE or Thanks link.
> There's no download link.
> etc.
> The quick start guide pages do have download links, but the primary
> link is to Maven rather than the ASF, and there are no instructions as
> to how to check sigs or hashes, and no link to the KEYS file that I
> could find.
> The SHA-512 checksum must have the extension .sha512
> http://www.apache.org/dev/release-distribution.html#sigs-and-sums
> Also the latest release appears to be 1.3.2 (dated Feb 2017) but the
> download links point to 1.3.1.
> The older releases (1.3.1 and 1.3.0) should have been deleted from the
> release/dist directory by now.
> There's no Apache feather logo which is often used as the link to the
> main ASF site.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (DATAFU-125) Upgrade Gradle to v4 or later

2017-10-26 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16220371#comment-16220371
 ] 

Eyal Allweil commented on DATAFU-125:
-

When I build with _./gradlew clean release -Prelease=true_, I don't get a zip 
file. In fact, I don't get jars either - I need to use _assemble_ to make them 
(both on master with and without upgrading Gradle). Am I using the wrong 
command? Does it work for you, [~matterhayes]?

> Upgrade Gradle to v4 or later
> -
>
> Key: DATAFU-125
> URL: https://issues.apache.org/jira/browse/DATAFU-125
> Project: DataFu
>  Issue Type: Task
>Reporter: Matthew Hayes
>Assignee: Eyal Allweil
> Attachments: DATAFU-125.patch
>
>
> We should update to the most recent version of Gradle.  We're currently using 
> 2.4.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (DATAFU-48) Upgrade Guava to 20.0

2017-10-29 Thread Eyal Allweil (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-48?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eyal Allweil updated DATAFU-48:
---
Summary: Upgrade Guava to 20.0  (was: Upgrade Guava to 17.0)

> Upgrade Guava to 20.0
> -
>
> Key: DATAFU-48
> URL: https://issues.apache.org/jira/browse/DATAFU-48
> Project: DataFu
>  Issue Type: Improvement
>Reporter: Philip (flip) Kromer
>Assignee: Philip (flip) Kromer
>Priority: Minor
>  Labels: build, dependency, guava, version
> Attachments: 0001-DATAFU-48-Upgrade-gradle-to-17.0.patch, 
> DATAFU-48-update-gradle-to-20.0.patch
>
>
> Specifically motivated by the improvements to hashing library, but also 
> because we're six versions behind at the moment.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (DATAFU-48) Upgrade Guava to 20.0

2017-10-29 Thread Eyal Allweil (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-48?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eyal Allweil resolved DATAFU-48.

   Resolution: Fixed
 Assignee: Eyal Allweil  (was: Philip (flip) Kromer)
Fix Version/s: 1.3.3

Merged

> Upgrade Guava to 20.0
> -
>
> Key: DATAFU-48
> URL: https://issues.apache.org/jira/browse/DATAFU-48
> Project: DataFu
>  Issue Type: Improvement
>Reporter: Philip (flip) Kromer
>Assignee: Eyal Allweil
>Priority: Minor
>  Labels: build, dependency, guava, version
> Fix For: 1.3.3
>
> Attachments: 0001-DATAFU-48-Upgrade-gradle-to-17.0.patch, 
> DATAFU-48-update-gradle-to-20.0.patch
>
>
> Specifically motivated by the improvements to hashing library, but also 
> because we're six versions behind at the moment.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


  1   2   >