[jira] [Commented] (DATAFU-16) weighted reservoir sampling with exponential jumps UDF

2016-10-18 Thread Matthew Hayes (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-16?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15587387#comment-15587387
 ] 

Matthew Hayes commented on DATAFU-16:
-

I don't think the exponential jump version got added.

> weighted reservoir sampling with exponential jumps UDF
> --
>
> Key: DATAFU-16
> URL: https://issues.apache.org/jira/browse/DATAFU-16
> Project: DataFu
>  Issue Type: New Feature
> Environment: Mac, Linux
> pig-0.11
>Reporter: jian wang
>Assignee: jian wang
>Priority: Minor
> Attachments: ScoredExpJmpReservoir.java, ScoredReservoir.java, 
> WeightedSamplingCorrectnessTests.java
>
>
> Create a weightedReservoirSampleWithExpJump UDF to implement the weighted 
> reservoir sampling algorithm with exponential jumps. Investigation is tracked 
> in  https://github.com/linkedin/datafu/issues/80. This task is part of 
> experiment of different weighted sampling algorithms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (DATAFU-45) RFE: CartesianProduct

2016-10-18 Thread Matthew Hayes (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-45?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matthew Hayes closed DATAFU-45.
---
Resolution: Won't Fix

I'm going to go ahead and close this then.

> RFE: CartesianProduct
> -
>
> Key: DATAFU-45
> URL: https://issues.apache.org/jira/browse/DATAFU-45
> Project: DataFu
>  Issue Type: New Feature
>Reporter: Sam Steingold
>
> Given two bags, produce their [Cartesian 
> product|http://en.wikipedia.org/wiki/Cartesian_product]:
> {code}
> B1: bag{T1}
> B2: bag{T2}
> CartesianProduct(B1,B2): bag{(T1,T2)}
> {code}
> Use case:
> {code}
> toks = TOKENIZE((charray)$0,',');
> kwds = CartesianProduct(toks, {1.0/(double)SIZE(toks)});
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DATAFU-45) RFE: CartesianProduct

2016-10-18 Thread Sam Steingold (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-45?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15587301#comment-15587301
 ] 

Sam Steingold commented on DATAFU-45:
-

Hi Eyal,
This was resolved by moving on to SPARK.
Thanks for your attention.


> RFE: CartesianProduct
> -
>
> Key: DATAFU-45
> URL: https://issues.apache.org/jira/browse/DATAFU-45
> Project: DataFu
>  Issue Type: New Feature
>Reporter: Sam Steingold
>
> Given two bags, produce their [Cartesian 
> product|http://en.wikipedia.org/wiki/Cartesian_product]:
> {code}
> B1: bag{T1}
> B2: bag{T2}
> CartesianProduct(B1,B2): bag{(T1,T2)}
> {code}
> Use case:
> {code}
> toks = TOKENIZE((charray)$0,',');
> kwds = CartesianProduct(toks, {1.0/(double)SIZE(toks)});
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: What's the latest with the release and graduation?

2016-10-18 Thread Matthew Hayes
I've been meaning to sort through what is needed for the NOTICE and LICENSE
files for the binary artifact release (based on Justin's feedback) before
doing another release but I haven't gotten to it yet.  Basically,
datafu-pig packages dependencies within the JAR so that using the
datafu-pig artifact is easy (e.g. you don't need to find all the
dependencies and load them).  The namespaces of these are changed in case
different versions of these libraries are already loaded elsewhere.  These
are all APLv2 so including them shouldn't be an issue.  But, we our binary
artifact for datafu-pig does not include NOTICE or LICENSE files; these
apparently should be different from the ones we are including in the source
release because the source release does not include these libraries.
Justin shared [2] as guidance on what to do.  According to that page we
need to do some analysis to check whether the libraries are doing any
bundling and also check for NOTICE files so we can include relevant
portions in our NOTICE.  I wanted to tackle all this before doing another
release.  If you'd like to help with analysis please let me know :)

-Matt

[1] Autojarred libraries in datafu-pig/build.gradle:

  autojarred "it.unimi.dsi:fastutil:$fastutilVersion"
  autojarred "org.apache.commons:commons-math:$commonsMathVersion"
  autojarred "com.clearspring.analytics:stream:$streamVersion"
  autojarred "com.google.guava:guava:$guavaVersion"
  autojarred "org.apache.opennlp:opennlp-tools:$openNlpVersion"
  autojarred "org.apache.opennlp:opennlp-uima:$openNlpVersion"
  autojarred "org.apache.opennlp:opennlp-maxent:$openNlpMaxEntVersion"

[2] http://www.apache.org/dev/licensing-howto.html#alv2-dep

On Tue, Oct 18, 2016 at 4:49 PM, Roman Shaposhnik 
wrote:

> Hi!
>
> I remember a month or so ago there was a push
> for both but then it kind of went quiet. What's the
> latest on that? I'd feel really great if I can help
> kick you guys out from the Incubator into the TLP ;-)
>
> Speaking of which -- anything I can do to help with
> the release?
>
> Thanks,
> Roman.
>


What's the latest with the release and graduation?

2016-10-18 Thread Roman Shaposhnik
Hi!

I remember a month or so ago there was a push
for both but then it kind of went quiet. What's the
latest on that? I'd feel really great if I can help
kick you guys out from the Incubator into the TLP ;-)

Speaking of which -- anything I can do to help with
the release?

Thanks,
Roman.


New features for DataFu - triage

2016-10-18 Thread Eyal Allweil
Looking at DataFu's jira, it seems like there are quite a few UDF's which
are in various states of acceptance to the project. I tried to categorize
them, skipping those which seemed problematic for some reason, or didn't
have any patch request attached:


*No response from the project:*

New UDF for Histogram / Frequency counting


Edit distance 

Create IncrementalAvroStorage UDF for incrementally processing date
partitioned data 

simple hash for near duplicate detection



Is there any existing process for deciding how to accept new content? I
don't know if the submitters are still around, but we should probably try
to give some sort of response.


*In a process of review:*

Add datafu.text.ToJson UDF to serialize any relation/field as a JSON String

Add DataFu MR project (obviously not a UDF)


UDF's to handle map type 

NCDG 

Aho-Corasick 

New UDF - TupleDiff 

Some of these seem close to being finished. I think I'll take a look at the
first one.

Regards,
Eyal


[jira] [Commented] (DATAFU-16) weighted reservoir sampling with exponential jumps UDF

2016-10-18 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-16?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15586085#comment-15586085
 ] 

Eyal Allweil commented on DATAFU-16:


It looks like this got added - can this issue be closed?

> weighted reservoir sampling with exponential jumps UDF
> --
>
> Key: DATAFU-16
> URL: https://issues.apache.org/jira/browse/DATAFU-16
> Project: DataFu
>  Issue Type: New Feature
> Environment: Mac, Linux
> pig-0.11
>Reporter: jian wang
>Assignee: jian wang
>Priority: Minor
> Attachments: ScoredExpJmpReservoir.java, ScoredReservoir.java, 
> WeightedSamplingCorrectnessTests.java
>
>
> Create a weightedReservoirSampleWithExpJump UDF to implement the weighted 
> reservoir sampling algorithm with exponential jumps. Investigation is tracked 
> in  https://github.com/linkedin/datafu/issues/80. This task is part of 
> experiment of different weighted sampling algorithms.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DATAFU-45) RFE: CartesianProduct

2016-10-18 Thread Eyal Allweil (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-45?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15584898#comment-15584898
 ] 

Eyal Allweil commented on DATAFU-45:


Hi Sam,

Did you ever solve this? I agree with Matthew that this should be doable via 
plain Pig - if not, I'd open a bug there.

> RFE: CartesianProduct
> -
>
> Key: DATAFU-45
> URL: https://issues.apache.org/jira/browse/DATAFU-45
> Project: DataFu
>  Issue Type: New Feature
>Reporter: Sam Steingold
>
> Given two bags, produce their [Cartesian 
> product|http://en.wikipedia.org/wiki/Cartesian_product]:
> {code}
> B1: bag{T1}
> B2: bag{T2}
> CartesianProduct(B1,B2): bag{(T1,T2)}
> {code}
> Use case:
> {code}
> toks = TOKENIZE((charray)$0,',');
> kwds = CartesianProduct(toks, {1.0/(double)SIZE(toks)});
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (DATAFU-65) Aho-Corasick Pig UDF

2016-10-18 Thread Eyal Allweil (JIRA)

 [ 
https://issues.apache.org/jira/browse/DATAFU-65?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eyal Allweil updated DATAFU-65:
---
Issue Type: New Feature  (was: Bug)

> Aho-Corasick Pig UDF
> 
>
> Key: DATAFU-65
> URL: https://issues.apache.org/jira/browse/DATAFU-65
> Project: DataFu
>  Issue Type: New Feature
>Affects Versions: 1.3.0
> Environment: Drought
>Reporter: Russell Jurney
> Attachments: DATAFU-65.diff
>
>   Original Estimate: 8h
>  Remaining Estimate: 8h
>
> I need to use the Aho-Corasick algorithm for efficient sub-string matching. A 
> java implementation is available at 
> https://github.com/robert-bor/aho-corasick and is available on maven central: 
> http://maven-repository.com/artifact/org.arabidopsis.ahocorasick/ahocorasick/2.x
>  A Pig UDF will be very helpful to me.
> How do I add a maven dependency with gradle?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)