Re: Should spark-ec2 get its own repo?

2015-07-21 Thread Shivaram Venkataraman
There is technically no PMC for the spark-ec2 project (I guess we are kind
of establishing one right now). I haven't heard anything from the Spark PMC
on the dev list that might suggest a need for a vote so far. I will send
another round of email notification to the dev list when we have a JIRA /
PR that actually moves the scripts (right now the only thing that changed
is the location of some scripts in mesos/ to amplab/).

Thanks
Shivaram

On Mon, Jul 20, 2015 at 12:55 PM, Mridul Muralidharan mri...@gmail.com
wrote:

 Might be a good idea to get the PMC's of both projects to sign off to
 prevent future issues with apache.

 Regards,
 Mridul

 On Mon, Jul 20, 2015 at 12:01 PM, Shivaram Venkataraman
 shiva...@eecs.berkeley.edu wrote:
  I've created https://github.com/amplab/spark-ec2 and added an initial
 set of
  committers. Note that this is not a fork of the existing
  github.com/mesos/spark-ec2 and users will need to fork from here. This
 is
  mostly to avoid the base-fork in pull requests being set incorrectly etc.
 
  I'll be migrating some PRs / closing them in the old repo and will also
  update the README in that repo.
 
  Thanks
  Shivaram
 
  On Fri, Jul 17, 2015 at 3:00 PM, Sean Owen so...@cloudera.com wrote:
 
  On Fri, Jul 17, 2015 at 6:58 PM, Shivaram Venkataraman
  shiva...@eecs.berkeley.edu wrote:
   I am not sure why the ASF JIRA can be only used to track one set of
   artifacts that are packaged and released together. I agree that
 marking
   a
   fix version as 1.5 for a change in another repo doesn't make a lot of
   sense,
   but we could just not use fix versions for the EC2 issues ?
 
  *shrug* it just seems harder and less natural to use ASF JIRA. What's
  the benefit? I agree it's not a big deal either way but it's a small
  part of the problem we're solving in the first place. I suspect that
  one way or the other, there would be issues filed both places, so this
  probably isn't worth debating.
 
 
   My concerns are less about it being pushed out etc. For better or
 worse
   we
   have had EC2 scripts be a part of the Spark distribution from a very
   early
   stage (from version 0.5.0 if my git history reading is correct).  So
   users
   will assume that any error with EC2 scripts belong to the Spark
 project.
   In
   addition almost all the contributions to the EC2 scripts come from
 Spark
   developers and so keeping the issues in the same mailing list / JIRA
   seems
   natural. This I guess again relates to the question of managing issues
   for
   code that isn't part of the Spark release artifact.
 
  Yeah good question -- Github doesn't give you a mailing list. I think
  dev@ would still be where it's discussed which is ... again 'part of
  the problem' but as you say, probably beneficial. It's a pretty low
  traffic topic anyway.
 
 
   I'll create the amplab/spark-ec2 repo over the next couple of days
   unless
   there are more comments on this thread. This will at least alleviate
   some of
   the naming confusion over using a repository in mesos and I'll give
   Sean,
   Nick, Matthew commit access to it. I am still not convinced about
 moving
   the
   issues over though.
 
  I won't move the issues. Maybe time tells whether one approach is
  better, or that it just doesn't matter.
 
  However it'd be a great opportunity to review and clear stale EC2
 issues.
 
 



Re: countByValue on dataframe with multiple columns

2015-07-21 Thread Ted Malaska
100% I would love to do it.  Who a good person to review the design with.
All I need is a quick chat about the design and approach and I'll create
the jira and push a patch.

Ted Malaska

On Tue, Jul 21, 2015 at 10:19 AM, Olivier Girardot 
o.girar...@lateral-thoughts.com wrote:

 Hi Ted,
 The TopNList would be great to see directly in the Dataframe API and my
 wish would be to be able to apply it on multiple columns at the same time
 and get all these statistics.
 the .describe() function is close to what we want to achieve, maybe we
 could try to enrich its output.
 Anyway, even as a spark-package, if you could package your code for
 Dataframes, that would be great.

 Regards,

 Olivier.

 2015-07-21 15:08 GMT+02:00 Jonathan Winandy jonathan.wina...@gmail.com:

 Ha ok !

 Then generic part would have that signature :

 def countColsByValue(df:Dataframe):Map[String /* colname */,Dataframe]


 +1 for more work (blog / api) for data quality checks.

 Cheers,
 Jonathan


 TopCMSParams and some other monoids from Algebird are really cool for
 that :

 https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/CountMinSketch.scala#L590


 On 21 July 2015 at 13:40, Ted Malaska ted.mala...@cloudera.com wrote:

 I'm guessing you want something like what I put in this blog post.


 http://blog.cloudera.com/blog/2015/07/how-to-do-data-quality-checks-using-apache-spark-dataframes/

 This is a very common use case.  If there is a +1 I would love to add it
 to dataframes.

 Let me know
 Ted Malaska

 On Tue, Jul 21, 2015 at 7:24 AM, Olivier Girardot 
 o.girar...@lateral-thoughts.com wrote:

 Yop,
 actually the generic part does not work, the countByValue on one column
 gives you the count for each value seen in the column.
 I would like a generic (multi-column) countByValue to give me the same
 kind of output for each column, not considering each n-uples of each column
 value as the key (which is what the groupBy is doing by default).

 Regards,

 Olivier

 2015-07-20 14:18 GMT+02:00 Jonathan Winandy jonathan.wina...@gmail.com
 :

 Ahoy !

 Maybe you can get countByValue by using sql.GroupedData :

 // some DFval df: DataFrame = 
 sqlContext.createDataFrame(sc.parallelize(List(A,B, B, 
 A)).map(Row.apply(_)), StructType(List(StructField(n, StringType


 df.groupBy(n).count().show()


 // generic
 def countByValueDf(df:DataFrame) = {

   val (h :: r) = df.columns.toList

   df.groupBy(h, r:_*).count()
 }

 countByValueDf(df).show()


 Cheers,
 Jon

 On 20 July 2015 at 11:28, Olivier Girardot 
 o.girar...@lateral-thoughts.com wrote:

 Hi,
 Is there any plan to add the countByValue function to Spark SQL
 Dataframe ?
 Even
 https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L78
 is using the RDD part right now, but for ML purposes, being able to get 
 the
 most frequent categorical value on multiple columns would be very useful.


 Regards,


 --
 *Olivier Girardot* | Associé
 o.girar...@lateral-thoughts.com
 +33 6 24 09 17 94





 --
 *Olivier Girardot* | Associé
 o.girar...@lateral-thoughts.com
 +33 6 24 09 17 94






 --
 *Olivier Girardot* | Associé
 o.girar...@lateral-thoughts.com
 +33 6 24 09 17 94



Re: countByValue on dataframe with multiple columns

2015-07-21 Thread Ted Malaska
I added the following jira

https://issues.apache.org/jira/browse/SPARK-9237

Please help me get it assigned to myself thanks.

Ted Malaska

On Tue, Jul 21, 2015 at 7:53 PM, Ted Malaska ted.mala...@cloudera.com
wrote:

 Cool I will make a jira after I check in to my hotel.  And try to get a
 patch early next week.
 On Jul 21, 2015 5:15 PM, Olivier Girardot 
 o.girar...@lateral-thoughts.com wrote:

 yes and freqItems does not give you an ordered count (right ?) + the
 threshold makes it difficult to calibrate it + we noticed some strange
 behaviour when testing it on small datasets.

 2015-07-21 20:30 GMT+02:00 Ted Malaska ted.mala...@cloudera.com:

 Look at the implementation for frequently items.  It is a different from
 true count.
 On Jul 21, 2015 1:19 PM, Reynold Xin r...@databricks.com wrote:

 Is this just frequent items?


 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L97



 On Tue, Jul 21, 2015 at 7:39 AM, Ted Malaska ted.mala...@cloudera.com
 wrote:

 100% I would love to do it.  Who a good person to review the design
 with.  All I need is a quick chat about the design and approach and I'll
 create the jira and push a patch.

 Ted Malaska

 On Tue, Jul 21, 2015 at 10:19 AM, Olivier Girardot 
 o.girar...@lateral-thoughts.com wrote:

 Hi Ted,
 The TopNList would be great to see directly in the Dataframe API and
 my wish would be to be able to apply it on multiple columns at the same
 time and get all these statistics.
 the .describe() function is close to what we want to achieve, maybe
 we could try to enrich its output.
 Anyway, even as a spark-package, if you could package your code for
 Dataframes, that would be great.

 Regards,

 Olivier.

 2015-07-21 15:08 GMT+02:00 Jonathan Winandy 
 jonathan.wina...@gmail.com:

 Ha ok !

 Then generic part would have that signature :

 def countColsByValue(df:Dataframe):Map[String /* colname
 */,Dataframe]


 +1 for more work (blog / api) for data quality checks.

 Cheers,
 Jonathan


 TopCMSParams and some other monoids from Algebird are really cool
 for that :

 https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/CountMinSketch.scala#L590


 On 21 July 2015 at 13:40, Ted Malaska ted.mala...@cloudera.com
 wrote:

 I'm guessing you want something like what I put in this blog post.


 http://blog.cloudera.com/blog/2015/07/how-to-do-data-quality-checks-using-apache-spark-dataframes/

 This is a very common use case.  If there is a +1 I would love to
 add it to dataframes.

 Let me know
 Ted Malaska

 On Tue, Jul 21, 2015 at 7:24 AM, Olivier Girardot 
 o.girar...@lateral-thoughts.com wrote:

 Yop,
 actually the generic part does not work, the countByValue on one
 column gives you the count for each value seen in the column.
 I would like a generic (multi-column) countByValue to give me the
 same kind of output for each column, not considering each n-uples of 
 each
 column value as the key (which is what the groupBy is doing by 
 default).

 Regards,

 Olivier

 2015-07-20 14:18 GMT+02:00 Jonathan Winandy 
 jonathan.wina...@gmail.com:

 Ahoy !

 Maybe you can get countByValue by using sql.GroupedData :

 // some DFval df: DataFrame = 
 sqlContext.createDataFrame(sc.parallelize(List(A,B, B, 
 A)).map(Row.apply(_)), StructType(List(StructField(n, 
 StringType


 df.groupBy(n).count().show()


 // generic
 def countByValueDf(df:DataFrame) = {

   val (h :: r) = df.columns.toList

   df.groupBy(h, r:_*).count()
 }

 countByValueDf(df).show()


 Cheers,
 Jon

 On 20 July 2015 at 11:28, Olivier Girardot 
 o.girar...@lateral-thoughts.com wrote:

 Hi,
 Is there any plan to add the countByValue function to Spark SQL
 Dataframe ?
 Even
 https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L78
 is using the RDD part right now, but for ML purposes, being able to 
 get the
 most frequent categorical value on multiple columns would be very 
 useful.


 Regards,


 --
 *Olivier Girardot* | Associé
 o.girar...@lateral-thoughts.com
 +33 6 24 09 17 94





 --
 *Olivier Girardot* | Associé
 o.girar...@lateral-thoughts.com
 +33 6 24 09 17 94






 --
 *Olivier Girardot* | Associé
 o.girar...@lateral-thoughts.com
 +33 6 24 09 17 94






 --
 *Olivier Girardot* | Associé
 o.girar...@lateral-thoughts.com
 +33 6 24 09 17 94




Re: Should spark-ec2 get its own repo?

2015-07-21 Thread Mridul Muralidharan
If I am not wrong, since the code was hosted within mesos project
repo, I assume (atleast part of it) is owned by mesos project and so
its PMC ?

- Mridul

On Tue, Jul 21, 2015 at 9:22 AM, Shivaram Venkataraman
shiva...@eecs.berkeley.edu wrote:
 There is technically no PMC for the spark-ec2 project (I guess we are kind
 of establishing one right now). I haven't heard anything from the Spark PMC
 on the dev list that might suggest a need for a vote so far. I will send
 another round of email notification to the dev list when we have a JIRA / PR
 that actually moves the scripts (right now the only thing that changed is
 the location of some scripts in mesos/ to amplab/).

 Thanks
 Shivaram

 On Mon, Jul 20, 2015 at 12:55 PM, Mridul Muralidharan mri...@gmail.com
 wrote:

 Might be a good idea to get the PMC's of both projects to sign off to
 prevent future issues with apache.

 Regards,
 Mridul

 On Mon, Jul 20, 2015 at 12:01 PM, Shivaram Venkataraman
 shiva...@eecs.berkeley.edu wrote:
  I've created https://github.com/amplab/spark-ec2 and added an initial
  set of
  committers. Note that this is not a fork of the existing
  github.com/mesos/spark-ec2 and users will need to fork from here. This
  is
  mostly to avoid the base-fork in pull requests being set incorrectly
  etc.
 
  I'll be migrating some PRs / closing them in the old repo and will also
  update the README in that repo.
 
  Thanks
  Shivaram
 
  On Fri, Jul 17, 2015 at 3:00 PM, Sean Owen so...@cloudera.com wrote:
 
  On Fri, Jul 17, 2015 at 6:58 PM, Shivaram Venkataraman
  shiva...@eecs.berkeley.edu wrote:
   I am not sure why the ASF JIRA can be only used to track one set of
   artifacts that are packaged and released together. I agree that
   marking
   a
   fix version as 1.5 for a change in another repo doesn't make a lot of
   sense,
   but we could just not use fix versions for the EC2 issues ?
 
  *shrug* it just seems harder and less natural to use ASF JIRA. What's
  the benefit? I agree it's not a big deal either way but it's a small
  part of the problem we're solving in the first place. I suspect that
  one way or the other, there would be issues filed both places, so this
  probably isn't worth debating.
 
 
   My concerns are less about it being pushed out etc. For better or
   worse
   we
   have had EC2 scripts be a part of the Spark distribution from a very
   early
   stage (from version 0.5.0 if my git history reading is correct).  So
   users
   will assume that any error with EC2 scripts belong to the Spark
   project.
   In
   addition almost all the contributions to the EC2 scripts come from
   Spark
   developers and so keeping the issues in the same mailing list / JIRA
   seems
   natural. This I guess again relates to the question of managing
   issues
   for
   code that isn't part of the Spark release artifact.
 
  Yeah good question -- Github doesn't give you a mailing list. I think
  dev@ would still be where it's discussed which is ... again 'part of
  the problem' but as you say, probably beneficial. It's a pretty low
  traffic topic anyway.
 
 
   I'll create the amplab/spark-ec2 repo over the next couple of days
   unless
   there are more comments on this thread. This will at least alleviate
   some of
   the naming confusion over using a repository in mesos and I'll give
   Sean,
   Nick, Matthew commit access to it. I am still not convinced about
   moving
   the
   issues over though.
 
  I won't move the issues. Maybe time tells whether one approach is
  better, or that it just doesn't matter.
 
  However it'd be a great opportunity to review and clear stale EC2
  issues.
 
 



-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Make off-heap store pluggable

2015-07-21 Thread Alexey Goncharuk
2015-07-20 21:32 GMT-07:00 Prashant Sharma scrapco...@gmail.com:

 +1 Looks like a nice idea(I do not see any harm). Would you like to work
 on the patch to support it ?

 Prashant Sharma


Yes, I would like to contribute to it once we clarify the appropriate path.

--Alexey




 On Tue, Jul 21, 2015 at 2:46 AM, Alexey Goncharuk 
 alexey.goncha...@gmail.com wrote:

 Hello Spark community,

 I was looking through the code in order to understand better how RDD is
 persisted to Tachyon off-heap filesystem. It looks like that the Tachyon
 filesystem is hard-coded and there is no way to switch to another in-memory
 filesystem. I think it would be great if the implementation of the
 BlockManager and BlockStore would be able to plug in another filesystem.

 For example, Apache Ignite also has an implementation of in-memory
 filesystem which can store data in on-heap and off-heap formats. It would
 be great if it could integrate with Spark.

 I have filed a ticket in Jira:
 https://issues.apache.org/jira/browse/SPARK-9203

 If it makes sense, I will be happy to contribute to it.

 Thoughts?

 -Alexey (Apache Ignite PMC)





Re: Should spark-ec2 get its own repo?

2015-07-21 Thread Shivaram Venkataraman
Thats part of the confusion we are trying to fix here -- the repository
used to live in the mesos github account but was never a part of the Apache
Mesos project. It was a remnant part of Spark from when Spark used to live
at github.com/mesos/spark.

Shivaram

On Tue, Jul 21, 2015 at 11:03 AM, Mridul Muralidharan mri...@gmail.com
wrote:

 If I am not wrong, since the code was hosted within mesos project
 repo, I assume (atleast part of it) is owned by mesos project and so
 its PMC ?

 - Mridul

 On Tue, Jul 21, 2015 at 9:22 AM, Shivaram Venkataraman
 shiva...@eecs.berkeley.edu wrote:
  There is technically no PMC for the spark-ec2 project (I guess we are
 kind
  of establishing one right now). I haven't heard anything from the Spark
 PMC
  on the dev list that might suggest a need for a vote so far. I will send
  another round of email notification to the dev list when we have a JIRA
 / PR
  that actually moves the scripts (right now the only thing that changed is
  the location of some scripts in mesos/ to amplab/).
 
  Thanks
  Shivaram
 
  On Mon, Jul 20, 2015 at 12:55 PM, Mridul Muralidharan mri...@gmail.com
  wrote:
 
  Might be a good idea to get the PMC's of both projects to sign off to
  prevent future issues with apache.
 
  Regards,
  Mridul
 
  On Mon, Jul 20, 2015 at 12:01 PM, Shivaram Venkataraman
  shiva...@eecs.berkeley.edu wrote:
   I've created https://github.com/amplab/spark-ec2 and added an initial
   set of
   committers. Note that this is not a fork of the existing
   github.com/mesos/spark-ec2 and users will need to fork from here.
 This
   is
   mostly to avoid the base-fork in pull requests being set incorrectly
   etc.
  
   I'll be migrating some PRs / closing them in the old repo and will
 also
   update the README in that repo.
  
   Thanks
   Shivaram
  
   On Fri, Jul 17, 2015 at 3:00 PM, Sean Owen so...@cloudera.com
 wrote:
  
   On Fri, Jul 17, 2015 at 6:58 PM, Shivaram Venkataraman
   shiva...@eecs.berkeley.edu wrote:
I am not sure why the ASF JIRA can be only used to track one set of
artifacts that are packaged and released together. I agree that
marking
a
fix version as 1.5 for a change in another repo doesn't make a lot
 of
sense,
but we could just not use fix versions for the EC2 issues ?
  
   *shrug* it just seems harder and less natural to use ASF JIRA. What's
   the benefit? I agree it's not a big deal either way but it's a small
   part of the problem we're solving in the first place. I suspect that
   one way or the other, there would be issues filed both places, so
 this
   probably isn't worth debating.
  
  
My concerns are less about it being pushed out etc. For better or
worse
we
have had EC2 scripts be a part of the Spark distribution from a
 very
early
stage (from version 0.5.0 if my git history reading is correct).
 So
users
will assume that any error with EC2 scripts belong to the Spark
project.
In
addition almost all the contributions to the EC2 scripts come from
Spark
developers and so keeping the issues in the same mailing list /
 JIRA
seems
natural. This I guess again relates to the question of managing
issues
for
code that isn't part of the Spark release artifact.
  
   Yeah good question -- Github doesn't give you a mailing list. I think
   dev@ would still be where it's discussed which is ... again 'part of
   the problem' but as you say, probably beneficial. It's a pretty low
   traffic topic anyway.
  
  
I'll create the amplab/spark-ec2 repo over the next couple of days
unless
there are more comments on this thread. This will at least
 alleviate
some of
the naming confusion over using a repository in mesos and I'll give
Sean,
Nick, Matthew commit access to it. I am still not convinced about
moving
the
issues over though.
  
   I won't move the issues. Maybe time tells whether one approach is
   better, or that it just doesn't matter.
  
   However it'd be a great opportunity to review and clear stale EC2
   issues.
  
  
 
 



Re: Make off-heap store pluggable

2015-07-21 Thread Alexey Goncharuk
2015-07-20 23:29 GMT-07:00 Matei Zaharia matei.zaha...@gmail.com:

 I agree with this -- basically, to build on Reynold's point, you should be
 able to get almost the same performance by implementing either the Hadoop
 FileSystem API or the Spark Data Source API over Ignite in the right way.
 This would let people save data persistently in Ignite in addition to using
 it for caching, and it would provide a global namespace, optionally a
 schema, etc. You can still provide data locality, short-circuit reads, etc
 with these APIs.


Absolutely agree.

In fact, Ignite already provides a shared RDD implementation which is
essentially a view of Ignite cache data. This implementation adheres to the
Spark DataFrame API. More information can be found here:
http://ignite.incubator.apache.org/features/igniterdd.html

Also, Ignite in-memory filesystem is compliant with Hadoop filesystem API
and can transparently replace HDFS if needed. Plugging it into Spark should
be fairly easy. More information can be found here:
http://ignite.incubator.apache.org/features/igfs.html

--Alexey


Re: Make off-heap store pluggable

2015-07-21 Thread Zhan Zhang
Hi Alexey,

SPARK-6479https://issues.apache.org/jira/browse/SPARK-6479 is for the plugin 
API, and SPARK-6112https://issues.apache.org/jira/browse/SPARK-6112 is for 
hdfs plugin.


Thanks.

Zhan Zhang

On Jul 21, 2015, at 10:56 AM, Alexey Goncharuk 
alexey.goncha...@gmail.commailto:alexey.goncha...@gmail.com wrote:


2015-07-20 23:29 GMT-07:00 Matei Zaharia 
matei.zaha...@gmail.commailto:matei.zaha...@gmail.com:
I agree with this -- basically, to build on Reynold's point, you should be able 
to get almost the same performance by implementing either the Hadoop FileSystem 
API or the Spark Data Source API over Ignite in the right way. This would let 
people save data persistently in Ignite in addition to using it for caching, 
and it would provide a global namespace, optionally a schema, etc. You can 
still provide data locality, short-circuit reads, etc with these APIs.

Absolutely agree.

In fact, Ignite already provides a shared RDD implementation which is 
essentially a view of Ignite cache data. This implementation adheres to the 
Spark DataFrame API. More information can be found here: 
http://ignite.incubator.apache.org/features/igniterdd.html

Also, Ignite in-memory filesystem is compliant with Hadoop filesystem API and 
can transparently replace HDFS if needed. Plugging it into Spark should be 
fairly easy. More information can be found here: 
http://ignite.incubator.apache.org/features/igfs.html

--Alexey




Re: countByValue on dataframe with multiple columns

2015-07-21 Thread Reynold Xin
Is this just frequent items?

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L97



On Tue, Jul 21, 2015 at 7:39 AM, Ted Malaska ted.mala...@cloudera.com
wrote:

 100% I would love to do it.  Who a good person to review the design with.
 All I need is a quick chat about the design and approach and I'll create
 the jira and push a patch.

 Ted Malaska

 On Tue, Jul 21, 2015 at 10:19 AM, Olivier Girardot 
 o.girar...@lateral-thoughts.com wrote:

 Hi Ted,
 The TopNList would be great to see directly in the Dataframe API and my
 wish would be to be able to apply it on multiple columns at the same time
 and get all these statistics.
 the .describe() function is close to what we want to achieve, maybe we
 could try to enrich its output.
 Anyway, even as a spark-package, if you could package your code for
 Dataframes, that would be great.

 Regards,

 Olivier.

 2015-07-21 15:08 GMT+02:00 Jonathan Winandy jonathan.wina...@gmail.com:

 Ha ok !

 Then generic part would have that signature :

 def countColsByValue(df:Dataframe):Map[String /* colname */,Dataframe]


 +1 for more work (blog / api) for data quality checks.

 Cheers,
 Jonathan


 TopCMSParams and some other monoids from Algebird are really cool for
 that :

 https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/CountMinSketch.scala#L590


 On 21 July 2015 at 13:40, Ted Malaska ted.mala...@cloudera.com wrote:

 I'm guessing you want something like what I put in this blog post.


 http://blog.cloudera.com/blog/2015/07/how-to-do-data-quality-checks-using-apache-spark-dataframes/

 This is a very common use case.  If there is a +1 I would love to add
 it to dataframes.

 Let me know
 Ted Malaska

 On Tue, Jul 21, 2015 at 7:24 AM, Olivier Girardot 
 o.girar...@lateral-thoughts.com wrote:

 Yop,
 actually the generic part does not work, the countByValue on one
 column gives you the count for each value seen in the column.
 I would like a generic (multi-column) countByValue to give me the same
 kind of output for each column, not considering each n-uples of each 
 column
 value as the key (which is what the groupBy is doing by default).

 Regards,

 Olivier

 2015-07-20 14:18 GMT+02:00 Jonathan Winandy 
 jonathan.wina...@gmail.com:

 Ahoy !

 Maybe you can get countByValue by using sql.GroupedData :

 // some DFval df: DataFrame = 
 sqlContext.createDataFrame(sc.parallelize(List(A,B, B, 
 A)).map(Row.apply(_)), StructType(List(StructField(n, StringType


 df.groupBy(n).count().show()


 // generic
 def countByValueDf(df:DataFrame) = {

   val (h :: r) = df.columns.toList

   df.groupBy(h, r:_*).count()
 }

 countByValueDf(df).show()


 Cheers,
 Jon

 On 20 July 2015 at 11:28, Olivier Girardot 
 o.girar...@lateral-thoughts.com wrote:

 Hi,
 Is there any plan to add the countByValue function to Spark SQL
 Dataframe ?
 Even
 https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L78
 is using the RDD part right now, but for ML purposes, being able to get 
 the
 most frequent categorical value on multiple columns would be very 
 useful.


 Regards,


 --
 *Olivier Girardot* | Associé
 o.girar...@lateral-thoughts.com
 +33 6 24 09 17 94





 --
 *Olivier Girardot* | Associé
 o.girar...@lateral-thoughts.com
 +33 6 24 09 17 94






 --
 *Olivier Girardot* | Associé
 o.girar...@lateral-thoughts.com
 +33 6 24 09 17 94





Re: Make off-heap store pluggable

2015-07-21 Thread Alexey Goncharuk
2015-07-20 21:40 GMT-07:00 Reynold Xin r...@databricks.com:

 I sent it prematurely.

 They are already pluggable, or at least in the process to be more
 pluggable. In 1.4, instead of calling the external system's API directly,
 we added an API for that.  There is a patch to add support for HDFS
 in-memory cache.


Is there a ticket or branch you can show me so I can take a look at the
ongoing work?

--Alexey


-Phive-thriftserver when compiling for use in pyspark and JDBC connections

2015-07-21 Thread Aaron
I compile/make a distribution, with either the 1.4 branch or  master,
using the -Phive-thriftserver, and attempt a JDBC connection to a
mysql DB..using latest connector (5.1.36) jar.

When I setup the pyspark shell doing:

bin/pyspark --jars mysql-connection...jar --driver-class-path
mysql-connector..jar

when I make a data frame from sqlContext.read.jdbc(jdbc://...)

and perhaps I do df.show()

 things seem to work;  but if I compile with out that
-Phive-thriftserver, the same python, in the same settings (spark-env,
spark-defaults), just hangs...never to return.

I am curious, how the hive-thriftserver module plays into this type of
interaction.

Thanks in advance.

Cheers,
Aaron

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



What is the difference between SlowSparkPullRequestBuilder and SparkPullRequestBuilder?

2015-07-21 Thread Yu Ishikawa
Hi all, 

When we send a PR, it seems that two requests to run tests are thrown to the
Jenkins sometimes. 
What is the difference between SparkPullRequestBuilder and
SlowSparkPullRequestBuilder?

Thanks,
Yu



-
-- Yu Ishikawa
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/What-is-the-difference-between-SlowSparkPullRequestBuilder-and-SparkPullRequestBuilder-tp13377.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Should spark-ec2 get its own repo?

2015-07-21 Thread Mridul Muralidharan
That sounds good. Thanks for clarifying !


Regards,
Mridul

On Tue, Jul 21, 2015 at 11:09 AM, Shivaram Venkataraman
shiva...@eecs.berkeley.edu wrote:
 Thats part of the confusion we are trying to fix here -- the repository used
 to live in the mesos github account but was never a part of the Apache Mesos
 project. It was a remnant part of Spark from when Spark used to live at
 github.com/mesos/spark.

 Shivaram

 On Tue, Jul 21, 2015 at 11:03 AM, Mridul Muralidharan mri...@gmail.com
 wrote:

 If I am not wrong, since the code was hosted within mesos project
 repo, I assume (atleast part of it) is owned by mesos project and so
 its PMC ?

 - Mridul

 On Tue, Jul 21, 2015 at 9:22 AM, Shivaram Venkataraman
 shiva...@eecs.berkeley.edu wrote:
  There is technically no PMC for the spark-ec2 project (I guess we are
  kind
  of establishing one right now). I haven't heard anything from the Spark
  PMC
  on the dev list that might suggest a need for a vote so far. I will send
  another round of email notification to the dev list when we have a JIRA
  / PR
  that actually moves the scripts (right now the only thing that changed
  is
  the location of some scripts in mesos/ to amplab/).
 
  Thanks
  Shivaram
 
  On Mon, Jul 20, 2015 at 12:55 PM, Mridul Muralidharan mri...@gmail.com
  wrote:
 
  Might be a good idea to get the PMC's of both projects to sign off to
  prevent future issues with apache.
 
  Regards,
  Mridul
 
  On Mon, Jul 20, 2015 at 12:01 PM, Shivaram Venkataraman
  shiva...@eecs.berkeley.edu wrote:
   I've created https://github.com/amplab/spark-ec2 and added an initial
   set of
   committers. Note that this is not a fork of the existing
   github.com/mesos/spark-ec2 and users will need to fork from here.
   This
   is
   mostly to avoid the base-fork in pull requests being set incorrectly
   etc.
  
   I'll be migrating some PRs / closing them in the old repo and will
   also
   update the README in that repo.
  
   Thanks
   Shivaram
  
   On Fri, Jul 17, 2015 at 3:00 PM, Sean Owen so...@cloudera.com
   wrote:
  
   On Fri, Jul 17, 2015 at 6:58 PM, Shivaram Venkataraman
   shiva...@eecs.berkeley.edu wrote:
I am not sure why the ASF JIRA can be only used to track one set
of
artifacts that are packaged and released together. I agree that
marking
a
fix version as 1.5 for a change in another repo doesn't make a lot
of
sense,
but we could just not use fix versions for the EC2 issues ?
  
   *shrug* it just seems harder and less natural to use ASF JIRA.
   What's
   the benefit? I agree it's not a big deal either way but it's a small
   part of the problem we're solving in the first place. I suspect that
   one way or the other, there would be issues filed both places, so
   this
   probably isn't worth debating.
  
  
My concerns are less about it being pushed out etc. For better or
worse
we
have had EC2 scripts be a part of the Spark distribution from a
very
early
stage (from version 0.5.0 if my git history reading is correct).
So
users
will assume that any error with EC2 scripts belong to the Spark
project.
In
addition almost all the contributions to the EC2 scripts come from
Spark
developers and so keeping the issues in the same mailing list /
JIRA
seems
natural. This I guess again relates to the question of managing
issues
for
code that isn't part of the Spark release artifact.
  
   Yeah good question -- Github doesn't give you a mailing list. I
   think
   dev@ would still be where it's discussed which is ... again 'part of
   the problem' but as you say, probably beneficial. It's a pretty low
   traffic topic anyway.
  
  
I'll create the amplab/spark-ec2 repo over the next couple of days
unless
there are more comments on this thread. This will at least
alleviate
some of
the naming confusion over using a repository in mesos and I'll
give
Sean,
Nick, Matthew commit access to it. I am still not convinced about
moving
the
issues over though.
  
   I won't move the issues. Maybe time tells whether one approach is
   better, or that it just doesn't matter.
  
   However it'd be a great opportunity to review and clear stale EC2
   issues.
  
  
 
 



-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: countByValue on dataframe with multiple columns

2015-07-21 Thread Ted Malaska
Look at the implementation for frequently items.  It is a different from
true count.
On Jul 21, 2015 1:19 PM, Reynold Xin r...@databricks.com wrote:

 Is this just frequent items?


 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L97



 On Tue, Jul 21, 2015 at 7:39 AM, Ted Malaska ted.mala...@cloudera.com
 wrote:

 100% I would love to do it.  Who a good person to review the design
 with.  All I need is a quick chat about the design and approach and I'll
 create the jira and push a patch.

 Ted Malaska

 On Tue, Jul 21, 2015 at 10:19 AM, Olivier Girardot 
 o.girar...@lateral-thoughts.com wrote:

 Hi Ted,
 The TopNList would be great to see directly in the Dataframe API and my
 wish would be to be able to apply it on multiple columns at the same time
 and get all these statistics.
 the .describe() function is close to what we want to achieve, maybe we
 could try to enrich its output.
 Anyway, even as a spark-package, if you could package your code for
 Dataframes, that would be great.

 Regards,

 Olivier.

 2015-07-21 15:08 GMT+02:00 Jonathan Winandy jonathan.wina...@gmail.com
 :

 Ha ok !

 Then generic part would have that signature :

 def countColsByValue(df:Dataframe):Map[String /* colname */,Dataframe]


 +1 for more work (blog / api) for data quality checks.

 Cheers,
 Jonathan


 TopCMSParams and some other monoids from Algebird are really cool for
 that :

 https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/CountMinSketch.scala#L590


 On 21 July 2015 at 13:40, Ted Malaska ted.mala...@cloudera.com wrote:

 I'm guessing you want something like what I put in this blog post.


 http://blog.cloudera.com/blog/2015/07/how-to-do-data-quality-checks-using-apache-spark-dataframes/

 This is a very common use case.  If there is a +1 I would love to add
 it to dataframes.

 Let me know
 Ted Malaska

 On Tue, Jul 21, 2015 at 7:24 AM, Olivier Girardot 
 o.girar...@lateral-thoughts.com wrote:

 Yop,
 actually the generic part does not work, the countByValue on one
 column gives you the count for each value seen in the column.
 I would like a generic (multi-column) countByValue to give me the
 same kind of output for each column, not considering each n-uples of each
 column value as the key (which is what the groupBy is doing by default).

 Regards,

 Olivier

 2015-07-20 14:18 GMT+02:00 Jonathan Winandy 
 jonathan.wina...@gmail.com:

 Ahoy !

 Maybe you can get countByValue by using sql.GroupedData :

 // some DFval df: DataFrame = 
 sqlContext.createDataFrame(sc.parallelize(List(A,B, B, 
 A)).map(Row.apply(_)), StructType(List(StructField(n, StringType


 df.groupBy(n).count().show()


 // generic
 def countByValueDf(df:DataFrame) = {

   val (h :: r) = df.columns.toList

   df.groupBy(h, r:_*).count()
 }

 countByValueDf(df).show()


 Cheers,
 Jon

 On 20 July 2015 at 11:28, Olivier Girardot 
 o.girar...@lateral-thoughts.com wrote:

 Hi,
 Is there any plan to add the countByValue function to Spark SQL
 Dataframe ?
 Even
 https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L78
 is using the RDD part right now, but for ML purposes, being able to 
 get the
 most frequent categorical value on multiple columns would be very 
 useful.


 Regards,


 --
 *Olivier Girardot* | Associé
 o.girar...@lateral-thoughts.com
 +33 6 24 09 17 94





 --
 *Olivier Girardot* | Associé
 o.girar...@lateral-thoughts.com
 +33 6 24 09 17 94






 --
 *Olivier Girardot* | Associé
 o.girar...@lateral-thoughts.com
 +33 6 24 09 17 94






Re: countByValue on dataframe with multiple columns

2015-07-21 Thread Olivier Girardot
yes and freqItems does not give you an ordered count (right ?) + the
threshold makes it difficult to calibrate it + we noticed some strange
behaviour when testing it on small datasets.

2015-07-21 20:30 GMT+02:00 Ted Malaska ted.mala...@cloudera.com:

 Look at the implementation for frequently items.  It is a different from
 true count.
 On Jul 21, 2015 1:19 PM, Reynold Xin r...@databricks.com wrote:

 Is this just frequent items?


 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L97



 On Tue, Jul 21, 2015 at 7:39 AM, Ted Malaska ted.mala...@cloudera.com
 wrote:

 100% I would love to do it.  Who a good person to review the design
 with.  All I need is a quick chat about the design and approach and I'll
 create the jira and push a patch.

 Ted Malaska

 On Tue, Jul 21, 2015 at 10:19 AM, Olivier Girardot 
 o.girar...@lateral-thoughts.com wrote:

 Hi Ted,
 The TopNList would be great to see directly in the Dataframe API and my
 wish would be to be able to apply it on multiple columns at the same time
 and get all these statistics.
 the .describe() function is close to what we want to achieve, maybe we
 could try to enrich its output.
 Anyway, even as a spark-package, if you could package your code for
 Dataframes, that would be great.

 Regards,

 Olivier.

 2015-07-21 15:08 GMT+02:00 Jonathan Winandy jonathan.wina...@gmail.com
 :

 Ha ok !

 Then generic part would have that signature :

 def countColsByValue(df:Dataframe):Map[String /* colname */,Dataframe]


 +1 for more work (blog / api) for data quality checks.

 Cheers,
 Jonathan


 TopCMSParams and some other monoids from Algebird are really cool for
 that :

 https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/CountMinSketch.scala#L590


 On 21 July 2015 at 13:40, Ted Malaska ted.mala...@cloudera.com
 wrote:

 I'm guessing you want something like what I put in this blog post.


 http://blog.cloudera.com/blog/2015/07/how-to-do-data-quality-checks-using-apache-spark-dataframes/

 This is a very common use case.  If there is a +1 I would love to add
 it to dataframes.

 Let me know
 Ted Malaska

 On Tue, Jul 21, 2015 at 7:24 AM, Olivier Girardot 
 o.girar...@lateral-thoughts.com wrote:

 Yop,
 actually the generic part does not work, the countByValue on one
 column gives you the count for each value seen in the column.
 I would like a generic (multi-column) countByValue to give me the
 same kind of output for each column, not considering each n-uples of 
 each
 column value as the key (which is what the groupBy is doing by default).

 Regards,

 Olivier

 2015-07-20 14:18 GMT+02:00 Jonathan Winandy 
 jonathan.wina...@gmail.com:

 Ahoy !

 Maybe you can get countByValue by using sql.GroupedData :

 // some DFval df: DataFrame = 
 sqlContext.createDataFrame(sc.parallelize(List(A,B, B, 
 A)).map(Row.apply(_)), StructType(List(StructField(n, 
 StringType


 df.groupBy(n).count().show()


 // generic
 def countByValueDf(df:DataFrame) = {

   val (h :: r) = df.columns.toList

   df.groupBy(h, r:_*).count()
 }

 countByValueDf(df).show()


 Cheers,
 Jon

 On 20 July 2015 at 11:28, Olivier Girardot 
 o.girar...@lateral-thoughts.com wrote:

 Hi,
 Is there any plan to add the countByValue function to Spark SQL
 Dataframe ?
 Even
 https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L78
 is using the RDD part right now, but for ML purposes, being able to 
 get the
 most frequent categorical value on multiple columns would be very 
 useful.


 Regards,


 --
 *Olivier Girardot* | Associé
 o.girar...@lateral-thoughts.com
 +33 6 24 09 17 94





 --
 *Olivier Girardot* | Associé
 o.girar...@lateral-thoughts.com
 +33 6 24 09 17 94






 --
 *Olivier Girardot* | Associé
 o.girar...@lateral-thoughts.com
 +33 6 24 09 17 94






-- 
*Olivier Girardot* | Associé
o.girar...@lateral-thoughts.com
+33 6 24 09 17 94


Re: countByValue on dataframe with multiple columns

2015-07-21 Thread Ted Malaska
Cool I will make a jira after I check in to my hotel.  And try to get a
patch early next week.
On Jul 21, 2015 5:15 PM, Olivier Girardot o.girar...@lateral-thoughts.com
wrote:

 yes and freqItems does not give you an ordered count (right ?) + the
 threshold makes it difficult to calibrate it + we noticed some strange
 behaviour when testing it on small datasets.

 2015-07-21 20:30 GMT+02:00 Ted Malaska ted.mala...@cloudera.com:

 Look at the implementation for frequently items.  It is a different from
 true count.
 On Jul 21, 2015 1:19 PM, Reynold Xin r...@databricks.com wrote:

 Is this just frequent items?


 https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L97



 On Tue, Jul 21, 2015 at 7:39 AM, Ted Malaska ted.mala...@cloudera.com
 wrote:

 100% I would love to do it.  Who a good person to review the design
 with.  All I need is a quick chat about the design and approach and I'll
 create the jira and push a patch.

 Ted Malaska

 On Tue, Jul 21, 2015 at 10:19 AM, Olivier Girardot 
 o.girar...@lateral-thoughts.com wrote:

 Hi Ted,
 The TopNList would be great to see directly in the Dataframe API and
 my wish would be to be able to apply it on multiple columns at the same
 time and get all these statistics.
 the .describe() function is close to what we want to achieve, maybe we
 could try to enrich its output.
 Anyway, even as a spark-package, if you could package your code for
 Dataframes, that would be great.

 Regards,

 Olivier.

 2015-07-21 15:08 GMT+02:00 Jonathan Winandy 
 jonathan.wina...@gmail.com:

 Ha ok !

 Then generic part would have that signature :

 def countColsByValue(df:Dataframe):Map[String /* colname */,Dataframe]


 +1 for more work (blog / api) for data quality checks.

 Cheers,
 Jonathan


 TopCMSParams and some other monoids from Algebird are really cool for
 that :

 https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/CountMinSketch.scala#L590


 On 21 July 2015 at 13:40, Ted Malaska ted.mala...@cloudera.com
 wrote:

 I'm guessing you want something like what I put in this blog post.


 http://blog.cloudera.com/blog/2015/07/how-to-do-data-quality-checks-using-apache-spark-dataframes/

 This is a very common use case.  If there is a +1 I would love to
 add it to dataframes.

 Let me know
 Ted Malaska

 On Tue, Jul 21, 2015 at 7:24 AM, Olivier Girardot 
 o.girar...@lateral-thoughts.com wrote:

 Yop,
 actually the generic part does not work, the countByValue on one
 column gives you the count for each value seen in the column.
 I would like a generic (multi-column) countByValue to give me the
 same kind of output for each column, not considering each n-uples of 
 each
 column value as the key (which is what the groupBy is doing by 
 default).

 Regards,

 Olivier

 2015-07-20 14:18 GMT+02:00 Jonathan Winandy 
 jonathan.wina...@gmail.com:

 Ahoy !

 Maybe you can get countByValue by using sql.GroupedData :

 // some DFval df: DataFrame = 
 sqlContext.createDataFrame(sc.parallelize(List(A,B, B, 
 A)).map(Row.apply(_)), StructType(List(StructField(n, 
 StringType


 df.groupBy(n).count().show()


 // generic
 def countByValueDf(df:DataFrame) = {

   val (h :: r) = df.columns.toList

   df.groupBy(h, r:_*).count()
 }

 countByValueDf(df).show()


 Cheers,
 Jon

 On 20 July 2015 at 11:28, Olivier Girardot 
 o.girar...@lateral-thoughts.com wrote:

 Hi,
 Is there any plan to add the countByValue function to Spark SQL
 Dataframe ?
 Even
 https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L78
 is using the RDD part right now, but for ML purposes, being able to 
 get the
 most frequent categorical value on multiple columns would be very 
 useful.


 Regards,


 --
 *Olivier Girardot* | Associé
 o.girar...@lateral-thoughts.com
 +33 6 24 09 17 94





 --
 *Olivier Girardot* | Associé
 o.girar...@lateral-thoughts.com
 +33 6 24 09 17 94






 --
 *Olivier Girardot* | Associé
 o.girar...@lateral-thoughts.com
 +33 6 24 09 17 94






 --
 *Olivier Girardot* | Associé
 o.girar...@lateral-thoughts.com
 +33 6 24 09 17 94



Re: Make off-heap store pluggable

2015-07-21 Thread Matei Zaharia
I agree with this -- basically, to build on Reynold's point, you should be able 
to get almost the same performance by implementing either the Hadoop FileSystem 
API or the Spark Data Source API over Ignite in the right way. This would let 
people save data persistently in Ignite in addition to using it for caching, 
and it would provide a global namespace, optionally a schema, etc. You can 
still provide data locality, short-circuit reads, etc with these APIs.

Matei

 On Jul 20, 2015, at 9:40 PM, Reynold Xin r...@databricks.com wrote:
 
 I sent it prematurely.
 
 They are already pluggable, or at least in the process to be more pluggable. 
 In 1.4, instead of calling the external system's API directly, we added an 
 API for that.  There is a patch to add support for HDFS in-memory cache. 
 
 Somewhat orthogonal to this, longer term, I am not sure whether it makes 
 sense to have the current off heap API, because there is no namespacing and 
 the benefit to end users is actually not very substantial (at least I can 
 think of much simpler ways to achieve exactly the same gains), and yet it 
 introduces quite a bit of complexity to the codebase.
 
 
 
 
 On Mon, Jul 20, 2015 at 9:34 PM, Reynold Xin r...@databricks.com 
 mailto:r...@databricks.com wrote:
 They are already pluggable.
 
 
 On Mon, Jul 20, 2015 at 9:32 PM, Prashant Sharma scrapco...@gmail.com 
 mailto:scrapco...@gmail.com wrote:
 +1 Looks like a nice idea(I do not see any harm). Would you like to work on 
 the patch to support it ?
 
 Prashant Sharma
 
 
 
 On Tue, Jul 21, 2015 at 2:46 AM, Alexey Goncharuk alexey.goncha...@gmail.com 
 mailto:alexey.goncha...@gmail.com wrote:
 Hello Spark community,
 
 I was looking through the code in order to understand better how RDD is 
 persisted to Tachyon off-heap filesystem. It looks like that the Tachyon 
 filesystem is hard-coded and there is no way to switch to another in-memory 
 filesystem. I think it would be great if the implementation of the 
 BlockManager and BlockStore would be able to plug in another filesystem.
 
 For example, Apache Ignite also has an implementation of in-memory filesystem 
 which can store data in on-heap and off-heap formats. It would be great if it 
 could integrate with Spark.
 
 I have filed a ticket in Jira: 
 https://issues.apache.org/jira/browse/SPARK-9203 
 https://issues.apache.org/jira/browse/SPARK-9203
 
 If it makes sense, I will be happy to contribute to it.
 
 Thoughts?
 
 -Alexey (Apache Ignite PMC)
 
 
 



Re: Make off-heap store pluggable

2015-07-21 Thread Sean Owen
(Related, not important comment: it would also be nice to separate out the
Tachyon dependency from core, as it's conceptually pluggable but is still
hard-coded into several places in the code, and a lot of the comments/docs
in the code.)

On Tue, Jul 21, 2015 at 5:40 AM, Reynold Xin r...@databricks.com wrote:

 I sent it prematurely.

 They are already pluggable, or at least in the process to be more
 pluggable. In 1.4, instead of calling the external system's API directly,
 we added an API for that.  There is a patch to add support for HDFS
 in-memory cache.

 Somewhat orthogonal to this, longer term, I am not sure whether it makes
 sense to have the current off heap API, because there is no namespacing and
 the benefit to end users is actually not very substantial (at least I can
 think of much simpler ways to achieve exactly the same gains), and yet it
 introduces quite a bit of complexity to the codebase.




 On Mon, Jul 20, 2015 at 9:34 PM, Reynold Xin r...@databricks.com wrote:

 They are already pluggable.


 On Mon, Jul 20, 2015 at 9:32 PM, Prashant Sharma scrapco...@gmail.com
 wrote:

 +1 Looks like a nice idea(I do not see any harm). Would you like to work
 on the patch to support it ?

 Prashant Sharma



 On Tue, Jul 21, 2015 at 2:46 AM, Alexey Goncharuk 
 alexey.goncha...@gmail.com wrote:

 Hello Spark community,

 I was looking through the code in order to understand better how RDD is
 persisted to Tachyon off-heap filesystem. It looks like that the Tachyon
 filesystem is hard-coded and there is no way to switch to another in-memory
 filesystem. I think it would be great if the implementation of the
 BlockManager and BlockStore would be able to plug in another filesystem.

 For example, Apache Ignite also has an implementation of in-memory
 filesystem which can store data in on-heap and off-heap formats. It would
 be great if it could integrate with Spark.

 I have filed a ticket in Jira:
 https://issues.apache.org/jira/browse/SPARK-9203

 If it makes sense, I will be happy to contribute to it.

 Thoughts?

 -Alexey (Apache Ignite PMC)







Re: countByValue on dataframe with multiple columns

2015-07-21 Thread Ted Malaska
I'm guessing you want something like what I put in this blog post.

http://blog.cloudera.com/blog/2015/07/how-to-do-data-quality-checks-using-apache-spark-dataframes/

This is a very common use case.  If there is a +1 I would love to add it to
dataframes.

Let me know
Ted Malaska

On Tue, Jul 21, 2015 at 7:24 AM, Olivier Girardot 
o.girar...@lateral-thoughts.com wrote:

 Yop,
 actually the generic part does not work, the countByValue on one column
 gives you the count for each value seen in the column.
 I would like a generic (multi-column) countByValue to give me the same
 kind of output for each column, not considering each n-uples of each column
 value as the key (which is what the groupBy is doing by default).

 Regards,

 Olivier

 2015-07-20 14:18 GMT+02:00 Jonathan Winandy jonathan.wina...@gmail.com:

 Ahoy !

 Maybe you can get countByValue by using sql.GroupedData :

 // some DFval df: DataFrame = 
 sqlContext.createDataFrame(sc.parallelize(List(A,B, B, 
 A)).map(Row.apply(_)), StructType(List(StructField(n, StringType


 df.groupBy(n).count().show()


 // generic
 def countByValueDf(df:DataFrame) = {

   val (h :: r) = df.columns.toList

   df.groupBy(h, r:_*).count()
 }

 countByValueDf(df).show()


 Cheers,
 Jon

 On 20 July 2015 at 11:28, Olivier Girardot 
 o.girar...@lateral-thoughts.com wrote:

 Hi,
 Is there any plan to add the countByValue function to Spark SQL
 Dataframe ?
 Even
 https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L78
 is using the RDD part right now, but for ML purposes, being able to get the
 most frequent categorical value on multiple columns would be very useful.


 Regards,


 --
 *Olivier Girardot* | Associé
 o.girar...@lateral-thoughts.com
 +33 6 24 09 17 94





 --
 *Olivier Girardot* | Associé
 o.girar...@lateral-thoughts.com
 +33 6 24 09 17 94



Re: countByValue on dataframe with multiple columns

2015-07-21 Thread Olivier Girardot
Yop,
actually the generic part does not work, the countByValue on one column
gives you the count for each value seen in the column.
I would like a generic (multi-column) countByValue to give me the same kind
of output for each column, not considering each n-uples of each column
value as the key (which is what the groupBy is doing by default).

Regards,

Olivier

2015-07-20 14:18 GMT+02:00 Jonathan Winandy jonathan.wina...@gmail.com:

 Ahoy !

 Maybe you can get countByValue by using sql.GroupedData :

 // some DFval df: DataFrame = 
 sqlContext.createDataFrame(sc.parallelize(List(A,B, B, 
 A)).map(Row.apply(_)), StructType(List(StructField(n, StringType


 df.groupBy(n).count().show()


 // generic
 def countByValueDf(df:DataFrame) = {

   val (h :: r) = df.columns.toList

   df.groupBy(h, r:_*).count()
 }

 countByValueDf(df).show()


 Cheers,
 Jon

 On 20 July 2015 at 11:28, Olivier Girardot 
 o.girar...@lateral-thoughts.com wrote:

 Hi,
 Is there any plan to add the countByValue function to Spark SQL Dataframe
 ?
 Even
 https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L78
 is using the RDD part right now, but for ML purposes, being able to get the
 most frequent categorical value on multiple columns would be very useful.


 Regards,


 --
 *Olivier Girardot* | Associé
 o.girar...@lateral-thoughts.com
 +33 6 24 09 17 94





-- 
*Olivier Girardot* | Associé
o.girar...@lateral-thoughts.com
+33 6 24 09 17 94


Re: countByValue on dataframe with multiple columns

2015-07-21 Thread Jonathan Winandy
Ha ok !

Then generic part would have that signature :

def countColsByValue(df:Dataframe):Map[String /* colname */,Dataframe]


+1 for more work (blog / api) for data quality checks.

Cheers,
Jonathan


TopCMSParams and some other monoids from Algebird are really cool for that :
https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/CountMinSketch.scala#L590


On 21 July 2015 at 13:40, Ted Malaska ted.mala...@cloudera.com wrote:

 I'm guessing you want something like what I put in this blog post.


 http://blog.cloudera.com/blog/2015/07/how-to-do-data-quality-checks-using-apache-spark-dataframes/

 This is a very common use case.  If there is a +1 I would love to add it
 to dataframes.

 Let me know
 Ted Malaska

 On Tue, Jul 21, 2015 at 7:24 AM, Olivier Girardot 
 o.girar...@lateral-thoughts.com wrote:

 Yop,
 actually the generic part does not work, the countByValue on one column
 gives you the count for each value seen in the column.
 I would like a generic (multi-column) countByValue to give me the same
 kind of output for each column, not considering each n-uples of each column
 value as the key (which is what the groupBy is doing by default).

 Regards,

 Olivier

 2015-07-20 14:18 GMT+02:00 Jonathan Winandy jonathan.wina...@gmail.com:

 Ahoy !

 Maybe you can get countByValue by using sql.GroupedData :

 // some DFval df: DataFrame = 
 sqlContext.createDataFrame(sc.parallelize(List(A,B, B, 
 A)).map(Row.apply(_)), StructType(List(StructField(n, StringType


 df.groupBy(n).count().show()


 // generic
 def countByValueDf(df:DataFrame) = {

   val (h :: r) = df.columns.toList

   df.groupBy(h, r:_*).count()
 }

 countByValueDf(df).show()


 Cheers,
 Jon

 On 20 July 2015 at 11:28, Olivier Girardot 
 o.girar...@lateral-thoughts.com wrote:

 Hi,
 Is there any plan to add the countByValue function to Spark SQL
 Dataframe ?
 Even
 https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L78
 is using the RDD part right now, but for ML purposes, being able to get the
 most frequent categorical value on multiple columns would be very useful.


 Regards,


 --
 *Olivier Girardot* | Associé
 o.girar...@lateral-thoughts.com
 +33 6 24 09 17 94





 --
 *Olivier Girardot* | Associé
 o.girar...@lateral-thoughts.com
 +33 6 24 09 17 94





Re: countByValue on dataframe with multiple columns

2015-07-21 Thread Olivier Girardot
Hi Ted,
The TopNList would be great to see directly in the Dataframe API and my
wish would be to be able to apply it on multiple columns at the same time
and get all these statistics.
the .describe() function is close to what we want to achieve, maybe we
could try to enrich its output.
Anyway, even as a spark-package, if you could package your code for
Dataframes, that would be great.

Regards,

Olivier.

2015-07-21 15:08 GMT+02:00 Jonathan Winandy jonathan.wina...@gmail.com:

 Ha ok !

 Then generic part would have that signature :

 def countColsByValue(df:Dataframe):Map[String /* colname */,Dataframe]


 +1 for more work (blog / api) for data quality checks.

 Cheers,
 Jonathan


 TopCMSParams and some other monoids from Algebird are really cool for that
 :

 https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/CountMinSketch.scala#L590


 On 21 July 2015 at 13:40, Ted Malaska ted.mala...@cloudera.com wrote:

 I'm guessing you want something like what I put in this blog post.


 http://blog.cloudera.com/blog/2015/07/how-to-do-data-quality-checks-using-apache-spark-dataframes/

 This is a very common use case.  If there is a +1 I would love to add it
 to dataframes.

 Let me know
 Ted Malaska

 On Tue, Jul 21, 2015 at 7:24 AM, Olivier Girardot 
 o.girar...@lateral-thoughts.com wrote:

 Yop,
 actually the generic part does not work, the countByValue on one column
 gives you the count for each value seen in the column.
 I would like a generic (multi-column) countByValue to give me the same
 kind of output for each column, not considering each n-uples of each column
 value as the key (which is what the groupBy is doing by default).

 Regards,

 Olivier

 2015-07-20 14:18 GMT+02:00 Jonathan Winandy jonathan.wina...@gmail.com
 :

 Ahoy !

 Maybe you can get countByValue by using sql.GroupedData :

 // some DFval df: DataFrame = 
 sqlContext.createDataFrame(sc.parallelize(List(A,B, B, 
 A)).map(Row.apply(_)), StructType(List(StructField(n, StringType


 df.groupBy(n).count().show()


 // generic
 def countByValueDf(df:DataFrame) = {

   val (h :: r) = df.columns.toList

   df.groupBy(h, r:_*).count()
 }

 countByValueDf(df).show()


 Cheers,
 Jon

 On 20 July 2015 at 11:28, Olivier Girardot 
 o.girar...@lateral-thoughts.com wrote:

 Hi,
 Is there any plan to add the countByValue function to Spark SQL
 Dataframe ?
 Even
 https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L78
 is using the RDD part right now, but for ML purposes, being able to get 
 the
 most frequent categorical value on multiple columns would be very useful.


 Regards,


 --
 *Olivier Girardot* | Associé
 o.girar...@lateral-thoughts.com
 +33 6 24 09 17 94





 --
 *Olivier Girardot* | Associé
 o.girar...@lateral-thoughts.com
 +33 6 24 09 17 94






-- 
*Olivier Girardot* | Associé
o.girar...@lateral-thoughts.com
+33 6 24 09 17 94