Re: Should spark-ec2 get its own repo?
There is technically no PMC for the spark-ec2 project (I guess we are kind of establishing one right now). I haven't heard anything from the Spark PMC on the dev list that might suggest a need for a vote so far. I will send another round of email notification to the dev list when we have a JIRA / PR that actually moves the scripts (right now the only thing that changed is the location of some scripts in mesos/ to amplab/). Thanks Shivaram On Mon, Jul 20, 2015 at 12:55 PM, Mridul Muralidharan mri...@gmail.com wrote: Might be a good idea to get the PMC's of both projects to sign off to prevent future issues with apache. Regards, Mridul On Mon, Jul 20, 2015 at 12:01 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: I've created https://github.com/amplab/spark-ec2 and added an initial set of committers. Note that this is not a fork of the existing github.com/mesos/spark-ec2 and users will need to fork from here. This is mostly to avoid the base-fork in pull requests being set incorrectly etc. I'll be migrating some PRs / closing them in the old repo and will also update the README in that repo. Thanks Shivaram On Fri, Jul 17, 2015 at 3:00 PM, Sean Owen so...@cloudera.com wrote: On Fri, Jul 17, 2015 at 6:58 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: I am not sure why the ASF JIRA can be only used to track one set of artifacts that are packaged and released together. I agree that marking a fix version as 1.5 for a change in another repo doesn't make a lot of sense, but we could just not use fix versions for the EC2 issues ? *shrug* it just seems harder and less natural to use ASF JIRA. What's the benefit? I agree it's not a big deal either way but it's a small part of the problem we're solving in the first place. I suspect that one way or the other, there would be issues filed both places, so this probably isn't worth debating. My concerns are less about it being pushed out etc. For better or worse we have had EC2 scripts be a part of the Spark distribution from a very early stage (from version 0.5.0 if my git history reading is correct). So users will assume that any error with EC2 scripts belong to the Spark project. In addition almost all the contributions to the EC2 scripts come from Spark developers and so keeping the issues in the same mailing list / JIRA seems natural. This I guess again relates to the question of managing issues for code that isn't part of the Spark release artifact. Yeah good question -- Github doesn't give you a mailing list. I think dev@ would still be where it's discussed which is ... again 'part of the problem' but as you say, probably beneficial. It's a pretty low traffic topic anyway. I'll create the amplab/spark-ec2 repo over the next couple of days unless there are more comments on this thread. This will at least alleviate some of the naming confusion over using a repository in mesos and I'll give Sean, Nick, Matthew commit access to it. I am still not convinced about moving the issues over though. I won't move the issues. Maybe time tells whether one approach is better, or that it just doesn't matter. However it'd be a great opportunity to review and clear stale EC2 issues.
Re: countByValue on dataframe with multiple columns
100% I would love to do it. Who a good person to review the design with. All I need is a quick chat about the design and approach and I'll create the jira and push a patch. Ted Malaska On Tue, Jul 21, 2015 at 10:19 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi Ted, The TopNList would be great to see directly in the Dataframe API and my wish would be to be able to apply it on multiple columns at the same time and get all these statistics. the .describe() function is close to what we want to achieve, maybe we could try to enrich its output. Anyway, even as a spark-package, if you could package your code for Dataframes, that would be great. Regards, Olivier. 2015-07-21 15:08 GMT+02:00 Jonathan Winandy jonathan.wina...@gmail.com: Ha ok ! Then generic part would have that signature : def countColsByValue(df:Dataframe):Map[String /* colname */,Dataframe] +1 for more work (blog / api) for data quality checks. Cheers, Jonathan TopCMSParams and some other monoids from Algebird are really cool for that : https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/CountMinSketch.scala#L590 On 21 July 2015 at 13:40, Ted Malaska ted.mala...@cloudera.com wrote: I'm guessing you want something like what I put in this blog post. http://blog.cloudera.com/blog/2015/07/how-to-do-data-quality-checks-using-apache-spark-dataframes/ This is a very common use case. If there is a +1 I would love to add it to dataframes. Let me know Ted Malaska On Tue, Jul 21, 2015 at 7:24 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Yop, actually the generic part does not work, the countByValue on one column gives you the count for each value seen in the column. I would like a generic (multi-column) countByValue to give me the same kind of output for each column, not considering each n-uples of each column value as the key (which is what the groupBy is doing by default). Regards, Olivier 2015-07-20 14:18 GMT+02:00 Jonathan Winandy jonathan.wina...@gmail.com : Ahoy ! Maybe you can get countByValue by using sql.GroupedData : // some DFval df: DataFrame = sqlContext.createDataFrame(sc.parallelize(List(A,B, B, A)).map(Row.apply(_)), StructType(List(StructField(n, StringType df.groupBy(n).count().show() // generic def countByValueDf(df:DataFrame) = { val (h :: r) = df.columns.toList df.groupBy(h, r:_*).count() } countByValueDf(df).show() Cheers, Jon On 20 July 2015 at 11:28, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi, Is there any plan to add the countByValue function to Spark SQL Dataframe ? Even https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L78 is using the RDD part right now, but for ML purposes, being able to get the most frequent categorical value on multiple columns would be very useful. Regards, -- *Olivier Girardot* | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94 -- *Olivier Girardot* | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94 -- *Olivier Girardot* | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94
Re: countByValue on dataframe with multiple columns
I added the following jira https://issues.apache.org/jira/browse/SPARK-9237 Please help me get it assigned to myself thanks. Ted Malaska On Tue, Jul 21, 2015 at 7:53 PM, Ted Malaska ted.mala...@cloudera.com wrote: Cool I will make a jira after I check in to my hotel. And try to get a patch early next week. On Jul 21, 2015 5:15 PM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: yes and freqItems does not give you an ordered count (right ?) + the threshold makes it difficult to calibrate it + we noticed some strange behaviour when testing it on small datasets. 2015-07-21 20:30 GMT+02:00 Ted Malaska ted.mala...@cloudera.com: Look at the implementation for frequently items. It is a different from true count. On Jul 21, 2015 1:19 PM, Reynold Xin r...@databricks.com wrote: Is this just frequent items? https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L97 On Tue, Jul 21, 2015 at 7:39 AM, Ted Malaska ted.mala...@cloudera.com wrote: 100% I would love to do it. Who a good person to review the design with. All I need is a quick chat about the design and approach and I'll create the jira and push a patch. Ted Malaska On Tue, Jul 21, 2015 at 10:19 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi Ted, The TopNList would be great to see directly in the Dataframe API and my wish would be to be able to apply it on multiple columns at the same time and get all these statistics. the .describe() function is close to what we want to achieve, maybe we could try to enrich its output. Anyway, even as a spark-package, if you could package your code for Dataframes, that would be great. Regards, Olivier. 2015-07-21 15:08 GMT+02:00 Jonathan Winandy jonathan.wina...@gmail.com: Ha ok ! Then generic part would have that signature : def countColsByValue(df:Dataframe):Map[String /* colname */,Dataframe] +1 for more work (blog / api) for data quality checks. Cheers, Jonathan TopCMSParams and some other monoids from Algebird are really cool for that : https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/CountMinSketch.scala#L590 On 21 July 2015 at 13:40, Ted Malaska ted.mala...@cloudera.com wrote: I'm guessing you want something like what I put in this blog post. http://blog.cloudera.com/blog/2015/07/how-to-do-data-quality-checks-using-apache-spark-dataframes/ This is a very common use case. If there is a +1 I would love to add it to dataframes. Let me know Ted Malaska On Tue, Jul 21, 2015 at 7:24 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Yop, actually the generic part does not work, the countByValue on one column gives you the count for each value seen in the column. I would like a generic (multi-column) countByValue to give me the same kind of output for each column, not considering each n-uples of each column value as the key (which is what the groupBy is doing by default). Regards, Olivier 2015-07-20 14:18 GMT+02:00 Jonathan Winandy jonathan.wina...@gmail.com: Ahoy ! Maybe you can get countByValue by using sql.GroupedData : // some DFval df: DataFrame = sqlContext.createDataFrame(sc.parallelize(List(A,B, B, A)).map(Row.apply(_)), StructType(List(StructField(n, StringType df.groupBy(n).count().show() // generic def countByValueDf(df:DataFrame) = { val (h :: r) = df.columns.toList df.groupBy(h, r:_*).count() } countByValueDf(df).show() Cheers, Jon On 20 July 2015 at 11:28, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi, Is there any plan to add the countByValue function to Spark SQL Dataframe ? Even https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L78 is using the RDD part right now, but for ML purposes, being able to get the most frequent categorical value on multiple columns would be very useful. Regards, -- *Olivier Girardot* | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94 -- *Olivier Girardot* | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94 -- *Olivier Girardot* | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94 -- *Olivier Girardot* | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94
Re: Should spark-ec2 get its own repo?
If I am not wrong, since the code was hosted within mesos project repo, I assume (atleast part of it) is owned by mesos project and so its PMC ? - Mridul On Tue, Jul 21, 2015 at 9:22 AM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: There is technically no PMC for the spark-ec2 project (I guess we are kind of establishing one right now). I haven't heard anything from the Spark PMC on the dev list that might suggest a need for a vote so far. I will send another round of email notification to the dev list when we have a JIRA / PR that actually moves the scripts (right now the only thing that changed is the location of some scripts in mesos/ to amplab/). Thanks Shivaram On Mon, Jul 20, 2015 at 12:55 PM, Mridul Muralidharan mri...@gmail.com wrote: Might be a good idea to get the PMC's of both projects to sign off to prevent future issues with apache. Regards, Mridul On Mon, Jul 20, 2015 at 12:01 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: I've created https://github.com/amplab/spark-ec2 and added an initial set of committers. Note that this is not a fork of the existing github.com/mesos/spark-ec2 and users will need to fork from here. This is mostly to avoid the base-fork in pull requests being set incorrectly etc. I'll be migrating some PRs / closing them in the old repo and will also update the README in that repo. Thanks Shivaram On Fri, Jul 17, 2015 at 3:00 PM, Sean Owen so...@cloudera.com wrote: On Fri, Jul 17, 2015 at 6:58 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: I am not sure why the ASF JIRA can be only used to track one set of artifacts that are packaged and released together. I agree that marking a fix version as 1.5 for a change in another repo doesn't make a lot of sense, but we could just not use fix versions for the EC2 issues ? *shrug* it just seems harder and less natural to use ASF JIRA. What's the benefit? I agree it's not a big deal either way but it's a small part of the problem we're solving in the first place. I suspect that one way or the other, there would be issues filed both places, so this probably isn't worth debating. My concerns are less about it being pushed out etc. For better or worse we have had EC2 scripts be a part of the Spark distribution from a very early stage (from version 0.5.0 if my git history reading is correct). So users will assume that any error with EC2 scripts belong to the Spark project. In addition almost all the contributions to the EC2 scripts come from Spark developers and so keeping the issues in the same mailing list / JIRA seems natural. This I guess again relates to the question of managing issues for code that isn't part of the Spark release artifact. Yeah good question -- Github doesn't give you a mailing list. I think dev@ would still be where it's discussed which is ... again 'part of the problem' but as you say, probably beneficial. It's a pretty low traffic topic anyway. I'll create the amplab/spark-ec2 repo over the next couple of days unless there are more comments on this thread. This will at least alleviate some of the naming confusion over using a repository in mesos and I'll give Sean, Nick, Matthew commit access to it. I am still not convinced about moving the issues over though. I won't move the issues. Maybe time tells whether one approach is better, or that it just doesn't matter. However it'd be a great opportunity to review and clear stale EC2 issues. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Make off-heap store pluggable
2015-07-20 21:32 GMT-07:00 Prashant Sharma scrapco...@gmail.com: +1 Looks like a nice idea(I do not see any harm). Would you like to work on the patch to support it ? Prashant Sharma Yes, I would like to contribute to it once we clarify the appropriate path. --Alexey On Tue, Jul 21, 2015 at 2:46 AM, Alexey Goncharuk alexey.goncha...@gmail.com wrote: Hello Spark community, I was looking through the code in order to understand better how RDD is persisted to Tachyon off-heap filesystem. It looks like that the Tachyon filesystem is hard-coded and there is no way to switch to another in-memory filesystem. I think it would be great if the implementation of the BlockManager and BlockStore would be able to plug in another filesystem. For example, Apache Ignite also has an implementation of in-memory filesystem which can store data in on-heap and off-heap formats. It would be great if it could integrate with Spark. I have filed a ticket in Jira: https://issues.apache.org/jira/browse/SPARK-9203 If it makes sense, I will be happy to contribute to it. Thoughts? -Alexey (Apache Ignite PMC)
Re: Should spark-ec2 get its own repo?
Thats part of the confusion we are trying to fix here -- the repository used to live in the mesos github account but was never a part of the Apache Mesos project. It was a remnant part of Spark from when Spark used to live at github.com/mesos/spark. Shivaram On Tue, Jul 21, 2015 at 11:03 AM, Mridul Muralidharan mri...@gmail.com wrote: If I am not wrong, since the code was hosted within mesos project repo, I assume (atleast part of it) is owned by mesos project and so its PMC ? - Mridul On Tue, Jul 21, 2015 at 9:22 AM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: There is technically no PMC for the spark-ec2 project (I guess we are kind of establishing one right now). I haven't heard anything from the Spark PMC on the dev list that might suggest a need for a vote so far. I will send another round of email notification to the dev list when we have a JIRA / PR that actually moves the scripts (right now the only thing that changed is the location of some scripts in mesos/ to amplab/). Thanks Shivaram On Mon, Jul 20, 2015 at 12:55 PM, Mridul Muralidharan mri...@gmail.com wrote: Might be a good idea to get the PMC's of both projects to sign off to prevent future issues with apache. Regards, Mridul On Mon, Jul 20, 2015 at 12:01 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: I've created https://github.com/amplab/spark-ec2 and added an initial set of committers. Note that this is not a fork of the existing github.com/mesos/spark-ec2 and users will need to fork from here. This is mostly to avoid the base-fork in pull requests being set incorrectly etc. I'll be migrating some PRs / closing them in the old repo and will also update the README in that repo. Thanks Shivaram On Fri, Jul 17, 2015 at 3:00 PM, Sean Owen so...@cloudera.com wrote: On Fri, Jul 17, 2015 at 6:58 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: I am not sure why the ASF JIRA can be only used to track one set of artifacts that are packaged and released together. I agree that marking a fix version as 1.5 for a change in another repo doesn't make a lot of sense, but we could just not use fix versions for the EC2 issues ? *shrug* it just seems harder and less natural to use ASF JIRA. What's the benefit? I agree it's not a big deal either way but it's a small part of the problem we're solving in the first place. I suspect that one way or the other, there would be issues filed both places, so this probably isn't worth debating. My concerns are less about it being pushed out etc. For better or worse we have had EC2 scripts be a part of the Spark distribution from a very early stage (from version 0.5.0 if my git history reading is correct). So users will assume that any error with EC2 scripts belong to the Spark project. In addition almost all the contributions to the EC2 scripts come from Spark developers and so keeping the issues in the same mailing list / JIRA seems natural. This I guess again relates to the question of managing issues for code that isn't part of the Spark release artifact. Yeah good question -- Github doesn't give you a mailing list. I think dev@ would still be where it's discussed which is ... again 'part of the problem' but as you say, probably beneficial. It's a pretty low traffic topic anyway. I'll create the amplab/spark-ec2 repo over the next couple of days unless there are more comments on this thread. This will at least alleviate some of the naming confusion over using a repository in mesos and I'll give Sean, Nick, Matthew commit access to it. I am still not convinced about moving the issues over though. I won't move the issues. Maybe time tells whether one approach is better, or that it just doesn't matter. However it'd be a great opportunity to review and clear stale EC2 issues.
Re: Make off-heap store pluggable
2015-07-20 23:29 GMT-07:00 Matei Zaharia matei.zaha...@gmail.com: I agree with this -- basically, to build on Reynold's point, you should be able to get almost the same performance by implementing either the Hadoop FileSystem API or the Spark Data Source API over Ignite in the right way. This would let people save data persistently in Ignite in addition to using it for caching, and it would provide a global namespace, optionally a schema, etc. You can still provide data locality, short-circuit reads, etc with these APIs. Absolutely agree. In fact, Ignite already provides a shared RDD implementation which is essentially a view of Ignite cache data. This implementation adheres to the Spark DataFrame API. More information can be found here: http://ignite.incubator.apache.org/features/igniterdd.html Also, Ignite in-memory filesystem is compliant with Hadoop filesystem API and can transparently replace HDFS if needed. Plugging it into Spark should be fairly easy. More information can be found here: http://ignite.incubator.apache.org/features/igfs.html --Alexey
Re: Make off-heap store pluggable
Hi Alexey, SPARK-6479https://issues.apache.org/jira/browse/SPARK-6479 is for the plugin API, and SPARK-6112https://issues.apache.org/jira/browse/SPARK-6112 is for hdfs plugin. Thanks. Zhan Zhang On Jul 21, 2015, at 10:56 AM, Alexey Goncharuk alexey.goncha...@gmail.commailto:alexey.goncha...@gmail.com wrote: 2015-07-20 23:29 GMT-07:00 Matei Zaharia matei.zaha...@gmail.commailto:matei.zaha...@gmail.com: I agree with this -- basically, to build on Reynold's point, you should be able to get almost the same performance by implementing either the Hadoop FileSystem API or the Spark Data Source API over Ignite in the right way. This would let people save data persistently in Ignite in addition to using it for caching, and it would provide a global namespace, optionally a schema, etc. You can still provide data locality, short-circuit reads, etc with these APIs. Absolutely agree. In fact, Ignite already provides a shared RDD implementation which is essentially a view of Ignite cache data. This implementation adheres to the Spark DataFrame API. More information can be found here: http://ignite.incubator.apache.org/features/igniterdd.html Also, Ignite in-memory filesystem is compliant with Hadoop filesystem API and can transparently replace HDFS if needed. Plugging it into Spark should be fairly easy. More information can be found here: http://ignite.incubator.apache.org/features/igfs.html --Alexey
Re: countByValue on dataframe with multiple columns
Is this just frequent items? https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L97 On Tue, Jul 21, 2015 at 7:39 AM, Ted Malaska ted.mala...@cloudera.com wrote: 100% I would love to do it. Who a good person to review the design with. All I need is a quick chat about the design and approach and I'll create the jira and push a patch. Ted Malaska On Tue, Jul 21, 2015 at 10:19 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi Ted, The TopNList would be great to see directly in the Dataframe API and my wish would be to be able to apply it on multiple columns at the same time and get all these statistics. the .describe() function is close to what we want to achieve, maybe we could try to enrich its output. Anyway, even as a spark-package, if you could package your code for Dataframes, that would be great. Regards, Olivier. 2015-07-21 15:08 GMT+02:00 Jonathan Winandy jonathan.wina...@gmail.com: Ha ok ! Then generic part would have that signature : def countColsByValue(df:Dataframe):Map[String /* colname */,Dataframe] +1 for more work (blog / api) for data quality checks. Cheers, Jonathan TopCMSParams and some other monoids from Algebird are really cool for that : https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/CountMinSketch.scala#L590 On 21 July 2015 at 13:40, Ted Malaska ted.mala...@cloudera.com wrote: I'm guessing you want something like what I put in this blog post. http://blog.cloudera.com/blog/2015/07/how-to-do-data-quality-checks-using-apache-spark-dataframes/ This is a very common use case. If there is a +1 I would love to add it to dataframes. Let me know Ted Malaska On Tue, Jul 21, 2015 at 7:24 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Yop, actually the generic part does not work, the countByValue on one column gives you the count for each value seen in the column. I would like a generic (multi-column) countByValue to give me the same kind of output for each column, not considering each n-uples of each column value as the key (which is what the groupBy is doing by default). Regards, Olivier 2015-07-20 14:18 GMT+02:00 Jonathan Winandy jonathan.wina...@gmail.com: Ahoy ! Maybe you can get countByValue by using sql.GroupedData : // some DFval df: DataFrame = sqlContext.createDataFrame(sc.parallelize(List(A,B, B, A)).map(Row.apply(_)), StructType(List(StructField(n, StringType df.groupBy(n).count().show() // generic def countByValueDf(df:DataFrame) = { val (h :: r) = df.columns.toList df.groupBy(h, r:_*).count() } countByValueDf(df).show() Cheers, Jon On 20 July 2015 at 11:28, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi, Is there any plan to add the countByValue function to Spark SQL Dataframe ? Even https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L78 is using the RDD part right now, but for ML purposes, being able to get the most frequent categorical value on multiple columns would be very useful. Regards, -- *Olivier Girardot* | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94 -- *Olivier Girardot* | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94 -- *Olivier Girardot* | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94
Re: Make off-heap store pluggable
2015-07-20 21:40 GMT-07:00 Reynold Xin r...@databricks.com: I sent it prematurely. They are already pluggable, or at least in the process to be more pluggable. In 1.4, instead of calling the external system's API directly, we added an API for that. There is a patch to add support for HDFS in-memory cache. Is there a ticket or branch you can show me so I can take a look at the ongoing work? --Alexey
-Phive-thriftserver when compiling for use in pyspark and JDBC connections
I compile/make a distribution, with either the 1.4 branch or master, using the -Phive-thriftserver, and attempt a JDBC connection to a mysql DB..using latest connector (5.1.36) jar. When I setup the pyspark shell doing: bin/pyspark --jars mysql-connection...jar --driver-class-path mysql-connector..jar when I make a data frame from sqlContext.read.jdbc(jdbc://...) and perhaps I do df.show() things seem to work; but if I compile with out that -Phive-thriftserver, the same python, in the same settings (spark-env, spark-defaults), just hangs...never to return. I am curious, how the hive-thriftserver module plays into this type of interaction. Thanks in advance. Cheers, Aaron - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
What is the difference between SlowSparkPullRequestBuilder and SparkPullRequestBuilder?
Hi all, When we send a PR, it seems that two requests to run tests are thrown to the Jenkins sometimes. What is the difference between SparkPullRequestBuilder and SlowSparkPullRequestBuilder? Thanks, Yu - -- Yu Ishikawa -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/What-is-the-difference-between-SlowSparkPullRequestBuilder-and-SparkPullRequestBuilder-tp13377.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Should spark-ec2 get its own repo?
That sounds good. Thanks for clarifying ! Regards, Mridul On Tue, Jul 21, 2015 at 11:09 AM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: Thats part of the confusion we are trying to fix here -- the repository used to live in the mesos github account but was never a part of the Apache Mesos project. It was a remnant part of Spark from when Spark used to live at github.com/mesos/spark. Shivaram On Tue, Jul 21, 2015 at 11:03 AM, Mridul Muralidharan mri...@gmail.com wrote: If I am not wrong, since the code was hosted within mesos project repo, I assume (atleast part of it) is owned by mesos project and so its PMC ? - Mridul On Tue, Jul 21, 2015 at 9:22 AM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: There is technically no PMC for the spark-ec2 project (I guess we are kind of establishing one right now). I haven't heard anything from the Spark PMC on the dev list that might suggest a need for a vote so far. I will send another round of email notification to the dev list when we have a JIRA / PR that actually moves the scripts (right now the only thing that changed is the location of some scripts in mesos/ to amplab/). Thanks Shivaram On Mon, Jul 20, 2015 at 12:55 PM, Mridul Muralidharan mri...@gmail.com wrote: Might be a good idea to get the PMC's of both projects to sign off to prevent future issues with apache. Regards, Mridul On Mon, Jul 20, 2015 at 12:01 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: I've created https://github.com/amplab/spark-ec2 and added an initial set of committers. Note that this is not a fork of the existing github.com/mesos/spark-ec2 and users will need to fork from here. This is mostly to avoid the base-fork in pull requests being set incorrectly etc. I'll be migrating some PRs / closing them in the old repo and will also update the README in that repo. Thanks Shivaram On Fri, Jul 17, 2015 at 3:00 PM, Sean Owen so...@cloudera.com wrote: On Fri, Jul 17, 2015 at 6:58 PM, Shivaram Venkataraman shiva...@eecs.berkeley.edu wrote: I am not sure why the ASF JIRA can be only used to track one set of artifacts that are packaged and released together. I agree that marking a fix version as 1.5 for a change in another repo doesn't make a lot of sense, but we could just not use fix versions for the EC2 issues ? *shrug* it just seems harder and less natural to use ASF JIRA. What's the benefit? I agree it's not a big deal either way but it's a small part of the problem we're solving in the first place. I suspect that one way or the other, there would be issues filed both places, so this probably isn't worth debating. My concerns are less about it being pushed out etc. For better or worse we have had EC2 scripts be a part of the Spark distribution from a very early stage (from version 0.5.0 if my git history reading is correct). So users will assume that any error with EC2 scripts belong to the Spark project. In addition almost all the contributions to the EC2 scripts come from Spark developers and so keeping the issues in the same mailing list / JIRA seems natural. This I guess again relates to the question of managing issues for code that isn't part of the Spark release artifact. Yeah good question -- Github doesn't give you a mailing list. I think dev@ would still be where it's discussed which is ... again 'part of the problem' but as you say, probably beneficial. It's a pretty low traffic topic anyway. I'll create the amplab/spark-ec2 repo over the next couple of days unless there are more comments on this thread. This will at least alleviate some of the naming confusion over using a repository in mesos and I'll give Sean, Nick, Matthew commit access to it. I am still not convinced about moving the issues over though. I won't move the issues. Maybe time tells whether one approach is better, or that it just doesn't matter. However it'd be a great opportunity to review and clear stale EC2 issues. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: countByValue on dataframe with multiple columns
Look at the implementation for frequently items. It is a different from true count. On Jul 21, 2015 1:19 PM, Reynold Xin r...@databricks.com wrote: Is this just frequent items? https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L97 On Tue, Jul 21, 2015 at 7:39 AM, Ted Malaska ted.mala...@cloudera.com wrote: 100% I would love to do it. Who a good person to review the design with. All I need is a quick chat about the design and approach and I'll create the jira and push a patch. Ted Malaska On Tue, Jul 21, 2015 at 10:19 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi Ted, The TopNList would be great to see directly in the Dataframe API and my wish would be to be able to apply it on multiple columns at the same time and get all these statistics. the .describe() function is close to what we want to achieve, maybe we could try to enrich its output. Anyway, even as a spark-package, if you could package your code for Dataframes, that would be great. Regards, Olivier. 2015-07-21 15:08 GMT+02:00 Jonathan Winandy jonathan.wina...@gmail.com : Ha ok ! Then generic part would have that signature : def countColsByValue(df:Dataframe):Map[String /* colname */,Dataframe] +1 for more work (blog / api) for data quality checks. Cheers, Jonathan TopCMSParams and some other monoids from Algebird are really cool for that : https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/CountMinSketch.scala#L590 On 21 July 2015 at 13:40, Ted Malaska ted.mala...@cloudera.com wrote: I'm guessing you want something like what I put in this blog post. http://blog.cloudera.com/blog/2015/07/how-to-do-data-quality-checks-using-apache-spark-dataframes/ This is a very common use case. If there is a +1 I would love to add it to dataframes. Let me know Ted Malaska On Tue, Jul 21, 2015 at 7:24 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Yop, actually the generic part does not work, the countByValue on one column gives you the count for each value seen in the column. I would like a generic (multi-column) countByValue to give me the same kind of output for each column, not considering each n-uples of each column value as the key (which is what the groupBy is doing by default). Regards, Olivier 2015-07-20 14:18 GMT+02:00 Jonathan Winandy jonathan.wina...@gmail.com: Ahoy ! Maybe you can get countByValue by using sql.GroupedData : // some DFval df: DataFrame = sqlContext.createDataFrame(sc.parallelize(List(A,B, B, A)).map(Row.apply(_)), StructType(List(StructField(n, StringType df.groupBy(n).count().show() // generic def countByValueDf(df:DataFrame) = { val (h :: r) = df.columns.toList df.groupBy(h, r:_*).count() } countByValueDf(df).show() Cheers, Jon On 20 July 2015 at 11:28, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi, Is there any plan to add the countByValue function to Spark SQL Dataframe ? Even https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L78 is using the RDD part right now, but for ML purposes, being able to get the most frequent categorical value on multiple columns would be very useful. Regards, -- *Olivier Girardot* | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94 -- *Olivier Girardot* | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94 -- *Olivier Girardot* | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94
Re: countByValue on dataframe with multiple columns
yes and freqItems does not give you an ordered count (right ?) + the threshold makes it difficult to calibrate it + we noticed some strange behaviour when testing it on small datasets. 2015-07-21 20:30 GMT+02:00 Ted Malaska ted.mala...@cloudera.com: Look at the implementation for frequently items. It is a different from true count. On Jul 21, 2015 1:19 PM, Reynold Xin r...@databricks.com wrote: Is this just frequent items? https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L97 On Tue, Jul 21, 2015 at 7:39 AM, Ted Malaska ted.mala...@cloudera.com wrote: 100% I would love to do it. Who a good person to review the design with. All I need is a quick chat about the design and approach and I'll create the jira and push a patch. Ted Malaska On Tue, Jul 21, 2015 at 10:19 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi Ted, The TopNList would be great to see directly in the Dataframe API and my wish would be to be able to apply it on multiple columns at the same time and get all these statistics. the .describe() function is close to what we want to achieve, maybe we could try to enrich its output. Anyway, even as a spark-package, if you could package your code for Dataframes, that would be great. Regards, Olivier. 2015-07-21 15:08 GMT+02:00 Jonathan Winandy jonathan.wina...@gmail.com : Ha ok ! Then generic part would have that signature : def countColsByValue(df:Dataframe):Map[String /* colname */,Dataframe] +1 for more work (blog / api) for data quality checks. Cheers, Jonathan TopCMSParams and some other monoids from Algebird are really cool for that : https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/CountMinSketch.scala#L590 On 21 July 2015 at 13:40, Ted Malaska ted.mala...@cloudera.com wrote: I'm guessing you want something like what I put in this blog post. http://blog.cloudera.com/blog/2015/07/how-to-do-data-quality-checks-using-apache-spark-dataframes/ This is a very common use case. If there is a +1 I would love to add it to dataframes. Let me know Ted Malaska On Tue, Jul 21, 2015 at 7:24 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Yop, actually the generic part does not work, the countByValue on one column gives you the count for each value seen in the column. I would like a generic (multi-column) countByValue to give me the same kind of output for each column, not considering each n-uples of each column value as the key (which is what the groupBy is doing by default). Regards, Olivier 2015-07-20 14:18 GMT+02:00 Jonathan Winandy jonathan.wina...@gmail.com: Ahoy ! Maybe you can get countByValue by using sql.GroupedData : // some DFval df: DataFrame = sqlContext.createDataFrame(sc.parallelize(List(A,B, B, A)).map(Row.apply(_)), StructType(List(StructField(n, StringType df.groupBy(n).count().show() // generic def countByValueDf(df:DataFrame) = { val (h :: r) = df.columns.toList df.groupBy(h, r:_*).count() } countByValueDf(df).show() Cheers, Jon On 20 July 2015 at 11:28, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi, Is there any plan to add the countByValue function to Spark SQL Dataframe ? Even https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L78 is using the RDD part right now, but for ML purposes, being able to get the most frequent categorical value on multiple columns would be very useful. Regards, -- *Olivier Girardot* | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94 -- *Olivier Girardot* | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94 -- *Olivier Girardot* | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94 -- *Olivier Girardot* | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94
Re: countByValue on dataframe with multiple columns
Cool I will make a jira after I check in to my hotel. And try to get a patch early next week. On Jul 21, 2015 5:15 PM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: yes and freqItems does not give you an ordered count (right ?) + the threshold makes it difficult to calibrate it + we noticed some strange behaviour when testing it on small datasets. 2015-07-21 20:30 GMT+02:00 Ted Malaska ted.mala...@cloudera.com: Look at the implementation for frequently items. It is a different from true count. On Jul 21, 2015 1:19 PM, Reynold Xin r...@databricks.com wrote: Is this just frequent items? https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L97 On Tue, Jul 21, 2015 at 7:39 AM, Ted Malaska ted.mala...@cloudera.com wrote: 100% I would love to do it. Who a good person to review the design with. All I need is a quick chat about the design and approach and I'll create the jira and push a patch. Ted Malaska On Tue, Jul 21, 2015 at 10:19 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi Ted, The TopNList would be great to see directly in the Dataframe API and my wish would be to be able to apply it on multiple columns at the same time and get all these statistics. the .describe() function is close to what we want to achieve, maybe we could try to enrich its output. Anyway, even as a spark-package, if you could package your code for Dataframes, that would be great. Regards, Olivier. 2015-07-21 15:08 GMT+02:00 Jonathan Winandy jonathan.wina...@gmail.com: Ha ok ! Then generic part would have that signature : def countColsByValue(df:Dataframe):Map[String /* colname */,Dataframe] +1 for more work (blog / api) for data quality checks. Cheers, Jonathan TopCMSParams and some other monoids from Algebird are really cool for that : https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/CountMinSketch.scala#L590 On 21 July 2015 at 13:40, Ted Malaska ted.mala...@cloudera.com wrote: I'm guessing you want something like what I put in this blog post. http://blog.cloudera.com/blog/2015/07/how-to-do-data-quality-checks-using-apache-spark-dataframes/ This is a very common use case. If there is a +1 I would love to add it to dataframes. Let me know Ted Malaska On Tue, Jul 21, 2015 at 7:24 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Yop, actually the generic part does not work, the countByValue on one column gives you the count for each value seen in the column. I would like a generic (multi-column) countByValue to give me the same kind of output for each column, not considering each n-uples of each column value as the key (which is what the groupBy is doing by default). Regards, Olivier 2015-07-20 14:18 GMT+02:00 Jonathan Winandy jonathan.wina...@gmail.com: Ahoy ! Maybe you can get countByValue by using sql.GroupedData : // some DFval df: DataFrame = sqlContext.createDataFrame(sc.parallelize(List(A,B, B, A)).map(Row.apply(_)), StructType(List(StructField(n, StringType df.groupBy(n).count().show() // generic def countByValueDf(df:DataFrame) = { val (h :: r) = df.columns.toList df.groupBy(h, r:_*).count() } countByValueDf(df).show() Cheers, Jon On 20 July 2015 at 11:28, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi, Is there any plan to add the countByValue function to Spark SQL Dataframe ? Even https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L78 is using the RDD part right now, but for ML purposes, being able to get the most frequent categorical value on multiple columns would be very useful. Regards, -- *Olivier Girardot* | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94 -- *Olivier Girardot* | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94 -- *Olivier Girardot* | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94 -- *Olivier Girardot* | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94
Re: Make off-heap store pluggable
I agree with this -- basically, to build on Reynold's point, you should be able to get almost the same performance by implementing either the Hadoop FileSystem API or the Spark Data Source API over Ignite in the right way. This would let people save data persistently in Ignite in addition to using it for caching, and it would provide a global namespace, optionally a schema, etc. You can still provide data locality, short-circuit reads, etc with these APIs. Matei On Jul 20, 2015, at 9:40 PM, Reynold Xin r...@databricks.com wrote: I sent it prematurely. They are already pluggable, or at least in the process to be more pluggable. In 1.4, instead of calling the external system's API directly, we added an API for that. There is a patch to add support for HDFS in-memory cache. Somewhat orthogonal to this, longer term, I am not sure whether it makes sense to have the current off heap API, because there is no namespacing and the benefit to end users is actually not very substantial (at least I can think of much simpler ways to achieve exactly the same gains), and yet it introduces quite a bit of complexity to the codebase. On Mon, Jul 20, 2015 at 9:34 PM, Reynold Xin r...@databricks.com mailto:r...@databricks.com wrote: They are already pluggable. On Mon, Jul 20, 2015 at 9:32 PM, Prashant Sharma scrapco...@gmail.com mailto:scrapco...@gmail.com wrote: +1 Looks like a nice idea(I do not see any harm). Would you like to work on the patch to support it ? Prashant Sharma On Tue, Jul 21, 2015 at 2:46 AM, Alexey Goncharuk alexey.goncha...@gmail.com mailto:alexey.goncha...@gmail.com wrote: Hello Spark community, I was looking through the code in order to understand better how RDD is persisted to Tachyon off-heap filesystem. It looks like that the Tachyon filesystem is hard-coded and there is no way to switch to another in-memory filesystem. I think it would be great if the implementation of the BlockManager and BlockStore would be able to plug in another filesystem. For example, Apache Ignite also has an implementation of in-memory filesystem which can store data in on-heap and off-heap formats. It would be great if it could integrate with Spark. I have filed a ticket in Jira: https://issues.apache.org/jira/browse/SPARK-9203 https://issues.apache.org/jira/browse/SPARK-9203 If it makes sense, I will be happy to contribute to it. Thoughts? -Alexey (Apache Ignite PMC)
Re: Make off-heap store pluggable
(Related, not important comment: it would also be nice to separate out the Tachyon dependency from core, as it's conceptually pluggable but is still hard-coded into several places in the code, and a lot of the comments/docs in the code.) On Tue, Jul 21, 2015 at 5:40 AM, Reynold Xin r...@databricks.com wrote: I sent it prematurely. They are already pluggable, or at least in the process to be more pluggable. In 1.4, instead of calling the external system's API directly, we added an API for that. There is a patch to add support for HDFS in-memory cache. Somewhat orthogonal to this, longer term, I am not sure whether it makes sense to have the current off heap API, because there is no namespacing and the benefit to end users is actually not very substantial (at least I can think of much simpler ways to achieve exactly the same gains), and yet it introduces quite a bit of complexity to the codebase. On Mon, Jul 20, 2015 at 9:34 PM, Reynold Xin r...@databricks.com wrote: They are already pluggable. On Mon, Jul 20, 2015 at 9:32 PM, Prashant Sharma scrapco...@gmail.com wrote: +1 Looks like a nice idea(I do not see any harm). Would you like to work on the patch to support it ? Prashant Sharma On Tue, Jul 21, 2015 at 2:46 AM, Alexey Goncharuk alexey.goncha...@gmail.com wrote: Hello Spark community, I was looking through the code in order to understand better how RDD is persisted to Tachyon off-heap filesystem. It looks like that the Tachyon filesystem is hard-coded and there is no way to switch to another in-memory filesystem. I think it would be great if the implementation of the BlockManager and BlockStore would be able to plug in another filesystem. For example, Apache Ignite also has an implementation of in-memory filesystem which can store data in on-heap and off-heap formats. It would be great if it could integrate with Spark. I have filed a ticket in Jira: https://issues.apache.org/jira/browse/SPARK-9203 If it makes sense, I will be happy to contribute to it. Thoughts? -Alexey (Apache Ignite PMC)
Re: countByValue on dataframe with multiple columns
I'm guessing you want something like what I put in this blog post. http://blog.cloudera.com/blog/2015/07/how-to-do-data-quality-checks-using-apache-spark-dataframes/ This is a very common use case. If there is a +1 I would love to add it to dataframes. Let me know Ted Malaska On Tue, Jul 21, 2015 at 7:24 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Yop, actually the generic part does not work, the countByValue on one column gives you the count for each value seen in the column. I would like a generic (multi-column) countByValue to give me the same kind of output for each column, not considering each n-uples of each column value as the key (which is what the groupBy is doing by default). Regards, Olivier 2015-07-20 14:18 GMT+02:00 Jonathan Winandy jonathan.wina...@gmail.com: Ahoy ! Maybe you can get countByValue by using sql.GroupedData : // some DFval df: DataFrame = sqlContext.createDataFrame(sc.parallelize(List(A,B, B, A)).map(Row.apply(_)), StructType(List(StructField(n, StringType df.groupBy(n).count().show() // generic def countByValueDf(df:DataFrame) = { val (h :: r) = df.columns.toList df.groupBy(h, r:_*).count() } countByValueDf(df).show() Cheers, Jon On 20 July 2015 at 11:28, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi, Is there any plan to add the countByValue function to Spark SQL Dataframe ? Even https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L78 is using the RDD part right now, but for ML purposes, being able to get the most frequent categorical value on multiple columns would be very useful. Regards, -- *Olivier Girardot* | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94 -- *Olivier Girardot* | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94
Re: countByValue on dataframe with multiple columns
Yop, actually the generic part does not work, the countByValue on one column gives you the count for each value seen in the column. I would like a generic (multi-column) countByValue to give me the same kind of output for each column, not considering each n-uples of each column value as the key (which is what the groupBy is doing by default). Regards, Olivier 2015-07-20 14:18 GMT+02:00 Jonathan Winandy jonathan.wina...@gmail.com: Ahoy ! Maybe you can get countByValue by using sql.GroupedData : // some DFval df: DataFrame = sqlContext.createDataFrame(sc.parallelize(List(A,B, B, A)).map(Row.apply(_)), StructType(List(StructField(n, StringType df.groupBy(n).count().show() // generic def countByValueDf(df:DataFrame) = { val (h :: r) = df.columns.toList df.groupBy(h, r:_*).count() } countByValueDf(df).show() Cheers, Jon On 20 July 2015 at 11:28, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi, Is there any plan to add the countByValue function to Spark SQL Dataframe ? Even https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L78 is using the RDD part right now, but for ML purposes, being able to get the most frequent categorical value on multiple columns would be very useful. Regards, -- *Olivier Girardot* | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94 -- *Olivier Girardot* | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94
Re: countByValue on dataframe with multiple columns
Ha ok ! Then generic part would have that signature : def countColsByValue(df:Dataframe):Map[String /* colname */,Dataframe] +1 for more work (blog / api) for data quality checks. Cheers, Jonathan TopCMSParams and some other monoids from Algebird are really cool for that : https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/CountMinSketch.scala#L590 On 21 July 2015 at 13:40, Ted Malaska ted.mala...@cloudera.com wrote: I'm guessing you want something like what I put in this blog post. http://blog.cloudera.com/blog/2015/07/how-to-do-data-quality-checks-using-apache-spark-dataframes/ This is a very common use case. If there is a +1 I would love to add it to dataframes. Let me know Ted Malaska On Tue, Jul 21, 2015 at 7:24 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Yop, actually the generic part does not work, the countByValue on one column gives you the count for each value seen in the column. I would like a generic (multi-column) countByValue to give me the same kind of output for each column, not considering each n-uples of each column value as the key (which is what the groupBy is doing by default). Regards, Olivier 2015-07-20 14:18 GMT+02:00 Jonathan Winandy jonathan.wina...@gmail.com: Ahoy ! Maybe you can get countByValue by using sql.GroupedData : // some DFval df: DataFrame = sqlContext.createDataFrame(sc.parallelize(List(A,B, B, A)).map(Row.apply(_)), StructType(List(StructField(n, StringType df.groupBy(n).count().show() // generic def countByValueDf(df:DataFrame) = { val (h :: r) = df.columns.toList df.groupBy(h, r:_*).count() } countByValueDf(df).show() Cheers, Jon On 20 July 2015 at 11:28, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi, Is there any plan to add the countByValue function to Spark SQL Dataframe ? Even https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L78 is using the RDD part right now, but for ML purposes, being able to get the most frequent categorical value on multiple columns would be very useful. Regards, -- *Olivier Girardot* | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94 -- *Olivier Girardot* | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94
Re: countByValue on dataframe with multiple columns
Hi Ted, The TopNList would be great to see directly in the Dataframe API and my wish would be to be able to apply it on multiple columns at the same time and get all these statistics. the .describe() function is close to what we want to achieve, maybe we could try to enrich its output. Anyway, even as a spark-package, if you could package your code for Dataframes, that would be great. Regards, Olivier. 2015-07-21 15:08 GMT+02:00 Jonathan Winandy jonathan.wina...@gmail.com: Ha ok ! Then generic part would have that signature : def countColsByValue(df:Dataframe):Map[String /* colname */,Dataframe] +1 for more work (blog / api) for data quality checks. Cheers, Jonathan TopCMSParams and some other monoids from Algebird are really cool for that : https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/CountMinSketch.scala#L590 On 21 July 2015 at 13:40, Ted Malaska ted.mala...@cloudera.com wrote: I'm guessing you want something like what I put in this blog post. http://blog.cloudera.com/blog/2015/07/how-to-do-data-quality-checks-using-apache-spark-dataframes/ This is a very common use case. If there is a +1 I would love to add it to dataframes. Let me know Ted Malaska On Tue, Jul 21, 2015 at 7:24 AM, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Yop, actually the generic part does not work, the countByValue on one column gives you the count for each value seen in the column. I would like a generic (multi-column) countByValue to give me the same kind of output for each column, not considering each n-uples of each column value as the key (which is what the groupBy is doing by default). Regards, Olivier 2015-07-20 14:18 GMT+02:00 Jonathan Winandy jonathan.wina...@gmail.com : Ahoy ! Maybe you can get countByValue by using sql.GroupedData : // some DFval df: DataFrame = sqlContext.createDataFrame(sc.parallelize(List(A,B, B, A)).map(Row.apply(_)), StructType(List(StructField(n, StringType df.groupBy(n).count().show() // generic def countByValueDf(df:DataFrame) = { val (h :: r) = df.columns.toList df.groupBy(h, r:_*).count() } countByValueDf(df).show() Cheers, Jon On 20 July 2015 at 11:28, Olivier Girardot o.girar...@lateral-thoughts.com wrote: Hi, Is there any plan to add the countByValue function to Spark SQL Dataframe ? Even https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L78 is using the RDD part right now, but for ML purposes, being able to get the most frequent categorical value on multiple columns would be very useful. Regards, -- *Olivier Girardot* | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94 -- *Olivier Girardot* | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94 -- *Olivier Girardot* | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94