Suggestion: RDD cache depth
It would be nice if the RDD cache() method incorporate a depth information. That is, void test() { JavaRDD. rdd = .; rdd.cache(); // to depth 1. actual caching happens. rdd.cache(); // to depth 2. Nop as long as the storage level is the same. Else, exception. . rdd.uncache(); // to depth 1. Nop. rdd.uncache(); // to depth 0. Actual unpersist happens. } This can be useful when writing code in modular way. When a function receives an rdd as an argument, it doesn't necessarily know the cache status of the rdd. But it could want to cache the rdd, since it will use the rdd multiple times. But with the current RDD API, it cannot determine whether it should unpersist it or leave it alone (so that caller can continue to use that rdd without rebuilding). Thanks.
GraphX triplets on 5-node graph
Shouldn't I be seeing N2 and N4 in the output below? (Spark 0.9.0 REPL) Or am I missing something fundamental? val nodes = sc.parallelize(Array((1L, N1), (2L, N2), (3L, N3), (4L, N4), (5L, N5))) val edges = sc.parallelize(Array(Edge(1L, 2L, E1), Edge(1L, 3L, E2), Edge(2L, 4L, E3), Edge(3L, 5L, E4))) Graph(nodes, edges).triplets.collect res1: Array[org.apache.spark.graphx.EdgeTriplet[String,String]] = Array(((1,N1),(3,N3),E2), ((1,N1),(3,N3),E2), ((3,N3),(5,N5),E4), ((3,N3),(5,N5),E4))
Re: Suggestion: RDD cache depth
This is a pretty cool idea — instead of cache depth I’d call it something like reference counting. Would you mind opening a JIRA issue about it? The issue of really composing together libraries that use RDDs nicely isn’t fully explored, but this is certainly one thing that would help with it. I’d love to look at other ones too, e.g. how to allow libraries to share scans over the same dataset. Unfortunately using multiple cache() calls for this is probably not feasible because it would change the current meaning of multiple calls. But we can add a new API, or a parameter to the method. Matei On May 28, 2014, at 11:46 PM, innowireless TaeYun Kim taeyun@innowireless.co.kr wrote: It would be nice if the RDD cache() method incorporate a depth information. That is, void test() { JavaRDD. rdd = .; rdd.cache(); // to depth 1. actual caching happens. rdd.cache(); // to depth 2. Nop as long as the storage level is the same. Else, exception. . rdd.uncache(); // to depth 1. Nop. rdd.uncache(); // to depth 0. Actual unpersist happens. } This can be useful when writing code in modular way. When a function receives an rdd as an argument, it doesn't necessarily know the cache status of the rdd. But it could want to cache the rdd, since it will use the rdd multiple times. But with the current RDD API, it cannot determine whether it should unpersist it or leave it alone (so that caller can continue to use that rdd without rebuilding). Thanks.
Re: GraphX triplets on 5-node graph
Take a look at this one: https://issues.apache.org/jira/browse/SPARK-1188 It was an optimization that added user inconvenience. We got rid of that now in Spark 1.0. On Wed, May 28, 2014 at 11:48 PM, Michael Malak michaelma...@yahoo.comwrote: Shouldn't I be seeing N2 and N4 in the output below? (Spark 0.9.0 REPL) Or am I missing something fundamental? val nodes = sc.parallelize(Array((1L, N1), (2L, N2), (3L, N3), (4L, N4), (5L, N5))) val edges = sc.parallelize(Array(Edge(1L, 2L, E1), Edge(1L, 3L, E2), Edge(2L, 4L, E3), Edge(3L, 5L, E4))) Graph(nodes, edges).triplets.collect res1: Array[org.apache.spark.graphx.EdgeTriplet[String,String]] = Array(((1,N1),(3,N3),E2), ((1,N1),(3,N3),E2), ((3,N3),(5,N5),E4), ((3,N3),(5,N5),E4))
RE: Suggestion: RDD cache depth
Opened a JIRA issue. (https://issues.apache.org/jira/browse/SPARK-1962) Thanks. -Original Message- From: Matei Zaharia [mailto:matei.zaha...@gmail.com] Sent: Thursday, May 29, 2014 3:54 PM To: dev@spark.apache.org Subject: Re: Suggestion: RDD cache depth This is a pretty cool idea - instead of cache depth I'd call it something like reference counting. Would you mind opening a JIRA issue about it? The issue of really composing together libraries that use RDDs nicely isn't fully explored, but this is certainly one thing that would help with it. I'd love to look at other ones too, e.g. how to allow libraries to share scans over the same dataset. Unfortunately using multiple cache() calls for this is probably not feasible because it would change the current meaning of multiple calls. But we can add a new API, or a parameter to the method. Matei On May 28, 2014, at 11:46 PM, innowireless TaeYun Kim taeyun@innowireless.co.kr wrote: It would be nice if the RDD cache() method incorporate a depth information. That is, void test() { JavaRDD. rdd = .; rdd.cache(); // to depth 1. actual caching happens. rdd.cache(); // to depth 2. Nop as long as the storage level is the same. Else, exception. . rdd.uncache(); // to depth 1. Nop. rdd.uncache(); // to depth 0. Actual unpersist happens. } This can be useful when writing code in modular way. When a function receives an rdd as an argument, it doesn't necessarily know the cache status of the rdd. But it could want to cache the rdd, since it will use the rdd multiple times. But with the current RDD API, it cannot determine whether it should unpersist it or leave it alone (so that caller can continue to use that rdd without rebuilding). Thanks.
Re: LogisticRegression: Predicting continuous outcomes
Xiangrui, Christopher, Thanks for responding. I'll go through the code in detail to evaluate if the loss function used is suitable to our dataset. I'll also go through the referred paper since I was unaware of the underlying theory. Thanks again. -Bharath On Thu, May 29, 2014 at 8:16 AM, Christopher Nguyen c...@adatao.com wrote: Bharath, (apologies if you're already familiar with the theory): the proposed approach may or may not be appropriate depending on the overall transfer function in your data. In general, a single logistic regressor cannot approximate arbitrary non-linear functions (of linear combinations of the inputs). You can review works by, e.g., Hornik and Cybenko in the late 80's to see if you need something more, such as a simple, one hidden-layer neural network. This is a good summary: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.101.2647rep=rep1type=pdf -- Christopher T. Nguyen Co-founder CEO, Adatao http://adatao.com linkedin.com/in/ctnguyen On Wed, May 28, 2014 at 11:18 AM, Bharath Ravi Kumar reachb...@gmail.com wrote: I'm looking to reuse the LogisticRegression model (with SGD) to predict a real-valued outcome variable. (I understand that logistic regression is generally applied to predict binary outcome, but for various reasons, this model suits our needs better than LinearRegression). Related to that I have the following questions: 1) Can the current LogisticRegression model be used as is to train based on binary input (i.e. explanatory) features, or is there an assumption that the explanatory features must be continuous? 2) I intend to reuse the current class to train a model on LabeledPoints where the label is a real value (and not 0 / 1). I'd like to know if invoking setValidateData(false) would suffice or if one must override the validator to achieve this. 3) I recall seeing an experimental method on the class ( https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala ) that clears the threshold separating positive negative predictions. Once the model is trained on real valued labels, would clearing this flag suffice to predict an outcome that is continous in nature? Thanks, Bharath P.S: I'm writing to dev@ and not user@ assuming that lib changes might be necessary. Apologies if the mailing list is incorrect.
Please change instruction about Launching Applications Inside the Cluster
The instruction address is in http://spark.apache.org/docs/0.9.0/spark-standalone.html#launching-applications-inside-the-cluster or http://spark.apache.org/docs/0.9.1/spark-standalone.html#launching-applications-inside-the-cluster Origin instruction is: ./bin/spark-class org.apache.spark.deploy.Client launch [client-options] \ cluster-url application-jar-url main-class \ [application-options] If I follow this instruction, I will not run my program deployed in a spark standalone cluster properly. Based on source code, This instruction should be changed to ./bin/spark-class org.apache.spark.deploy.Client [client-options] launch \ cluster-url application-jar-url main-class \ [application-options] That is to say: [client-options] must be put ahead of launch
Re: Standard preprocessing/scaling
I do see the issue for centering sparse data. Actually, the centering is less important than the scaling by the standard deviation. Not having unit variance causes the convergence issues and long runtimes. RowMatrix will compute variance of a column? -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/MLlib-Standard-preprocessing-scaling-tp6826p6849.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
Timestamp support in v1.0
Can anyone verify which rc [SPARK-1360] Add Timestamp Support for SQL #275 https://github.com/apache/spark/pull/275 is included in? I am running rc3, but receiving errors with TIMESTAMP as a datatype in my Hive tables when trying to use them in pyspark. *The error I get: * 14/05/29 15:44:47 INFO ParseDriver: Parsing command: SELECT COUNT(*) FROM aol 14/05/29 15:44:48 INFO ParseDriver: Parse Completed 14/05/29 15:44:48 INFO metastore: Trying to connect to metastore with URI thrift: 14/05/29 15:44:48 INFO metastore: Waiting 1 seconds before next connection attempt. 14/05/29 15:44:49 INFO metastore: Connected to metastore. Traceback (most recent call last): File stdin, line 1, in module File /opt/spark-1.0.0-rc3/python/pyspark/sql.py, line 189, in hql return self.hiveql(hqlQuery) File /opt/spark-1.0.0-rc3/python/pyspark/sql.py, line 183, in hiveql return SchemaRDD(self._ssql_ctx.hiveql(hqlQuery), self) File /opt/spark-1.0.0-rc3/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 537, in __call__ File /opt/spark-1.0.0-rc3/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o14.hiveql. : java.lang.RuntimeException: Unsupported dataType: timestamp *The table I loaded:* DROP TABLE IF EXISTS aol; CREATE EXTERNAL TABLE aol ( userid STRING, query STRING, query_time TIMESTAMP, item_rank INT, click_url STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION '/tmp/data/aol'; *The pyspark commands:* from pyspark.sql import HiveContext hctx= HiveContext(sc) results = hctx.hql(SELECT COUNT(*) FROM aol).collect() -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Timestamp-support-in-v1-0-tp6850.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
Re: Timestamp support in v1.0
I can confirm that the commit is included in the 1.0.0 release candidates (it was committed before branch-1.0 split off from master), but I can't confirm that it works in PySpark. Generally the Python and Java interfaces lag a little behind the Scala interface to Spark, but we're working to keep that diff much smaller going forward. Can you try the same thing in Scala? On Thu, May 29, 2014 at 8:54 AM, dataginjaninja rickett.stepha...@gmail.com wrote: Can anyone verify which rc [SPARK-1360] Add Timestamp Support for SQL #275 https://github.com/apache/spark/pull/275 is included in? I am running rc3, but receiving errors with TIMESTAMP as a datatype in my Hive tables when trying to use them in pyspark. *The error I get: * 14/05/29 15:44:47 INFO ParseDriver: Parsing command: SELECT COUNT(*) FROM aol 14/05/29 15:44:48 INFO ParseDriver: Parse Completed 14/05/29 15:44:48 INFO metastore: Trying to connect to metastore with URI thrift: 14/05/29 15:44:48 INFO metastore: Waiting 1 seconds before next connection attempt. 14/05/29 15:44:49 INFO metastore: Connected to metastore. Traceback (most recent call last): File stdin, line 1, in module File /opt/spark-1.0.0-rc3/python/pyspark/sql.py, line 189, in hql return self.hiveql(hqlQuery) File /opt/spark-1.0.0-rc3/python/pyspark/sql.py, line 183, in hiveql return SchemaRDD(self._ssql_ctx.hiveql(hqlQuery), self) File /opt/spark-1.0.0-rc3/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 537, in __call__ File /opt/spark-1.0.0-rc3/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o14.hiveql. : java.lang.RuntimeException: Unsupported dataType: timestamp *The table I loaded:* DROP TABLE IF EXISTS aol; CREATE EXTERNAL TABLE aol ( userid STRING, query STRING, query_time TIMESTAMP, item_rank INT, click_url STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION '/tmp/data/aol'; *The pyspark commands:* from pyspark.sql import HiveContext hctx= HiveContext(sc) results = hctx.hql(SELECT COUNT(*) FROM aol).collect() -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Timestamp-support-in-v1-0-tp6850.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
Re: [VOTE] Release Apache Spark 1.0.0 (RC11)
+1 I spun up a few EC2 clusters and ran my normal audit checks. Tests passing, sigs, CHANGES and NOTICE look good Thanks TD for helping cut this RC! On Wed, May 28, 2014 at 9:38 PM, Kevin Markey kevin.mar...@oracle.com wrote: +1 Built -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 Ran current version of one of my applications on 1-node pseudocluster (sorry, unable to test on full cluster). yarn-cluster mode Ran regression tests. Thanks Kevin On 05/28/2014 09:55 PM, Krishna Sankar wrote: +1 Pulled built on MacOS X, EC2 Amazon Linux Ran test programs on OS X, 5 node c3.4xlarge cluster Cheers k/ On Wed, May 28, 2014 at 7:36 PM, Andy Konwinski andykonwin...@gmail.comwrote: +1 On May 28, 2014 7:05 PM, Xiangrui Meng men...@gmail.com wrote: +1 Tested apps with standalone client mode and yarn cluster and client modes. Xiangrui On Wed, May 28, 2014 at 1:07 PM, Sean McNamara sean.mcnam...@webtrends.com wrote: Pulled down, compiled, and tested examples on OS X and ubuntu. Deployed app we are building on spark and poured data through it. +1 Sean On May 26, 2014, at 8:39 AM, Tathagata Das tathagata.das1...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.0.0! This has a few important bug fixes on top of rc10: SPARK-1900 and SPARK-1918: https://github.com/apache/spark/pull/853 SPARK-1870: https://github.com/apache/spark/pull/848 SPARK-1897: https://github.com/apache/spark/pull/849 The tag to be voted on is v1.0.0-rc11 (commit c69d97cd): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=c69d97cdb42f809cb71113a1db4194c21372242a The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~tdas/spark-1.0.0-rc11/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/tdas.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1019/ The documentation corresponding to this release can be found at: http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/ Please vote on releasing this package as Apache Spark 1.0.0! The vote is open until Thursday, May 29, at 16:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.0.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == API Changes == We welcome users to compile Spark applications against 1.0. There are a few API changes in this release. Here are links to the associated upgrade guides - user facing changes have been kept as small as possible. Changes to ML vector specification: http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/mllib-guide.html#from-09-to-10 Changes to the Java API: http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark Changes to the streaming API: http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x Changes to the GraphX API: http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091 Other changes: coGroup and related functions now return Iterable[T] instead of Seq[T] == Call toSeq on the result to restore the old behavior SparkContext.jarOfClass returns Option[String] instead of Seq[String] == Call toSeq on the result to restore old behavior
Re: Timestamp support in v1.0
Thanks for reporting this! https://issues.apache.org/jira/browse/SPARK-1964 https://github.com/apache/spark/pull/913 If you could test out that PR and see if it fixes your problems I'd really appreciate it! Michael On Thu, May 29, 2014 at 9:09 AM, Andrew Ash and...@andrewash.com wrote: I can confirm that the commit is included in the 1.0.0 release candidates (it was committed before branch-1.0 split off from master), but I can't confirm that it works in PySpark. Generally the Python and Java interfaces lag a little behind the Scala interface to Spark, but we're working to keep that diff much smaller going forward. Can you try the same thing in Scala? On Thu, May 29, 2014 at 8:54 AM, dataginjaninja rickett.stepha...@gmail.com wrote: Can anyone verify which rc [SPARK-1360] Add Timestamp Support for SQL #275 https://github.com/apache/spark/pull/275 is included in? I am running rc3, but receiving errors with TIMESTAMP as a datatype in my Hive tables when trying to use them in pyspark. *The error I get: * 14/05/29 15:44:47 INFO ParseDriver: Parsing command: SELECT COUNT(*) FROM aol 14/05/29 15:44:48 INFO ParseDriver: Parse Completed 14/05/29 15:44:48 INFO metastore: Trying to connect to metastore with URI thrift: 14/05/29 15:44:48 INFO metastore: Waiting 1 seconds before next connection attempt. 14/05/29 15:44:49 INFO metastore: Connected to metastore. Traceback (most recent call last): File stdin, line 1, in module File /opt/spark-1.0.0-rc3/python/pyspark/sql.py, line 189, in hql return self.hiveql(hqlQuery) File /opt/spark-1.0.0-rc3/python/pyspark/sql.py, line 183, in hiveql return SchemaRDD(self._ssql_ctx.hiveql(hqlQuery), self) File /opt/spark-1.0.0-rc3/python/lib/py4j-0.8.1-src.zip/py4j/java_gateway.py, line 537, in __call__ File /opt/spark-1.0.0-rc3/python/lib/py4j-0.8.1-src.zip/py4j/protocol.py, line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o14.hiveql. : java.lang.RuntimeException: Unsupported dataType: timestamp *The table I loaded:* DROP TABLE IF EXISTS aol; CREATE EXTERNAL TABLE aol ( userid STRING, query STRING, query_time TIMESTAMP, item_rank INT, click_url STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION '/tmp/data/aol'; *The pyspark commands:* from pyspark.sql import HiveContext hctx= HiveContext(sc) results = hctx.hql(SELECT COUNT(*) FROM aol).collect() -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Timestamp-support-in-v1-0-tp6850.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
Re: Timestamp support in v1.0
Yes, I get the same error: scala val hc = new org.apache.spark.sql.hive.HiveContext(sc) 14/05/29 16:53:40 INFO deprecation: mapred.input.dir.recursive is deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive 14/05/29 16:53:40 INFO deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize 14/05/29 16:53:40 INFO deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize 14/05/29 16:53:40 INFO deprecation: mapred.min.split.size.per.rack is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.rack 14/05/29 16:53:40 INFO deprecation: mapred.min.split.size.per.node is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.node 14/05/29 16:53:40 INFO deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces 14/05/29 16:53:40 INFO deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative 14/05/29 16:53:41 WARN HiveConf: DEPRECATED: Configuration property hive.metastore.local no longer has any effect. Make sure to provide a valid value for hive.metastore.uris if you are connecting to a remote metastore. 14/05/29 16:53:42 WARN HiveConf: DEPRECATED: Configuration property hive.metastore.local no longer has any effect. Make sure to provide a valid value for hive.metastore.uris if you are connecting to a remote metastore. hc: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext@36482814 scala val results = hc.hql(SELECT COUNT(*) FROM aol).collect() 14/05/29 16:53:46 INFO ParseDriver: Parsing command: SELECT COUNT(*) FROM aol 14/05/29 16:53:46 INFO ParseDriver: Parse Completed 14/05/29 16:53:47 INFO metastore: Trying to connect to metastore with URI th 14/05/29 16:53:47 INFO metastore: Waiting 1 seconds before next connection attempt. 14/05/29 16:53:48 INFO metastore: Connected to metastore. java.lang.RuntimeException: Unsupported dataType: timestamp at scala.sys.package$.error(package.scala:27) -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Timestamp-support-in-v1-0-tp6850p6853.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
Re: Timestamp support in v1.0
Michael, Will I have to rebuild after adding the change? Thanks -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Timestamp-support-in-v1-0-tp6850p6855.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
Re: Timestamp support in v1.0
Darn, I was hoping just to sneak it in that file. I am not the only person working on the cluster; if I rebuild it that means I have to redeploy everything to all the nodes as well. So I cannot do that ... today. If someone else doesn't beat me to it. I can rebuild at another time. - Cheers, Stephanie -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Timestamp-support-in-v1-0-tp6850p6857.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
Re: Timestamp support in v1.0
Yes, you'll need to download the code from that PR and reassemble Spark (sbt/sbt assembly). On Thu, May 29, 2014 at 10:02 AM, dataginjaninja rickett.stepha...@gmail.com wrote: Michael, Will I have to rebuild after adding the change? Thanks -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Timestamp-support-in-v1-0-tp6850p6855.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
Re: Timestamp support in v1.0
You should be able to get away with only doing it locally. This bug is happening during analysis which only occurs on the driver. On Thu, May 29, 2014 at 10:17 AM, dataginjaninja rickett.stepha...@gmail.com wrote: Darn, I was hoping just to sneak it in that file. I am not the only person working on the cluster; if I rebuild it that means I have to redeploy everything to all the nodes as well. So I cannot do that ... today. If someone else doesn't beat me to it. I can rebuild at another time. - Cheers, Stephanie -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Timestamp-support-in-v1-0-tp6850p6857.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
Re: [VOTE] Release Apache Spark 1.0.0 (rc5)
[tl;dr stable API's are important - sorry, this is slightly meandering] Hey - just wanted to chime in on this as I was travelling. Sean, you bring up great points here about the velocity and stability of Spark. Many projects have fairly customized semantics around what versions actually mean (HBase is a good, if somewhat hard-to-comprehend, example). What the 1.X label means to Spark is that we are willing to guarantee stability for Spark's core API. This is something that actually, Spark has been doing for a while already (we've made few or no breaking changes to the Spark core API for several years) and we want to codify this for application developers. In this regard Spark has made a bunch of changes to enforce the integrity of our API's: - We went through and clearly annotated internal, or experimental API's. This was a huge project-wide effort and included Scaladoc and several other components to make it clear to users. - We implemented automated byte-code verification of all proposed pull requests that they don't break public API's. Pull requests after 1.0 will fail if they break API's that are not explicitly declared private or experimental. I can't possibly emphasize enough the importance of API stability. What we want to avoid is the Hadoop approach. Candidly, Hadoop does a poor job on this. There really isn't a well defined stable API for any of the Hadoop components, for a few reasons: 1. Hadoop projects don't do any rigorous checking that new patches don't break API's. Of course, the results in regular API breaks and a poor understanding of what is a public API. 2. In several cases it's not possible to do basic things in Hadoop without using deprecated or private API's. 3. There is significant vendor fragmentation of API's. The main focus of the Hadoop vendors is making consistent cuts of the core projects work together (HDFS/Pig/Hive/etc) - so API breaks are sometimes considered fixed as long as the other projects work around them (see [1]). We also regularly need to do archaeology (see [2]) and directly interact with Hadoop committers to understand what API's are stable and in which versions. One goal of Spark is to deal with the pain of inter-operating with Hadoop so that application writers don't to. We'd like to retain the property that if you build an application against the (well defined, stable) Spark API's right now, you'll be able to run it across many Hadoop vendors and versions for the entire Spark 1.X release cycle. Writing apps against Hadoop can be very difficult... consider how much more engineering effort we spent maintaining YARN support than Mesos support. There are many factors, but one is that Mesos has a single, narrow, stable API. We've never had to make a change in Mesos due to an API change, for several years. YARN on the other hand, there are at least 3 YARN API's that currently exist, which are all binary incompatible. We'd like to offer apps the ability to build against Spark's API and just let us deal with it. As more vendors packaging Spark, I'd like to see us put tools in the upstream Spark repo that do validation for vendor packages of Spark, so that we don't end up with fragmentation. Of course, vendors can enhance the API and are encouraged to, but we need a kernel of API's that vendors must maintain (think POSIX) to be considered compliant with Apache Spark. I believe some other projects like OpenStack have done this to avoid fragmentation. - Patrick [1] https://issues.apache.org/jira/browse/MAPREDUCE-5830 [2] http://2.bp.blogspot.com/-GO6HF0OAFHw/UOfNEH-4sEI/AD0/dEWFFYTRgYw/s1600/output-file.png On Sun, May 18, 2014 at 2:13 AM, Mridul Muralidharan mri...@gmail.com wrote: So I think I need to clarify a few things here - particularly since this mail went to the wrong mailing list and a much wider audience than I intended it for :-) Most of the issues I mentioned are internal implementation detail of spark core : which means, we can enhance them in future without disruption to our userbase (ability to support large number of input/output partitions. Note: this is of order of 100k input and output partitions with uniform spread of keys - very rarely seen outside of some crazy jobs). Some of the issues I mentioned would reqiure DeveloperApi changes - which are not user exposed : they would impact developer use of these api's - which are mostly internally provided by spark. (Like fixing blocks 2G would require change to Serializer api) A smaller faction might require interface changes - note, I am referring specifically to configuration changes (removing/deprecating some) and possibly newer options to submit/env, etc - I dont envision any programming api change itself. The only api change we did was from Seq - Iterable - which is actually to address some of the issues I mentioned (join/cogroup). Remaining are bugs which need to be addressed or the feature removed/enhanced like shuffle consolidation. There might be
[RESULT][VOTE] Release Apache Spark 1.0.0 (RC11)
Hello everyone, The vote on Spark 1.0.0 RC11 passes with13 +1 votes, one 0 vote and no -1 vote. Thanks to everyone who tested the RC and voted. Here are the totals: +1: (13 votes) Matei Zaharia* Mark Hamstra* Holden Karau Nick Pentreath* Will Benton Henry Saputra Sean McNamara* Xiangrui Meng* Andy Konwinski* Krishna Sankar Kevin Markey Patrick Wendell* Tathagata Das* 0: (1 vote) Ankur Dave* -1: (0 vote) Please hold off announcing Spark 1.0.0 until Apache Software Foundation makes the press release tomorrow. Thank you very much for your cooperation. TD
Re: [VOTE] Release Apache Spark 1.0.0 (RC11)
Let me put in my +1 as well! This voting is now closed, and it successfully passes with 13 +1 votes and one 0 vote. Thanks to everyone who tested the RC and voted. Here are the totals: +1: (13 votes) Matei Zaharia* Mark Hamstra* Holden Karau Nick Pentreath* Will Benton Henry Saputra Sean McNamara* Xiangrui Meng* Andy Konwinski* Krishna Sankar Kevin Markey Patrick Wendell* Tathagata Das* 0: (1 vote) Ankur Dave* -1: (0 vote) * = binding Please hold off announcing Spark 1.0.0 until Apache Software Foundation makes the press release tomorrow. Thank you very much for your cooperation. TD On Thu, May 29, 2014 at 9:14 AM, Patrick Wendell pwend...@gmail.com wrote: +1 I spun up a few EC2 clusters and ran my normal audit checks. Tests passing, sigs, CHANGES and NOTICE look good Thanks TD for helping cut this RC! On Wed, May 28, 2014 at 9:38 PM, Kevin Markey kevin.mar...@oracle.com wrote: +1 Built -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 Ran current version of one of my applications on 1-node pseudocluster (sorry, unable to test on full cluster). yarn-cluster mode Ran regression tests. Thanks Kevin On 05/28/2014 09:55 PM, Krishna Sankar wrote: +1 Pulled built on MacOS X, EC2 Amazon Linux Ran test programs on OS X, 5 node c3.4xlarge cluster Cheers k/ On Wed, May 28, 2014 at 7:36 PM, Andy Konwinski andykonwin...@gmail.comwrote: +1 On May 28, 2014 7:05 PM, Xiangrui Meng men...@gmail.com wrote: +1 Tested apps with standalone client mode and yarn cluster and client modes. Xiangrui On Wed, May 28, 2014 at 1:07 PM, Sean McNamara sean.mcnam...@webtrends.com wrote: Pulled down, compiled, and tested examples on OS X and ubuntu. Deployed app we are building on spark and poured data through it. +1 Sean On May 26, 2014, at 8:39 AM, Tathagata Das tathagata.das1...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.0.0! This has a few important bug fixes on top of rc10: SPARK-1900 and SPARK-1918: https://github.com/apache/spark/pull/853 SPARK-1870: https://github.com/apache/spark/pull/848 SPARK-1897: https://github.com/apache/spark/pull/849 The tag to be voted on is v1.0.0-rc11 (commit c69d97cd): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=c69d97cdb42f809cb71113a1db4194c21372242a The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~tdas/spark-1.0.0-rc11/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/tdas.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1019/ The documentation corresponding to this release can be found at: http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/ Please vote on releasing this package as Apache Spark 1.0.0! The vote is open until Thursday, May 29, at 16:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.0.0 [ ] -1 Do not release this package because ... To learn more about Apache Spark, please see http://spark.apache.org/ == API Changes == We welcome users to compile Spark applications against 1.0. There are a few API changes in this release. Here are links to the associated upgrade guides - user facing changes have been kept as small as possible. Changes to ML vector specification: http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/mllib-guide.html#from-09-to-10 Changes to the Java API: http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark Changes to the streaming API: http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/streaming-programming-guide.html#migration-guide-from-091-or-below-to-1x Changes to the GraphX API: http://people.apache.org/~tdas/spark-1.0.0-rc11-docs/graphx-programming-guide.html#upgrade-guide-from-spark-091 Other changes: coGroup and related functions now return Iterable[T] instead of Seq[T] == Call toSeq on the result to restore the old behavior SparkContext.jarOfClass returns Option[String] instead of Seq[String] == Call toSeq on the result to restore old behavior
Re: [RESULT][VOTE] Release Apache Spark 1.0.0 (RC11)
Yup, congrats all. The most impressive thing is the number of contributors to this release — with over 100 contributors, it’s becoming hard to even write the credits. Look forward to the Apache press release tomorrow. Matei On May 29, 2014, at 1:33 PM, Patrick Wendell pwend...@gmail.com wrote: Congrats everyone! This is a huge accomplishment! On Thu, May 29, 2014 at 1:24 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Hello everyone, The vote on Spark 1.0.0 RC11 passes with13 +1 votes, one 0 vote and no -1 vote. Thanks to everyone who tested the RC and voted. Here are the totals: +1: (13 votes) Matei Zaharia* Mark Hamstra* Holden Karau Nick Pentreath* Will Benton Henry Saputra Sean McNamara* Xiangrui Meng* Andy Konwinski* Krishna Sankar Kevin Markey Patrick Wendell* Tathagata Das* 0: (1 vote) Ankur Dave* -1: (0 vote) Please hold off announcing Spark 1.0.0 until Apache Software Foundation makes the press release tomorrow. Thank you very much for your cooperation. TD
Re: [RESULT][VOTE] Release Apache Spark 1.0.0 (RC11)
Yes great work all. Special thanks to Patrick (and TD) for excellent leadership! On May 29, 2014 5:39 PM, Usman Ghani us...@platfora.com wrote: Congrats everyone. Really pumped about this. On Thu, May 29, 2014 at 2:57 PM, Henry Saputra henry.sapu...@gmail.com wrote: Congrats guys! Another milestone for Apache Spark indeed =) - Henry On Thu, May 29, 2014 at 2:08 PM, Matei Zaharia matei.zaha...@gmail.com wrote: Yup, congrats all. The most impressive thing is the number of contributors to this release — with over 100 contributors, it’s becoming hard to even write the credits. Look forward to the Apache press release tomorrow. Matei On May 29, 2014, at 1:33 PM, Patrick Wendell pwend...@gmail.com wrote: Congrats everyone! This is a huge accomplishment! On Thu, May 29, 2014 at 1:24 PM, Tathagata Das tathagata.das1...@gmail.com wrote: Hello everyone, The vote on Spark 1.0.0 RC11 passes with13 +1 votes, one 0 vote and no -1 vote. Thanks to everyone who tested the RC and voted. Here are the totals: +1: (13 votes) Matei Zaharia* Mark Hamstra* Holden Karau Nick Pentreath* Will Benton Henry Saputra Sean McNamara* Xiangrui Meng* Andy Konwinski* Krishna Sankar Kevin Markey Patrick Wendell* Tathagata Das* 0: (1 vote) Ankur Dave* -1: (0 vote) Please hold off announcing Spark 1.0.0 until Apache Software Foundation makes the press release tomorrow. Thank you very much for your cooperation. TD
how spark partition data when creating table like create table xxx as select * from xxx
hi, spark-developers, i am using shark/spark, and i am puzzled by such question, and can not find any info from the web, so i ask you. 1. how spark partition data in memory when creating table when using create table a tblproperties(shark.cache=memory) as select * from table b , in another words, how many rdds will be created ? how spark decide the number of rdds ? 2. how spark partition data on tachyon when creating table when using create table a tblproperties(shark.cache=tachyon) as select * from table b . in another words, how many files will be created ? how spark decide the number of files? i found this settings about tachyon tachyon.user.default.block.size.byte , what it means? could i set it to control each file size ? thanks for any guiding .