Re: Spark 1.0.0 rc3
SPARK_HADOOP_VERSION=2.3.0 sbt/sbt assembly and copy the generated jar to lib/ directory of my application, it seems that sbt cannot find the dependencies in the jar? but everything works with the pre-built jar files downloaded from the link provided by Patrick Best, -- Nan Zhu On Thursday, May 1, 2014 at 11:16 PM, Madhu wrote: I'm guessing EC2 support is not there yet? I was able to build using the binary download on both Windows 7 and RHEL 6 without issues. I tried to create an EC2 cluster, but saw this: ~/spark-ec2 Initializing spark ~ ~/spark-ec2 ERROR: Unknown Spark version Initializing shark ~ ~/spark-ec2 ~/spark-ec2 ERROR: Unknown Shark version The spark dir on the EC2 master has only a conf dir, so it didn't deploy properly. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-1-0-0-rc3-tp6427p6456.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com (http://Nabble.com).
Re: Spark 1.0.0 rc3
Hi, I tried to build the 1.0.0 rc3 version with Java 8 and I got the error - java.util.concurrent.ExecutionException: java.lang.OutOfMemoryError: GC overhead limit exceeded I am building on a Core-i7(Quad core) windows laptop with 8 GB RAM. Earlier I had tried to build Spark 0.9.1 with Java 8 and I had gotten an error about comparator.class not found - which was mentioned today on another thread, so I am not getting that error now. I have successfully build Spark 0.9.0 with Java 1.7. [image: Inline image 1] Thanks, Manu On Tue, Apr 29, 2014 at 10:43 PM, Patrick Wendell pwend...@gmail.comwrote: That suggestion got lost along the way and IIRC the patch didn't have that. It's a good idea though, if nothing else to provide a simple means for backwards compatibility. I created a JIRA for this. It's very straightforward so maybe someone can pick it up quickly: https://issues.apache.org/jira/browse/SPARK-1677 On Tue, Apr 29, 2014 at 2:20 PM, Dean Wampler deanwamp...@gmail.com wrote: Thanks. I'm fine with the logic change, although I was a bit surprised to see Hadoop used for file I/O. Anyway, the jira issue and pull request discussions mention a flag to enable overwrites. That would be very convenient for a tutorial I'm writing, although I wouldn't recommend it for normal use, of course. However, I can't figure out if this actually exists. I found the spark.files.overwrite property, but that doesn't apply. Does this override flag, method call, or method argument actually exist? Thanks, Dean On Tue, Apr 29, 2014 at 1:54 PM, Patrick Wendell pwend...@gmail.com wrote: Hi Dean, We always used the Hadoop libraries here to read and write local files. In Spark 1.0 we started enforcing the rule that you can't over-write an existing directory because it can cause confusing/undefined behavior if multiple jobs output to the directory (they partially clobber each other's output). https://issues.apache.org/jira/browse/SPARK-1100 https://github.com/apache/spark/pull/11 In the JIRA I actually proposed slightly deviating from Hadoop semantics and allowing the directory to exist if it is empty, but I think in the end we decided to just go with the exact same semantics as Hadoop (i.e. empty directories are a problem). - Patrick On Tue, Apr 29, 2014 at 9:43 AM, Dean Wampler deanwamp...@gmail.com wrote: I'm observing one anomalous behavior. With the 1.0.0 libraries, it's using HDFS classes for file I/O, while the same script compiled and running with 0.9.1 uses only the local-mode File IO. The script is a variation of the Word Count script. Here are the guts: object WordCount2 { def main(args: Array[String]) = { val sc = new SparkContext(local, Word Count (2)) val input = sc.textFile(.../some/local/file).map(line = line.toLowerCase) input.cache val wc2 = input .flatMap(line = line.split(\W+)) .map(word = (word, 1)) .reduceByKey((count1, count2) = count1 + count2) wc2.saveAsTextFile(output/some/directory) sc.stop() It works fine compiled and executed with 0.9.1. If I recompile and run with 1.0.0-RC1, where the same output directory still exists, I get this familiar Hadoop-ish exception: [error] (run-main-0) org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory file:/Users/deanwampler/projects/typesafe/activator/activator-spark/output/kjv-wc already exists org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory file:/Users/deanwampler/projects/typesafe/activator/activator-spark/output/kjv-wc already exists at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:121) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:749) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:662) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:581) at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1057) at spark.activator.WordCount2$.main(WordCount2.scala:42) at spark.activator.WordCount2.main(WordCount2.scala) ... Thoughts? On Tue, Apr 29, 2014 at 3:05 AM, Patrick Wendell pwend...@gmail.com wrote: Hey All, This is not an official vote, but I wanted to cut an RC so that people can test against the Maven artifacts, test building with their configuration, etc. We are still chasing down a few issues and updating docs, etc. If you have issues or bug reports for this release, please send an e-mail to the Spark dev list and/or file a JIRA. Commit: d636772 (v1.0.0-rc3) https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=d636772ea9f98e449a038567b7975b1a07de3221 Binaries: http://people.apache.org/~pwendell/spark-1.0.0-rc3
Re: Spark 1.0.0 rc3
I'm guessing EC2 support is not there yet? I was able to build using the binary download on both Windows 7 and RHEL 6 without issues. I tried to create an EC2 cluster, but saw this: ~/spark-ec2 Initializing spark ~ ~/spark-ec2 ERROR: Unknown Spark version Initializing shark ~ ~/spark-ec2 ~/spark-ec2 ERROR: Unknown Shark version The spark dir on the EC2 master has only a conf dir, so it didn't deploy properly. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-1-0-0-rc3-tp6427p6456.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
Spark 1.0.0 rc3
Hey All, This is not an official vote, but I wanted to cut an RC so that people can test against the Maven artifacts, test building with their configuration, etc. We are still chasing down a few issues and updating docs, etc. If you have issues or bug reports for this release, please send an e-mail to the Spark dev list and/or file a JIRA. Commit: d636772 (v1.0.0-rc3) https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=d636772ea9f98e449a038567b7975b1a07de3221 Binaries: http://people.apache.org/~pwendell/spark-1.0.0-rc3/ Docs: http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/ Repository: https://repository.apache.org/content/repositories/orgapachespark-1012/ == API Changes == If you want to test building against Spark there are some minor API changes. We'll get these written up for the final release but I'm noting a few here (not comprehensive): changes to ML vector specification: http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/mllib-guide.html#from-09-to-10 changes to the Java API: http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark coGroup and related functions now return Iterable[T] instead of Seq[T] == Call toSeq on the result to restore the old behavior SparkContext.jarOfClass returns Option[String] instead of Seq[String] == Call toSeq on the result to restore old behavior Streaming classes have been renamed: NetworkReceiver - Receiver
Re: Spark 1.0.0 rc3
Hi Patrick, What are the expectations / guarantees on binary compatibility between 0.9 and 1.0? You mention some API changes, which kinda hint that binary compatibility has already been broken, but just wanted to point out there are other cases. e.g.: Exception in thread main java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:236) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:47) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.NoSuchMethodError: org.apache.spark.SparkContext$.rddToOrderedRDDFunctions(Lorg/apache/spark/rdd/RDD;Lscala/Function1;Lscala/reflect/ClassTag;Lscala/reflect/ClassTag;)Lorg/apache/spark/rdd/OrderedRDDFunctions; (Compiled against 0.9, run against 1.0.) Offending code: val top10 = counts.sortByKey(false).take(10) Recompiling fixes the problem. On Tue, Apr 29, 2014 at 1:05 AM, Patrick Wendell pwend...@gmail.com wrote: Hey All, This is not an official vote, but I wanted to cut an RC so that people can test against the Maven artifacts, test building with their configuration, etc. We are still chasing down a few issues and updating docs, etc. If you have issues or bug reports for this release, please send an e-mail to the Spark dev list and/or file a JIRA. Commit: d636772 (v1.0.0-rc3) https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=d636772ea9f98e449a038567b7975b1a07de3221 Binaries: http://people.apache.org/~pwendell/spark-1.0.0-rc3/ Docs: http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/ Repository: https://repository.apache.org/content/repositories/orgapachespark-1012/ == API Changes == If you want to test building against Spark there are some minor API changes. We'll get these written up for the final release but I'm noting a few here (not comprehensive): changes to ML vector specification: http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/mllib-guide.html#from-09-to-10 changes to the Java API: http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark coGroup and related functions now return Iterable[T] instead of Seq[T] == Call toSeq on the result to restore the old behavior SparkContext.jarOfClass returns Option[String] instead of Seq[String] == Call toSeq on the result to restore old behavior Streaming classes have been renamed: NetworkReceiver - Receiver -- Marcelo
Re: Spark 1.0.0 rc3
What are the expectations / guarantees on binary compatibility between 0.9 and 1.0? There are not guarantees.
Re: Spark 1.0.0 rc3
Hi Dean, We always used the Hadoop libraries here to read and write local files. In Spark 1.0 we started enforcing the rule that you can't over-write an existing directory because it can cause confusing/undefined behavior if multiple jobs output to the directory (they partially clobber each other's output). https://issues.apache.org/jira/browse/SPARK-1100 https://github.com/apache/spark/pull/11 In the JIRA I actually proposed slightly deviating from Hadoop semantics and allowing the directory to exist if it is empty, but I think in the end we decided to just go with the exact same semantics as Hadoop (i.e. empty directories are a problem). - Patrick On Tue, Apr 29, 2014 at 9:43 AM, Dean Wampler deanwamp...@gmail.com wrote: I'm observing one anomalous behavior. With the 1.0.0 libraries, it's using HDFS classes for file I/O, while the same script compiled and running with 0.9.1 uses only the local-mode File IO. The script is a variation of the Word Count script. Here are the guts: object WordCount2 { def main(args: Array[String]) = { val sc = new SparkContext(local, Word Count (2)) val input = sc.textFile(.../some/local/file).map(line = line.toLowerCase) input.cache val wc2 = input .flatMap(line = line.split(\W+)) .map(word = (word, 1)) .reduceByKey((count1, count2) = count1 + count2) wc2.saveAsTextFile(output/some/directory) sc.stop() It works fine compiled and executed with 0.9.1. If I recompile and run with 1.0.0-RC1, where the same output directory still exists, I get this familiar Hadoop-ish exception: [error] (run-main-0) org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory file:/Users/deanwampler/projects/typesafe/activator/activator-spark/output/kjv-wc already exists org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory file:/Users/deanwampler/projects/typesafe/activator/activator-spark/output/kjv-wc already exists at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:121) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:749) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:662) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:581) at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1057) at spark.activator.WordCount2$.main(WordCount2.scala:42) at spark.activator.WordCount2.main(WordCount2.scala) ... Thoughts? On Tue, Apr 29, 2014 at 3:05 AM, Patrick Wendell pwend...@gmail.com wrote: Hey All, This is not an official vote, but I wanted to cut an RC so that people can test against the Maven artifacts, test building with their configuration, etc. We are still chasing down a few issues and updating docs, etc. If you have issues or bug reports for this release, please send an e-mail to the Spark dev list and/or file a JIRA. Commit: d636772 (v1.0.0-rc3) https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=d636772ea9f98e449a038567b7975b1a07de3221 Binaries: http://people.apache.org/~pwendell/spark-1.0.0-rc3/ Docs: http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/ Repository: https://repository.apache.org/content/repositories/orgapachespark-1012/ == API Changes == If you want to test building against Spark there are some minor API changes. We'll get these written up for the final release but I'm noting a few here (not comprehensive): changes to ML vector specification: http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/mllib-guide.html#from-09-to-10 changes to the Java API: http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark coGroup and related functions now return Iterable[T] instead of Seq[T] == Call toSeq on the result to restore the old behavior SparkContext.jarOfClass returns Option[String] instead of Seq[String] == Call toSeq on the result to restore old behavior Streaming classes have been renamed: NetworkReceiver - Receiver -- Dean Wampler, Ph.D. Typesafe @deanwampler http://typesafe.com http://polyglotprogramming.com
Re: Spark 1.0.0 rc3
Thanks. I'm fine with the logic change, although I was a bit surprised to see Hadoop used for file I/O. Anyway, the jira issue and pull request discussions mention a flag to enable overwrites. That would be very convenient for a tutorial I'm writing, although I wouldn't recommend it for normal use, of course. However, I can't figure out if this actually exists. I found the spark.files.overwrite property, but that doesn't apply. Does this override flag, method call, or method argument actually exist? Thanks, Dean On Tue, Apr 29, 2014 at 1:54 PM, Patrick Wendell pwend...@gmail.com wrote: Hi Dean, We always used the Hadoop libraries here to read and write local files. In Spark 1.0 we started enforcing the rule that you can't over-write an existing directory because it can cause confusing/undefined behavior if multiple jobs output to the directory (they partially clobber each other's output). https://issues.apache.org/jira/browse/SPARK-1100 https://github.com/apache/spark/pull/11 In the JIRA I actually proposed slightly deviating from Hadoop semantics and allowing the directory to exist if it is empty, but I think in the end we decided to just go with the exact same semantics as Hadoop (i.e. empty directories are a problem). - Patrick On Tue, Apr 29, 2014 at 9:43 AM, Dean Wampler deanwamp...@gmail.com wrote: I'm observing one anomalous behavior. With the 1.0.0 libraries, it's using HDFS classes for file I/O, while the same script compiled and running with 0.9.1 uses only the local-mode File IO. The script is a variation of the Word Count script. Here are the guts: object WordCount2 { def main(args: Array[String]) = { val sc = new SparkContext(local, Word Count (2)) val input = sc.textFile(.../some/local/file).map(line = line.toLowerCase) input.cache val wc2 = input .flatMap(line = line.split(\W+)) .map(word = (word, 1)) .reduceByKey((count1, count2) = count1 + count2) wc2.saveAsTextFile(output/some/directory) sc.stop() It works fine compiled and executed with 0.9.1. If I recompile and run with 1.0.0-RC1, where the same output directory still exists, I get this familiar Hadoop-ish exception: [error] (run-main-0) org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory file:/Users/deanwampler/projects/typesafe/activator/activator-spark/output/kjv-wc already exists org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory file:/Users/deanwampler/projects/typesafe/activator/activator-spark/output/kjv-wc already exists at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:121) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:749) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:662) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:581) at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1057) at spark.activator.WordCount2$.main(WordCount2.scala:42) at spark.activator.WordCount2.main(WordCount2.scala) ... Thoughts? On Tue, Apr 29, 2014 at 3:05 AM, Patrick Wendell pwend...@gmail.com wrote: Hey All, This is not an official vote, but I wanted to cut an RC so that people can test against the Maven artifacts, test building with their configuration, etc. We are still chasing down a few issues and updating docs, etc. If you have issues or bug reports for this release, please send an e-mail to the Spark dev list and/or file a JIRA. Commit: d636772 (v1.0.0-rc3) https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=d636772ea9f98e449a038567b7975b1a07de3221 Binaries: http://people.apache.org/~pwendell/spark-1.0.0-rc3/ Docs: http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/ Repository: https://repository.apache.org/content/repositories/orgapachespark-1012/ == API Changes == If you want to test building against Spark there are some minor API changes. We'll get these written up for the final release but I'm noting a few here (not comprehensive): changes to ML vector specification: http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/mllib-guide.html#from-09-to-10 changes to the Java API: http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark coGroup and related functions now return Iterable[T] instead of Seq[T] == Call toSeq on the result to restore the old behavior SparkContext.jarOfClass returns Option[String] instead of Seq[String] == Call toSeq on the result to restore old behavior Streaming classes have been renamed: NetworkReceiver - Receiver -- Dean Wampler, Ph.D. Typesafe @deanwampler http://typesafe.com http://polyglotprogramming.com -- Dean Wampler, Ph.D. Typesafe @deanwampler
Re: Spark 1.0.0 rc3
That suggestion got lost along the way and IIRC the patch didn't have that. It's a good idea though, if nothing else to provide a simple means for backwards compatibility. I created a JIRA for this. It's very straightforward so maybe someone can pick it up quickly: https://issues.apache.org/jira/browse/SPARK-1677 On Tue, Apr 29, 2014 at 2:20 PM, Dean Wampler deanwamp...@gmail.com wrote: Thanks. I'm fine with the logic change, although I was a bit surprised to see Hadoop used for file I/O. Anyway, the jira issue and pull request discussions mention a flag to enable overwrites. That would be very convenient for a tutorial I'm writing, although I wouldn't recommend it for normal use, of course. However, I can't figure out if this actually exists. I found the spark.files.overwrite property, but that doesn't apply. Does this override flag, method call, or method argument actually exist? Thanks, Dean On Tue, Apr 29, 2014 at 1:54 PM, Patrick Wendell pwend...@gmail.com wrote: Hi Dean, We always used the Hadoop libraries here to read and write local files. In Spark 1.0 we started enforcing the rule that you can't over-write an existing directory because it can cause confusing/undefined behavior if multiple jobs output to the directory (they partially clobber each other's output). https://issues.apache.org/jira/browse/SPARK-1100 https://github.com/apache/spark/pull/11 In the JIRA I actually proposed slightly deviating from Hadoop semantics and allowing the directory to exist if it is empty, but I think in the end we decided to just go with the exact same semantics as Hadoop (i.e. empty directories are a problem). - Patrick On Tue, Apr 29, 2014 at 9:43 AM, Dean Wampler deanwamp...@gmail.com wrote: I'm observing one anomalous behavior. With the 1.0.0 libraries, it's using HDFS classes for file I/O, while the same script compiled and running with 0.9.1 uses only the local-mode File IO. The script is a variation of the Word Count script. Here are the guts: object WordCount2 { def main(args: Array[String]) = { val sc = new SparkContext(local, Word Count (2)) val input = sc.textFile(.../some/local/file).map(line = line.toLowerCase) input.cache val wc2 = input .flatMap(line = line.split(\W+)) .map(word = (word, 1)) .reduceByKey((count1, count2) = count1 + count2) wc2.saveAsTextFile(output/some/directory) sc.stop() It works fine compiled and executed with 0.9.1. If I recompile and run with 1.0.0-RC1, where the same output directory still exists, I get this familiar Hadoop-ish exception: [error] (run-main-0) org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory file:/Users/deanwampler/projects/typesafe/activator/activator-spark/output/kjv-wc already exists org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory file:/Users/deanwampler/projects/typesafe/activator/activator-spark/output/kjv-wc already exists at org.apache.hadoop.mapred.FileOutputFormat.checkOutputSpecs(FileOutputFormat.java:121) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:749) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:662) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:581) at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1057) at spark.activator.WordCount2$.main(WordCount2.scala:42) at spark.activator.WordCount2.main(WordCount2.scala) ... Thoughts? On Tue, Apr 29, 2014 at 3:05 AM, Patrick Wendell pwend...@gmail.com wrote: Hey All, This is not an official vote, but I wanted to cut an RC so that people can test against the Maven artifacts, test building with their configuration, etc. We are still chasing down a few issues and updating docs, etc. If you have issues or bug reports for this release, please send an e-mail to the Spark dev list and/or file a JIRA. Commit: d636772 (v1.0.0-rc3) https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=d636772ea9f98e449a038567b7975b1a07de3221 Binaries: http://people.apache.org/~pwendell/spark-1.0.0-rc3/ Docs: http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/ Repository: https://repository.apache.org/content/repositories/orgapachespark-1012/ == API Changes == If you want to test building against Spark there are some minor API changes. We'll get these written up for the final release but I'm noting a few here (not comprehensive): changes to ML vector specification: http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/mllib-guide.html#from-09-to-10 changes to the Java API: http://people.apache.org/~pwendell/spark-1.0.0-rc3-docs/java-programming-guide.html#upgrading-from-pre-10-versions-of-spark coGroup and related functions now return Iterable[T] instead of Seq[T] == Call toSeq