Additional fix for Avro IncompatibleClassChangeError (SPARK-3039)
SPARK-3039 Spark assembly for new hadoop API (hadoop 2) contains avro-mapred for hadoop 1 API was marked resolved with Spark 1.2.0 release. However, when I download the pre-built Spark distro for Hadoop 2.4 and later (spark-1.2.0-bin-hadoop2.4.tgz) and run it against Avro code compiled against Hadoop 2.4/new Hadoop API I still get: java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected at org.apache.avro.mapreduce.AvroRecordReaderBase.initialize(AvroRecordReaderBase.java:87) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.init(NewHadoopRDD.scala:135) TaskAttemptContext was a class in the Hadoop 1.x series but became an interface in Hadoop 2.x. Therefore there is a avro-mapred-1.7.6.jar and avro-mapred-1.7.6-hadoop2.jar. For Hadoop 2.x the avro-mapred-1.7.6-hadoop2.jar is needed. So it seemed that spark-assembly-1.2.0-hadoop2.4.0.jar still did not contain the org.apache.avro.mapreduce.AvroRecordReaderBase from avro-mapred-1.7.6-hadoop2.jar. I then downloaded the source code and compiled with: mvn -Pyarn -Phadoop-2.4 -Phive-0.13.1 -DskipTests clean package The hadoop-2.4 profile sets: avro.mapred.classifierhadoop2/avro.mapred.classifier which through dependency management should pull in the right hadoop2 version: dependency groupIdorg.apache.avro/groupId artifactIdavro-mapred/artifactId version${avro.version}/version classifier${avro.mapred.classifier}/classifier exclusions However, same IncompatibleClassChangeError after replacing the assembly jar. I had cleaned my local ~/.m2/repository before the build and found that for avro-mapred both 1.7.5 (no extension, i.e. hadoop1) and 1.7.6 (hadoop2) had been downloaded. That seemed a likely culprit. After installing the created jar files into my local repo (had to handcopy poms/jars for repl/yarn subprojects) and then running: mvn -Pyarn -Phadoop-2.4 -Phive-0.13.1 -DskipTests dependency:tree -Dincludes=org.apache.avro:avro-mapred Building Spark Project Hive 1.2.0 [INFO] [INFO] [INFO] --- maven-dependency-plugin:2.4:tree (default-cli) @ spark-hive_2.10 --- [INFO] org.apache.spark:spark-hive_2.10:jar:1.2.0 [INFO] +- org.spark-project.hive:hive-exec:jar:0.13.1a:compile [INFO] | \- org.apache.avro:avro-mapred:jar:1.7.5:compile [INFO] \- org.apache.avro:avro-mapred:jar:hadoop2:1.7.6:compile [INFO] Showed that hive-exec brought in the avro-mapred-1.7.5.jar (hadoop1). Fix for spark 1.2.x: spark-1.2.0/sql/hive/pom.xml dependency groupIdorg.spark-project.hive/groupId artifactIdhive-exec/artifactId version${hive.version}/version exclusions exclusion groupIdcommons-logging/groupId artifactIdcommons-logging/artifactId /exclusion exclusion groupIdcom.esotericsoftware.kryo/groupId artifactIdkryo/artifactId /exclusion exclusion groupIdorg.apache.avro/groupId artifactIdavro-mapred/artifactId /exclusion /exclusions /dependency Just add the last exclusion for avro-mapred (comparison at https://github.com/medale/spark/compare/apache:v1.2.1-rc2...medale:avro-hadoop2-v1.2.1-rc2). I was able to build and run against that fix with Avro code. Fix for current master: https://github.com/apache/spark/pull/4315 Any feedback much appreciated, Markus - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Get size of rdd in memory
It's already fixed in the master branch. Sorry that we forgot to update this before releasing 1.2.0 and caused you trouble... Cheng On 2/2/15 2:03 PM, ankits wrote: Great, thank you very much. I was confused because this is in the docs: https://spark.apache.org/docs/1.2.0/sql-programming-guide.html, and on the branch-1.2 branch, https://github.com/apache/spark/blob/branch-1.2/docs/sql-programming-guide.md Note that if you call schemaRDD.cache() rather than sqlContext.cacheTable(...), tables will not be cached using the in-memory columnar format, and therefore sqlContext.cacheTable(...) is strongly recommended for this use case.. If this is no longer accurate, i could make a PR to remove it. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Get-size-of-rdd-in-memory-tp10366p10392.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Get size of rdd in memory
Actually |SchemaRDD.cache()| behaves exactly the same as |cacheTable| since Spark 1.2.0. The reason why your web UI didn’t show you the cached table is that both |cacheTable| and |sql(SELECT ...)| are lazy :-) Simply add a |.collect()| after the |sql(...)| call. Cheng On 2/2/15 12:23 PM, ankits wrote: Thanks for your response. So AFAICT calling parallelize(1 to1024).map(i =KV(i, i.toString)).toSchemaRDD.cache().count(), will allow me to see the size of the schemardd in memory and parallelize(1 to1024).map(i =KV(i, i.toString)).cache().count() will show me the size of a regular rdd. But this will not show us the size when using cacheTable() right? Like if i call parallelize(1 to1024).map(i =KV(i, i.toString)).toSchemaRDD.registerTempTable(test) sqc.cacheTable(test) sqc.sql(SELECT COUNT(*) FROM test) the web UI does not show us the size of the cached table. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Get-size-of-rdd-in-memory-tp10366p10388.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Get size of rdd in memory
Great, thank you very much. I was confused because this is in the docs: https://spark.apache.org/docs/1.2.0/sql-programming-guide.html, and on the branch-1.2 branch, https://github.com/apache/spark/blob/branch-1.2/docs/sql-programming-guide.md Note that if you call schemaRDD.cache() rather than sqlContext.cacheTable(...), tables will not be cached using the in-memory columnar format, and therefore sqlContext.cacheTable(...) is strongly recommended for this use case.. If this is no longer accurate, i could make a PR to remove it. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Get-size-of-rdd-in-memory-tp10366p10392.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Performance test for sort shuffle
Is there a recommended performance test for sort based shuffle? Something similar to terasort on Hadoop. I couldn't find one on the spark-perf code base. https://github.com/databricks/spark-perf -- Kannan
Re: Spark Master Maven with YARN build is broken
It's my fault, I'm sending a hot fix now. On Mon, Feb 2, 2015 at 1:44 PM, Nicholas Chammas nicholas.cham...@gmail.com wrote: https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-Maven-with-YARN/HADOOP_PROFILE=hadoop-2.4,label=centos/ Is this is a known issue? It seems to have been broken since last night. Here's a snippet from the build output of one of the builds https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/HADOOP_PROFILE=hadoop-2.4,label=centos/1308/console : [error] bad symbolic reference. A signature in WebUI.class refers to term eclipse [error] in package org which is not available. [error] It may be completely missing from the current classpath, or the version on [error] the classpath might be incompatible with the version used when compiling WebUI.class. [error] bad symbolic reference. A signature in WebUI.class refers to term jetty [error] in value org.eclipse which is not available. [error] It may be completely missing from the current classpath, or the version on [error] the classpath might be incompatible with the version used when compiling WebUI.class. [error] [error] while compiling: /home/jenkins/workspace/Spark-Master-Maven-with-YARN/HADOOP_PROFILE/hadoop-2.4/label/centos/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala [error] during phase: erasure [error] library version: version 2.10.4 [error] compiler version: version 2.10.4 Nick - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
[spark-sql] JsonRDD
Hey Spark developers, Is there a good reason for JsonRDD being a Scala object as opposed to class? Seems most other RDDs are classes, and can be extended. The reason I'm asking is that there is a problem with Hive interoperability with JSON DataFrames where jsonFile generates case sensitive schema, while Hive expects case insensitive and fails with an exception during saveAsTable if there are two columns with the same name in different case. I'm trying to resolve the problem, but that requires me to extend JsonRDD, which I can't do. Other RDDs are subclass friendly, why is JsonRDD different? Dan
Re: Building Spark with Pants
I'm asking from an experimental standpoint; this is not happening anytime soon. Of course, if the experiment turns out very well, Pants would replace both sbt and Maven (like it has at Twitter, for example). Pants also works with IDEs http://pantsbuild.github.io/index.html#using-pants-with. On Mon Feb 02 2015 at 4:33:11 PM Stephen Boesch java...@gmail.com wrote: There is a significant investment in sbt and maven - and they are not at all likely to be going away. A third build tool? Note that there is also the perspective of building within an IDE - which actually works presently for sbt and with a little bit of tweaking with maven as well. 2015-02-02 16:25 GMT-08:00 Nicholas Chammas nicholas.cham...@gmail.com: Does anyone here have experience with Pants http://pantsbuild.github.io/index.html or interest in trying to build Spark with it? Pants has an interesting story. It was born at Twitter to help them build their Scala, Java, and Python projects as several independent components in one monolithic repo. (It was inspired by a similar build tool at Google called blaze.) The mix of languages and sub-projects at Twitter seems similar to the breakdown we have in Spark. Pants has an interesting take on how a build system should work, and Twitter and Foursquare (who use Pants as their primary build tool) claim it helps enforce better build hygiene and maintainability. Some relevant talks: - Building Scala Hygienically with Pants https://www.youtube.com/watch?v=ukqke8iTuH0 - The Pants Build Tool at Twitter https://engineering.twitter.com/university/videos/the-pants-build-tool-at-twitter - Getting Started with the Pants Build System: Why Pants? https://engineering.twitter.com/university/videos/getting-started-with-the-pants-build-system-why-pants At some point I may take a shot at converting Spark to use Pants as an experiment and just see what it’s like. Nick
Re: Building Spark with Pants
There is a significant investment in sbt and maven - and they are not at all likely to be going away. A third build tool? Note that there is also the perspective of building within an IDE - which actually works presently for sbt and with a little bit of tweaking with maven as well. 2015-02-02 16:25 GMT-08:00 Nicholas Chammas nicholas.cham...@gmail.com: Does anyone here have experience with Pants http://pantsbuild.github.io/index.html or interest in trying to build Spark with it? Pants has an interesting story. It was born at Twitter to help them build their Scala, Java, and Python projects as several independent components in one monolithic repo. (It was inspired by a similar build tool at Google called blaze.) The mix of languages and sub-projects at Twitter seems similar to the breakdown we have in Spark. Pants has an interesting take on how a build system should work, and Twitter and Foursquare (who use Pants as their primary build tool) claim it helps enforce better build hygiene and maintainability. Some relevant talks: - Building Scala Hygienically with Pants https://www.youtube.com/watch?v=ukqke8iTuH0 - The Pants Build Tool at Twitter https://engineering.twitter.com/university/videos/the-pants-build-tool-at-twitter - Getting Started with the Pants Build System: Why Pants? https://engineering.twitter.com/university/videos/getting-started-with-the-pants-build-system-why-pants At some point I may take a shot at converting Spark to use Pants as an experiment and just see what it’s like. Nick
Re: [spark-sql] JsonRDD
It's bad naming - JsonRDD is actually not an RDD. It is just a set of util methods. The case sensitivity issues seem orthogonal, and would be great to be able to control that with a flag. On Mon, Feb 2, 2015 at 4:16 PM, Daniil Osipov daniil.osi...@shazam.com wrote: Hey Spark developers, Is there a good reason for JsonRDD being a Scala object as opposed to class? Seems most other RDDs are classes, and can be extended. The reason I'm asking is that there is a problem with Hive interoperability with JSON DataFrames where jsonFile generates case sensitive schema, while Hive expects case insensitive and fails with an exception during saveAsTable if there are two columns with the same name in different case. I'm trying to resolve the problem, but that requires me to extend JsonRDD, which I can't do. Other RDDs are subclass friendly, why is JsonRDD different? Dan
Re: Building Spark with Pants
To reiterate, I'm asking from an experimental perspective. I'm not proposing we change Spark to build with Pants or anything like that. I'm interested in trying Pants out and I'm wondering if anyone else shares my interest or already has experience with Pants that they can share. On Mon Feb 02 2015 at 4:40:45 PM Nicholas Chammas nicholas.cham...@gmail.com wrote: I'm asking from an experimental standpoint; this is not happening anytime soon. Of course, if the experiment turns out very well, Pants would replace both sbt and Maven (like it has at Twitter, for example). Pants also works with IDEs http://pantsbuild.github.io/index.html#using-pants-with. On Mon Feb 02 2015 at 4:33:11 PM Stephen Boesch java...@gmail.com wrote: There is a significant investment in sbt and maven - and they are not at all likely to be going away. A third build tool? Note that there is also the perspective of building within an IDE - which actually works presently for sbt and with a little bit of tweaking with maven as well. 2015-02-02 16:25 GMT-08:00 Nicholas Chammas nicholas.cham...@gmail.com: Does anyone here have experience with Pants http://pantsbuild.github.io/index.html or interest in trying to build Spark with it? Pants has an interesting story. It was born at Twitter to help them build their Scala, Java, and Python projects as several independent components in one monolithic repo. (It was inspired by a similar build tool at Google called blaze.) The mix of languages and sub-projects at Twitter seems similar to the breakdown we have in Spark. Pants has an interesting take on how a build system should work, and Twitter and Foursquare (who use Pants as their primary build tool) claim it helps enforce better build hygiene and maintainability. Some relevant talks: - Building Scala Hygienically with Pants https://www.youtube.com/watch?v=ukqke8iTuH0 - The Pants Build Tool at Twitter https://engineering.twitter.com/university/videos/the- pants-build-tool-at-twitter - Getting Started with the Pants Build System: Why Pants? https://engineering.twitter.com/university/videos/getting- started-with-the-pants-build-system-why-pants At some point I may take a shot at converting Spark to use Pants as an experiment and just see what it’s like. Nick
Building Spark with Pants
Does anyone here have experience with Pants http://pantsbuild.github.io/index.html or interest in trying to build Spark with it? Pants has an interesting story. It was born at Twitter to help them build their Scala, Java, and Python projects as several independent components in one monolithic repo. (It was inspired by a similar build tool at Google called blaze.) The mix of languages and sub-projects at Twitter seems similar to the breakdown we have in Spark. Pants has an interesting take on how a build system should work, and Twitter and Foursquare (who use Pants as their primary build tool) claim it helps enforce better build hygiene and maintainability. Some relevant talks: - Building Scala Hygienically with Pants https://www.youtube.com/watch?v=ukqke8iTuH0 - The Pants Build Tool at Twitter https://engineering.twitter.com/university/videos/the-pants-build-tool-at-twitter - Getting Started with the Pants Build System: Why Pants? https://engineering.twitter.com/university/videos/getting-started-with-the-pants-build-system-why-pants At some point I may take a shot at converting Spark to use Pants as an experiment and just see what it’s like. Nick
Temporary jenkins issue
Hey All, I made a change to the Jenkins configuration that caused most builds to fail (attempting to enable a new plugin), I've reverted the change effective about 10 minutes ago. If you've seen recent build failures like below, this was caused by that change. Sorry about that. ERROR: Publisher com.google.jenkins.flakyTestHandler.plugin.JUnitFlakyResultArchiver aborted due to exception java.lang.NoSuchMethodError: hudson.model.AbstractBuild.getTestResultAction()Lhudson/tasks/test/AbstractTestResultAction; at com.google.jenkins.flakyTestHandler.plugin.FlakyTestResultAction.init(FlakyTestResultAction.java:78) at com.google.jenkins.flakyTestHandler.plugin.JUnitFlakyResultArchiver.perform(JUnitFlakyResultArchiver.java:89) at hudson.tasks.BuildStepMonitor$1.perform(BuildStepMonitor.java:20) at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:770) at hudson.model.AbstractBuild$AbstractBuildExecution.performAllBuildSteps(AbstractBuild.java:734) at hudson.model.Build$BuildExecution.post2(Build.java:183) at hudson.model.AbstractBuild$AbstractBuildExecution.post(AbstractBuild.java:683) at hudson.model.Run.execute(Run.java:1784) at hudson.matrix.MatrixRun.run(MatrixRun.java:146) at hudson.model.ResourceController.execute(ResourceController.java:89) at hudson.model.Executor.run(Executor.java:240) - Patrick - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.2.1 (RC3)
+1 (non-binding, of course) 1. Compiled OSX 10.10 (Yosemite) OK Total time: 11:13 min mvn clean package -Pyarn -Dyarn.version=2.6.0 -Phadoop-2.4 -Dhadoop.version=2.6.0 -Phive -DskipTests -Dscala-2.11 2. Tested pyspark, mlib - running as well as compare results with 1.1.x 1.2.0 2.1. statistics (min,max,mean,Pearson,Spearman) OK 2.2. Linear/Ridge/Laso Regression OK 2.3. Decision Tree, Naive Bayes OK 2.4. KMeans OK Center And Scale OK Fixed : org.apache.spark.SparkException in zip ! 2.5. rdd operations OK State of the Union Texts - MapReduce, Filter,sortByKey (word count) 2.6. Recommendation (Movielens medium dataset ~1 M ratings) OK Model evaluation/optimization (rank, numIter, lmbda) with itertools OK 3. Scala - MLLib 3.1. statistics (min,max,mean,Pearson,Spearman) OK 3.2. LinearRegressionWIthSGD OK 3.3. Decision Tree OK 3.4. KMeans OK 3.5. Recommendation (Movielens medium dataset ~1 M ratings) OK Cheers k/ On Mon, Feb 2, 2015 at 8:57 PM, Patrick Wendell pwend...@gmail.com wrote: Please vote on releasing the following candidate as Apache Spark version 1.2.1! The tag to be voted on is v1.2.1-rc3 (commit b6eaf77): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b6eaf77d4332bfb0a698849b1f5f917d20d70e97 The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.2.1-rc3/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1065/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.2.1-rc3-docs/ Changes from rc2: A single patch fixing a windows issue. Please vote on releasing this package as Apache Spark 1.2.1! The vote is open until Friday, February 06, at 05:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.2.1 [ ] -1 Do not release this package because ... For a list of fixes in this release, see http://s.apache.org/Mpn. To learn more about Apache Spark, please see http://spark.apache.org/ - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
[RESULT] [VOTE] Release Apache Spark 1.2.1 (RC2)
This is cancelled in favor of RC2. On Mon, Feb 2, 2015 at 8:50 PM, Patrick Wendell pwend...@gmail.com wrote: The windows issue reported only affects actually running Spark on Windows (not job submission). However, I agree it's worth cutting a new RC. I'm going to cancel this vote and propose RC3 with a single additional patch. Let's try to vote that through so we can ship Spark 1.2.1. - Patrick On Sat, Jan 31, 2015 at 7:36 PM, Matei Zaharia matei.zaha...@gmail.com wrote: This looks like a pretty serious problem, thanks! Glad people are testing on Windows. Matei On Jan 31, 2015, at 11:57 AM, MartinWeindel martin.wein...@gmail.com wrote: FYI: Spark 1.2.1rc2 does not work on Windows! On creating a Spark context you get following log output on my Windows machine: INFO org.apache.spark.SparkEnv:59 - Registering BlockManagerMaster ERROR org.apache.spark.util.Utils:75 - Failed to create local root dir in C:\Users\mweindel\AppData\Local\Temp\. Ignoring this directory. ERROR org.apache.spark.storage.DiskBlockManager:75 - Failed to create any local dir. I have already located the cause. A newly added function chmod700() in org.apache.util.Utils uses functionality which only works on a Unix file system. See also pull request [https://github.com/apache/spark/pull/4299] for my suggestion how to resolve the issue. Best regards, Martin Weindel -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-2-1-RC2-tp10317p10370.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: [VOTE] Release Apache Spark 1.2.1 (RC2)
The windows issue reported only affects actually running Spark on Windows (not job submission). However, I agree it's worth cutting a new RC. I'm going to cancel this vote and propose RC3 with a single additional patch. Let's try to vote that through so we can ship Spark 1.2.1. - Patrick On Sat, Jan 31, 2015 at 7:36 PM, Matei Zaharia matei.zaha...@gmail.com wrote: This looks like a pretty serious problem, thanks! Glad people are testing on Windows. Matei On Jan 31, 2015, at 11:57 AM, MartinWeindel martin.wein...@gmail.com wrote: FYI: Spark 1.2.1rc2 does not work on Windows! On creating a Spark context you get following log output on my Windows machine: INFO org.apache.spark.SparkEnv:59 - Registering BlockManagerMaster ERROR org.apache.spark.util.Utils:75 - Failed to create local root dir in C:\Users\mweindel\AppData\Local\Temp\. Ignoring this directory. ERROR org.apache.spark.storage.DiskBlockManager:75 - Failed to create any local dir. I have already located the cause. A newly added function chmod700() in org.apache.util.Utils uses functionality which only works on a Unix file system. See also pull request [https://github.com/apache/spark/pull/4299] for my suggestion how to resolve the issue. Best regards, Martin Weindel -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-2-1-RC2-tp10317p10370.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
[VOTE] Release Apache Spark 1.2.1 (RC3)
Please vote on releasing the following candidate as Apache Spark version 1.2.1! The tag to be voted on is v1.2.1-rc3 (commit b6eaf77): https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=b6eaf77d4332bfb0a698849b1f5f917d20d70e97 The release files, including signatures, digests, etc. can be found at: http://people.apache.org/~pwendell/spark-1.2.1-rc3/ Release artifacts are signed with the following key: https://people.apache.org/keys/committer/pwendell.asc The staging repository for this release can be found at: https://repository.apache.org/content/repositories/orgapachespark-1065/ The documentation corresponding to this release can be found at: http://people.apache.org/~pwendell/spark-1.2.1-rc3-docs/ Changes from rc2: A single patch fixing a windows issue. Please vote on releasing this package as Apache Spark 1.2.1! The vote is open until Friday, February 06, at 05:00 UTC and passes if a majority of at least 3 +1 PMC votes are cast. [ ] +1 Release this package as Apache Spark 1.2.1 [ ] -1 Do not release this package because ... For a list of fixes in this release, see http://s.apache.org/Mpn. To learn more about Apache Spark, please see http://spark.apache.org/ - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
IDF for ml pipeline
Hi all I am trying the ml pipeline for text classfication now. recently, i succeed to execute the pipeline processing in ml packages, which consist of the original Japanese tokenizer, hashingTF, logisticRegression. then, i failed to executed the pipeline with idf in mllib package directly. To use the idf feature in ml package, do i have to implement the wrapper for idf in ml package like the hashingTF? best Masaki Rikitoku - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Performance test for sort shuffle
Hi Kannan, I have a branch here: https://github.com/ehiggs/spark/tree/terasort The code is in the examples. I don't do any fancy partitioning so it could be made quicker, I'm sure. But it should be a good baseline. I have a WIP PR for spark-perf but I'm having trouble building it there[1]. I put it on the back burner until someone can get back to me on it. Yours, Ewan Higgs [1] http://apache-spark-developers-list.1001551.n3.nabble.com/SparkSpark-perf-terasort-WIP-branch-tt10105.html On 02/02/15 23:26, Kannan Rajah wrote: Is there a recommended performance test for sort based shuffle? Something similar to terasort on Hadoop. I couldn't find one on the spark-perf code base. https://github.com/databricks/spark-perf -- Kannan - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Get size of rdd in memory
Thanks for your response. So AFAICT calling parallelize(1 to1024).map(i =KV(i, i.toString)).toSchemaRDD.cache().count(), will allow me to see the size of the schemardd in memory and parallelize(1 to1024).map(i =KV(i, i.toString)).cache().count() will show me the size of a regular rdd. But this will not show us the size when using cacheTable() right? Like if i call parallelize(1 to1024).map(i =KV(i, i.toString)).toSchemaRDD.registerTempTable(test) sqc.cacheTable(test) sqc.sql(SELECT COUNT(*) FROM test) the web UI does not show us the size of the cached table. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Get-size-of-rdd-in-memory-tp10366p10388.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Can spark provide an option to start reduce stage early?
In hadoop MR, there is an option *mapred.reduce.slowstart.completed.maps* which can be used to start reducer stage when X% mappers are completed. By doing this, the data shuffling process is able to parallel with the map process. In a large multi-tenancy cluster, this option is usually tuned off. But, in some cases, turn on the option could accelerate some high priority jobs. Will spark provide similar option?
Questions about Spark standalone resource scheduler
Hi all, I have some questions about the future development of Spark's standalone resource scheduler. We've heard some users have the requirements to have multi-tenant support in standalone mode, like multi-user management, resource management and isolation, whitelist of users. Seems current Spark standalone do not support such kind of functionalities, while resource schedulers like Yarn offers such kind of advanced managements, I'm not sure what's the future target of standalone resource scheduler, will it only target on simple implementation, and for advanced usage shift to YARN? Or will it plan to add some simple multi-tenant related functionalities? Thanks a lot for your comments. BR Jerry
Re: Questions about Spark standalone resource scheduler
Hey Jerry, I think standalone mode will still add more features over time, but the goal isn't really for it to become equivalent to what Mesos/YARN are today. Or at least, I doubt Spark Standalone will ever attempt to manage _other_ frameworks outside of Spark and become a general purpose resource manager. In terms of having better support for multi tenancy, meaning multiple *Spark* instances, this is something I think could be in scope in the future. For instance, we added H/A to the standalone scheduler a while back, because it let us support H/A streaming apps in a totally native way. It's a trade off of adding new features and keeping the scheduler very simple and easy to use. We've tended to bias towards simplicity as the main goal, since this is something we want to be really easy out of the box. One thing to point out, a lot of people use the standalone mode with some coarser grained scheduler, such as running in a cloud service. In this case they really just want a simple inner cluster manager. This may even be the majority of all Spark installations. This is slightly different than Hadoop environments, where they might just want nice integration into the existing Hadoop stack via something like YARN. - Patrick On Mon, Feb 2, 2015 at 12:24 AM, Shao, Saisai saisai.s...@intel.com wrote: Hi all, I have some questions about the future development of Spark's standalone resource scheduler. We've heard some users have the requirements to have multi-tenant support in standalone mode, like multi-user management, resource management and isolation, whitelist of users. Seems current Spark standalone do not support such kind of functionalities, while resource schedulers like Yarn offers such kind of advanced managements, I'm not sure what's the future target of standalone resource scheduler, will it only target on simple implementation, and for advanced usage shift to YARN? Or will it plan to add some simple multi-tenant related functionalities? Thanks a lot for your comments. BR Jerry - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
RE: Questions about Spark standalone resource scheduler
Hi Patrick, Thanks a lot for your detailed explanation. For now we have such requirements: whitelist the application submitter, user resources (CPU, MEMORY) quotas, resources allocations in Spark Standalone mode. These are quite specific requirements for production-use, generally these problem will become whether we need to offer a more advanced resource scheduler compared to current simple FIFO one. I think our aim is to not provide a general resource scheduler like Mesos/Yarn, we only support Spark, but we hope to add some Mesos/Yarn functionalities to better use of Spark standalone mode. I admitted that resource scheduler may have some overlaps with cloud manager, whether to offer a powerful scheduler or use cloud manager is really a dilemma. I think we can break down to some small features to improve the standalone mode. What's your opinion? Thanks Jerry -Original Message- From: Patrick Wendell [mailto:pwend...@gmail.com] Sent: Monday, February 2, 2015 4:49 PM To: Shao, Saisai Cc: dev@spark.apache.org; u...@spark.apache.org Subject: Re: Questions about Spark standalone resource scheduler Hey Jerry, I think standalone mode will still add more features over time, but the goal isn't really for it to become equivalent to what Mesos/YARN are today. Or at least, I doubt Spark Standalone will ever attempt to manage _other_ frameworks outside of Spark and become a general purpose resource manager. In terms of having better support for multi tenancy, meaning multiple *Spark* instances, this is something I think could be in scope in the future. For instance, we added H/A to the standalone scheduler a while back, because it let us support H/A streaming apps in a totally native way. It's a trade off of adding new features and keeping the scheduler very simple and easy to use. We've tended to bias towards simplicity as the main goal, since this is something we want to be really easy out of the box. One thing to point out, a lot of people use the standalone mode with some coarser grained scheduler, such as running in a cloud service. In this case they really just want a simple inner cluster manager. This may even be the majority of all Spark installations. This is slightly different than Hadoop environments, where they might just want nice integration into the existing Hadoop stack via something like YARN. - Patrick On Mon, Feb 2, 2015 at 12:24 AM, Shao, Saisai saisai.s...@intel.com wrote: Hi all, I have some questions about the future development of Spark's standalone resource scheduler. We've heard some users have the requirements to have multi-tenant support in standalone mode, like multi-user management, resource management and isolation, whitelist of users. Seems current Spark standalone do not support such kind of functionalities, while resource schedulers like Yarn offers such kind of advanced managements, I'm not sure what's the future target of standalone resource scheduler, will it only target on simple implementation, and for advanced usage shift to YARN? Or will it plan to add some simple multi-tenant related functionalities? Thanks a lot for your comments. BR Jerry - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org