Re: Upgrade the scala code using the most updated Spark version
Adding to advices given by others ... Spark 2.1.0 works with Scala 2.11, so set: scalaVersion := "2.11.8" When you see something like: "org.apache.spark" % "spark-core_2.10" % "1.5.2" that means that library `spark-core` is compiled against Scala 2.10, so you would have to change that to 2.11: "org.apache.spark" % "spark-core_2.11" % "2.1.0" better yet, let SBT worry about libraries built against particular Scala versions: "org.apache.spark" %% "spark-core" % "2.1.0" The `%%` will instruct SBT to choose the library appropriate for a version of Scala that is set in `scalaVersion`. It may be worth mentioning that the `%%` thing works only with Scala libraries as they are compiled against a certain Scala version. Java libraries are unaffected (have nothing to do with Scala), e.g. for `slf4j` one only uses single `%`s: "org.slf4j" % "slf4j-api" % "1.7.2" Cheers, Dinko On 27 March 2017 at 23:30, Mich Talebzadehwrote: > check these versions > > function create_build_sbt_file { > BUILD_SBT_FILE=${GEN_APPSDIR}/scala/${APPLICATION}/build.sbt > [ -f ${BUILD_SBT_FILE} ] && rm -f ${BUILD_SBT_FILE} > cat >> $BUILD_SBT_FILE << ! > lazy val root = (project in file(".")). > settings( > name := "${APPLICATION}", > version := "1.0", > scalaVersion := "2.11.8", > mainClass in Compile := Some("myPackage.${APPLICATION}") > ) > libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.0" % > "provided" > libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.0.0" % > "provided" > libraryDependencies += "org.apache.spark" %% "spark-hive" % "2.0.0" % > "provided" > libraryDependencies += "org.apache.spark" %% "spark-streaming" % "2.0.0" % > "provided" > libraryDependencies += "org.apache.spark" %% "spark-streaming-kafka" % > "1.6.1" % "provided" > libraryDependencies += "com.google.code.gson" % "gson" % "2.6.2" > libraryDependencies += "org.apache.phoenix" % "phoenix-spark" % > "4.6.0-HBase-1.0" > libraryDependencies += "org.apache.hbase" % "hbase" % "1.2.3" > libraryDependencies += "org.apache.hbase" % "hbase-client" % "1.2.3" > libraryDependencies += "org.apache.hbase" % "hbase-common" % "1.2.3" > libraryDependencies += "org.apache.hbase" % "hbase-server" % "1.2.3" > // META-INF discarding > mergeStrategy in assembly <<= (mergeStrategy in assembly) { (old) => >{ > case PathList("META-INF", xs @ _*) => MergeStrategy.discard > case x => MergeStrategy.first >} > } > ! > } > > HTH > > Dr Mich Talebzadeh > > > > LinkedIn > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > > > > http://talebzadehmich.wordpress.com > > > Disclaimer: Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. The > author will in no case be liable for any monetary damages arising from such > loss, damage or destruction. > > > > > On 27 March 2017 at 21:45, Jörn Franke wrote: >> >> Usually you define the dependencies to the Spark library as provided. You >> also seem to mix different Spark versions which should be avoided. >> The Hadoop library seems to be outdated and should also only be provided. >> >> The other dependencies you could assemble in a fat jar. >> >> On 27 Mar 2017, at 21:25, Anahita Talebi >> wrote: >> >> Hi friends, >> >> I have a code which is written in Scala. The scala version 2.10.4 and >> Spark version 1.5.2 are used to run the code. >> >> I would like to upgrade the code to the most updated version of spark, >> meaning 2.1.0. >> >> Here is the build.sbt: >> >> import AssemblyKeys._ >> >> assemblySettings >> >> name := "proxcocoa" >> >> version := "0.1" >> >> scalaVersion := "2.10.4" >> >> parallelExecution in Test := false >> >> { >> val excludeHadoop = ExclusionRule(organization = "org.apache.hadoop") >> libraryDependencies ++= Seq( >> "org.slf4j" % "slf4j-api" % "1.7.2", >> "org.slf4j" % "slf4j-log4j12" % "1.7.2", >> "org.scalatest" %% "scalatest" % "1.9.1" % "test", >> "org.apache.spark" % "spark-core_2.10" % "1.5.2" >> excludeAll(excludeHadoop), >> "org.apache.spark" % "spark-mllib_2.10" % "1.5.2" >> excludeAll(excludeHadoop), >> "org.apache.spark" % "spark-sql_2.10" % "1.5.2" >> excludeAll(excludeHadoop), >> "org.apache.commons" % "commons-compress" % "1.7", >> "commons-io" % "commons-io" % "2.4", >> "org.scalanlp" % "breeze_2.10" % "0.11.2", >> "com.github.fommil.netlib" % "all" % "1.1.2" pomOnly(), >> "com.github.scopt" %% "scopt" % "3.3.0" >> ) >> } >> >> { >> val defaultHadoopVersion = "1.0.4" >> val hadoopVersion = >> scala.util.Properties.envOrElse("SPARK_HADOOP_VERSION", >> defaultHadoopVersion) >> libraryDependencies += "org.apache.hadoop" % "hadoop-client" % >> hadoopVersion >> } >> >>
Re: submit a spark code on google cloud
Getting to the Spark web UI when Spark is running on Dataproc is not that straightforward. Connecting to that web interface is a two step process: 1. create an SSH tunnel 2. configure the browser to use a SOCKS proxy to connect The above steps are described here: https://cloud.google.com/dataproc/docs/concepts/cluster-web-interfaces Once you have your browser configured and running, go to the http://:4040 for the Spark web UI and http://:18080 for Spark's history server. is the name of the cluster with "-m" appendage. So, if the cluster name is "mycluster", master will be called "mycluster-m". Cheers, Dinko On 7 February 2017 at 21:41, Jacek Laskowskiwrote: > Hi, > > I know nothing about Spark in GCP so answering this for a pure Spark. > > Can you use web UI and Executors tab or a SparkListener? > > Jacek > > On 7 Feb 2017 5:33 p.m., "Anahita Talebi" wrote: > > Hello Friends, > > I am trying to run a spark code on multiple machines. To this aim, I submit > a spark code on submit job on google cloud platform. > https://cloud.google.com/dataproc/docs/guides/submit-job > > I have created a cluster with 6 nodes. Does anyone know how I can realize > which nodes are participated when I run the code on the cluster? > > Thanks a lot, > Anahita > > - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Running a spark code using submit job in google cloud platform
On 13 January 2017 at 13:55, Anahita Talebiwrote: > Hi, > > Thanks for your answer. > > I have chose "Spark" in the "job type". There is not any option where we can > choose the version. How I can choose different version? There's "Preemptible workers, bucket, network, version, initialization, & access options" link just above the "Create" and "Cancel" buttons on the "Create a cluster" page. When you click it, you'll find "Image version" field where you can enter the image version. Dataproc versions: * 1.1 would be Spark 2.0.2, * 1.0 includes Spark 1.6.2 More about versions can be found here: https://cloud.google.com/dataproc/docs/concepts/dataproc-versions Cheers, Dinko > > Thanks, > Anahita > > > On Thu, Jan 12, 2017 at 6:39 PM, A Shaikh wrote: >> >> You may have tested this code on Spark version on your local machine >> version of which may be different to whats in Google Cloud Storage. >> You need to select appropraite Spark version when you submit your job. >> >> On 12 January 2017 at 15:51, Anahita Talebi >> wrote: >>> >>> Dear all, >>> >>> I am trying to run a .jar file as a job using submit job in google cloud >>> console. >>> https://cloud.google.com/dataproc/docs/guides/submit-job >>> >>> I actually ran the spark code on my local computer to generate a .jar >>> file. Then in the Argument folder, I give the value of the arguments that I >>> used in the spark code. One of the argument is training data set that I put >>> in the same bucket that I save my .jar file. In the bucket, I put only the >>> .jar file, training dataset and testing dataset. >>> >>> Main class or jar >>> gs://Anahita/test.jar >>> >>> Arguments >>> >>> --lambda=.001 >>> --eta=1.0 >>> --trainFile=gs://Anahita/small_train.dat >>> --testFile=gs://Anahita/small_test.dat >>> >>> The problem is that when I run the job I get the following error and >>> actually it cannot read my training and testing data sets. >>> >>> Exception in thread "main" java.lang.NoSuchMethodError: >>> org.apache.spark.rdd.RDD.coalesce(IZLscala/math/Ordering;)Lorg/apache/spark/rdd/RDD; >>> >>> Can anyone help me how I can solve this problem? >>> >>> Thanks, >>> >>> Anahita >>> >>> >> > - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Entering the variables in the Argument part in Submit job section to run a spark code on Google Cloud
Not knowing how the code that handles those arguments look like, I would, in the "Arguments" field for submitting a dataproc job, put: --trainFile=gs://Anahita/small_train.dat --testFile=gs://Anahita/small_test.dat --numFeatures=9947 --numRounds=100 ... providing you still keep those files in the "Anahita" bucket. Each line in the "Arguments" field ends up as an element of the `args` argument (an Array) of the method `main`. Cheers, Dinko On 9 January 2017 at 13:43, Anahita Talebiwrote: > Dear friends, > > I am trying to run a run a spark code on Google cloud using submit job. > https://cloud.google.com/dataproc/docs/tutorials/spark-scala > > My question is about the part "argument". > In my spark code, they are some variables that their values are defined in a > shell file (.sh), as following: > > --trainFile=small_train.dat \ > --testFile=small_test.dat \ > --numFeatures=9947 \ > --numRounds=100 \ > > > - I have tried to enter only the values and each value in a separate box as > following but it is not working: > > data/small_train.dat > data/small_test.dat > 9947 > 100 > > I have also tried to give the parameters like in this below, but it is not > working neither: > trainFile=small_train.dat > testFile=small_test.dat > numFeatures=9947 > numRounds=100 > > I added the files small_train.dat and small_test.dat in the same bucket > where I saved the .jar file. Let's say if my bucket is named Anahita, I > added spark.jar, small_train.dat and small_test.dat in the bucket "Anahita". > > > Does anyone know, how I can enter these values in the argument part? > > Thanks in advance, > Anahita > - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: (send this email to subscribe)
You can run Spark app on Dataproc, which is Google's managed Spark and Hadoop service: https://cloud.google.com/dataproc/docs/ basically, you: * assemble a jar * create a cluster * submit a job to that cluster (with the jar) * delete a cluster when the job is done Before all that, one has to create a Cloud Platform project, enable billing and Dataproc API - but all this is explained in the docs. Cheers, Dinko On 4 January 2017 at 17:34, Anahita Talebiwrote: > > To whom it might concern, > > I have a question about running a spark code on Google cloud. > > Actually, I have a spark code and would like to run it using multiple > machines on Google cloud. Unfortunately, I couldn't find a good > documentation about how to do it. > > Do you have any hints which could help me to solve my problem? > > Have a nice day, > > Anahita > > - To unsubscribe e-mail: user-unsubscr...@spark.apache.org