Re: Upgrade the scala code using the most updated Spark version

2017-03-28 Thread Dinko Srkoč
Adding to advices given by others ... Spark 2.1.0 works with Scala 2.11, so set:

  scalaVersion := "2.11.8"

When you see something like:

  "org.apache.spark" % "spark-core_2.10" % "1.5.2"

that means that library `spark-core` is compiled against Scala 2.10,
so you would have to change that to 2.11:

  "org.apache.spark" % "spark-core_2.11" % "2.1.0"

better yet, let SBT worry about libraries built against particular
Scala versions:

  "org.apache.spark" %% "spark-core" % "2.1.0"

The `%%` will instruct SBT to choose the library appropriate for a
version of Scala that is set in `scalaVersion`.

It may be worth mentioning that the `%%` thing works only with Scala
libraries as they are compiled against a certain Scala version. Java
libraries are unaffected (have nothing to do with Scala), e.g. for
`slf4j` one only uses single `%`s:

  "org.slf4j" % "slf4j-api" % "1.7.2"

Cheers,
Dinko

On 27 March 2017 at 23:30, Mich Talebzadeh  wrote:
> check these versions
>
> function create_build_sbt_file {
> BUILD_SBT_FILE=${GEN_APPSDIR}/scala/${APPLICATION}/build.sbt
> [ -f ${BUILD_SBT_FILE} ] && rm -f ${BUILD_SBT_FILE}
> cat >> $BUILD_SBT_FILE << !
> lazy val root = (project in file(".")).
>   settings(
> name := "${APPLICATION}",
> version := "1.0",
> scalaVersion := "2.11.8",
> mainClass in Compile := Some("myPackage.${APPLICATION}")
>   )
> libraryDependencies += "org.apache.spark" %% "spark-core" % "2.0.0" %
> "provided"
> libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.0.0" %
> "provided"
> libraryDependencies += "org.apache.spark" %% "spark-hive" % "2.0.0" %
> "provided"
> libraryDependencies += "org.apache.spark" %% "spark-streaming" % "2.0.0" %
> "provided"
> libraryDependencies += "org.apache.spark" %% "spark-streaming-kafka" %
> "1.6.1" % "provided"
> libraryDependencies += "com.google.code.gson" % "gson" % "2.6.2"
> libraryDependencies += "org.apache.phoenix" % "phoenix-spark" %
> "4.6.0-HBase-1.0"
> libraryDependencies += "org.apache.hbase" % "hbase" % "1.2.3"
> libraryDependencies += "org.apache.hbase" % "hbase-client" % "1.2.3"
> libraryDependencies += "org.apache.hbase" % "hbase-common" % "1.2.3"
> libraryDependencies += "org.apache.hbase" % "hbase-server" % "1.2.3"
> // META-INF discarding
> mergeStrategy in assembly <<= (mergeStrategy in assembly) { (old) =>
>{
> case PathList("META-INF", xs @ _*) => MergeStrategy.discard
> case x => MergeStrategy.first
>}
> }
> !
> }
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> Disclaimer: Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed. The
> author will in no case be liable for any monetary damages arising from such
> loss, damage or destruction.
>
>
>
>
> On 27 March 2017 at 21:45, Jörn Franke  wrote:
>>
>> Usually you define the dependencies to the Spark library as provided. You
>> also seem to mix different Spark versions which should be avoided.
>> The Hadoop library seems to be outdated and should also only be provided.
>>
>> The other dependencies you could assemble in a fat jar.
>>
>> On 27 Mar 2017, at 21:25, Anahita Talebi 
>> wrote:
>>
>> Hi friends,
>>
>> I have a code which is written in Scala. The scala version 2.10.4 and
>> Spark version 1.5.2 are used to run the code.
>>
>> I would like to upgrade the code to the most updated version of spark,
>> meaning 2.1.0.
>>
>> Here is the build.sbt:
>>
>> import AssemblyKeys._
>>
>> assemblySettings
>>
>> name := "proxcocoa"
>>
>> version := "0.1"
>>
>> scalaVersion := "2.10.4"
>>
>> parallelExecution in Test := false
>>
>> {
>>   val excludeHadoop = ExclusionRule(organization = "org.apache.hadoop")
>>   libraryDependencies ++= Seq(
>> "org.slf4j" % "slf4j-api" % "1.7.2",
>> "org.slf4j" % "slf4j-log4j12" % "1.7.2",
>> "org.scalatest" %% "scalatest" % "1.9.1" % "test",
>> "org.apache.spark" % "spark-core_2.10" % "1.5.2"
>> excludeAll(excludeHadoop),
>> "org.apache.spark" % "spark-mllib_2.10" % "1.5.2"
>> excludeAll(excludeHadoop),
>> "org.apache.spark" % "spark-sql_2.10" % "1.5.2"
>> excludeAll(excludeHadoop),
>> "org.apache.commons" % "commons-compress" % "1.7",
>> "commons-io" % "commons-io" % "2.4",
>> "org.scalanlp" % "breeze_2.10" % "0.11.2",
>> "com.github.fommil.netlib" % "all" % "1.1.2" pomOnly(),
>> "com.github.scopt" %% "scopt" % "3.3.0"
>>   )
>> }
>>
>> {
>>   val defaultHadoopVersion = "1.0.4"
>>   val hadoopVersion =
>> scala.util.Properties.envOrElse("SPARK_HADOOP_VERSION",
>> defaultHadoopVersion)
>>   libraryDependencies += "org.apache.hadoop" % "hadoop-client" %
>> hadoopVersion
>> }
>>
>> 

Re: submit a spark code on google cloud

2017-02-07 Thread Dinko Srkoč
Getting to the Spark web UI when Spark is running on Dataproc is not
that straightforward. Connecting to that web interface is a two step
process:

1. create an SSH tunnel
2. configure the browser to use a SOCKS proxy to connect

The above steps are described here:
https://cloud.google.com/dataproc/docs/concepts/cluster-web-interfaces

Once you have your browser configured and running, go to the
http://:4040 for the Spark web UI and
http://:18080 for Spark's history server.

 is the name of the cluster with "-m" appendage. So,
if the cluster name is "mycluster", master will be called
"mycluster-m".

Cheers,
Dinko

On 7 February 2017 at 21:41, Jacek Laskowski  wrote:
> Hi,
>
> I know nothing about Spark in GCP so answering this for a pure Spark.
>
> Can you use web UI and Executors tab or a SparkListener?
>
> Jacek
>
> On 7 Feb 2017 5:33 p.m., "Anahita Talebi"  wrote:
>
> Hello Friends,
>
> I am trying to run a spark code on multiple machines. To this aim, I submit
> a spark code on submit job on google cloud platform.
> https://cloud.google.com/dataproc/docs/guides/submit-job
>
> I have created a cluster with 6 nodes. Does anyone know how I can realize
> which nodes are participated when I run the code on the cluster?
>
> Thanks a lot,
> Anahita
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Running a spark code using submit job in google cloud platform

2017-01-13 Thread Dinko Srkoč
On 13 January 2017 at 13:55, Anahita Talebi  wrote:
> Hi,
>
> Thanks for your answer.
>
> I have chose "Spark" in the "job type". There is not any option where we can
> choose the version. How I can choose different version?

There's "Preemptible workers, bucket, network, version,
initialization, & access options" link just above the "Create" and
"Cancel" buttons on the "Create a cluster" page. When you click it,
you'll find "Image version" field where you can enter the image
version.

Dataproc versions:
* 1.1 would be Spark 2.0.2,
* 1.0 includes Spark 1.6.2

More about versions can be found here:
https://cloud.google.com/dataproc/docs/concepts/dataproc-versions

Cheers,
Dinko

>
> Thanks,
> Anahita
>
>
> On Thu, Jan 12, 2017 at 6:39 PM, A Shaikh  wrote:
>>
>> You may have tested this code on Spark version on your local machine
>> version of which may be different to whats in Google Cloud Storage.
>> You need to select appropraite Spark version when you submit your job.
>>
>> On 12 January 2017 at 15:51, Anahita Talebi 
>> wrote:
>>>
>>> Dear all,
>>>
>>> I am trying to run a .jar file as a job using submit job in google cloud
>>> console.
>>> https://cloud.google.com/dataproc/docs/guides/submit-job
>>>
>>> I actually ran the spark code on my local computer to generate a .jar
>>> file. Then in the Argument folder, I give the value of the arguments that I
>>> used in the spark code. One of the argument is training data set that I put
>>> in the same bucket that I save my .jar file. In the bucket, I put only the
>>> .jar file, training dataset and testing dataset.
>>>
>>> Main class or jar
>>> gs://Anahita/test.jar
>>>
>>> Arguments
>>>
>>> --lambda=.001
>>> --eta=1.0
>>> --trainFile=gs://Anahita/small_train.dat
>>> --testFile=gs://Anahita/small_test.dat
>>>
>>> The problem is that when I run the job I get the following error and
>>> actually it cannot read  my training and testing data sets.
>>>
>>> Exception in thread "main" java.lang.NoSuchMethodError:
>>> org.apache.spark.rdd.RDD.coalesce(IZLscala/math/Ordering;)Lorg/apache/spark/rdd/RDD;
>>>
>>> Can anyone help me how I can solve this problem?
>>>
>>> Thanks,
>>>
>>> Anahita
>>>
>>>
>>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Entering the variables in the Argument part in Submit job section to run a spark code on Google Cloud

2017-01-09 Thread Dinko Srkoč
Not knowing how the code that handles those arguments look like, I
would, in the "Arguments" field for submitting a dataproc job, put:

--trainFile=gs://Anahita/small_train.dat
--testFile=gs://Anahita/small_test.dat
--numFeatures=9947
--numRounds=100

... providing you still keep those files in the "Anahita" bucket.

Each line in the "Arguments" field ends up as an element of the `args`
argument (an Array) of the method `main`.

Cheers,
Dinko

On 9 January 2017 at 13:43, Anahita Talebi  wrote:
> Dear friends,
>
> I am trying to run a run a spark code on Google cloud using submit job.
> https://cloud.google.com/dataproc/docs/tutorials/spark-scala
>
> My question is about the part "argument".
> In my spark code, they are some variables that their values are defined in a
> shell file (.sh), as following:
>
> --trainFile=small_train.dat \
> --testFile=small_test.dat \
> --numFeatures=9947 \
> --numRounds=100 \
>
>
> - I have tried to enter only the values and each value in a separate box as
> following but it is not working:
>
> data/small_train.dat
> data/small_test.dat
> 9947
> 100
>
> I have also tried to give the parameters like in this below, but it is not
> working neither:
> trainFile=small_train.dat
> testFile=small_test.dat
> numFeatures=9947
> numRounds=100
>
> I added the files small_train.dat and small_test.dat in the same bucket
> where I saved the .jar file. Let's say if my bucket is named Anahita, I
> added spark.jar, small_train.dat and small_test.dat in the bucket "Anahita".
>
>
> Does anyone know, how I can enter these values in the argument part?
>
> Thanks in advance,
> Anahita
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: (send this email to subscribe)

2017-01-04 Thread Dinko Srkoč
You can run Spark app on Dataproc, which is Google's managed Spark and
Hadoop service:

https://cloud.google.com/dataproc/docs/

basically, you:

* assemble a jar
* create a cluster
* submit a job to that cluster (with the jar)
* delete a cluster when the job is done

Before all that, one has to create a Cloud Platform project, enable
billing and Dataproc API - but all this is explained in the docs.

Cheers,
Dinko


On 4 January 2017 at 17:34, Anahita Talebi  wrote:
>
> To whom it might concern,
>
> I have a question about running a spark code on Google cloud.
>
> Actually, I have a spark code and would like to run it using multiple
> machines on Google cloud. Unfortunately, I couldn't find a good
> documentation about how to do it.
>
> Do you have any hints which could help me to solve my problem?
>
> Have a nice day,
>
> Anahita
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org