Re: Apache Spark documentation on mllib's Kmeans doesn't jibe.
The train method is on the Companion Object https://spark.apache.org/docs/2.1.0/api/scala/index.html#org.apache.spark.mllib.clustering.KMeans$ here is a decent resource on Companion Object usage: https://docs.scala-lang.org/tour/singleton-objects.html On Wed, Dec 13, 2017 at 9:16 AM Michael Segel <msegel_had...@hotmail.com> wrote: > Hi, > > Just came across this while looking at the docs on how to use Spark’s > Kmeans clustering. > > Note: This appears to be true in both 2.1 and 2.2 documentation. > > The overview page: > https://spark.apache.org/docs/2.1.0/mllib-clustering.html#k-means > <https://urldefense.proofpoint.com/v2/url?u=https-3A__spark.apache.org_docs_2.1.0_mllib-2Dclustering.html-23k-2Dmeans=DwMGaQ=x_Y1Lz9GyeGp2OvBCa_eow=ChXZJWKniTslJvQGptpIW7qAh4kkrpgYSer_wfh4G5w=aqceDwZltCTqlsZ5_SVCDe_DGw08lU2Duf0yymdZZ7k=i-__RwjSLQ18f4-0jfvArBoWU8FzygMCKzJXp_FPv1U=> > > Here’ the example contains the following line: > > val clusters = KMeans.train(parsedData, numClusters, numIterations) > > I was trying to get more information on the train() method. > So I checked out the KMeans Scala API: > > > https://spark.apache.org/docs/2.1.0/api/scala/index.html#org.apache.spark.mllib.clustering.KMeans > <https://urldefense.proofpoint.com/v2/url?u=https-3A__spark.apache.org_docs_2.1.0_api_scala_index.html-23org.apache.spark.mllib.clustering.KMeans=DwMGaQ=x_Y1Lz9GyeGp2OvBCa_eow=ChXZJWKniTslJvQGptpIW7qAh4kkrpgYSer_wfh4G5w=aqceDwZltCTqlsZ5_SVCDe_DGw08lU2Duf0yymdZZ7k=F8KhbHkJ4gQWQb4d1I-4a3gcn6uX4Z-lPmrQTmnaCp4=> > > The issue is that I couldn’t find the train method… > > So I thought I was slowly losing my mind. > > I checked out the entire API page… could not find any API docs which > describe the method train(). > > I ended up looking at the source code and found the method in the scala > source code. > (You can see the code here: > https://github.com/apache/spark/blob/v2.1.0/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala > <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_spark_blob_v2.1.0_mllib_src_main_scala_org_apache_spark_mllib_clustering_KMeans.scala=DwMGaQ=x_Y1Lz9GyeGp2OvBCa_eow=ChXZJWKniTslJvQGptpIW7qAh4kkrpgYSer_wfh4G5w=aqceDwZltCTqlsZ5_SVCDe_DGw08lU2Duf0yymdZZ7k=tYWGTjYLcXRMIuaE3IKN7ugoMSSXqfHknoWQewlqMPc=> > ) > > So the method(s) exist, but not covered in the Scala API doc. > > How do you raise this as a ‘bug’ ? > > Thx > > -Mike > > -- Scott Reynolds Principal Engineer [image: twilio] <http://www.twilio.com/?utm_source=email_signature> EMAIL sreyno...@twilio.com
Job Jar files located in s3, driver never starts the job
Following the documentation on spark-submit, http://spark.apache.org/docs/latest/submitting-applications.html#launching-applications-with-spark-submit - application-jar: Path to a bundled jar including your application and all dependencies. The URL must be globally visible inside of your cluster, for instance, an hdfs:// path or a file:// path that is present on all nodes. I submitted a job with the application-jar specified as s3a://path/to/jar/file/in/s3.jar and the driver didn't do anything. No logs and no cores / memory taken. I had plenty of both. I was able to add the hadoop-aws and aws-sdk to the master and the worker's class paths so they are running with the libraries. Can someone help me understand how a driver is run on a spark worker? Can someone help me understand how to get the proper hadoop libraries onto the path of the driver so that it is able to download and execute a jar file in s3 ?
Re: s3a file system and spark deployment mode
hmm I tried using --jars and that got passed to MasterArguments and that doesn't work :-( https://github.com/apache/spark/blob/branch-1.5/core/src/main/scala/org/apache/spark/deploy/master/MasterArguments.scala Same with Worker: https://github.com/apache/spark/blob/branch-1.5/core/src/main/scala/org/apache/spark/deploy/worker/WorkerArguments.scala Both Master and Worker have to start with these two jars because a.) the Master has to serve the event log in s3 b.) the Worker runs the Driver and has to download the jar from s3 And yes I am using these deps: org.apache.hadoop hadoop-aws 2.7.1 com.amazonaws aws-java-sdk 1.7.4 I think I have settled on just modifying the java command line that starts up the worker and master. Just seems easier. Currently launching them with spark-class bash script /mnt/services/spark/bin/spark-class org.apache.spark.deploy.master.Master \ --ip `hostname -i` --port 7077 --webui-port 8080 If all else fails I will update the spark pom and and include it in the shaded spark jar. On Fri, Oct 16, 2015 at 2:25 AM, Steve Loughran <ste...@hortonworks.com> wrote: > > > On 15 Oct 2015, at 19:04, Scott Reynolds <sreyno...@twilio.com> wrote: > > > > List, > > > > Right now we build our spark jobs with the s3a hadoop client. We do this > because our machines are only allowed to use IAM access to the s3 store. We > can build our jars with the s3a filesystem and the aws sdk just fine and > this jars run great in *client mode*. > > > > We would like to move from client mode to cluster mode as that will > allow us to be more resilient to driver failure. In order to do this either: > > 1. the jar file has to be on worker's local disk > > 2. the jar file is in shared storage (s3a) > > > > We would like to put the jar file in s3 storage, but when we give the > jar path as s3a://.., the worker node doesn't have the hadoop s3a and > aws sdk in its classpath / uber jar. > > > > Other then building spark with those two dependencies, what other > options do I have ? We are using 1.5.1 so SPARK_CLASSPATH is no longer a > thing. > > > > Need to get s3a access to both the master (so that we can log spark > event log to s3) and to the worker processes (driver, executor). > > > > Looking for ideas before just adding the dependencies to our spark build > and calling it a day. > > > you can use --jars to add these, e.g > > -jars hadoop-aws.jar,aws-java-sdk-s3 > > > as others have warned, you need Hadoop 2.7.1 for s3a to work proplery >
Re: s3a file system and spark deployment mode
We do not use EMR. This is deployed on Amazon VMs We build Spark with Hadoop-2.6.0 but that does not include the s3a filesystem nor the Amazon AWS SDK On Thu, Oct 15, 2015 at 12:26 PM, Spark Newbie <sparknewbie1...@gmail.com> wrote: > Are you using EMR? > You can install Hadoop-2.6.0 along with Spark-1.5.1 in your EMR cluster. > And that brings s3a jars to the worker nodes and it becomes available to > your application. > > On Thu, Oct 15, 2015 at 11:04 AM, Scott Reynolds <sreyno...@twilio.com> > wrote: > >> List, >> >> Right now we build our spark jobs with the s3a hadoop client. We do this >> because our machines are only allowed to use IAM access to the s3 store. We >> can build our jars with the s3a filesystem and the aws sdk just fine and >> this jars run great in *client mode*. >> >> We would like to move from client mode to cluster mode as that will allow >> us to be more resilient to driver failure. In order to do this either: >> 1. the jar file has to be on worker's local disk >> 2. the jar file is in shared storage (s3a) >> >> We would like to put the jar file in s3 storage, but when we give the jar >> path as s3a://.., the worker node doesn't have the hadoop s3a and aws >> sdk in its classpath / uber jar. >> >> Other then building spark with those two dependencies, what other options >> do I have ? We are using 1.5.1 so SPARK_CLASSPATH is no longer a thing. >> >> Need to get s3a access to both the master (so that we can log spark event >> log to s3) and to the worker processes (driver, executor). >> >> Looking for ideas before just adding the dependencies to our spark build >> and calling it a day. >> > >
s3a file system and spark deployment mode
List, Right now we build our spark jobs with the s3a hadoop client. We do this because our machines are only allowed to use IAM access to the s3 store. We can build our jars with the s3a filesystem and the aws sdk just fine and this jars run great in *client mode*. We would like to move from client mode to cluster mode as that will allow us to be more resilient to driver failure. In order to do this either: 1. the jar file has to be on worker's local disk 2. the jar file is in shared storage (s3a) We would like to put the jar file in s3 storage, but when we give the jar path as s3a://.., the worker node doesn't have the hadoop s3a and aws sdk in its classpath / uber jar. Other then building spark with those two dependencies, what other options do I have ? We are using 1.5.1 so SPARK_CLASSPATH is no longer a thing. Need to get s3a access to both the master (so that we can log spark event log to s3) and to the worker processes (driver, executor). Looking for ideas before just adding the dependencies to our spark build and calling it a day.