Get application id when using SparkSubmit.main from java
Hi, I am trying to get the application id after I use SparkSubmit.main for a yarn submission. I am able to make it asynchronous using spark.yarn.watForCompletion=false configuration option, but I can't seem to figure out how I can get the application id for this job. I read both SparkSubmit.scala and Client.scala. Any thoughts on how I could do it? I'd prefer not to use Client.run directly that returns the application id since there's a lot of environment prep that SparkSubmit actually does. Thanks in advance for any help... Thanks,Ron
Re: Get full RDD lineage for a spark job
Cool thanks. Will give that a try... --Ron On Friday, July 21, 2017 8:09 PM, Keith Chapman <keithgchap...@gmail.com> wrote: You could also enable it with --conf spark.logLineage=true if you do not want to change any code. Regards,Keith. http://keith-chapman.com On Fri, Jul 21, 2017 at 7:57 PM, Keith Chapman <keithgchap...@gmail.com> wrote: Hi Ron, You can try using the toDebugString method on the RDD, this will print the RDD lineage. Regards,Keith. http://keith-chapman.com On Fri, Jul 21, 2017 at 11:24 AM, Ron Gonzalez <zlgonza...@yahoo.com.invalid> wrote: Hi, Can someone point me to a test case or share sample code that is able to extract the RDD graph from a Spark job anywhere during its lifecycle? I understand that Spark has UI that can show the graph of the execution so I'm hoping that is using some API somewhere that I could use. I know RDD is the actual execution graph, so if there is also a more logical abstraction API closer to calls like map, filter, aggregate, etc., that would even be better. Appreciate any help... Thanks,Ron
Get full RDD lineage for a spark job
Hi, Can someone point me to a test case or share sample code that is able to extract the RDD graph from a Spark job anywhere during its lifecycle? I understand that Spark has UI that can show the graph of the execution so I'm hoping that is using some API somewhere that I could use. I know RDD is the actual execution graph, so if there is also a more logical abstraction API closer to calls like map, filter, aggregate, etc., that would even be better. Appreciate any help... Thanks,Ron
Losing files in hdfs after creating spark sql table
Hi, After I create a table in spark sql and load infile an hdfs file to it, the file is no longer queryable if I do hadoop fs -ls. Is this expected? Thanks, Ron - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Question on Spark SQL for a directory
Hi, Question on using spark sql. Can someone give an example for creating table from a directory containing parquet files in HDFS instead of an actual parquet file? Thanks, Ron On 07/21/2015 01:59 PM, Brandon White wrote: A few questions about caching a table in Spark SQL. 1) Is there any difference between caching the dataframe and the table? df.cache() vs sqlContext.cacheTable(tableName) 2) Do you need to warm up the cache before seeing the performance benefits? Is the cache LRU? Do you need to run some queries on the table before it is cached in memory? 3) Is caching the table much faster than .saveAsTable? I am only seeing a 10 %- 20% performance increase. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Classifier for Big Data Mining
I'd use Random Forest. It will give you better generalizability. There are also a number of things you can do with RF that allows to train on samples of the massive data set and then just average over the resulting models... Thanks, Ron On 07/21/2015 02:17 PM, Olivier Girardot wrote: depends on your data and I guess the time/performance goals you have for both training/prediction, but for a quick answer : yes :) 2015-07-21 11:22 GMT+02:00 Chintan Bhatt chintanbhatt...@charusat.ac.in mailto:chintanbhatt...@charusat.ac.in: Which classifier can be useful for mining massive datasets in spark? Decision Tree can be good choice as per scalability? -- CHINTAN BHATT http://in.linkedin.com/pub/chintan-bhatt/22/b31/336/ Assistant Professor, U P U Patel Department of Computer Engineering, Chandubhai S. Patel Institute of Technology, Charotar University of Science And Technology (CHARUSAT), Changa-388421, Gujarat, INDIA. http://www.charusat.ac.in http://www.charusat.ac.in/ _Personal Website_: https://sites.google.com/a/ecchanga.ac.in/chintan/
Re: Basic Spark SQL question
Cool thanks. Will take a look... Sent from my iPhone On Jul 13, 2015, at 6:40 PM, Michael Armbrust mich...@databricks.com wrote: I'd look at the JDBC server (a long running yarn job you can submit queries too) https://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbcodbc-server On Mon, Jul 13, 2015 at 6:31 PM, Jerrick Hoang jerrickho...@gmail.com wrote: Well for adhoc queries you can use the CLI On Mon, Jul 13, 2015 at 5:34 PM, Ron Gonzalez zlgonza...@yahoo.com.invalid wrote: Hi, I have a question for Spark SQL. Is there a way to be able to use Spark SQL on YARN without having to submit a job? Bottom line here is I want to be able to reduce the latency of running queries as a job. I know that the spark sql default submission is like a job, but was wondering if it's possible to run queries like one would with a regular db like MySQL or Oracle. Thanks, Ron - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Basic Spark SQL question
Hi, I have a question for Spark SQL. Is there a way to be able to use Spark SQL on YARN without having to submit a job? Bottom line here is I want to be able to reduce the latency of running queries as a job. I know that the spark sql default submission is like a job, but was wondering if it's possible to run queries like one would with a regular db like MySQL or Oracle. Thanks, Ron - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: error with pyspark
If you're running on Ubuntu, do ulimit -n, which gives the max number of allowed open files. You will have to change the value in /etc/security/limits.conf to something like 1, logout and log back in. Thanks, Ron Sent from my iPad On Aug 10, 2014, at 10:19 PM, Davies Liu dav...@databricks.com wrote: On Fri, Aug 8, 2014 at 9:12 AM, Baoqiang Cao bqcaom...@gmail.com wrote: Hi There I ran into a problem and can’t find a solution. I was running bin/pyspark ../python/wordcount.py you could use bin/spark-submit ../python/wordcount.py The wordcount.py is here: import sys from operator import add from pyspark import SparkContext datafile = '/mnt/data/m1.txt' sc = SparkContext() outfile = datafile + '.freq' lines = sc.textFile(datafile, 1) counts = lines.flatMap(lambda x: x.split(' ')) \ .map(lambda x: (x, 1)) \ .reduceByKey(add) output = counts.collect() outf = open(outfile, 'w') for (word, count) in output: outf.write(word.encode('utf-8') + '\t' + str(count) + '\n') outf.close() The error message is here: 14/08/08 16:01:59 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 0) java.io.FileNotFoundException: /tmp/spark-local-20140808160150-d36b/12/shuffle_0_0_468 (Too many open files) This message means that the Spark (JVM) had reach the max number of open files, there are fd leak some where, unfortunately I can not reproduce this problem. What is the version of Spark? at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.init(FileOutputStream.java:221) at org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:107) at org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:175) at org.apache.spark.shuffle.hash.HashShuffleWriter$$anonfun$write$1.apply(HashShuffleWriter.scala:67) at org.apache.spark.shuffle.hash.HashShuffleWriter$$anonfun$write$1.apply(HashShuffleWriter.scala:65) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:65) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:744) The m1.txt is about 4G, and I have 120GB Ram and used -Xmx120GB. It is on Ubuntu. Any help please? Best Baoqiang Cao Blog: http://baoqiang.org Email: bqcaom...@gmail.com - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Save an RDD to a SQL Database
Hi Vida, It's possible to save an RDD as a hadoop file using hadoop output formats. It might be worthwhile to investigate using DBOutputFormat and see if this will work for you. I haven't personally written to a db, but I'd imagine this would be one way to do it. Thanks, Ron Sent from my iPhone On Aug 5, 2014, at 8:29 PM, Vida Ha vid...@gmail.com wrote: Hi, I would like to save an RDD to a SQL database. It seems like this would be a common enough use case. Are there any built in libraries to do it? Otherwise, I'm just planning on mapping my RDD, and having that call a method to write to the database. Given that a lot of records are going to be written, the code would need to be smart and do a batch insert after enough records have collected. Does that sound like a reasonable approach? -Vida - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Computing mean and standard deviation by key
Cool thanks! On Monday, August 4, 2014 8:58 AM, kriskalish k...@kalish.net wrote: Hey Ron, It was pretty much exactly as Sean had depicted. I just needed to provide count an anonymous function to tell it which elements to count. Since I wanted to count them all, the function is simply true. val grouped = rdd.groupByKey().mapValues { mcs = val values = mcs.map(_.foo.toDouble) val n = values.count(x = true) val sum = values.sum val sumSquares = values.map(x = x * x).sum val stddev = math.sqrt(n * sumSquares - sum * sum) / n print(stddev: + stddev) stddev } I hope that helps -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Computing-mean-and-standard-deviation-by-key-tp11192p11334.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Issues with HDP 2.4.0.2.1.3.0-563
One key thing I forgot to mention is that I changed the avro version to 1.7.7 to get AVRO-1476. I took a closer look at the jars, and what I noticed is that the assembly jars that work do not have the org.apache.avro.mapreduce package packaged into the assembly. For spark-1.0.1, org.apache.avro.mapreduce is always found. When creating an assembly from an older download of Spark 1.0.0, this package doesn't exist. In a recent download of Spark 1.0.0, the generated assembly with any HDP version also has org.apache.avro.mapreduce. I recompiled against the new download, and it also has the same problems even with an older version of HDP. So I think the bottom line issue here is that the generated assemblies that include org.apache.avro.mapreduce seems to cause this issue. If I use the older Spark 1.0.0 version, I am able to create assemblies that work. I noticed that assemblies generated from the newer versions are indeed bigger so it seems a bug was perhaps fixed to ensure that all dependencies are pulled into the final assembly, but is now causing this symptom that I have reported... Thanks, Ron On Monday, August 4, 2014 10:39 AM, Steve Nunez snu...@hortonworks.com wrote: Hmm. Fair enough. I hadn¹t given that answer much thought and on reflection think you¹re right in that a profile would just be a bad hack. On 8/4/14, 10:35, Sean Owen so...@cloudera.com wrote: What would such a profile do though? In general building for a specific vendor version means setting hadoop.verison and/or yarn.version. Any hard-coded value is unlikely to match what a particular user needs. Setting protobuf versions and so on is already done by the generic profiles. In a similar vein, I am not clear on why there's a mapr profile in the build. Its versions are about to be out of date and won't work with upcoming Hbase changes for example. (Elsewhere in the build I think it wouldn't hurt to clear out cloudera-specific profiles and releases too -- they're not in the pom but are in the distribution script. It's the vendor's problem.) This isn't any argument about being purist but just that I am not sure these are things that the project can meaningfully bother with. It makes sense to set vendor repos in the pom for convenience, and makes sense to run smoke tests in Jenkins against particular versions. $0.02 Sean On Mon, Aug 4, 2014 at 6:21 PM, Steve Nunez snu...@hortonworks.com wrote: I don¹t think there is an hwx profile, but there probably should be. -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Is there a way to write spark RDD to Avro files
You have to import org.apache.spark.rdd._, which will automatically make available this method. Thanks, Ron Sent from my iPhone On Aug 1, 2014, at 3:26 PM, touchdown yut...@gmail.com wrote: Hi, I am facing a similar dilemma. I am trying to aggregate a bunch of small avro files into one avro file. I read it in with: sc.newAPIHadoopFile[AvroKey[GenericRecord], NullWritable, AvroKeyInputFormat[GenericRecord]](path) but I can't find saveAsHadoopFile or saveAsNewAPIHadoopFile. Can you please tell us how it worked for you thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-a-way-to-write-spark-RDD-to-Avro-files-tp10947p11219.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Computing mean and standard deviation by key
Can you share the mapValues approach you did? Thanks, Ron Sent from my iPhone On Aug 1, 2014, at 3:00 PM, kriskalish k...@kalish.net wrote: Thanks for the help everyone. I got the mapValues approach working. I will experiment with the reduceByKey approach later. 3 -Kris -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Computing-mean-and-standard-deviation-by-key-tp11192p11214.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
NotSerializableException
Hi, I took avro 1.7.7 and recompiled my distribution to be able to fix the issue when dealing with avro GenericRecord. The issue I got was resolved. I'm referring to AVRO-1476. I also enabled kryo registration in SparkConf. That said, I am still seeing a NotSerializableException for Schema$RecordSchema. Do I need to do anything else? Thanks, Ron Sent from my iPad
Re: cache changes precision
Cool I'll take a look and give it a try! Thanks, Ron Sent from my iPad On Jul 24, 2014, at 10:35 PM, Andrew Ash and...@andrewash.com wrote: Hi Ron, I think you're encountering the issue where cacheing data from Hadoop ends up with many duplicate values instead of what you expect. Try adding a .clone() to the datum() call. The issue is that Hadoop returns the same object many times but with its contents changed. This is an optimization to prevent allocating and GC'ing an object for every row in Hadoop. This works fine in Hadoop MapReduce because it's single-threaded and with no cacheing of the objects. Spark though saves a reference to each object it gets back from Hadoop. So by the end of the partition, Spark ends up with a bunch of references all to the same object! I think it's just by chance that this ends up changing your average to be rounded. Can you try with cloning the records in the map call? Also look at the contents and see if they're actually changed, or if the resulting RDD after a cache is just the last record smeared across all the others. Cheers, Andrew On Thu, Jul 24, 2014 at 2:41 PM, Ron Gonzalez zlgonza...@yahoo.com wrote: Hi, I'm doing the following: def main(args: Array[String]) = { val sparkConf = new SparkConf().setAppName(AvroTest).setMaster(local[2]) val sc = new SparkContext(sparkConf) val conf = new Configuration() val job = new Job(conf) val path = new Path(/tmp/a.avro); val schema = AvroUtils.getSchema(conf, path); AvroJob.setInputKeySchema(job, schema); val rdd = sc.newAPIHadoopFile( path.toString(), classOf[AvroKeyInputFormat[GenericRecord]], classOf[AvroKey[GenericRecord]], classOf[NullWritable], conf).map(x = x._1.datum()) val sum = rdd.map(p = p.get(SEPAL_WIDTH).asInstanceOf[Float]).reduce(_ + _) val avg = sum/rdd.count() println(sSum = $sum) println(sAvg = $avg) } If I run this, it works as expected, when I add .cache() to val rdd = sc.newAPIHadoopFile( path.toString(), classOf[AvroKeyInputFormat[GenericRecord]], classOf[AvroKey[GenericRecord]], classOf[NullWritable], conf).map(x = x._1.datum()).cache() then the command rounds up the average. Any idea why this works this way? Any tips on how to fix this? Thanks, Ron
Issue submitting spark job to yarn
Folks, I've been able to submit simple jobs to yarn thus far. However, when I did something more complicated that added 194 dependency jars using --addJars, the job fails in YARN with no logs. What ends up happening is that no container logs get created (app master or executor). If I add just a couple of dependencies, it works, so this is clearly a case of too many dependencies passed into the invocation. Not sure if this means that no container was created at all, but bottom line is that I get no logs that can help me determine what's wrong. Because of the large number of jars, I figured it might have been a permgen issue so I added these options. However, that didn't help. It seems as if the actual submission wasn't even spawned since no container was created or no log was found. Any ideas for this? Thanks, Ron
cache changes precision
Hi, I'm doing the following: def main(args: Array[String]) = { val sparkConf = new SparkConf().setAppName(AvroTest).setMaster(local[2]) val sc = new SparkContext(sparkConf) val conf = new Configuration() val job = new Job(conf) val path = new Path(/tmp/a.avro); val schema = AvroUtils.getSchema(conf, path); AvroJob.setInputKeySchema(job, schema); val rdd = sc.newAPIHadoopFile( path.toString(), classOf[AvroKeyInputFormat[GenericRecord]], classOf[AvroKey[GenericRecord]], classOf[NullWritable], conf).map(x = x._1.datum()) val sum = rdd.map(p = p.get(SEPAL_WIDTH).asInstanceOf[Float]).reduce(_ + _) val avg = sum/rdd.count() println(sSum = $sum) println(sAvg = $avg) } If I run this, it works as expected, when I add .cache() to val rdd = sc.newAPIHadoopFile( path.toString(), classOf[AvroKeyInputFormat[GenericRecord]], classOf[AvroKey[GenericRecord]], classOf[NullWritable], conf).map(x = x._1.datum()).cache() then the command rounds up the average. Any idea why this works this way? Any tips on how to fix this? Thanks, Ron
Possible bug in ClientBase.scala?
Hi, I was doing programmatic submission of Spark yarn jobs and I saw code in ClientBase.getDefaultYarnApplicationClasspath(): val field = classOf[MRJobConfig].getField(DEFAULT_YARN_APPLICATION_CLASSPATH) MRJobConfig doesn't have this field so the created launch env is incomplete. Workaround is to set yarn.application.classpath with the value from YarnConfiguration.DEFAULT_YARN_APPLICATION_CLASSPATH. This results in having the spark job hang if the submission config is different from the default config. For example, if my resource manager port is 8050 instead of 8030, then the spark app is not able to register itself and stays in ACCEPTED state. I can easily fix this by changing this to YarnConfiguration instead of MRJobConfig but was wondering what the steps are for submitting a fix. Thanks, Ron Sent from my iPhone
Re: Spark on Yarn: Connecting to Existing Instance
The idea behind YARN is that you can run different application types like MapReduce, Storm and Spark. I would recommend that you build your spark jobs in the main method without specifying how you deploy it. Then you can use spark-submit to tell Spark how you would want to deploy to it using yarn-cluster as the master. The key point here is that once you have YARN setup, the spark client connects to it using the $HADOOP_CONF_DIR that contains the resource manager address. In particular, this needs to be accessible from the classpath of the submitter since it implicitly uses this when it instantiates a YarnConfiguration instance. If you want more details, read org.apache.spark.deploy.yarn.Client.scala. You should be able to download a standalone YARN cluster from any of the Hadoop providers like Cloudera or Hortonworks. Once you have that, the spark programming guide describes what I mention above in sufficient detail for you to proceed. Thanks, Ron Sent from my iPad On Jul 9, 2014, at 8:31 AM, John Omernik j...@omernik.com wrote: I am trying to get my head around using Spark on Yarn from a perspective of a cluster. I can start a Spark Shell no issues in Yarn. Works easily. This is done in yarn-client mode and it all works well. In multiple examples, I see instances where people have setup Spark Clusters in Stand Alone mode, and then in the examples they connect to this cluster in Stand Alone mode. This is done often times using the spark:// string for connection. Cool. s But what I don't understand is how do I setup a Yarn instance that I can connect to? I.e. I tried running Spark Shell in yarn-cluster mode and it gave me an error, telling me to use yarn-client. I see information on using spark-class or spark-submit. But what I'd really like is a instance I can connect a spark-shell too, and have the instance stay up. I'd like to be able run other things on that instance etc. Is that possible with Yarn? I know there may be long running job challenges with Yarn, but I am just testing, I am just curious if I am looking at something completely bonkers here, or just missing something simple. Thanks!
Re: Purpose of spark-submit?
Koert, Yeah I had the same problems trying to do programmatic submission of spark jobs to my Yarn cluster. I was ultimately able to resolve it by reviewing the classpath and debugging through all the different things that the Spark Yarn client (Client.scala) did for submitting to Yarn (like env setup, local resources, etc), and I compared it to what spark-submit had done. I have to admit though that it was far from trivial to get it working out of the box, and perhaps some work could be done in that regards. In my case, it had boiled down to the launch environment not having the HADOOP_CONF_DIR set, which prevented the app master from registering itself with the Resource Manager. Thanks, Ron Sent from my iPad On Jul 9, 2014, at 9:25 AM, Jerry Lam chiling...@gmail.com wrote: Sandy, I experienced the similar behavior as Koert just mentioned. I don't understand why there is a difference between using spark-submit and programmatic execution. Maybe there is something else we need to add to the spark conf/spark context in order to launch spark jobs programmatically that are not needed before? On Wed, Jul 9, 2014 at 12:14 PM, Koert Kuipers ko...@tresata.com wrote: sandy, that makes sense. however i had trouble doing programmatic execution on yarn in client mode as well. the application-master in yarn came up but then bombed because it was looking for jars that dont exist (it was looking in the original file paths on the driver side, which are not available on the yarn node). my guess is that spark-submit is changing some settings (perhaps preparing the distributed cache and modifying settings accordingly), which makes it harder to run things programmatically. i could be wrong however. i gave up debugging and resorted to using spark-submit for now. On Wed, Jul 9, 2014 at 12:05 PM, Sandy Ryza sandy.r...@cloudera.com wrote: Spark still supports the ability to submit jobs programmatically without shell scripts. Koert, The main reason that the unification can't be a part of SparkContext is that YARN and standalone support deploy modes where the driver runs in a managed process on the cluster. In this case, the SparkContext is created on a remote node well after the application is launched. On Wed, Jul 9, 2014 at 8:34 AM, Andrei faithlessfri...@gmail.com wrote: One another +1. For me it's a question of embedding. With SparkConf/SparkContext I can easily create larger projects with Spark as a separate service (just like MySQL and JDBC, for example). With spark-submit I'm bound to Spark as a main framework that defines how my application should look like. In my humble opinion, using Spark as embeddable library rather than main framework and runtime is much easier. On Wed, Jul 9, 2014 at 5:14 PM, Jerry Lam chiling...@gmail.com wrote: +1 as well for being able to submit jobs programmatically without using shell script. we also experience issues of submitting jobs programmatically without using spark-submit. In fact, even in the Hadoop World, I rarely used hadoop jar to submit jobs in shell. On Wed, Jul 9, 2014 at 9:47 AM, Robert James srobertja...@gmail.com wrote: +1 to be able to do anything via SparkConf/SparkContext. Our app worked fine in Spark 0.9, but, after several days of wrestling with uber jars and spark-submit, and so far failing to get Spark 1.0 working, we'd like to go back to doing it ourself with SparkConf. As the previous poster said, a few scripts should be able to give us the classpath and any other params we need, and be a lot more transparent and debuggable. On 7/9/14, Surendranauth Hiraman suren.hira...@velos.io wrote: Are there any gaps beyond convenience and code/config separation in using spark-submit versus SparkConf/SparkContext if you are willing to set your own config? If there are any gaps, +1 on having parity within SparkConf/SparkContext where possible. In my use case, we launch our jobs programmatically. In theory, we could shell out to spark-submit but it's not the best option for us. So far, we are only using Standalone Cluster mode, so I'm not knowledgeable on the complexities of other modes, though. -Suren On Wed, Jul 9, 2014 at 8:20 AM, Koert Kuipers ko...@tresata.com wrote: not sure I understand why unifying how you submit app for different platforms and dynamic configuration cannot be part of SparkConf and SparkContext? for classpath a simple script similar to hadoop classpath that shows what needs to be added should be sufficient. on spark standalone I can launch a program just fine with just SparkConf and SparkContext. not on yarn, so the spark-launch script must be doing a few things extra there I am missing... which makes things more difficult because I am not sure its realistic to expect every application that needs to run something on spark to be launched using
Re: Purpose of spark-submit?
I am able to use Client.scala or LauncherExecutor.scala as my programmatic entry point for Yarn. Thanks, Ron Sent from my iPad On Jul 9, 2014, at 7:14 AM, Jerry Lam chiling...@gmail.com wrote: +1 as well for being able to submit jobs programmatically without using shell script. we also experience issues of submitting jobs programmatically without using spark-submit. In fact, even in the Hadoop World, I rarely used hadoop jar to submit jobs in shell. On Wed, Jul 9, 2014 at 9:47 AM, Robert James srobertja...@gmail.com wrote: +1 to be able to do anything via SparkConf/SparkContext. Our app worked fine in Spark 0.9, but, after several days of wrestling with uber jars and spark-submit, and so far failing to get Spark 1.0 working, we'd like to go back to doing it ourself with SparkConf. As the previous poster said, a few scripts should be able to give us the classpath and any other params we need, and be a lot more transparent and debuggable. On 7/9/14, Surendranauth Hiraman suren.hira...@velos.io wrote: Are there any gaps beyond convenience and code/config separation in using spark-submit versus SparkConf/SparkContext if you are willing to set your own config? If there are any gaps, +1 on having parity within SparkConf/SparkContext where possible. In my use case, we launch our jobs programmatically. In theory, we could shell out to spark-submit but it's not the best option for us. So far, we are only using Standalone Cluster mode, so I'm not knowledgeable on the complexities of other modes, though. -Suren On Wed, Jul 9, 2014 at 8:20 AM, Koert Kuipers ko...@tresata.com wrote: not sure I understand why unifying how you submit app for different platforms and dynamic configuration cannot be part of SparkConf and SparkContext? for classpath a simple script similar to hadoop classpath that shows what needs to be added should be sufficient. on spark standalone I can launch a program just fine with just SparkConf and SparkContext. not on yarn, so the spark-launch script must be doing a few things extra there I am missing... which makes things more difficult because I am not sure its realistic to expect every application that needs to run something on spark to be launched using spark-submit. On Jul 9, 2014 3:45 AM, Patrick Wendell pwend...@gmail.com wrote: It fulfills a few different functions. The main one is giving users a way to inject Spark as a runtime dependency separately from their program and make sure they get exactly the right version of Spark. So a user can bundle an application and then use spark-submit to send it to different types of clusters (or using different versions of Spark). It also unifies the way you bundle and submit an app for Yarn, Mesos, etc... this was something that became very fragmented over time before this was added. Another feature is allowing users to set configuration values dynamically rather than compile them inside of their program. That's the one you mention here. You can choose to use this feature or not. If you know your configs are not going to change, then you don't need to set them with spark-submit. On Wed, Jul 9, 2014 at 10:22 AM, Robert James srobertja...@gmail.com wrote: What is the purpose of spark-submit? Does it do anything outside of the standard val conf = new SparkConf ... val sc = new SparkContext ... ? -- SUREN HIRAMAN, VP TECHNOLOGY Velos Accelerating Machine Learning 440 NINTH AVENUE, 11TH FLOOR NEW YORK, NY 10001 O: (917) 525-2466 ext. 105 F: 646.349.4063 E: suren.hiraman@v suren.hira...@sociocast.comelos.io W: www.velos.io
Re: Setting queue for spark job on yarn
Btw, I'm on 0.9.1. Will setting a queue programmatically be available in 1.0? Thanks, Ron Sent from my iPad On May 20, 2014, at 6:27 PM, Ron Gonzalez zlgonza...@yahoo.com wrote: Hi Sandy, Is there a programmatic way? We're building a platform as a service and need to assign it to different queues that can provide different scheduler approaches. Thanks, Ron Sent from my iPhone On May 20, 2014, at 1:30 PM, Sandy Ryza sandy.r...@cloudera.com wrote: Hi Ron, What version are you using? For 0.9, you need to set it outside your code with the SPARK_YARN_QUEUE environment variable. -Sandy On Mon, May 19, 2014 at 9:29 PM, Ron Gonzalez zlgonza...@yahoo.com wrote: Hi, How does one submit a spark job to yarn and specify a queue? The code that successfully submits to yarn is: val conf = new SparkConf() val sc = new SparkContext(yarn-client, Simple App, conf) Where do I need to specify the queue? Thanks in advance for any help on this... Thanks, Ron
Setting queue for spark job on yarn
Hi, How does one submit a spark job to yarn and specify a queue? The code that successfully submits to yarn is: val conf = new SparkConf() val sc = new SparkContext(yarn-client, Simple App, conf) Where do I need to specify the queue? Thanks in advance for any help on this... Thanks, Ron
Re: Job initialization performance of Spark standalone mode vs YARN
Hi, Can you explain a little more what's going on? Which one submits a job to the yarn cluster that creates an application master and spawns containers for the local jobs? I tried yarn-client and submitted to our yarn cluster and it seems to work that way. Shouldn't Client.scala be running within the AppMaster instance in this run mode? How exactly does yarn-standalone work? Thanks, Ron Sent from my iPhone On Apr 3, 2014, at 11:19 AM, Kevin Markey kevin.mar...@oracle.com wrote: We are now testing precisely what you ask about in our environment. But Sandy's questions are relevant. The bigger issue is not Spark vs. Yarn but client vs. standalone and where the client is located on the network relative to the cluster. The client options that locate the client/master remote from the cluster, while useful for interactive queries, suffer from considerable network traffic overhead as the master schedules and transfers data with the worker nodes on the cluster. The standalone options locate the master/client on the cluster. In yarn-standalone, the master is a thread contained by the Yarn Resource Manager. Lots less traffic, as the master is co-located with the worker nodes on the cluster and its scheduling/data communication has less latency. In my comparisons between yarn-client and yarn-standalone (so as not to conflate yarn vs Spark), yarn-client computation time is at least double yarn-standalone! At least for a job with lots of stages and lots of client/worker communication, although rather few collect actions, so it's mainly scheduling that's relevant here. I'll be posting more information as I have it available. Kevin On 03/03/2014 03:48 PM, Sandy Ryza wrote: Are you running in yarn-standalone mode or yarn-client mode? Also, what YARN scheduler and what NodeManager heartbeat? On Sun, Mar 2, 2014 at 9:41 PM, polkosity polkos...@gmail.com wrote: Thanks for the advice Mayur. I thought I'd report back on the performance difference... Spark standalone mode has executors processing at capacity in under a second :) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Job-initialization-performance-of-Spark-standalone-mode-vs-YARN-tp2016p2243.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Avro serialization
Thanks will take a look... Sent from my iPad On Apr 3, 2014, at 7:49 AM, FRANK AUSTIN NOTHAFT fnoth...@berkeley.edu wrote: We use avro objects in our project, and have a Kryo serializer for generic Avro SpecificRecords. Take a look at: https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/edu/berkeley/cs/amplab/adam/serialization/ADAMKryoRegistrator.scala Also, Matt Massie has a good blog post about this at http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/. Frank Austin Nothaft fnoth...@berkeley.edu fnoth...@eecs.berkeley.edu 202-340-0466 On Thu, Apr 3, 2014 at 7:16 AM, Ian O'Connell i...@ianoconnell.com wrote: Objects been transformed need to be one of these in flight. Source data can just use the mapreduce input formats, so anything you can do with mapred. doing an avro one for this you probably want one of : https://github.com/kevinweil/elephant-bird/blob/master/core/src/main/java/com/twitter/elephantbird/mapreduce/input/*ProtoBuf* or just whatever your using at the moment to open them in a MR job probably could be re-purposed On Thu, Apr 3, 2014 at 7:11 AM, Ron Gonzalez zlgonza...@yahoo.com wrote: Hi, I know that sources need to either be java serializable or use kryo serialization. Does anyone have sample code that reads, transforms and writes avro files in spark? Thanks, Ron
Submitting to yarn cluster
Hi, I have a small program but I cannot seem to make it connect to the right properties of the cluster. I have the SPARK_YARN_APP_JAR, SPARK_JAR and SPARK_HOME set properly. If I run this scala file, I am seeing that this is never using the yarn.resourcemanager.address property that I set on the SparkConf instance. Any advice? Thanks, Ron import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.deploy.yarn.Client import java.lang.System import org.apache.spark.SparkConf object SimpleApp { def main(args: Array[String]) { val logFile = /home/rgonzalez/app/spark-0.9.0-incubating-bin-hadoop2/README.md val conf = new SparkConf() conf.set(yarn.resourcemanager.address, localhost:8050) val sc = new SparkContext(yarn-client, Simple App, conf) val logData = sc.textFile(logFile, 2).cache() val numAs = logData.filter(line = line.contains(a)).count() val numBs = logData.filter(line = line.contains(b)).count() println(Lines with a: %s, Lines with b: %s.format(numAs, numBs)) } }