Get application id when using SparkSubmit.main from java

2018-04-20 Thread Ron Gonzalez
Hi,  I am trying to get the application id after I use SparkSubmit.main for a 
yarn submission.  I am able to make it asynchronous using 
spark.yarn.watForCompletion=false configuration option, but I can't seem to 
figure out how I can get the application id for this job. I read both 
SparkSubmit.scala and Client.scala.  Any thoughts on how I could do it? I'd 
prefer not to use Client.run directly that returns the application id since 
there's a lot of environment prep that SparkSubmit actually does.  Thanks in 
advance for any help...
Thanks,Ron

Re: Get full RDD lineage for a spark job

2017-07-23 Thread Ron Gonzalez
Cool thanks. Will give that a try...
--Ron 

On Friday, July 21, 2017 8:09 PM, Keith Chapman <keithgchap...@gmail.com> 
wrote:
 

 You could also enable it with --conf spark.logLineage=true if you do not want 
to change any code.

Regards,Keith.
http://keith-chapman.com

On Fri, Jul 21, 2017 at 7:57 PM, Keith Chapman <keithgchap...@gmail.com> wrote:

Hi Ron,
You can try using the toDebugString method on the RDD, this will print the RDD 
lineage. 
Regards,Keith.
http://keith-chapman.com

On Fri, Jul 21, 2017 at 11:24 AM, Ron Gonzalez <zlgonza...@yahoo.com.invalid> 
wrote:

Hi,  Can someone point me to a test case or share sample code that is able to 
extract the RDD graph from a Spark job anywhere during its lifecycle? I 
understand that Spark has UI that can show the graph of the execution so I'm 
hoping that is using some API somewhere that I could use.  I know RDD is the 
actual execution graph, so if there is also a more logical abstraction API 
closer to calls like map, filter, aggregate, etc., that would even be better.  
Appreciate any help...
Thanks,Ron





   

Get full RDD lineage for a spark job

2017-07-21 Thread Ron Gonzalez
Hi,  Can someone point me to a test case or share sample code that is able to 
extract the RDD graph from a Spark job anywhere during its lifecycle? I 
understand that Spark has UI that can show the graph of the execution so I'm 
hoping that is using some API somewhere that I could use.  I know RDD is the 
actual execution graph, so if there is also a more logical abstraction API 
closer to calls like map, filter, aggregate, etc., that would even be better.  
Appreciate any help...
Thanks,Ron

Losing files in hdfs after creating spark sql table

2015-07-30 Thread Ron Gonzalez

Hi,
  After I create a table in spark sql and load infile an hdfs file to 
it, the file is no longer queryable if I do hadoop fs -ls.

  Is this expected?

Thanks,
Ron

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Question on Spark SQL for a directory

2015-07-21 Thread Ron Gonzalez

Hi,
  Question on using spark sql.
  Can someone give an example for creating table from a directory 
containing parquet files in HDFS instead of an actual parquet file?


Thanks,
Ron

On 07/21/2015 01:59 PM, Brandon White wrote:

A few questions about caching a table in Spark SQL.

1) Is there any difference between caching the dataframe and the table?

df.cache() vs sqlContext.cacheTable(tableName)

2) Do you need to warm up the cache before seeing the performance 
benefits? Is the cache LRU? Do you need to run some queries on the 
table before it is cached in memory?


3) Is caching the table much faster than .saveAsTable? I am only 
seeing a 10 %- 20% performance increase.



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Classifier for Big Data Mining

2015-07-21 Thread Ron Gonzalez
I'd use Random Forest. It will give you better generalizability. There 
are also a number of things you can do with RF that allows to train on 
samples of the massive data set and then just average over the resulting 
models...


Thanks,
Ron

On 07/21/2015 02:17 PM, Olivier Girardot wrote:
depends on your data and I guess the time/performance goals you have 
for both training/prediction, but for a quick answer : yes :)


2015-07-21 11:22 GMT+02:00 Chintan Bhatt 
chintanbhatt...@charusat.ac.in mailto:chintanbhatt...@charusat.ac.in:


Which classifier can be useful for mining massive datasets in spark?
Decision Tree can be good choice as per scalability?

-- 
CHINTAN BHATT http://in.linkedin.com/pub/chintan-bhatt/22/b31/336/

Assistant Professor,
U  P U Patel Department of Computer Engineering,
Chandubhai S. Patel Institute of Technology,
Charotar University of Science And Technology (CHARUSAT),
Changa-388421, Gujarat, INDIA.
http://www.charusat.ac.in http://www.charusat.ac.in/
_Personal Website_: https://sites.google.com/a/ecchanga.ac.in/chintan/






Re: Basic Spark SQL question

2015-07-14 Thread Ron Gonzalez
Cool thanks. Will take a look...

Sent from my iPhone

 On Jul 13, 2015, at 6:40 PM, Michael Armbrust mich...@databricks.com wrote:
 
 I'd look at the JDBC server (a long running yarn job you can submit queries 
 too)
 
 https://spark.apache.org/docs/latest/sql-programming-guide.html#running-the-thrift-jdbcodbc-server
 
 On Mon, Jul 13, 2015 at 6:31 PM, Jerrick Hoang jerrickho...@gmail.com 
 wrote:
 Well for adhoc queries you can use the CLI
 
 On Mon, Jul 13, 2015 at 5:34 PM, Ron Gonzalez 
 zlgonza...@yahoo.com.invalid wrote:
 Hi,
   I have a question for Spark SQL. Is there a way to be able to use Spark 
 SQL on YARN without having to submit a job?
   Bottom line here is I want to be able to reduce the latency of running 
 queries as a job. I know that the spark sql default submission is like a 
 job, but was wondering if it's possible to run queries like one would with 
 a regular db like MySQL or Oracle.
 
 Thanks,
 Ron
 
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 


Basic Spark SQL question

2015-07-13 Thread Ron Gonzalez

Hi,
  I have a question for Spark SQL. Is there a way to be able to use 
Spark SQL on YARN without having to submit a job?
  Bottom line here is I want to be able to reduce the latency of 
running queries as a job. I know that the spark sql default submission 
is like a job, but was wondering if it's possible to run queries like 
one would with a regular db like MySQL or Oracle.


Thanks,
Ron


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: error with pyspark

2014-08-11 Thread Ron Gonzalez
If you're running on Ubuntu, do ulimit -n, which gives the max number of 
allowed open files. You will have to change the value in 
/etc/security/limits.conf to something like 1, logout and log back in.

Thanks,
Ron

Sent from my iPad

 On Aug 10, 2014, at 10:19 PM, Davies Liu dav...@databricks.com wrote:
 
 On Fri, Aug 8, 2014 at 9:12 AM, Baoqiang Cao bqcaom...@gmail.com wrote:
 Hi There
 
 I ran into a problem and can’t find a solution.
 
 I was running bin/pyspark  ../python/wordcount.py
 
 you could use bin/spark-submit  ../python/wordcount.py
 
 The wordcount.py is here:
 
 
 import sys
 from operator import add
 
 from pyspark import SparkContext
 
 datafile = '/mnt/data/m1.txt'
 
 sc = SparkContext()
 outfile = datafile + '.freq'
 lines = sc.textFile(datafile, 1)
 counts = lines.flatMap(lambda x: x.split(' ')) \
.map(lambda x: (x, 1)) \
.reduceByKey(add)
 output = counts.collect()
 
 outf = open(outfile, 'w')
 
 for (word, count) in output:
   outf.write(word.encode('utf-8') + '\t' + str(count) + '\n')
 outf.close()
 
 
 
 The error message is here:
 
 14/08/08 16:01:59 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 0)
 java.io.FileNotFoundException:
 /tmp/spark-local-20140808160150-d36b/12/shuffle_0_0_468 (Too many open
 files)
 
 This message means that the Spark (JVM) had reach  the max number of open 
 files,
 there are fd leak some where, unfortunately I can not reproduce this
 problem.  What
 is the version of Spark?
 
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.init(FileOutputStream.java:221)
at
 org.apache.spark.storage.DiskBlockObjectWriter.open(BlockObjectWriter.scala:107)
at
 org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:175)
at
 org.apache.spark.shuffle.hash.HashShuffleWriter$$anonfun$write$1.apply(HashShuffleWriter.scala:67)
at
 org.apache.spark.shuffle.hash.HashShuffleWriter$$anonfun$write$1.apply(HashShuffleWriter.scala:65)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at
 org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:65)
at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
 
 
 The m1.txt is about 4G, and I have 120GB Ram and used -Xmx120GB. It is on
 Ubuntu. Any help please?
 
 Best
 Baoqiang Cao
 Blog: http://baoqiang.org
 Email: bqcaom...@gmail.com
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Save an RDD to a SQL Database

2014-08-06 Thread Ron Gonzalez
Hi Vida,
  It's possible to save an RDD as a hadoop file using hadoop output formats. It 
might be worthwhile to investigate using DBOutputFormat and see if this will 
work for you.
  I haven't personally written to a db, but I'd imagine this would be one way 
to do it.

Thanks,
Ron

Sent from my iPhone

 On Aug 5, 2014, at 8:29 PM, Vida Ha vid...@gmail.com wrote:
 
 
 Hi,
 
 I would like to save an RDD to a SQL database.  It seems like this would be a 
 common enough use case.  Are there any built in libraries to do it?
 
 Otherwise, I'm just planning on mapping my RDD, and having that call a method 
 to write to the database.   Given that a lot of records are going to be 
 written, the code would need to be smart and do a batch insert after enough 
 records have collected.  Does that sound like a reasonable approach?
 
 
 -Vida
 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Computing mean and standard deviation by key

2014-08-04 Thread Ron Gonzalez
Cool thanks! 


On Monday, August 4, 2014 8:58 AM, kriskalish k...@kalish.net wrote:
 


Hey Ron, 

It was pretty much exactly as Sean had depicted. I just needed to provide
count an anonymous function to tell it which elements to count. Since I
wanted to count them all, the function is simply true.

        val grouped = rdd.groupByKey().mapValues { mcs =
          val values = mcs.map(_.foo.toDouble)
          val n = values.count(x = true)
          val sum = values.sum
          val sumSquares = values.map(x = x * x).sum
          val stddev = math.sqrt(n * sumSquares - sum * sum) / n
          print(stddev:  + stddev)
          stddev
        }


I hope that helps



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Computing-mean-and-standard-deviation-by-key-tp11192p11334.html

Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Issues with HDP 2.4.0.2.1.3.0-563

2014-08-04 Thread Ron Gonzalez
One key thing I forgot to mention is that I changed the avro version to 1.7.7 
to get AVRO-1476.

I took a closer look at the jars, and what I noticed is that the assembly jars 
that work do not have the org.apache.avro.mapreduce package packaged into the 
assembly. For spark-1.0.1, org.apache.avro.mapreduce is always found. When 
creating an assembly from an older download of Spark 1.0.0, this package 
doesn't exist. In a recent download of Spark 1.0.0, the generated assembly with 
any HDP version also has org.apache.avro.mapreduce. I recompiled against the 
new download, and it also has the same problems even with an older version of 
HDP.

So I think the bottom line issue here is that the generated assemblies that 
include org.apache.avro.mapreduce seems to cause this issue. If I use the older 
Spark 1.0.0 version, I am able to create assemblies that work. I noticed that 
assemblies generated from the newer versions are indeed bigger so it seems a 
bug was perhaps fixed to ensure that all dependencies are pulled into the final 
assembly, but is now causing this symptom that I have reported...

Thanks,
Ron


On Monday, August 4, 2014 10:39 AM, Steve Nunez snu...@hortonworks.com wrote:
 


Hmm. Fair enough. I hadn¹t given that answer much thought and on
reflection think you¹re right in that a profile would just be a bad hack.




On 8/4/14, 10:35, Sean Owen so...@cloudera.com wrote:

What would such a profile do though? In general building for a
specific vendor version means setting hadoop.verison and/or
yarn.version. Any hard-coded value is unlikely to match what a
particular user needs. Setting protobuf versions and so on is already
done by the generic profiles.

In a similar vein, I am not clear on why there's a mapr profile in the
build. Its versions are about to be out of date and won't work with
upcoming Hbase changes for example.

(Elsewhere in the build I think it wouldn't hurt to clear out
cloudera-specific profiles and releases too -- they're not in the pom
but are in the distribution script. It's the vendor's problem.)

This isn't any argument about being purist but just that I am not sure
these are things that the project can meaningfully bother with.

It makes sense to set vendor repos in the pom for convenience, and
makes sense to run smoke tests in Jenkins against particular versions.

$0.02
Sean

On Mon, Aug 4, 2014 at 6:21 PM, Steve Nunez snu...@hortonworks.com
wrote:
 I don¹t think there is an hwx profile, but there probably should be.




-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Is there a way to write spark RDD to Avro files

2014-08-01 Thread Ron Gonzalez
You have to import org.apache.spark.rdd._, which will automatically make 
available this method.

Thanks,
Ron

Sent from my iPhone

 On Aug 1, 2014, at 3:26 PM, touchdown yut...@gmail.com wrote:
 
 Hi, I am facing a similar dilemma. I am trying to aggregate a bunch of small
 avro files into one avro file. I read it in with:
 sc.newAPIHadoopFile[AvroKey[GenericRecord], NullWritable,
 AvroKeyInputFormat[GenericRecord]](path)
 
 but I can't find saveAsHadoopFile or saveAsNewAPIHadoopFile. Can you please
 tell us how it worked for you thanks!
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Is-there-a-way-to-write-spark-RDD-to-Avro-files-tp10947p11219.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Computing mean and standard deviation by key

2014-08-01 Thread Ron Gonzalez
Can you share the mapValues approach you did? 

Thanks,
Ron

Sent from my iPhone

 On Aug 1, 2014, at 3:00 PM, kriskalish k...@kalish.net wrote:
 
 Thanks for the help everyone. I got the mapValues approach working. I will
 experiment with the reduceByKey approach later.
 
 3
 
 -Kris
 
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Computing-mean-and-standard-deviation-by-key-tp11192p11214.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.


NotSerializableException

2014-07-30 Thread Ron Gonzalez
Hi,
  I took avro 1.7.7 and recompiled my distribution to be able to fix the issue 
when dealing with avro GenericRecord. The issue I got was resolved. I'm 
referring to AVRO-1476.
  I also enabled kryo registration in SparkConf.
  That said, I am still seeing a NotSerializableException for 
Schema$RecordSchema. Do I need to do anything else?

Thanks,
Ron

Sent from my iPad

Re: cache changes precision

2014-07-25 Thread Ron Gonzalez
Cool I'll take a look and give it a try!

Thanks,
Ron

Sent from my iPad

 On Jul 24, 2014, at 10:35 PM, Andrew Ash and...@andrewash.com wrote:
 
 Hi Ron,
 
 I think you're encountering the issue where cacheing data from Hadoop ends up 
 with many duplicate values instead of what you expect.  Try adding a .clone() 
 to the datum() call.
 
 The issue is that Hadoop returns the same object many times but with its 
 contents changed.  This is an optimization to prevent allocating and GC'ing 
 an object for every row in Hadoop.  This works fine in Hadoop MapReduce 
 because it's single-threaded and with no cacheing of the objects.
 
 Spark though saves a reference to each object it gets back from Hadoop.  So 
 by the end of the partition, Spark ends up with a bunch of references all to 
 the same object!  I think it's just by chance that this ends up changing your 
 average to be rounded.
 
 Can you try with cloning the records in the map call?  Also look at the 
 contents and see if they're actually changed, or if the resulting RDD after a 
 cache is just the last record smeared across all the others.
 
 Cheers,
 Andrew
 
 
 On Thu, Jul 24, 2014 at 2:41 PM, Ron Gonzalez zlgonza...@yahoo.com wrote:
 Hi,
   I'm doing the following:
 
   def main(args: Array[String]) = {
 val sparkConf = new 
 SparkConf().setAppName(AvroTest).setMaster(local[2])
 val sc = new SparkContext(sparkConf)
 val conf = new Configuration()
 val job = new Job(conf)
 val path = new Path(/tmp/a.avro);
 val schema = AvroUtils.getSchema(conf, path);
 
 AvroJob.setInputKeySchema(job, schema);
 
 val rdd = sc.newAPIHadoopFile(
path.toString(),
classOf[AvroKeyInputFormat[GenericRecord]],
classOf[AvroKey[GenericRecord]],
classOf[NullWritable], conf).map(x = x._1.datum())
 val sum = rdd.map(p = 
 p.get(SEPAL_WIDTH).asInstanceOf[Float]).reduce(_ + _)
 val avg = sum/rdd.count()
 println(sSum = $sum)
 println(sAvg = $avg)
   }
 
 If I run this, it works as expected, when I add .cache() to 
 
 val rdd = sc.newAPIHadoopFile(
path.toString(),
classOf[AvroKeyInputFormat[GenericRecord]],
classOf[AvroKey[GenericRecord]],
classOf[NullWritable], conf).map(x = x._1.datum()).cache()
 
 then the command rounds up the average.
 
 Any idea why this works this way? Any tips on how to fix this?
 
 Thanks,
 Ron
 


Issue submitting spark job to yarn

2014-07-25 Thread Ron Gonzalez
Folks,
  I've been able to submit simple jobs to yarn thus far. However, when I did 
something more complicated that added 194 dependency jars using --addJars, the 
job fails in YARN with no logs. What ends up happening is that no container 
logs get created (app master or executor). If I add just a couple of 
dependencies, it works, so this is clearly a case of too many dependencies 
passed into the invocation.

  Not sure if this means that no container was created at all, but bottom line 
is that I get no logs that can help me determine what's wrong. Because of the 
large number of jars, I figured it might have been a permgen issue so I added 
these options. However, that didn't help. It seems as if the actual submission 
wasn't even spawned since no container was created or no log was found.

  Any ideas for this?

Thanks,
Ron

cache changes precision

2014-07-24 Thread Ron Gonzalez
Hi,
  I'm doing the following:

  def main(args: Array[String]) = {
    val sparkConf = new SparkConf().setAppName(AvroTest).setMaster(local[2])
    val sc = new SparkContext(sparkConf)
    val conf = new Configuration()
    val job = new Job(conf)
    val path = new Path(/tmp/a.avro);
    val schema = AvroUtils.getSchema(conf, path);

    AvroJob.setInputKeySchema(job, schema);
    
    val rdd = sc.newAPIHadoopFile(
       path.toString(),
       classOf[AvroKeyInputFormat[GenericRecord]],
       classOf[AvroKey[GenericRecord]],
       classOf[NullWritable], conf).map(x = x._1.datum())
    val sum = rdd.map(p = p.get(SEPAL_WIDTH).asInstanceOf[Float]).reduce(_ + 
_)
    val avg = sum/rdd.count()
    println(sSum = $sum)
    println(sAvg = $avg)
  }

If I run this, it works as expected, when I add .cache() to 

val rdd = sc.newAPIHadoopFile(
       path.toString(),
       classOf[AvroKeyInputFormat[GenericRecord]],
       classOf[AvroKey[GenericRecord]],
       classOf[NullWritable], conf).map(x = x._1.datum()).cache()

then the command rounds up the average.

Any idea why this works this way? Any tips on how to fix this?

Thanks,
Ron


Possible bug in ClientBase.scala?

2014-07-13 Thread Ron Gonzalez
Hi,
  I was doing programmatic submission of Spark yarn jobs and I saw code in 
ClientBase.getDefaultYarnApplicationClasspath():

val field = classOf[MRJobConfig].getField(DEFAULT_YARN_APPLICATION_CLASSPATH)
MRJobConfig doesn't have this field so the created launch env is incomplete. 
Workaround is to set yarn.application.classpath with the value from 
YarnConfiguration.DEFAULT_YARN_APPLICATION_CLASSPATH.

This results in having the spark job hang if the submission config is different 
from the default config. For example, if my resource manager port is 8050 
instead of 8030, then the spark app is not able to register itself and stays in 
ACCEPTED state.

I can easily fix this by changing this to YarnConfiguration instead of 
MRJobConfig but was wondering what the steps are for submitting a fix.

Thanks,
Ron

Sent from my iPhone

Re: Spark on Yarn: Connecting to Existing Instance

2014-07-09 Thread Ron Gonzalez
The idea behind YARN is that you can run different application types like 
MapReduce, Storm and Spark.

I would recommend that you build your spark jobs in the main method without 
specifying how you deploy it. Then you can use spark-submit to tell Spark how 
you would want to deploy to it using yarn-cluster as the master. The key point 
here is that once you have YARN setup, the spark client connects to it using 
the $HADOOP_CONF_DIR that contains the resource manager address. In particular, 
this needs to be accessible from the classpath of the submitter since it 
implicitly uses this when it instantiates a YarnConfiguration instance. If you 
want more details, read org.apache.spark.deploy.yarn.Client.scala.

You should be able to download a standalone YARN cluster from any of the Hadoop 
providers like Cloudera or Hortonworks. Once you have that, the spark 
programming guide describes what I mention above in sufficient detail for you 
to proceed.

Thanks,
Ron

Sent from my iPad

 On Jul 9, 2014, at 8:31 AM, John Omernik j...@omernik.com wrote:
 
 I am trying to get my head around using Spark on Yarn from a perspective of a 
 cluster. I can start a Spark Shell no issues in Yarn. Works easily.  This is 
 done in yarn-client mode and it all works well. 
 
 In multiple examples, I see instances where people have setup Spark Clusters 
 in Stand Alone mode, and then in the examples they connect to this cluster 
 in Stand Alone mode. This is done often times using the spark:// string for 
 connection.  Cool. s
 But what I don't understand is how do I setup a Yarn instance that I can 
 connect to? I.e. I tried running Spark Shell in yarn-cluster mode and it 
 gave me an error, telling me to use yarn-client.  I see information on using 
 spark-class or spark-submit.  But what I'd really like is a instance I can 
 connect a spark-shell too, and have the instance stay up. I'd like to be able 
 run other things on that instance etc. Is that possible with Yarn? I know 
 there may be long running job challenges with Yarn, but I am just testing, I 
 am just curious if I am looking at something completely bonkers here, or just 
 missing something simple. 
 
 Thanks!
 
 


Re: Purpose of spark-submit?

2014-07-09 Thread Ron Gonzalez
Koert,
Yeah I had the same problems trying to do programmatic submission of spark jobs 
to my Yarn cluster. I was ultimately able to resolve it by reviewing the 
classpath and debugging through all the different things that the Spark Yarn 
client (Client.scala) did for submitting to Yarn (like env setup, local 
resources, etc), and I compared it to what spark-submit had done.
I have to admit though that it was far from trivial to get it working out of 
the box, and perhaps some work could be done in that regards. In my case, it 
had boiled down to the launch environment not having the HADOOP_CONF_DIR set, 
which prevented the app master from registering itself with the Resource 
Manager.

Thanks,
Ron

Sent from my iPad

 On Jul 9, 2014, at 9:25 AM, Jerry Lam chiling...@gmail.com wrote:
 
 Sandy, I experienced the similar behavior as Koert just mentioned. I don't 
 understand why there is a difference between using spark-submit and 
 programmatic execution. Maybe there is something else we need to add to the 
 spark conf/spark context in order to launch spark jobs programmatically that 
 are not needed before?
 
 
 
 On Wed, Jul 9, 2014 at 12:14 PM, Koert Kuipers ko...@tresata.com wrote:
 sandy, that makes sense. however i had trouble doing programmatic execution 
 on yarn in client mode as well. the application-master in yarn came up but 
 then bombed because it was looking for jars that dont exist (it was looking 
 in the original file paths on the driver side, which are not available on 
 the yarn node). my guess is that spark-submit is changing some settings 
 (perhaps preparing the distributed cache and modifying settings 
 accordingly), which makes it harder to run things programmatically. i could 
 be wrong however. i gave up debugging and resorted to using spark-submit for 
 now.
 
 
 
 On Wed, Jul 9, 2014 at 12:05 PM, Sandy Ryza sandy.r...@cloudera.com wrote:
 Spark still supports the ability to submit jobs programmatically without 
 shell scripts.
 
 Koert,
 The main reason that the unification can't be a part of SparkContext is 
 that YARN and standalone support deploy modes where the driver runs in a 
 managed process on the cluster.  In this case, the SparkContext is created 
 on a remote node well after the application is launched.
 
 
 On Wed, Jul 9, 2014 at 8:34 AM, Andrei faithlessfri...@gmail.com wrote:
 One another +1. For me it's a question of embedding. With 
 SparkConf/SparkContext I can easily create larger projects with Spark as a 
 separate service (just like MySQL and JDBC, for example). With 
 spark-submit I'm bound to Spark as a main framework that defines how my 
 application should look like. In my humble opinion, using Spark as 
 embeddable library rather than main framework and runtime is much easier. 
 
 
 
 
 On Wed, Jul 9, 2014 at 5:14 PM, Jerry Lam chiling...@gmail.com wrote:
 +1 as well for being able to submit jobs programmatically without using 
 shell script.
 
 we also experience issues of submitting jobs programmatically without 
 using spark-submit. In fact, even in the Hadoop World, I rarely used 
 hadoop jar to submit jobs in shell. 
 
 
 
 On Wed, Jul 9, 2014 at 9:47 AM, Robert James srobertja...@gmail.com 
 wrote:
 +1 to be able to do anything via SparkConf/SparkContext.  Our app
 worked fine in Spark 0.9, but, after several days of wrestling with
 uber jars and spark-submit, and so far failing to get Spark 1.0
 working, we'd like to go back to doing it ourself with SparkConf.
 
 As the previous poster said, a few scripts should be able to give us
 the classpath and any other params we need, and be a lot more
 transparent and debuggable.
 
 On 7/9/14, Surendranauth Hiraman suren.hira...@velos.io wrote:
  Are there any gaps beyond convenience and code/config separation in 
  using
  spark-submit versus SparkConf/SparkContext if you are willing to set 
  your
  own config?
 
  If there are any gaps, +1 on having parity within 
  SparkConf/SparkContext
  where possible. In my use case, we launch our jobs programmatically. In
  theory, we could shell out to spark-submit but it's not the best 
  option for
  us.
 
  So far, we are only using Standalone Cluster mode, so I'm not 
  knowledgeable
  on the complexities of other modes, though.
 
  -Suren
 
 
 
  On Wed, Jul 9, 2014 at 8:20 AM, Koert Kuipers ko...@tresata.com 
  wrote:
 
  not sure I understand why unifying how you submit app for different
  platforms and dynamic configuration cannot be part of SparkConf and
  SparkContext?
 
  for classpath a simple script similar to hadoop classpath that shows
  what needs to be added should be sufficient.
 
  on spark standalone I can launch a program just fine with just 
  SparkConf
  and SparkContext. not on yarn, so the spark-launch script must be 
  doing a
  few things extra there I am missing... which makes things more 
  difficult
  because I am not sure its realistic to expect every application that
  needs
  to run something on spark to be launched using 

Re: Purpose of spark-submit?

2014-07-09 Thread Ron Gonzalez
I am able to use Client.scala or LauncherExecutor.scala as my programmatic 
entry point for Yarn.

Thanks,
Ron

Sent from my iPad

 On Jul 9, 2014, at 7:14 AM, Jerry Lam chiling...@gmail.com wrote:
 
 +1 as well for being able to submit jobs programmatically without using shell 
 script.
 
 we also experience issues of submitting jobs programmatically without using 
 spark-submit. In fact, even in the Hadoop World, I rarely used hadoop jar 
 to submit jobs in shell. 
 
 
 
 On Wed, Jul 9, 2014 at 9:47 AM, Robert James srobertja...@gmail.com wrote:
 +1 to be able to do anything via SparkConf/SparkContext.  Our app
 worked fine in Spark 0.9, but, after several days of wrestling with
 uber jars and spark-submit, and so far failing to get Spark 1.0
 working, we'd like to go back to doing it ourself with SparkConf.
 
 As the previous poster said, a few scripts should be able to give us
 the classpath and any other params we need, and be a lot more
 transparent and debuggable.
 
 On 7/9/14, Surendranauth Hiraman suren.hira...@velos.io wrote:
  Are there any gaps beyond convenience and code/config separation in using
  spark-submit versus SparkConf/SparkContext if you are willing to set your
  own config?
 
  If there are any gaps, +1 on having parity within SparkConf/SparkContext
  where possible. In my use case, we launch our jobs programmatically. In
  theory, we could shell out to spark-submit but it's not the best option for
  us.
 
  So far, we are only using Standalone Cluster mode, so I'm not knowledgeable
  on the complexities of other modes, though.
 
  -Suren
 
 
 
  On Wed, Jul 9, 2014 at 8:20 AM, Koert Kuipers ko...@tresata.com wrote:
 
  not sure I understand why unifying how you submit app for different
  platforms and dynamic configuration cannot be part of SparkConf and
  SparkContext?
 
  for classpath a simple script similar to hadoop classpath that shows
  what needs to be added should be sufficient.
 
  on spark standalone I can launch a program just fine with just SparkConf
  and SparkContext. not on yarn, so the spark-launch script must be doing a
  few things extra there I am missing... which makes things more difficult
  because I am not sure its realistic to expect every application that
  needs
  to run something on spark to be launched using spark-submit.
   On Jul 9, 2014 3:45 AM, Patrick Wendell pwend...@gmail.com wrote:
 
  It fulfills a few different functions. The main one is giving users a
  way to inject Spark as a runtime dependency separately from their
  program and make sure they get exactly the right version of Spark. So
  a user can bundle an application and then use spark-submit to send it
  to different types of clusters (or using different versions of Spark).
 
  It also unifies the way you bundle and submit an app for Yarn, Mesos,
  etc... this was something that became very fragmented over time before
  this was added.
 
  Another feature is allowing users to set configuration values
  dynamically rather than compile them inside of their program. That's
  the one you mention here. You can choose to use this feature or not.
  If you know your configs are not going to change, then you don't need
  to set them with spark-submit.
 
 
  On Wed, Jul 9, 2014 at 10:22 AM, Robert James srobertja...@gmail.com
  wrote:
   What is the purpose of spark-submit? Does it do anything outside of
   the standard val conf = new SparkConf ... val sc = new SparkContext
   ... ?
 
 
 
 
  --
 
  SUREN HIRAMAN, VP TECHNOLOGY
  Velos
  Accelerating Machine Learning
 
  440 NINTH AVENUE, 11TH FLOOR
  NEW YORK, NY 10001
  O: (917) 525-2466 ext. 105
  F: 646.349.4063
  E: suren.hiraman@v suren.hira...@sociocast.comelos.io
  W: www.velos.io
 
 


Re: Setting queue for spark job on yarn

2014-05-21 Thread Ron Gonzalez
Btw, I'm on 0.9.1. Will setting a queue programmatically be available in 1.0?

Thanks,
Ron

Sent from my iPad

 On May 20, 2014, at 6:27 PM, Ron Gonzalez zlgonza...@yahoo.com wrote:
 
 Hi Sandy,
   Is there a programmatic way? We're building a platform as a service and 
 need to assign it to different queues that can provide different scheduler 
 approaches.
 
 Thanks,
 Ron
 
 Sent from my iPhone
 
 On May 20, 2014, at 1:30 PM, Sandy Ryza sandy.r...@cloudera.com wrote:
 
 Hi Ron,
 
 What version are you using?  For 0.9, you need to set it outside your code 
 with the SPARK_YARN_QUEUE environment variable.
 
 -Sandy
 
 
 On Mon, May 19, 2014 at 9:29 PM, Ron Gonzalez zlgonza...@yahoo.com wrote:
 Hi,
   How does one submit a spark job to yarn and specify a queue?
   The code that successfully submits to yarn is:
 
val conf = new SparkConf()
val sc = new SparkContext(yarn-client, Simple App, conf)
 
Where do I need to specify the queue?
 
   Thanks in advance for any help on this...
 
 Thanks,
 Ron
 


Setting queue for spark job on yarn

2014-05-19 Thread Ron Gonzalez
Hi,
  How does one submit a spark job to yarn and specify a queue?
  The code that successfully submits to yarn is:

   val conf = new SparkConf()
   val sc = new SparkContext(yarn-client, Simple App, conf)

   Where do I need to specify the queue?

  Thanks in advance for any help on this...

Thanks,
Ron

Re: Job initialization performance of Spark standalone mode vs YARN

2014-04-04 Thread Ron Gonzalez
Hi,
  Can you explain a little more what's going on? Which one submits a job to the 
yarn cluster that creates an application master and spawns containers for the 
local jobs? I tried yarn-client and submitted to our yarn cluster and it seems 
to work that way.  Shouldn't Client.scala be running within the AppMaster 
instance in this run mode?
  How exactly does yarn-standalone work?

Thanks,
Ron

Sent from my iPhone

 On Apr 3, 2014, at 11:19 AM, Kevin Markey kevin.mar...@oracle.com wrote:
 
 We are now testing precisely what you ask about in our environment.  But 
 Sandy's questions are relevant.  The bigger issue is not Spark vs. Yarn but 
 client vs. standalone and where the client is located on the network 
 relative to the cluster.
 
 The client options that locate the client/master remote from the cluster, 
 while useful for interactive queries, suffer from considerable network 
 traffic overhead as the master schedules and transfers data with the worker 
 nodes on the cluster.  The standalone options locate the master/client on 
 the cluster.  In yarn-standalone, the master is a thread contained by the 
 Yarn Resource Manager.  Lots less traffic, as the master is co-located with 
 the worker nodes on the cluster and its scheduling/data communication has 
 less latency.
 
 In my comparisons between yarn-client and yarn-standalone (so as not to 
 conflate yarn vs Spark), yarn-client computation time is at least double 
 yarn-standalone!  At least for a job with lots of stages and lots of 
 client/worker communication, although rather few collect actions, so it's 
 mainly scheduling that's relevant here.
 
 I'll be posting more information as I have it available.
 
 Kevin
 
 
 On 03/03/2014 03:48 PM, Sandy Ryza wrote:
 Are you running in yarn-standalone mode or yarn-client mode?  Also, what 
 YARN scheduler and what NodeManager heartbeat?  
 
 
 On Sun, Mar 2, 2014 at 9:41 PM, polkosity polkos...@gmail.com wrote:
 Thanks for the advice Mayur.
 
 I thought I'd report back on the performance difference...  Spark standalone
 mode has executors processing at capacity in under a second :)
 
 
 
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Job-initialization-performance-of-Spark-standalone-mode-vs-YARN-tp2016p2243.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.
 


Re: Avro serialization

2014-04-04 Thread Ron Gonzalez
Thanks will take a look...

Sent from my iPad

 On Apr 3, 2014, at 7:49 AM, FRANK AUSTIN NOTHAFT fnoth...@berkeley.edu 
 wrote:
 
 We use avro objects in our project, and have a Kryo serializer for generic 
 Avro SpecificRecords. Take a look at:
 
 https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/edu/berkeley/cs/amplab/adam/serialization/ADAMKryoRegistrator.scala
 
 Also, Matt Massie has a good blog post about this at 
 http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/.
 
 Frank Austin Nothaft
 fnoth...@berkeley.edu
 fnoth...@eecs.berkeley.edu
 202-340-0466
 
 
 On Thu, Apr 3, 2014 at 7:16 AM, Ian O'Connell i...@ianoconnell.com wrote:
 Objects been transformed need to be one of these in flight. Source data can 
 just use the mapreduce input formats, so anything you can do with mapred. 
 doing an avro one for this you probably want one of :
 https://github.com/kevinweil/elephant-bird/blob/master/core/src/main/java/com/twitter/elephantbird/mapreduce/input/*ProtoBuf*
 
 or just whatever your using at the moment to open them in a MR job probably 
 could be re-purposed
 
 
 On Thu, Apr 3, 2014 at 7:11 AM, Ron Gonzalez zlgonza...@yahoo.com wrote:
 
 Hi,
   I know that sources need to either be java serializable or use kryo 
 serialization.
   Does anyone have sample code that reads, transforms and writes avro files 
 in spark?
 
 Thanks,
 Ron
 


Submitting to yarn cluster

2014-04-02 Thread Ron Gonzalez
Hi,
  I have a small program but I cannot seem to make it connect to the right 
properties of the cluster.
  I have the SPARK_YARN_APP_JAR, SPARK_JAR and SPARK_HOME set properly.
  If I run this scala file, I am seeing that this is never using the 
yarn.resourcemanager.address property that I set on the SparkConf instance.
  Any advice?

Thanks,
Ron

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.deploy.yarn.Client
import java.lang.System
import org.apache.spark.SparkConf


object SimpleApp {
  def main(args: Array[String]) {
    val logFile = 
/home/rgonzalez/app/spark-0.9.0-incubating-bin-hadoop2/README.md
    val conf = new SparkConf()
    conf.set(yarn.resourcemanager.address, localhost:8050)
    val sc = new SparkContext(yarn-client, Simple App, conf)
    val logData = sc.textFile(logFile, 2).cache()
    val numAs = logData.filter(line = line.contains(a)).count()
    val numBs = logData.filter(line = line.contains(b)).count()
    
    println(Lines with a: %s, Lines with b: %s.format(numAs, numBs))
  }
}