Get application id when using SparkSubmit.main from java

2018-04-20 Thread Ron Gonzalez
Hi,  I am trying to get the application id after I use SparkSubmit.main for a yarn submission.  I am able to make it asynchronous using spark.yarn.watForCompletion=false configuration option, but I can't seem to figure out how I can get the application id for this job. I read both

Re: Get full RDD lineage for a spark job

2017-07-23 Thread Ron Gonzalez
, 2017 at 7:57 PM, Keith Chapman <keithgchap...@gmail.com> wrote: Hi Ron, You can try using the toDebugString method on the RDD, this will print the RDD lineage.  Regards,Keith. http://keith-chapman.com On Fri, Jul 21, 2017 at 11:24 AM, Ron Gonzalez <zlgonza...@yahoo.com.invalid>

Get full RDD lineage for a spark job

2017-07-21 Thread Ron Gonzalez
Hi,  Can someone point me to a test case or share sample code that is able to extract the RDD graph from a Spark job anywhere during its lifecycle? I understand that Spark has UI that can show the graph of the execution so I'm hoping that is using some API somewhere that I could use.  I know

Losing files in hdfs after creating spark sql table

2015-07-30 Thread Ron Gonzalez
Hi, After I create a table in spark sql and load infile an hdfs file to it, the file is no longer queryable if I do hadoop fs -ls. Is this expected? Thanks, Ron - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

Question on Spark SQL for a directory

2015-07-21 Thread Ron Gonzalez
Hi, Question on using spark sql. Can someone give an example for creating table from a directory containing parquet files in HDFS instead of an actual parquet file? Thanks, Ron On 07/21/2015 01:59 PM, Brandon White wrote: A few questions about caching a table in Spark SQL. 1) Is there

Re: Classifier for Big Data Mining

2015-07-21 Thread Ron Gonzalez
I'd use Random Forest. It will give you better generalizability. There are also a number of things you can do with RF that allows to train on samples of the massive data set and then just average over the resulting models... Thanks, Ron On 07/21/2015 02:17 PM, Olivier Girardot wrote: depends

Re: Basic Spark SQL question

2015-07-14 Thread Ron Gonzalez
-the-thrift-jdbcodbc-server On Mon, Jul 13, 2015 at 6:31 PM, Jerrick Hoang jerrickho...@gmail.com wrote: Well for adhoc queries you can use the CLI On Mon, Jul 13, 2015 at 5:34 PM, Ron Gonzalez zlgonza...@yahoo.com.invalid wrote: Hi, I have a question for Spark SQL. Is there a way

Basic Spark SQL question

2015-07-13 Thread Ron Gonzalez
Hi, I have a question for Spark SQL. Is there a way to be able to use Spark SQL on YARN without having to submit a job? Bottom line here is I want to be able to reduce the latency of running queries as a job. I know that the spark sql default submission is like a job, but was wondering if

Re: error with pyspark

2014-08-11 Thread Ron Gonzalez
If you're running on Ubuntu, do ulimit -n, which gives the max number of allowed open files. You will have to change the value in /etc/security/limits.conf to something like 1, logout and log back in. Thanks, Ron Sent from my iPad On Aug 10, 2014, at 10:19 PM, Davies Liu

Re: Save an RDD to a SQL Database

2014-08-06 Thread Ron Gonzalez
Hi Vida, It's possible to save an RDD as a hadoop file using hadoop output formats. It might be worthwhile to investigate using DBOutputFormat and see if this will work for you. I haven't personally written to a db, but I'd imagine this would be one way to do it. Thanks, Ron Sent from my

Re: Computing mean and standard deviation by key

2014-08-04 Thread Ron Gonzalez
Cool thanks!  On Monday, August 4, 2014 8:58 AM, kriskalish k...@kalish.net wrote: Hey Ron, It was pretty much exactly as Sean had depicted. I just needed to provide count an anonymous function to tell it which elements to count. Since I wanted to count them all, the function is simply

Re: Issues with HDP 2.4.0.2.1.3.0-563

2014-08-04 Thread Ron Gonzalez
One key thing I forgot to mention is that I changed the avro version to 1.7.7 to get AVRO-1476. I took a closer look at the jars, and what I noticed is that the assembly jars that work do not have the org.apache.avro.mapreduce package packaged into the assembly. For spark-1.0.1,

Re: Is there a way to write spark RDD to Avro files

2014-08-01 Thread Ron Gonzalez
You have to import org.apache.spark.rdd._, which will automatically make available this method. Thanks, Ron Sent from my iPhone On Aug 1, 2014, at 3:26 PM, touchdown yut...@gmail.com wrote: Hi, I am facing a similar dilemma. I am trying to aggregate a bunch of small avro files into one

Re: Computing mean and standard deviation by key

2014-08-01 Thread Ron Gonzalez
Can you share the mapValues approach you did? Thanks, Ron Sent from my iPhone On Aug 1, 2014, at 3:00 PM, kriskalish k...@kalish.net wrote: Thanks for the help everyone. I got the mapValues approach working. I will experiment with the reduceByKey approach later. 3 -Kris --

NotSerializableException

2014-07-30 Thread Ron Gonzalez
Hi, I took avro 1.7.7 and recompiled my distribution to be able to fix the issue when dealing with avro GenericRecord. The issue I got was resolved. I'm referring to AVRO-1476. I also enabled kryo registration in SparkConf. That said, I am still seeing a NotSerializableException for

Re: cache changes precision

2014-07-25 Thread Ron Gonzalez
. Can you try with cloning the records in the map call? Also look at the contents and see if they're actually changed, or if the resulting RDD after a cache is just the last record smeared across all the others. Cheers, Andrew On Thu, Jul 24, 2014 at 2:41 PM, Ron Gonzalez zlgonza

Issue submitting spark job to yarn

2014-07-25 Thread Ron Gonzalez
Folks,   I've been able to submit simple jobs to yarn thus far. However, when I did something more complicated that added 194 dependency jars using --addJars, the job fails in YARN with no logs. What ends up happening is that no container logs get created (app master or executor). If I add just

cache changes precision

2014-07-24 Thread Ron Gonzalez
Hi,   I'm doing the following:   def main(args: Array[String]) = {     val sparkConf = new SparkConf().setAppName(AvroTest).setMaster(local[2])     val sc = new SparkContext(sparkConf)     val conf = new Configuration()     val job = new Job(conf)     val path = new Path(/tmp/a.avro);     val

Possible bug in ClientBase.scala?

2014-07-13 Thread Ron Gonzalez
Hi, I was doing programmatic submission of Spark yarn jobs and I saw code in ClientBase.getDefaultYarnApplicationClasspath(): val field = classOf[MRJobConfig].getField(DEFAULT_YARN_APPLICATION_CLASSPATH) MRJobConfig doesn't have this field so the created launch env is incomplete. Workaround

Re: Spark on Yarn: Connecting to Existing Instance

2014-07-09 Thread Ron Gonzalez
The idea behind YARN is that you can run different application types like MapReduce, Storm and Spark. I would recommend that you build your spark jobs in the main method without specifying how you deploy it. Then you can use spark-submit to tell Spark how you would want to deploy to it using

Re: Purpose of spark-submit?

2014-07-09 Thread Ron Gonzalez
Koert, Yeah I had the same problems trying to do programmatic submission of spark jobs to my Yarn cluster. I was ultimately able to resolve it by reviewing the classpath and debugging through all the different things that the Spark Yarn client (Client.scala) did for submitting to Yarn (like env

Re: Purpose of spark-submit?

2014-07-09 Thread Ron Gonzalez
I am able to use Client.scala or LauncherExecutor.scala as my programmatic entry point for Yarn. Thanks, Ron Sent from my iPad On Jul 9, 2014, at 7:14 AM, Jerry Lam chiling...@gmail.com wrote: +1 as well for being able to submit jobs programmatically without using shell script. we

Re: Setting queue for spark job on yarn

2014-05-21 Thread Ron Gonzalez
Btw, I'm on 0.9.1. Will setting a queue programmatically be available in 1.0? Thanks, Ron Sent from my iPad On May 20, 2014, at 6:27 PM, Ron Gonzalez zlgonza...@yahoo.com wrote: Hi Sandy, Is there a programmatic way? We're building a platform as a service and need to assign

Setting queue for spark job on yarn

2014-05-19 Thread Ron Gonzalez
Hi,   How does one submit a spark job to yarn and specify a queue?   The code that successfully submits to yarn is:    val conf = new SparkConf()    val sc = new SparkContext(yarn-client, Simple App, conf)    Where do I need to specify the queue?   Thanks in advance for any help on this...

Re: Job initialization performance of Spark standalone mode vs YARN

2014-04-04 Thread Ron Gonzalez
Hi, Can you explain a little more what's going on? Which one submits a job to the yarn cluster that creates an application master and spawns containers for the local jobs? I tried yarn-client and submitted to our yarn cluster and it seems to work that way. Shouldn't Client.scala be running

Re: Avro serialization

2014-04-04 Thread Ron Gonzalez
://github.com/kevinweil/elephant-bird/blob/master/core/src/main/java/com/twitter/elephantbird/mapreduce/input/*ProtoBuf* or just whatever your using at the moment to open them in a MR job probably could be re-purposed On Thu, Apr 3, 2014 at 7:11 AM, Ron Gonzalez zlgonza...@yahoo.com wrote

Submitting to yarn cluster

2014-04-02 Thread Ron Gonzalez
Hi,   I have a small program but I cannot seem to make it connect to the right properties of the cluster.   I have the SPARK_YARN_APP_JAR, SPARK_JAR and SPARK_HOME set properly.   If I run this scala file, I am seeing that this is never using the yarn.resourcemanager.address property that I set