Re: is there any way to submit spark application from outside of spark cluster

2016-03-25 Thread vetal king
Prateek It's possible to submit spark application from outside application. If you are using java then use ProcessBuilder and execute sparksubmit. There are two other options which i have not used. There is some spark submit server and spark also provides REST api to submit job but i don't have

Re: Testing spark with AWS spot instances

2016-03-25 Thread Sven Krasser
When a spot instance terminates, you lose all data (RDD partitions) stored in the executors that ran on that instance. Spark can recreate the partitions from input data, but if that requires going through multiple preceding shuffles a good chunk of the job will need to be redone. -Sven On Thu,

Re: Limit pyspark.daemon threads

2016-03-25 Thread Sven Krasser
Hey Ken, I also frequently see more pyspark daemons than configured concurrency, often it's a low multiple. (There was an issue pre-1.3.0 that caused this to be quite a bit higher, so make sure you at least have a recent version; see SPARK-5395.) Each pyspark daemon tries to stay below the

Re: Hive table created by Spark seems to end up in default

2016-03-25 Thread Ted Yu
Session management has improved in 1.6.x (see SPARK-10810) Mind giving 1.6.1 a try ? Thanks On Fri, Mar 25, 2016 at 3:48 PM, Mich Talebzadeh wrote: > I have noticed that the only sure way to specify a Hive table from Spark > is to prefix it with database (DBName)

Someone explain the /mnt and /mnt2 folders on spark-ec2 slaves

2016-03-25 Thread Dillian Murphey
I have a 40GB ephemeral disk on /mnt and another one on /mnt2 The person that set this up has left. I'm aware of having maybe 1 ebs disk, but I guess this was launched with 2 ebs volumes using the --ebsxyz? Or are those two instance storages part of the AMI? tnx

Re: This simple UDF is not working!

2016-03-25 Thread Mich Talebzadeh
Hi Ted, I decided to take a short cut here. I created the map leaving date as it is (p(1)) below def CleanupCurrency (word : String) : Double = { return word.toString.substring(1).replace(",", "").toDouble } sqlContext.udf.register("CleanupCurrency", CleanupCurrency(_:String)) val a =

Re: SparkSQL and multiple roots in 1.6

2016-03-25 Thread Spencer Uresk
Ok, that helped a lot - and I understand the feature/change better now. Thank you! On Fri, Mar 25, 2016 at 4:32 PM, Michael Armbrust wrote: > Oh, I'm sorry I didn't fully understand what you were trying to do. If > you don't need partitioning, you can set >

Hive table created by Spark seems to end up in default

2016-03-25 Thread Mich Talebzadeh
I have noticed that the only sure way to specify a Hive table from Spark is to prefix it with database (DBName) name otherwise it seems to be created in default even if you use sql("use DBName")? Thanks Dr Mich Talebzadeh LinkedIn *

Re: Finding out the time a table was created

2016-03-25 Thread Mich Talebzadeh
mine is version 1.5.2 Ted Thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com On 25 March 2016 at

Re: Finding out the time a table was created

2016-03-25 Thread Ted Yu
Strange: the JIRAs below were marked Fixed in 1.5.0 On Fri, Mar 25, 2016 at 3:43 PM, Mich Talebzadeh wrote: > Is this 1.6 Ted? > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >

Re: Finding out the time a table was created

2016-03-25 Thread Mich Talebzadeh
Is this 1.6 Ted? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com On 25 March 2016 at 22:40, Ted Yu

Re: Finding out the time a table was created

2016-03-25 Thread Mich Talebzadeh
Hive gets it OK. That is what you saw as well I Guess Ashok, Mine is 1.5.2 as well hive> describe formatted test.t14; OK # col_name data_type comment invoicenumber int paymentdate date net decimal(20,2) vat

Re: Finding out the time a table was created

2016-03-25 Thread Ted Yu
Looks like database support was fixed by: [SPARK-7943] [SPARK-8105] [SPARK-8435] [SPARK-8714] [SPARK-8561] Fixes multi-database support On Fri, Mar 25, 2016 at 3:35 PM, Ashok Kumar wrote: > 1.5.2 Ted. > > Those two lines I don't know where they come. It finds and gets the

Re: Finding out the time a table was created

2016-03-25 Thread Ashok Kumar
1.5.2 Ted. Those two lines I don't know where they come. It finds and gets the table info OK HTH On Friday, 25 March 2016, 22:32, Ted Yu wrote: Which release of Spark do you use, Mich ? In master branch, the message is more accurate

Re: Finding out the time a table was created

2016-03-25 Thread Ted Yu
Which release of Spark do you use, Mich ? In master branch, the message is more accurate (sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/NoSuchItemException.scala): override def getMessage: String = s"Table $table not found in database $db" On Fri, Mar 25, 2016 at 3:21

Re: SparkSQL and multiple roots in 1.6

2016-03-25 Thread Michael Armbrust
Oh, I'm sorry I didn't fully understand what you were trying to do. If you don't need partitioning, you can set "spark.sql.sources.partitionDiscovery.enabled=false". Otherwise, I think you need to use the unioning approach. On Fri, Mar 25, 2016 at 1:35 PM, Spencer Uresk

Re: Finding out the time a table was created

2016-03-25 Thread Mich Talebzadeh
You can use DESCRIBE FORMATTED . to get that info. This is based on the same command in Hive however, it throws two erroneous error lines as shown below (don't see them in Hive DESCRIBE ...) Example scala> sql("describe formatted test.t14").collect.foreach(println) 16/03/25 22:32:38 ERROR Hive:

Finding out the time a table was created

2016-03-25 Thread Ashok Kumar
Experts, I would like to know when a table was created in Hive database using Spark shell? Thanks

Re: SparkSQL and multiple roots in 1.6

2016-03-25 Thread Spencer Uresk
Thanks for the suggestion - I didn't try it at first because it seems like I have multiple roots and not necessarily partitioned data. Is this the correct way to do that? sqlContext.read.option("basePath", "hdfs://user/hdfs/analytics/").json("hdfs://user/hdfs/analytics/*/PAGEVIEW/*/*") If so, it

Re: [SQL] Two columns in output vs one when joining DataFrames?

2016-03-25 Thread Sean Owen
Although you understand the two are semantically equivalent, the second case involves an arbitrary condition not a join on a column per se. In the general case there is not even a shared column between the two being joined, so all of both are included. On Fri, Mar 25, 2016 at 9:19 PM, Jacek

Re: DataFrameWriter.save fails job with one executor failure

2016-03-25 Thread Surendra , Manchikanti
Hi Vinoth, As per documentation DirectParquetOutputCommitter better suits for S3. https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/DirectParquetOutputCommitter.scala Regards, Surendra M -- Surendra Manchikanti On Fri, Mar

Re: DataFrameWriter.save fails job with one executor failure

2016-03-25 Thread Michael Armbrust
I would not recommend using the direct output committer with HDFS. Its intended only as an optimization for S3. On Fri, Mar 25, 2016 at 4:03 AM, Vinoth Chandar wrote: > Hi, > > We are doing the following to save a dataframe in parquet (using > DirectParquetOutputCommitter) as

Re: SparkSQL and multiple roots in 1.6

2016-03-25 Thread Michael Armbrust
Have you tried setting a base path for partition discovery? Starting from Spark 1.6.0, partition discovery only finds partitions under > the given paths by default. For the above example, if users pass > path/to/table/gender=male to either SQLContext.read.parquet or > SQLContext.read.load, gender

Re: is there any way to submit spark application from outside of spark cluster

2016-03-25 Thread sunil m
Hi Prateek! You might want to have a look at spark job server: https://github.com/spark-jobserver/spark-jobserver Warm regards, Sunil Manikani. On 25 March 2016 at 23:34, Ted Yu wrote: > Do you run YARN in your production environment (and plan to run Spark jobs > on

Re: is there any way to submit spark application from outside of spark cluster

2016-03-25 Thread Ted Yu
Do you run YARN in your production environment (and plan to run Spark jobs on YARN) ? If that is the case, hadoop configuration is needed. Cheers On Fri, Mar 25, 2016 at 11:01 AM, prateek arora wrote: > Hi > > Thanks for the information . it will definitely solve

Re: is there any way to submit spark application from outside of spark cluster

2016-03-25 Thread prateek arora
Hi Thanks for the information . it will definitely solve my problem I have one more question .. if i want to launch a spark application in production environment so is there any other way so multiple users can submit there job without having hadoop configuration . Regards Prateek On Fri,

Re: is there any way to submit spark application from outside of spark cluster

2016-03-25 Thread Ted Yu
See this thread: http://search-hadoop.com/m/q3RTtAvwgE7dEI02 On Fri, Mar 25, 2016 at 10:39 AM, prateek arora wrote: > Hi > > I want to submit spark application from outside of spark clusters . so > please help me to provide a information regarding this. > >

Re: Limit pyspark.daemon threads

2016-03-25 Thread Carlile, Ken
Further data on this.  I’m watching another job right now where there are 16 pyspark.daemon threads, all of which are trying to get a full core (remember, this is a 16 core machine). Unfortunately , the java process actually running the spark worker is trying to take several cores of its

is there any way to submit spark application from outside of spark cluster

2016-03-25 Thread prateek arora
Hi I want to submit spark application from outside of spark clusters . so please help me to provide a information regarding this. Regards Prateek -- View this message in context:

Re: SparkSQL and multiple roots in 1.6

2016-03-25 Thread Ted Yu
This is the original subject of the JIRA: Partition discovery fail if there is a _SUCCESS file in the table's root dir If I remember correctly, there were discussions on how (traditional) partition discovery slowed down Spark jobs. Cheers On Fri, Mar 25, 2016 at 10:15 AM, suresk

SparkSQL and multiple roots in 1.6

2016-03-25 Thread suresk
In previous versions of Spark, this would work: val events = sqlContext.jsonFile("hdfs://user/hdfs/analytics/*/PAGEVIEW/*/*") Where the first wildcard corresponds to an application directory, the second to a partition directory, and the third matched all the files in the partition directory. The

Re: Spark Metrics Framework?

2016-03-25 Thread Silvio Fiorito
Hi Mike, Sorry got swamped with work and didn’t get a chance to reply. I misunderstood what you were trying to do. I thought you were just looking to create custom metrics vs looking for the existing Hadoop Output Format counters. I’m not familiar enough with the Hadoop APIs but I think it

Re: Best way to determine # of workers

2016-03-25 Thread Aaron Jackson
I think the SparkListener is about as close as it gets. That way I can start up the instance (aws, open-stack, vmware, etc) and simply wait until the SparkListener indicates that the executors are online before starting. Thanks for the advise. Aaron On Fri, Mar 25, 2016 at 10:54 AM, Jacek

Re: Best way to determine # of workers

2016-03-25 Thread Jacek Laskowski
Hi, You may want to use SparkListener [1] (as webui) and listens to SparkListenerExecutorAdded and SparkListenerExecutorRemoved. [1] http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.scheduler.SparkListener Pozdrawiam, Jacek Laskowski

Re: Problem using saveAsNewAPIHadoopFile API

2016-03-25 Thread vetal king
Hi Sebastian, Yes... my mistake... you are right. Every partition will create a different file. Shridhar On Fri, Mar 25, 2016 at 6:58 PM, Sebastian Piu wrote: > I dont understand about the race condition comment you mention. > Have you seen this somewhere? That

Re: This simple UDF is not working!

2016-03-25 Thread Mich Talebzadeh
This works with sql sqltext = """ INSERT INTO TABLE t14 SELECT INVOICENUMBER , TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(paymentdate,'dd/MM/'),'-MM-dd')) AS paymentdate , NET , VAT , TOTAL FROM tmp """ sql(sqltext) but not in UDF. I want to convert

Re: This simple UDF is not working!

2016-03-25 Thread Ted Yu
Do you mind showing body of TO_DATE() ? Thanks On Fri, Mar 25, 2016 at 7:38 AM, Ted Yu wrote: > Looks like you forgot an import for Date. > > FYI > > On Fri, Mar 25, 2016 at 7:36 AM, Mich Talebzadeh < > mich.talebza...@gmail.com> wrote: > >> >> >> Hi, >> >> writing a UDF

Re: Spark Metrics Framework?

2016-03-25 Thread Mike Sukmanowsky
Pinging again - any thoughts? On Wed, 23 Mar 2016 at 09:17 Mike Sukmanowsky wrote: > Thanks Ted and Silvio. I think I'll need a bit more hand holding here, > sorry. The way we use ES Hadoop is in pyspark via > org.elasticsearch.hadoop.mr.EsOutputFormat in a

Re: This simple UDF is not working!

2016-03-25 Thread Ted Yu
Looks like you forgot an import for Date. FYI On Fri, Mar 25, 2016 at 7:36 AM, Mich Talebzadeh wrote: > > > Hi, > > writing a UDF to convert a string into Date > > def ChangeDate(word : String) : Date = { > | return >

This simple UDF is not working!

2016-03-25 Thread Mich Talebzadeh
Hi, writing a UDF to convert a string into Date def ChangeDate(word : String) : Date = { | return TO_DATE(FROM_UNIXTIME(UNIX_TIMESTAMP(word),"dd/MM/"),"-MM-dd") | } :19: error: not found: type Date That code to_date.. works OK in sql but not here. It is complaining about

Re: Best way to determine # of workers

2016-03-25 Thread Ted Yu
Here is the doc for defaultParallelism : /** Default level of parallelism to use when not given by user (e.g. parallelize and makeRDD). */ def defaultParallelism: Int = { What if the user changes parallelism ? Cheers On Fri, Mar 25, 2016 at 5:33 AM, manasdebashiskar

Re: Problem using saveAsNewAPIHadoopFile API

2016-03-25 Thread Sebastian Piu
I dont understand about the race condition comment you mention. Have you seen this somewhere? That timestamp will be the same on each worker for that rdd, and each worker is handling a different partition which will be reflected on the filename, so no data will be overwriting. In fact this is

Re: Spark and DB connection pool

2016-03-25 Thread Marco Colombo
Thanks, I got that I can handle a pool by my own when dealing with foreachPartition, etc. My question is mainly related to what happens is such scenario. . val df: DataFrame = hiveSqlContext.read.format("jdbc").options(options).load(); df.registerTempTable("V_RELATIONS"); . I can

Re: Spark and DB connection pool

2016-03-25 Thread manasdebashiskar
Yes there is. You can use the default dbcp or your own preferred connection pool manager. Then when you ask for a connection you get one from the pool. Take a look at this https://github.com/manasdebashiskar/kafka-exactly-once It is forked from Cody's repo. ..Manas -- View this message in

Re: Serialization issue with Spark

2016-03-25 Thread manasdebashiskar
You have not mentioned what task is not serializable. The stack trace is usually a good idea while asking this question. Usually spark will tell you what class it is not able to serialize. If it is one of your own class then try making it serializable or make it transient so that it only gets

Re: Best way to determine # of workers

2016-03-25 Thread manasdebashiskar
There is a sc.sparkDefaultParallelism parameter that I use to dynamically maintain elasticity in my application. Depending upon your scenario this might be enough. -- View this message in context:

Re: Create one DB connection per executor

2016-03-25 Thread manasdebashiskar
You are on the right track. The only thing you will have to take care is when two of your partitions try to access the same connection at the same time. -- View this message in context:

Re: Problem using saveAsNewAPIHadoopFile API

2016-03-25 Thread vetal king
Hi Surendra, Thanks for your suggestion. I tried MultipleOutputForma and MultipleTextOutputFormat. But the result was the same. The folder would always contain a single file part-r-0, and this file gets overwritten everytime. This is how I am invoking the API

Re: Problem using saveAsNewAPIHadoopFile API

2016-03-25 Thread vetal king
Hi Sebastian, Thanks for your reply. I think using rdd.timestamp may cause one issue. If the parallelized RDD is executed on more than one workers, there may be a race condition if rdd.timestamp is used. It may also result in overwriting the file. Shridhar On Wed, Mar 23, 2016 at 12:32 AM,

DataFrameWriter.save fails job with one executor failure

2016-03-25 Thread Vinoth Chandar
Hi, We are doing the following to save a dataframe in parquet (using DirectParquetOutputCommitter) as follows. dfWriter.format("parquet") .mode(SaveMode.Overwrite) .save(outputPath) The problem is even if an executor fails once while writing file (say some transient HDFS issue), when its

Re: Does SparkSql has official jdbc/odbc driver?

2016-03-25 Thread Raymond Honderdors
Recommended drivers for spark / thrift are the once from databricks (simba) My experiance is that the databricks driver works perfect on windows and linux On windows you can get the microsoft driver Both are odbc Not jet tried the jdbc drivers Sent from Outlook Mobile

Re: Does SparkSql has official jdbc/odbc driver?

2016-03-25 Thread Mich Talebzadeh
JDBC drivers are specific to the databases you are accessing. they are produced by the database vendors, For example Oracle one is called ojdbc6.jar and Sybase is called jconn4.jar. Hive has got its own drivers There are companies that produce JDBC or ODBC drivers for various databases like

Re: Forcing data from disk to memory

2016-03-25 Thread Ravindra
yup, cache is a transformation and hence lazy. you need to run action to get the data into it. http://apache-spark-user-list.1001560.n3.nabble.com/How-to-enforce-RDD-to-be-cached-td20230.html On Fri, Mar 25, 2016 at 2:32 PM Jörn Franke wrote: > I am not 100% sure of the

Re: Forcing data from disk to memory

2016-03-25 Thread Jörn Franke
I am not 100% sure of the root cause, but if you need rdd caching then look at Apache Ignite or similar. > On 24 Mar 2016, at 16:22, Daniel Imberman wrote: > > Hi Takeshi, > > Thank you for getting back to me. If this is not possible then perhaps you > can help me

Re: Forcing data from disk to memory

2016-03-25 Thread Takeshi Yamamuro
I'm not 100% sure what you wanna do though, how about caching whole data and then querying? yourRdd.cache.foreach(_) On Fri, Mar 25, 2016 at 12:22 AM, Daniel Imberman wrote: > Hi Takeshi, > > Thank you for getting back to me. If this is not possible then perhaps you

Re: Does SparkSql has official jdbc/odbc driver?

2016-03-25 Thread Takeshi Yamamuro
Hi, No, you need to use the third-party's ones. // maropu On Fri, Mar 25, 2016 at 3:33 PM, sage wrote: > Hi all, > Does SparkSql has official jdbc/odbc driver? I only saw third-party's > jdbc/odbc driver. > > > > -- > View this message in context: >

Does SparkSql has official jdbc/odbc driver?

2016-03-25 Thread sage
Hi all, Does SparkSql has official jdbc/odbc driver? I only saw third-party's jdbc/odbc driver. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Does-SparkSql-has-official-jdbc-odbc-driver-tp26591.html Sent from the Apache Spark User List mailing list