Re: Could any one please tell me why this takes forever to finish?

2017-05-01 Thread Yan Facai
Hi, 10.x.x.x is private network, see https://en.wikipedia.org/wiki/IP_address. You should use the public IP of your AWS. On Sat, Apr 29, 2017 at 6:35 AM, Yuan Fang wrote: > > object SparkPi { > private val logger = Logger(this.getClass) > > val sparkConf = new

Re: Initialize Gaussian Mixture Model using Spark ML dataframe API

2017-05-01 Thread Yanbo Liang
Hi Tim, Spark ML API doesn't support set initial model for GMM currently. I wish we can get this feature in Spark 2.3. Thanks Yanbo On Fri, Apr 28, 2017 at 1:46 AM, Tim Smith wrote: > Hi, > > I am trying to figure out the API to initialize a gaussian mixture model > using

RE: Spark-SQL Query Optimization: overlapping ranges

2017-05-01 Thread Lavelle, Shawn
Jacek, Thanks for your help. I didn’t want to write a bug/enhancement unless warranted. ~ Shawn From: Jacek Laskowski [mailto:ja...@japila.pl] Sent: Thursday, April 27, 2017 8:39 AM To: Lavelle, Shawn Cc: user Subject: Re: Spark-SQL Query

Loading postgresql table to spark SyntaxError

2017-05-01 Thread Saulo Ricci
Hi, the following code is reading a table from my postgresql database, and I'm following the directives I've read on the internet: val txs = spark.read.format("jdbc").options(Map( ("driver" -> "org.postgresql.Driver"), ("url" -> "jdbc:postgresql://host/dbname"), ("dbtable" ->

Re: Schema Evolution for nested Dataset[T]

2017-05-01 Thread Michael Armbrust
Oh, and if you want a default other than null: import org.apache.spark.sql.functions._ df.withColumn("address", coalesce($"address", lit()) On Mon, May 1, 2017 at 10:29 AM, Michael Armbrust wrote: > The following should work: > > val schema =

Re: Schema Evolution for nested Dataset[T]

2017-05-01 Thread Michael Armbrust
The following should work: val schema = implicitly[org.apache.spark.sql.Encoder[Course]].schema spark.read.schema(schema).parquet("data.parquet").as[Course] Note this will only work for nullable files (i.e. if you add a primitive like Int you need to make it an Option[Int]) On Sun, Apr 30, 2017

Re: Reading table from sql database to apache spark dataframe/RDD

2017-05-01 Thread vincent gromakowski
Use cache or persist. The dataframe will be materialized when the 1st action is called and then be reused from memory for each following usage Le 1 mai 2017 4:51 PM, "Saulo Ricci" a écrit : > Hi, > > > I have the following code that is reading a table to a apache spark >

Reading table from sql database to apache spark dataframe/RDD

2017-05-01 Thread Saulo Ricci
Hi, I have the following code that is reading a table to a apache spark DataFrame: val df = spark.read.format("jdbc") .option("url","jdbc:postgresql:host/database") .option("dbtable","tablename").option("user","username") .option("password", "password") .load() When I

Re: removing columns from file

2017-05-01 Thread Steve Loughran
On 28 Apr 2017, at 16:10, Anubhav Agarwal > wrote: Are you using Spark's textFiles method? If so, go through this blog :- http://tech.kinja.com/how-not-to-pull-from-s3-using-apache-spark-1704509219 old/dated blog post. If you get the Hadoop 2.8