Re: if conditions

2016-11-28 Thread Stuart White
gt; applicable. > I am doing this in Java instead of scala. > Note:- I am using spark 1.6.1 version. > > -Original Message- > From: Stuart White [mailto:stuart.whi...@gmail.com] > Sent: Monday, November 28, 2016 10:26 AM > To: Hitesh Goyal > Cc: user@spark.apache.org

Re: if conditions

2016-11-27 Thread Stuart White
Use the when() and otherwise() functions. For example: import org.apache.spark.sql.functions._ val rows = Seq(("bob", 1), ("lucy", 2), ("pat", 3)).toDF("name", "genderCode") rows.show ++--+ |name|genderCode| ++--+ | bob| 1| |lucy| 2| | pat| 3|

Re: Create a Column expression from a String

2016-11-21 Thread Stuart White
Yes, that's what I was looking for. Thanks! On Mon, Nov 21, 2016 at 6:56 PM, Michael Armbrust <mich...@databricks.com> wrote: > You are looking for org.apache.spark.sql.functions.expr() > > On Sat, Nov 19, 2016 at 6:12 PM, Stuart White <stuart.whi...@gmail.com> > wrote:

Create a Column expression from a String

2016-11-19 Thread Stuart White
I'd like to allow for runtime-configured Column expressions in my Spark SQL application. For example, if my application needs a 5-digit zip code, but the file I'm processing contains a 9-digit zip code, I'd like to be able to configure my application with the expression "substring('zipCode, 0,

Re: sort descending with multiple columns

2016-11-18 Thread Stuart White
Is this what you're looking for? val df = Seq( (1, "A"), (1, "B"), (1, "C"), (2, "D"), (3, "E") ).toDF("foo", "bar") val colList = Seq("foo", "bar") df.sort(colList.map(col(_).desc): _*).show +---+---+ |foo|bar| +---+---+ | 3| E| | 2| D| | 1| C| | 1| B| | 1| A| +---+---+ On

Re: Best practice for preprocessing feature with DataFrame

2016-11-17 Thread Stuart White
ow +---+---+ |age| gender| +---+---+ | 90| male| | 80| female| | 80|unknown| +---+---+ On Thu, Nov 17, 2016 at 8:57 AM, Stuart White <stuart.whi...@gmail.com> wrote: > import org.apache.spark.sql.functions._ > > val rows = Seq(("90s", 1), ("80s&qu

Re: Best practice for preprocessing feature with DataFrame

2016-11-17 Thread Stuart White
import org.apache.spark.sql.functions._ val rows = Seq(("90s", 1), ("80s", 2), ("80s", 3)).toDF("age", "gender") rows.show +---+--+ |age|gender| +---+--+ |90s| 1| |80s| 2| |80s| 3| +---+--+ val modifiedRows .select( substring('age, 0, 2) as "age", when('gender

Re: Joining to a large, pre-sorted file

2016-11-15 Thread Stuart White
t; > Thanks, > Silvio > ---------- > *From:* Stuart White <stuart.whi...@gmail.com> > *Sent:* Saturday, November 12, 2016 11:20:28 AM > *To:* Silvio Fiorito > *Cc:* user@spark.apache.org > *Subject:* Re: Joining to a large, pre-sorted file > > Hi Silvio, > > T

Re: Joining to a large, pre-sorted file

2016-11-12 Thread Stuart White
Thanks for the reply. I understand that I need to use bucketBy() to write my master file, but I still can't seem to make it work as expected. Here's a code example for how I'm writing my master file: Range(0, 100) .map(i => (i, s"master_$i")) .toDF("key", "value") .write

Re: Joining to a large, pre-sorted file

2016-11-10 Thread Stuart White
It seems like this functionality is pretty new so there aren't a lot of examples available. On Thu, Nov 10, 2016 at 7:33 PM, Jörn Franke <jornfra...@gmail.com> wrote: > Can you split the files beforehand in several files (e.g. By the column > you do the join on?) ? > > On 10 Nov

Joining to a large, pre-sorted file

2016-11-10 Thread Stuart White
I have a large "master" file (~700m records) that I frequently join smaller "transaction" files to. (The transaction files have 10's of millions of records, so too large for a broadcast join). I would like to pre-sort the master file, write it to disk, and then, in subsequent jobs, read the file