Re: how to create all possible combinations from an array? how to join and explode row array?

2018-03-30 Thread Andy Davidson
NICE! Thanks Brandon Andy. From: Brandon Geise Date: Friday, March 30, 2018 at 6:15 PM To: Andrew Davidson , "user @spark" Subject: Re: how to create all possible combinations from an array? how to join and

Re: how to create all possible combinations from an array? how to join and explode row array?

2018-03-30 Thread Yong Zhang
What's wrong just using a UDF doing for loop in scala? You can change the for loop logic for what combination you want. scala> spark.version res4: String = 2.2.1 scala> aggDS.printSchema root |-- name: string (nullable = true) |-- colors: array (nullable = true) ||-- element: string

Re: how to create all possible combinations from an array? how to join and explode row array?

2018-03-30 Thread Brandon Geise
Possibly instead of doing the initial grouping, just do a full outer join on zyzy.  This is in scala but should be easily convertible to python. val data = Array(("john", "red"), ("john", "blue"), ("john", "red"), ("bill", "blue"), ("bill", "red"), ("sam", "green"))     val distData:

Re: how to create all possible combinations from an array? how to join and explode row array?

2018-03-30 Thread Andy Davidson
I was a little sloppy when I created the sample output. Its missing a few pairs Assume for a given row I have [a, b, c] I want to create something like the cartesian join From: Andrew Davidson Date: Friday, March 30, 2018 at 5:54 PM To: "user @spark"

how to create all possible combinations from an array? how to join and explode row array?

2018-03-30 Thread Andy Davidson
I have a dataframe and execute df.groupBy(³xyzy²).agg( collect_list(³abc²) This produces a column of type array. Now for each row I want to create a multiple pairs/tuples from the array so that I can create a contingency table. Any idea how I can transform my data so that call crosstab() ? The

Resource manage inside map function

2018-03-30 Thread Huiliang Zhang
Hi, I have a spark job which needs to access HBase inside a mapToPair function. The question is that I do not want to connect to HBase and close connection each time. As I understand, PairFunction is not designed to manage resources with setup() and close(), like Hadoop reader and writer. Does

Re: all spark settings end up being system properties

2018-03-30 Thread Koert Kuipers
thanks i will check our SparkSubmit class On Fri, Mar 30, 2018 at 2:46 PM, Marcelo Vanzin wrote: > Why: it's part historical, part "how else would you do it". > > SparkConf needs to read properties read from the command line, but > SparkConf is something that user code

Re: all spark settings end up being system properties

2018-03-30 Thread Marcelo Vanzin
Why: it's part historical, part "how else would you do it". SparkConf needs to read properties read from the command line, but SparkConf is something that user code instantiates, so we can't easily make it read data from arbitrary locations. You could use thread locals and other tricks, but user

all spark settings end up being system properties

2018-03-30 Thread Koert Kuipers
does anyone know why all spark settings end up being system properties, and where this is done? for example when i pass "--conf spark.foo=bar" into spark-submit then System.getProperty("spark.foo") will be equal to "bar" i grepped the spark codebase for System.setProperty or System.setProperties

[Structured Streaming] HDFSBackedStateStoreProvider OutOfMemoryError

2018-03-30 Thread ahmed alobaidi
Hi All, I'm working on simple structured streaming query that uses flatMapGroupsWithState to maintain relatively a large size state. After running the application for few minutes on my local machine, it starts to slow down and then crashes with OutOfMemoryError. Tracking the code led me to

Why fetchSize should be bigger than 0 in JDBCOptions.scala?

2018-03-30 Thread Young
My executor will be OOM when use spark-sql to read data from Mysql. In sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala, I see the following lines.I'm wandering why JDBC_BATCH_FETCH_SIZE should be bigger than 0? val fetchSize = { val size =