Implementing Spark metric source and Sink for custom application metrics

2018-04-18 Thread AnilKumar B
Hi All, What is the best way to instrument metrics of Spark Application from both Driver and Executor. I am trying to send my Spark application metrics into Kafka. I found two approaches. *Approach 1: * Implement custom Source and Sink and use the Source for instrumenting from both Driver and

distributed choleksy on spark?

2018-04-18 Thread qifan
I'm wondering if anyone has done distributed choleksy decomposition on Spark. I need to do it on a large matrix (200k x 200k) which would not fit on 1 machine. So far I've found:

Re: can we use mapGroupsWithState in raw sql?

2018-04-18 Thread Arun Mahadevan
I assume its going to compare by the first column and if equal compare the second column and so on. From: kant kodali Date: Wednesday, April 18, 2018 at 6:26 PM To: Jungtaek Lim Cc: Arun Iyer , Michael Armbrust

Re: can we use mapGroupsWithState in raw sql?

2018-04-18 Thread kant kodali
This is cool! Looks to me this works too select data.* from (SELECT max(struct(my_timestamp,*)) as data from view1 group by id) but I got naive question again. what does max of a struct mean? Does it always take the max of the first column and ignore the rest of the fields in the struct? On

Re: can we use mapGroupsWithState in raw sql?

2018-04-18 Thread Jungtaek Lim
Thanks Arun, I modified a bit to try my best to avoid enumerating fields: val query = socketDF .selectExpr("CAST(value AS STRING) as value") .as[String] .select(from_json($"value", schema=schema).as("data")) .select($"data.*") .groupBy($"ID") .agg(max(struct($"AMOUNT",

Re: can we use mapGroupsWithState in raw sql?

2018-04-18 Thread Arun Mahadevan
The below expr might work: df.groupBy($"id").agg(max(struct($"amount", $"my_timestamp")).as("data")).select($"id", $"data.*") Thanks, Arun From: Jungtaek Lim Date: Wednesday, April 18, 2018 at 4:54 PM To: Michael Armbrust Cc: kant kodali

Re: can we use mapGroupsWithState in raw sql?

2018-04-18 Thread Jungtaek Lim
Thanks Michael for providing great solution. Great to remove UDAF and any needs to provide fields manually. Btw, your code has compilation error. ')' is missing, and after I fix it, it complains again with other issue. :66: error: overloaded method value max with alternatives: (columnName:

Re: can we use mapGroupsWithState in raw sql?

2018-04-18 Thread Michael Armbrust
You can calculate argmax using a struct. df.groupBy($"id").agg(max($"my_timestamp", struct($"*").as("data")).getField("data").select($"data.*") You could transcode this to SQL, it'll just be complicated nested queries. On Wed, Apr 18, 2018 at 3:40 PM, kant kodali wrote: >

Re: can we use mapGroupsWithState in raw sql?

2018-04-18 Thread kant kodali
Hi Arun, I want to select the entire row with the max timestamp for each group. I have modified my data set below to avoid any confusion. *Input:* id | amount | my_timestamp --- 1 | 5 | 2018-04-01T01:00:00.000Z 1 | 10 |

Re: can we use mapGroupsWithState in raw sql?

2018-04-18 Thread Arun Mahadevan
Cant the “max” function used here ? Something like.. stream.groupBy($"id").max("amount").writeStream.outputMode(“complete”/“update")…. Unless the “stream” is already a grouped stream, in which case the above would not work since the support for multiple aggregate operations is not there yet.

[Spark 2.3] GLM Poisson issue

2018-04-18 Thread svattig
Has any one ran Poisson GLM model and got the GeneralizedLinearRegressionTrainingSummary object (to access p, t-values, deviances ,aic etc.,) successfully? I have tried to fit two datasets to compare Spark vs R outputs, both models ran fine in Spark and i was able to get the coefficients back.

Re: Why doesn't spark use broadcast join?

2018-04-18 Thread Kurt Fehlhauer
Try running AnalyzeTableCommand on both tables first. On Wed, Apr 18, 2018 at 2:57 AM Matteo Cossu wrote: > Can you check the value for spark.sql.autoBroadcastJoinThreshold? > > On 29 March 2018 at 14:41, Vitaliy Pisarev > wrote: > >> I am

Re: Why doesn't spark use broadcast join?

2018-04-18 Thread Matteo Cossu
Can you check the value for spark.sql.autoBroadcastJoinThreshold? On 29 March 2018 at 14:41, Vitaliy Pisarev wrote: > I am looking at the physical plan for the following query: > > SELECT f1,f2,f3,... > FROM T1 > LEFT ANTI JOIN T2 ON T1.id = T2.id > WHERE f1 =

Unsubscribe

2018-04-18 Thread Anu B Nair

"not in" sql spend a lot of time

2018-04-18 Thread 崔苗
Hi, when I execute sql like that: "select * from onlineDevice where deviceId not in (select deviceId from historyDevice)") I found the task spend a lot of time(over 40 min),I stopped the task but I can't found the reason from spark history UI. the historyDevice and onlineDevice only contain