Hi All,
What is the best way to instrument metrics of Spark Application from both
Driver and Executor.
I am trying to send my Spark application metrics into Kafka. I found two
approaches.
*Approach 1: * Implement custom Source and Sink and use the Source for
instrumenting from both Driver and
I'm wondering if anyone has done distributed choleksy decomposition on Spark.
I need to do it on a large matrix (200k x 200k) which would not fit on 1
machine.
So far I've found:
I assume its going to compare by the first column and if equal compare the
second column and so on.
From: kant kodali
Date: Wednesday, April 18, 2018 at 6:26 PM
To: Jungtaek Lim
Cc: Arun Iyer , Michael Armbrust
This is cool! Looks to me this works too
select data.* from (SELECT max(struct(my_timestamp,*)) as data from view1
group by id)
but I got naive question again. what does max of a struct mean? Does it
always take the max of the first column and ignore the rest of the fields
in the struct?
On
Thanks Arun, I modified a bit to try my best to avoid enumerating fields:
val query = socketDF
.selectExpr("CAST(value AS STRING) as value")
.as[String]
.select(from_json($"value", schema=schema).as("data"))
.select($"data.*")
.groupBy($"ID")
.agg(max(struct($"AMOUNT",
The below expr might work:
df.groupBy($"id").agg(max(struct($"amount",
$"my_timestamp")).as("data")).select($"id", $"data.*")
Thanks,
Arun
From: Jungtaek Lim
Date: Wednesday, April 18, 2018 at 4:54 PM
To: Michael Armbrust
Cc: kant kodali
Thanks Michael for providing great solution. Great to remove UDAF and any
needs to provide fields manually.
Btw, your code has compilation error. ')' is missing, and after I fix it,
it complains again with other issue.
:66: error: overloaded method value max with alternatives:
(columnName:
You can calculate argmax using a struct.
df.groupBy($"id").agg(max($"my_timestamp",
struct($"*").as("data")).getField("data").select($"data.*")
You could transcode this to SQL, it'll just be complicated nested queries.
On Wed, Apr 18, 2018 at 3:40 PM, kant kodali wrote:
>
Hi Arun,
I want to select the entire row with the max timestamp for each group. I
have modified my data set below to avoid any confusion.
*Input:*
id | amount | my_timestamp
---
1 | 5 | 2018-04-01T01:00:00.000Z
1 | 10 |
Cant the “max” function used here ? Something like..
stream.groupBy($"id").max("amount").writeStream.outputMode(“complete”/“update")….
Unless the “stream” is already a grouped stream, in which case the above would
not work since the support for multiple aggregate operations is not there yet.
Has any one ran Poisson GLM model and got the
GeneralizedLinearRegressionTrainingSummary object (to access p, t-values,
deviances ,aic etc.,) successfully?
I have tried to fit two datasets to compare Spark vs R outputs, both models
ran fine in Spark and i was able to get the coefficients back.
Try running AnalyzeTableCommand on both tables first.
On Wed, Apr 18, 2018 at 2:57 AM Matteo Cossu wrote:
> Can you check the value for spark.sql.autoBroadcastJoinThreshold?
>
> On 29 March 2018 at 14:41, Vitaliy Pisarev
> wrote:
>
>> I am
Can you check the value for spark.sql.autoBroadcastJoinThreshold?
On 29 March 2018 at 14:41, Vitaliy Pisarev
wrote:
> I am looking at the physical plan for the following query:
>
> SELECT f1,f2,f3,...
> FROM T1
> LEFT ANTI JOIN T2 ON T1.id = T2.id
> WHERE f1 =
Hi,
when I execute sql like that:
"select * from onlineDevice where deviceId not in (select deviceId from
historyDevice)")
I found the task spend a lot of time(over 40 min),I stopped the task but I
can't found the reason from spark history UI.
the historyDevice and onlineDevice only contain
15 matches
Mail list logo