Hi,
I have a raw source data frame having 2 columns as below
timestamp
2019-11-29 9:30:45
message_log
<123>NOV 29 10:20:35 ips01 sfids: connection:
tcp,bytes:104,user:unknown,url:unknown,host:127.0.0.1
how do we break above each key value as separate columns using
This seems like a suboptimal situation for a join. How can Spark know in
advance that all the fields are present and the tables have the same number
of rows? I suppose you could just sort the two frames by id and concatenate
them, but I'm not sure what join optimization is available here.
On Fri,
Hello,
In ranking_metrics_example.py
(https://github.com/apache/spark/blob/master/examples/src/main/python/mllib/ranking_metrics_example.py)
there is this comment:
# Several of the methods available in scala are currently missing from pyspark
Just wanted to know what are the missing methods.
I have 2 dataframes , lets call them A and B,
A is made up out of [unique_id, field1]
B is made up out of [unique_id, field2]
The have the exact same number of rows, and every id in A is also present
in B
if I execute a join like this A.join(B,
Seq("unique_id")).select($"unique_id", $"field1") t