Flatten log data Using Pyspark

2019-11-29 Thread anbutech
Hi, I have a raw source data frame having 2 columns as below timestamp 2019-11-29 9:30:45 message_log <123>NOV 29 10:20:35 ips01 sfids: connection: tcp,bytes:104,user:unknown,url:unknown,host:127.0.0.1 how do we break above each key value as separate columns using

Re: Any way to make catalyst optimise away join

2019-11-29 Thread Jerry Vinokurov
This seems like a suboptimal situation for a join. How can Spark know in advance that all the fields are present and the tables have the same number of rows? I suppose you could just sort the two frames by id and concatenate them, but I'm not sure what join optimization is available here. On Fri,

[SPARK MLlib Beginner] What are the ranking metrics methods available in scala that are missing in python?

2019-11-29 Thread Mohd Shukri Hasan
Hello, In ranking_metrics_example.py (https://github.com/apache/spark/blob/master/examples/src/main/python/mllib/ranking_metrics_example.py) there is this comment: # Several of the methods available in scala are currently missing from pyspark Just wanted to know what are the missing methods.

Any way to make catalyst optimise away join

2019-11-29 Thread jelmer
I have 2 dataframes , lets call them A and B, A is made up out of [unique_id, field1] B is made up out of [unique_id, field2] The have the exact same number of rows, and every id in A is also present in B if I execute a join like this A.join(B, Seq("unique_id")).select($"unique_id", $"field1") t