Re: PySpark - Expand rows into dataframes via function

2017-10-03 Thread Sathish Kumaran Vairavelu
Flatmap works too.. Explode function is a SQL/Dataframe way of one to many operation. Both should work. Thanks On Tue, Oct 3, 2017 at 8:30 AM Patrick McCarthy wrote: > Thanks Sathish. > > Before you responded, I came up with this solution: > > # A function to take in one

Re: PySpark - Expand rows into dataframes via function

2017-10-03 Thread Patrick McCarthy
Thanks Sathish. Before you responded, I came up with this solution: # A function to take in one row and return the expanded ranges: def processRow(x): ... return zip(list_of_ip_ranges, list_of_registry_ids) # and then in spark, processed_rdds = spark_df_of_input_data.rdd.flatMap(lambda x:

Re: PySpark - Expand rows into dataframes via function

2017-10-02 Thread Sathish Kumaran Vairavelu
It's possible with array function combined with struct construct. Below is a SQL example select Array(struct(ip1,hashkey), struct(ip2,hashkey)) from (select substr(col1,1,2) as ip1, substr(col1,3,3) as ip2, etc, hashkey from object) a If you want dynamic ip ranges; you need to dynamically

PySpark - Expand rows into dataframes via function

2017-10-02 Thread Patrick McCarthy
Hello, I'm trying to map ARIN registry files into more explicit IP ranges. They provide a number of IPs in the range (here it's 8192) and a starting IP, and I'm trying to map it into all the included /24 subnets. For example, Input: array(['arin', 'US', 'ipv4', '23.239.160.0', 8192, 20131104.0,