Re: Dataframes na fill with empty list

Sumona Routh Tue, 11 Apr 2017 08:56:19 -0700

For some reason my pasted screenshots were removed when I sent the email
(at least that's how it appeared on my end). Repasting as text below.


The sequence you are referring to represents the list of column names to
fill. I am asking about filling a column which is of type list with an
empty list.

Here is a quick example of what I am doing:

val spark = 
SparkSession.builder().master("local[*]").appName("test").getOrCreate()
import spark.implicits._

val list = List(IntPair(key = "a", value = 1),
                IntPair(key = "a", value = 2),
                IntPair(key = "b", value = 2))
val df = spark.createDataset(list).toDF
df.show
val collectList = df.groupBy($"key").agg(collect_list("value") as "listylist")
collectList.show
collectList.printSchema()
collectList.na.fill(Array(), Seq("listyList"))


The output of the show and printSchema for the collectList df:

|key|listylist|
+---+---------+
|  b|      [2]|
|  a|   [1, 2]|
+---+---------+

root
 |-- key: string (nullable = true)
 |-- listylist: array (nullable = true)
 |    |-- element: integer (containsNull = true)


So, the last line which doesn't compile is what I would want to do (after
outer joining of course, it's not necessary except in that particular case
where a null could be populated in that field).

Thanks,
Sumona



On Tue, Apr 11, 2017 at 9:50 AM Sumona Routh <sumos...@gmail.com> wrote:

> The sequence you are referring to represents the list of column names to
> fill. I am asking about filling a column which is of type list with an
> empty list.
>
> Here is a quick example of what I am doing:
>
>
> The output of the show and printSchema for the collectList df:
>
>
>
> So, the last line which doesn't compile is what I would want to do (after
> outer joining of course, it's not necessary except in that particular case
> where a null could be populated in that field).
>
> Thanks,
> Sumona
>
> On Tue, Apr 11, 2017 at 2:02 AM Didac Gil <didacgil9...@gmail.com> wrote:
>
> It does support it, at least in 2.0.2 as I am running:
>
> Here one example:
>
> val parsedLines = stream_of_logs
>   .map(line => p.parseRecord_viaCSVParser(line))
>   .join(appsCateg,$"Application"===$"name","left_outer")
>   .drop("id")
>   .na.fill(0, Seq(“numeric_field1”,"numeric_field2"))
>   .na.fill("", Seq(
>        “text_field1","text_field2","text_field3”))
>
>
> Notice that you have to differentiate those fields that are meant to be
> filled with an int, from those that require a different value, an empty
> string in my case.
>
> On 11 Apr 2017, at 03:18, Sumona Routh <sumos...@gmail.com> wrote:
>
> Hi there,
> I have two dataframes that each have some columns which are of list type
> (array<int> generated by the collect_list function actually).
>
> I need to outer join these two dfs, however by nature of an outer join I
> am sometimes left with null values. Normally I would use df.na.fill(...),
> however it appears the fill function doesn't support this data type.
>
> Can anyone recommend an alternative? I have also been playing around with
> coalesce in a sql expression, but I'm not having any luck here either.
>
> Obviously, I can do a null check on the fields downstream, however it is
> not in the spirit of scala to pass around nulls, so I wanted to see if I
> was missing another approach first.
>
> Thanks,
> Sumona
>
> I am using Spark 2.0.2
>
> Didac Gil de la Iglesia
> PhD in Computer Science
> didacg...@gmail.com
> Spain:     +34 696 285 544 <+34%20696%2028%2055%2044>
> Sweden: +46 (0)730229737 <+46%2073%20022%2097%2037>
> Skype: didac.gil.de.la.iglesia
>
> On 11 Apr 2017, at 03:18, Sumona Routh <sumos...@gmail.com> wrote:
>
> Hi there,
> I have two dataframes that each have some columns which are of list type
> (array<int> generated by the collect_list function actually).
>
> I need to outer join these two dfs, however by nature of an outer join I
> am sometimes left with null values. Normally I would use df.na.fill(...),
> however it appears the fill function doesn't support this data type.
>
> Can anyone recommend an alternative? I have also been playing around with
> coalesce in a sql expression, but I'm not having any luck here either.
>
> Obviously, I can do a null check on the fields downstream, however it is
> not in the spirit of scala to pass around nulls, so I wanted to see if I
> was missing another approach first.
>
> Thanks,
> Sumona
>
> I am using Spark 2.0.2
>
>
> Didac Gil de la Iglesia
> PhD in Computer Science
> didacg...@gmail.com
> Spain:     +34 696 285 544 <+34%20696%2028%2055%2044>
> Sweden: +46 (0)730229737 <+46%2073%20022%2097%2037>
> Skype: didac.gil.de.la.iglesia
>
>

Re: Dataframes na fill with empty list

Reply via email to