Re: Add column value in the dataset on the basis of a condition

Shahab Yunus Tue, 18 Dec 2018 06:58:41 -0800

Sorry Devender, I hit the send button sooner by mistake. I meant to add
more info.


So what I was trying to say was that you can use withColumn with
when/otherwise clauses to add a column conditionally. See an example here:
https://stackoverflow.com/questions/34908448/spark-add-column-to-dataframe-conditionally

On Tue, Dec 18, 2018 at 9:55 AM Shahab Yunus <shahab.yu...@gmail.com> wrote:

> Have you tried using withColumn? You can add a boolean column based on
> whether the age exists or not and then drop the older age column. You
> wouldn't need union of dataframes then
>
> On Tue, Dec 18, 2018 at 8:58 AM Devender Yadav <
> devender.ya...@impetus.co.in> wrote:
>
>> Hi All,
>>
>>
>> useful code:
>>
>> public class EmployeeBean implements Serializable {
>>
>>     private Long id;
>>
>>     private String name;
>>
>>     private Long salary;
>>
>>     private Integer age;
>>
>>     // getters and setters
>>
>> }
>>
>>
>> Relevant spark code:
>>
>> SparkSession spark =
>> SparkSession.builder().master("local[2]").appName("play-with-spark").getOrCreate();
>> List<EmployeeBean> employees1 = populateEmployees(1, 10);
>>
>> Dataset<EmployeeBean> ds1 = spark.createDataset(employees1,
>> Encoders.bean(EmployeeBean.class));
>> ds1.show();
>> ds1.printSchema();
>>
>> Dataset<Row> ds2 = ds1.where("age is null").withColumn("is_age_null",
>> lit(true));
>> Dataset<Row> ds3 = ds1.where("age is not null").withColumn("is_age_null",
>> lit(false));
>>
>> Dataset<Row> ds4 = ds2.union(ds3);
>> ds4.show();
>>
>>
>> Relevant Output:
>>
>>
>> ds1
>>
>> +----+---+----+------+
>> | age| id|name|salary|
>> +----+---+----+------+
>> |null|  1|dev1| 11000|
>> |   2|  2|dev2| 12000|
>> |null|  3|dev3| 13000|
>> |   4|  4|dev4| 14000|
>> |null|  5|dev5| 15000|
>> +----+---+----+------+
>>
>>
>> ds4
>>
>> +----+---+----+------+-----------+
>> | age| id|name|salary|is_age_null|
>> +----+---+----+------+-----------+
>> |null|  1|dev1| 11000|       true|
>> |null|  3|dev3| 13000|       true|
>> |null|  5|dev5| 15000|       true|
>> |   2|  2|dev2| 12000|      false|
>> |   4|  4|dev4| 14000|      false|
>> +----+---+----+------+-----------+
>>
>>
>> Is there any better solution to add this column in the dataset rather
>> than creating two datasets and performing union?
>>
>> <
>> https://stackoverflow.com/questions/53834286/add-column-value-in-spark-dataset-on-the-basis-of-the-condition
>> >
>>
>>
>>
>> Regards,
>> Devender
>>
>> ________________________________
>>
>>
>>
>>
>>
>>
>> NOTE: This message may contain information that is confidential,
>> proprietary, privileged or otherwise protected by law. The message is
>> intended solely for the named addressee. If received in error, please
>> destroy and notify the sender. Any use of this email is prohibited when
>> received in error. Impetus does not represent, warrant and/or guarantee,
>> that the integrity of this communication has been maintained nor that the
>> communication is free of errors, virus, interception or interference.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: Add column value in the dataset on the basis of a condition

Reply via email to