Select Columns from Dataframe in Java

2023-12-29 Thread PRASHANT L
Team
I am using Java and want to select columns from Dataframe , columns are
stored in List
equivalent of below scala code
*  array_df=array_df.select(fields: _*)*


When I try array_df=array_df.select(fields) , I get error saying Cast to
Column

I am using Spark 3.4


Re: Select Columns from Dataframe in Java

2023-12-30 Thread PRASHANT L
Hi Grisha
This is Great :) It worked thanks alot

I have this requirement , I will be running my spark application on EMR
and build a custom logging to create logs on S3. Any idea what should I do?
or In general if i create a custom log (with my Application name ), where
will logs be generated when run in cluster mode (since in cluster mode jobs
are executed all over different machine)

On Sat, Dec 30, 2023 at 1:56 PM Grisha Weintraub 
wrote:

> In Java, it expects an array of Columns, so you can simply cast your list
> to an array:
>
> array_df.select(fields.toArray(new Column[0]))
>
>
>
>
> On Fri, Dec 29, 2023 at 10:58 PM PRASHANT L  wrote:
>
>>
>> Team
>> I am using Java and want to select columns from Dataframe , columns are
>> stored in List
>> equivalent of below scala code
>> *  array_df=array_df.select(fields: _*)*
>>
>>
>> When I try array_df=array_df.select(fields) , I get error saying Cast to
>> Column
>>
>> I am using Spark 3.4
>>
>


Structured Streaming Process Each Records Individually

2024-01-10 Thread PRASHANT L
Hi
I have a use case where I need to process json payloads coming from Kafka
using structured streaming , but thing is json can have different formats ,
schema is not fixed
and each json will have a @type tag so based on tag , json has to be parsed
and loaded to table with tag name  , and if a json has nested sub tags ,
those tags shd go to different table
so I need to process each json record individually , and determine
destination tables what would be the best approach


> *{*
> *"os": "andriod",*
> *"type": "mobile",*
> *"device": {*
> *"warrenty": "3 years",*
> *"replace": "yes"*
> *},*
> *"zones": [*
> *{*
> *"city": "Bangalore",*
> *"state": "KA",*
> *"pin": "577401"*
> *},*
> *{*
> *"city": "Mumbai",*
> *"state": "MH",*
> *"pin": "576003"*
> *}*
> *],*
> *"@table": "product"**}*


so for the above json , there are 3 tables created
1. Product (@type) THis is a parent table
2.  poduct_zones and product_devices , child table


Create Custom Logs

2024-01-31 Thread PRASHANT L
Hi
I justed wanted to check if there is a way to create custom log in Spark
I want to write selective/custom log messages to S3 , running spark submit
on EMR
I would  not want all the spark generated logs ... I would just need the
log messages that are logged as part of Spark Application


Error While Running Merge Statement With Iceberg

2024-07-30 Thread PRASHANT L
Hi
I am trying to run merge statement with Iceberg with Spark
I am getting below error


THis is the statement


MERGE INTO glue.euc_dp_prd.emp_merge_test AS t USING source AS s ON
t.emp_id=s.emp_id WHEN MATCHED THEN UPDATE SET t.name=s.name ,
t.dept=s.dept , t.doj=s.doj

pyspark.errors.exceptions.captured.AnalysisException:
[INVALID_NON_DETERMINISTIC_EXPRESSIONS] The operator expects a
deterministic expression, but the actual expression is "(emp_id =
emp_id)", "exists(emp_id)".; line 1 pos 0;