Spark using iceberg

2023-06-15 Thread Gaurav Agarwal
HI

I am using spark with iceberg, updating the table with 1700 columns ,
We are loading 0.6 Million rows from parquet files ,in future it will be 16
Million rows and trying to update the data in the table which has 16
buckets .
Using the default partitioner of spark .Also we don't do any repartitioning
of the dataset.on the bucketing column,
One of the executor fails with OOME , and it recovers and again fails.when
we are using Merge Into strategy of iceberg
Merge into target( select * from source) on Target.id= source.id when
matched then update set
When not matched then insert

But when we do append blind append . this works.

Question :

How to find what the issue is ? as we are running spark on EKS cluster
.when executor gives OOME it dies logs also gone , unable to see the logs.

DO we need to partition of the column in the dataset ? when at the time of
loading or once the data is loaded .

Need help to understand?


Re: Spark using iceberg

2023-06-15 Thread Gaurav Agarwal
> HI
>
> I am using spark with iceberg, updating the table with 1700 columns ,
> We are loading 0.6 Million rows from parquet files ,in future it will be
> 16 Million rows and trying to update the data in the table which has 16
> buckets .
> Using the default partitioner of spark .Also we don't do any
> repartitioning of the dataset.on the bucketing column,
> One of the executor fails with OOME , and it recovers and again fails.when
> we are using Merge Into strategy of iceberg
> Merge into target( select * from source) on Target.id= source.id when
> matched then update set
> When not matched then insert
>
> But when we do append blind append . this works.
>
> Question :
>
> How to find what the issue is ? as we are running spark on EKS cluster
> .when executor gives OOME it dies logs also gone , unable to see the logs.
>
> DO we need to partition of the column in the dataset ? when at the time of
> loading or once the data is loaded .
>
> Need help to understand?
>
>


Fwd: iceberg queries

2023-06-15 Thread Gaurav Agarwal
Hi Team,

Sample Merge query:

df.createOrReplaceTempView("source")

MERGE INTO iceberg_hive_cat.iceberg_poc_db.iceberg_tab target
USING (SELECT * FROM source)
ON target.col1 = source.col1// this is my bucket column
WHEN MATCHED  THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *

The source dataset is a temporary view and it contains 1.5 million records
in future can 20 Million rows and with id that have 16 buckets.
The target iceberg table has 16 buckets . The source dataset will only
update if matched and insert if not matched with those id

I have 1700 columns in my table.

spark dataset is using default partitioning , do we need to bucket the
spark dataset on bucket column as well ?

Let me know if you need any further details.

it fails with OOME ,

Regards
Gaurav