Effectively append the dataset to avro directory

Rushikesh Kavar Fri, 16 Feb 2024 01:57:32 -0800

Hello Community,

I checked below issue in various platforms but I could not get satisfactory
answer.


I am using spark java.
I am having large data cluster.
My application is making more than 10 API calls.
Each calls returns a java list. Each list item is of same structure (i.e.
same java class)

I want to write each list to same avro directory.
For example /data/network_response/

What I am trying is after each API call, I convert list to
org.apache.spark.sql.Dataset object.
Using this snippet:  sqlcontext.createdataframe(results,
class_of_list_entry)

And then write the data to disk
Using below spinnet

records.write().mode(savemode).format(com.databricks.spark.avro).save(folder_path)

Here, savemode is set to overwrite for first API call writing. For
subsequent network call, I set savemode to append.

I want to effectively write all API responses to same directory in avro
format.


Question is :

Is this efficient way?

If I try to append the content of avro, does existing content overwritten
each time?
Where is the documentation for specifications of what is algorithm spark
use to append the content of avro through the way I gave above?
What exactly happens when we append the dataset to existing avro folder?

Shoudl I collect all API datasets to one union dataset  and then write once
the avro content rather than writing avro content to directory after each
network call?

Regards,
Rushikesh.

Effectively append the dataset to avro directory

Reply via email to