Hello Community, I checked below issue in various platforms but I could not get satisfactory answer.
I am using spark java. I am having large data cluster. My application is making more than 10 API calls. Each calls returns a java list. Each list item is of same structure (i.e. same java class) I want to write each list to same avro directory. For example /data/network_response/ What I am trying is after each API call, I convert list to org.apache.spark.sql.Dataset object. Using this snippet: sqlcontext.createdataframe(results, class_of_list_entry) And then write the data to disk Using below spinnet records.write().mode(savemode).format(com.databricks.spark.avro).save(folder_path) Here, savemode is set to overwrite for first API call writing. For subsequent network call, I set savemode to append. I want to effectively write all API responses to same directory in avro format. Question is : Is this efficient way? If I try to append the content of avro, does existing content overwritten each time? Where is the documentation for specifications of what is algorithm spark use to append the content of avro through the way I gave above? What exactly happens when we append the dataset to existing avro folder? Shoudl I collect all API datasets to one union dataset and then write once the avro content rather than writing avro content to directory after each network call? Regards, Rushikesh.