Re: Naming files while saving a Dataframe

2021-08-12 Thread Eric Beabes
This doesn't work as given here ( https://stackoverflow.com/questions/36107581/change-output-filename-prefix-for-dataframe-write) but the answer suggests using FileOutputFormat class. Will try that. Thanks. Regards. On Sun, Jul 18, 2021 at 12:44 AM Jörn Franke wrote: > Spark heavily depends on

Re: Naming files while saving a Dataframe

2021-07-18 Thread Jörn Franke
Spark heavily depends on Hadoop writing files. You can try to set the Hadoop property: mapreduce.output.basename https://spark.apache.org/docs/latest/api/java/org/apache/spark/SparkContext.html#hadoopConfiguration-- > Am 18.07.2021 um 01:15 schrieb Eric Beabes : > >  > Mich - You're

Re: Naming files while saving a Dataframe

2021-07-17 Thread Eric Beabes
Mich - You're suggesting changing the "Path". Problem is that, we've an EXTERNAL table created on top of this path so "Path" CANNOT change. If we could, it would be easy to solve this problem. My question is about changing the "Filename". As Ayan pointed out, Spark doesn't seem to allow

Re: Naming files while saving a Dataframe

2021-07-17 Thread Mich Talebzadeh
Using this df.write.mode("overwrite").format("parquet").saveAsTable("test.ABCD") That will create a parquet table in the database test. which is essentially a hive partition in the format /user/hive/warehouse/test.db/abcd/00_0 view my Linkedin profile

Re: Naming files while saving a Dataframe

2021-07-17 Thread ayan guha
Hi Eric - yes that maybe the best way to resolve this. I have not seen any specific way to define names of the actual files written by spark. Finally, make sure you optimize number of files written. On Sun, Jul 18, 2021 at 2:39 AM Eric Beabes wrote: > Reason we've two jobs writing to the same

Re: Naming files while saving a Dataframe

2021-07-17 Thread Eric Beabes
I am not sure if you've understood the question. Here's how we're saving the DataFrame: df .coalesce(numFiles) .write .partitionBy(partitionDate) .mode("overwrite") .format("parquet") .save(*someDirectory*) Now where would I add a 'prefix' in this one? On Sat, Jul 17, 2021 at

Re: Naming files while saving a Dataframe

2021-07-17 Thread Mich Talebzadeh
Jobs have names in spark. You can prefix it to the file name when writing to directory I guess val sparkConf = new SparkConf(). setAppName(sparkAppName). view my Linkedin profile *Disclaimer:* Use it at your own

Re: Naming files while saving a Dataframe

2021-07-17 Thread Eric Beabes
Reason we've two jobs writing to the same directory is that the data is partitioned by 'day' (mmdd) but the job runs hourly. Maybe the only way to do this is to create an hourly partition (/mmdd/hh). Is that the only way to solve this? On Fri, Jul 16, 2021 at 5:45 PM ayan guha wrote: >

Re: Naming files while saving a Dataframe

2021-07-16 Thread ayan guha
IMHO - this is a bad idea esp in failure scenarios. How about creating a subfolder each for the jobs? On Sat, 17 Jul 2021 at 9:11 am, Eric Beabes wrote: > We've two (or more) jobs that write data into the same directory via a > Dataframe.save method. We need to be able to figure out which job

Naming files while saving a Dataframe

2021-07-16 Thread Eric Beabes
We've two (or more) jobs that write data into the same directory via a Dataframe.save method. We need to be able to figure out which job wrote which file. Maybe provide a 'prefix' to the file names. I was wondering if there's any 'option' that allows us to do this. Googling didn't come up with any