Re: Naming files while saving a Dataframe

Mich Talebzadeh Sat, 17 Jul 2021 13:58:36 -0700

Using this

df.write.mode("overwrite").format("parquet").saveAsTable("test.ABCD")


That will create a parquet table in the database test. which is essentially
a hive partition in the format

/user/hive/warehouse/test.db/abcd/000000_0


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sat, 17 Jul 2021 at 20:45, Eric Beabes <mailinglist...@gmail.com> wrote:

> I am not sure if you've understood the question. Here's how we're saving
> the DataFrame:
>
> df
>   .coalesce(numFiles)
>   .write
>   .partitionBy(partitionDate)
>   .mode("overwrite")
>   .format("parquet")
>
>   .save(*someDirectory*)
>
>
> Now where would I add a 'prefix' in this one?
>
>
> On Sat, Jul 17, 2021 at 10:54 AM Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> try it see if it works
>>
>> fullyQualifiedTableName = appName+'_'+tableName
>>
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Sat, 17 Jul 2021 at 18:02, Eric Beabes <mailinglist...@gmail.com>
>> wrote:
>>
>>> I don't think Spark allows adding a 'prefix' to the file name, does it?
>>> If it does, please tell me how. Thanks.
>>>
>>> On Sat, Jul 17, 2021 at 9:47 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> Jobs have names in spark. You can prefix it to the file name when
>>>> writing to directory I guess
>>>>
>>>>  val sparkConf = new SparkConf().
>>>>                setAppName(sparkAppName).
>>>>
>>>>
>>>>
>>>>
>>>>    view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Sat, 17 Jul 2021 at 17:40, Eric Beabes <mailinglist...@gmail.com>
>>>> wrote:
>>>>
>>>>> Reason we've two jobs writing to the same directory is that the data
>>>>> is partitioned by 'day' (yyyymmdd) but the job runs hourly. Maybe the only
>>>>> way to do this is to create an hourly partition (/yyyymmdd/hh). Is that 
>>>>> the
>>>>> only way to solve this?
>>>>>
>>>>> On Fri, Jul 16, 2021 at 5:45 PM ayan guha <guha.a...@gmail.com> wrote:
>>>>>
>>>>>> IMHO - this is a bad idea esp in failure scenarios.
>>>>>>
>>>>>> How about creating a subfolder each for the jobs?
>>>>>>
>>>>>> On Sat, 17 Jul 2021 at 9:11 am, Eric Beabes <mailinglist...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> We've two (or more) jobs that write data into the same directory via
>>>>>>> a Dataframe.save method. We need to be able to figure out which job 
>>>>>>> wrote
>>>>>>> which file. Maybe provide a 'prefix' to the file names. I was wondering 
>>>>>>> if
>>>>>>> there's any 'option' that allows us to do this. Googling didn't come up
>>>>>>> with any solution so thought of asking the Spark experts on this mailing
>>>>>>> list.
>>>>>>>
>>>>>>> Thanks in advance.
>>>>>>>
>>>>>> --
>>>>>> Best Regards,
>>>>>> Ayan Guha
>>>>>>
>>>>>

Re: Naming files while saving a Dataframe

Reply via email to