By default Spark will create one file per partition. Spark SQL defaults to
using 200 partitions. If you want to reduce the number of files written
out, repartition your dataframe using repartition and give it the desired
number of partitions.

originalDF.repartition(10).write.avro("masterNew.avro")

Deenar



On 7 December 2015 at 21:21, Ruslan Dautkhanov <dautkha...@gmail.com> wrote:

> How man
>
On 7 December 2015 at 21:21, Ruslan Dautkhanov <dautkha...@gmail.com> wrote:

> How many reducers you had that created those avro files?
> Each reducer very likely creates its own avro part- file.
>
> We normally use Parquet, but it should be the same for Avro, so this might
> be
> relevant
>
> http://stackoverflow.com/questions/34026764/how-to-limit-parquet-file-dimension-for-a-parquet-table-in-hive/34059289#34059289
>
>
>
>
> --
> Ruslan Dautkhanov
>
> On Mon, Dec 7, 2015 at 11:27 AM, Test One <t...@cksworks.com> wrote:
>
>> I'm using spark-avro with SparkSQL to process and output avro files. My
>> data has the following schema:
>>
>> root
>>  |-- memberUuid: string (nullable = true)
>>  |-- communityUuid: string (nullable = true)
>>  |-- email: string (nullable = true)
>>  |-- firstName: string (nullable = true)
>>  |-- lastName: string (nullable = true)
>>  |-- username: string (nullable = true)
>>  |-- profiles: map (nullable = true)
>>  |    |-- key: string
>>  |    |-- value: string (valueContainsNull = true)
>>
>>
>> When I write the file output as such with:
>> originalDF.write.avro("masterNew.avro")
>>
>> The output location is a folder with masterNew.avro and with many many
>> files like these:
>> -rw-r--r--   1 kcsham  access_bpf     8 Dec  2 11:37 ._SUCCESS.crc
>> -rw-r--r--   1 kcsham  access_bpf    44 Dec  2 11:37
>> .part-r-00000-0c834f3e-9c15-4470-ad35-02f061826263.avro.crc
>> -rw-r--r--   1 kcsham  access_bpf    44 Dec  2 11:37
>> .part-r-00001-0c834f3e-9c15-4470-ad35-02f061826263.avro.crc
>> -rw-r--r--   1 kcsham  access_bpf    44 Dec  2 11:37
>> .part-r-00002-0c834f3e-9c15-4470-ad35-02f061826263.avro.crc
>> -rw-r--r--   1 kcsham  access_bpf     0 Dec  2 11:37 _SUCCESS
>> -rw-r--r--   1 kcsham  access_bpf  4261 Dec  2 11:37
>> part-r-00000-0c834f3e-9c15-4470-ad35-02f061826263.avro
>> -rw-r--r--   1 kcsham  access_bpf  4261 Dec  2 11:37
>> part-r-00001-0c834f3e-9c15-4470-ad35-02f061826263.avro
>> -rw-r--r--   1 kcsham  access_bpf  4261 Dec  2 11:37
>> part-r-00002-0c834f3e-9c15-4470-ad35-02f061826263.avro
>>
>>
>> Where there are ~100000 record, it has ~28000 files in that folder. When
>> I simply want to copy the same dataset to a new location as an exercise
>> from a local master, it takes long long time and having errors like such as
>> well.
>>
>> 22:01:44.247 [Executor task launch worker-21] WARN
>>  org.apache.spark.storage.MemoryStore - Not enough space to cache
>> rdd_112058_10705 in memory! (computed 496.0 B so far)
>> 22:01:44.247 [Executor task launch worker-21] WARN
>>  org.apache.spark.CacheManager - Persisting partition rdd_112058_10705 to
>> disk instead.
>> [Stage 0:===================>                               (10706 + 1) /
>> 28014]22:01:44.574 [Executor task launch worker-21] WARN
>>  org.apache.spark.storage.MemoryStore - Failed to reserve initial memory
>> threshold of 1024.0 KB for computing block rdd_112058_10706 in memory.
>>
>>
>> I'm attributing that there are way too many files to manipulate. The
>> questions:
>>
>> 1. Is this the expected format of the avro file written by spark-avro?
>> and each 'part-' is not more than 4k?
>> 2. My use case is to append new records to the existing dataset using:
>> originalDF.unionAll(stageDF).write.avro(masterNew)
>>     Any sqlconf, sparkconf that I should set to allow this to work?
>>
>>
>> Thanks,
>> kc
>>
>>
>>
>

Reply via email to