By default Spark will create one file per partition. Spark SQL defaults to using 200 partitions. If you want to reduce the number of files written out, repartition your dataframe using repartition and give it the desired number of partitions.
originalDF.repartition(10).write.avro("masterNew.avro") Deenar On 7 December 2015 at 21:21, Ruslan Dautkhanov <dautkha...@gmail.com> wrote: > How man > On 7 December 2015 at 21:21, Ruslan Dautkhanov <dautkha...@gmail.com> wrote: > How many reducers you had that created those avro files? > Each reducer very likely creates its own avro part- file. > > We normally use Parquet, but it should be the same for Avro, so this might > be > relevant > > http://stackoverflow.com/questions/34026764/how-to-limit-parquet-file-dimension-for-a-parquet-table-in-hive/34059289#34059289 > > > > > -- > Ruslan Dautkhanov > > On Mon, Dec 7, 2015 at 11:27 AM, Test One <t...@cksworks.com> wrote: > >> I'm using spark-avro with SparkSQL to process and output avro files. My >> data has the following schema: >> >> root >> |-- memberUuid: string (nullable = true) >> |-- communityUuid: string (nullable = true) >> |-- email: string (nullable = true) >> |-- firstName: string (nullable = true) >> |-- lastName: string (nullable = true) >> |-- username: string (nullable = true) >> |-- profiles: map (nullable = true) >> | |-- key: string >> | |-- value: string (valueContainsNull = true) >> >> >> When I write the file output as such with: >> originalDF.write.avro("masterNew.avro") >> >> The output location is a folder with masterNew.avro and with many many >> files like these: >> -rw-r--r-- 1 kcsham access_bpf 8 Dec 2 11:37 ._SUCCESS.crc >> -rw-r--r-- 1 kcsham access_bpf 44 Dec 2 11:37 >> .part-r-00000-0c834f3e-9c15-4470-ad35-02f061826263.avro.crc >> -rw-r--r-- 1 kcsham access_bpf 44 Dec 2 11:37 >> .part-r-00001-0c834f3e-9c15-4470-ad35-02f061826263.avro.crc >> -rw-r--r-- 1 kcsham access_bpf 44 Dec 2 11:37 >> .part-r-00002-0c834f3e-9c15-4470-ad35-02f061826263.avro.crc >> -rw-r--r-- 1 kcsham access_bpf 0 Dec 2 11:37 _SUCCESS >> -rw-r--r-- 1 kcsham access_bpf 4261 Dec 2 11:37 >> part-r-00000-0c834f3e-9c15-4470-ad35-02f061826263.avro >> -rw-r--r-- 1 kcsham access_bpf 4261 Dec 2 11:37 >> part-r-00001-0c834f3e-9c15-4470-ad35-02f061826263.avro >> -rw-r--r-- 1 kcsham access_bpf 4261 Dec 2 11:37 >> part-r-00002-0c834f3e-9c15-4470-ad35-02f061826263.avro >> >> >> Where there are ~100000 record, it has ~28000 files in that folder. When >> I simply want to copy the same dataset to a new location as an exercise >> from a local master, it takes long long time and having errors like such as >> well. >> >> 22:01:44.247 [Executor task launch worker-21] WARN >> org.apache.spark.storage.MemoryStore - Not enough space to cache >> rdd_112058_10705 in memory! (computed 496.0 B so far) >> 22:01:44.247 [Executor task launch worker-21] WARN >> org.apache.spark.CacheManager - Persisting partition rdd_112058_10705 to >> disk instead. >> [Stage 0:===================> (10706 + 1) / >> 28014]22:01:44.574 [Executor task launch worker-21] WARN >> org.apache.spark.storage.MemoryStore - Failed to reserve initial memory >> threshold of 1024.0 KB for computing block rdd_112058_10706 in memory. >> >> >> I'm attributing that there are way too many files to manipulate. The >> questions: >> >> 1. Is this the expected format of the avro file written by spark-avro? >> and each 'part-' is not more than 4k? >> 2. My use case is to append new records to the existing dataset using: >> originalDF.unionAll(stageDF).write.avro(masterNew) >> Any sqlconf, sparkconf that I should set to allow this to work? >> >> >> Thanks, >> kc >> >> >> >