How many reducers you had that created those avro files? Each reducer very likely creates its own avro part- file.
We normally use Parquet, but it should be the same for Avro, so this might be relevant http://stackoverflow.com/questions/34026764/how-to-limit-parquet-file-dimension-for-a-parquet-table-in-hive/34059289#34059289 -- Ruslan Dautkhanov On Mon, Dec 7, 2015 at 11:27 AM, Test One <t...@cksworks.com> wrote: > I'm using spark-avro with SparkSQL to process and output avro files. My > data has the following schema: > > root > |-- memberUuid: string (nullable = true) > |-- communityUuid: string (nullable = true) > |-- email: string (nullable = true) > |-- firstName: string (nullable = true) > |-- lastName: string (nullable = true) > |-- username: string (nullable = true) > |-- profiles: map (nullable = true) > | |-- key: string > | |-- value: string (valueContainsNull = true) > > > When I write the file output as such with: > originalDF.write.avro("masterNew.avro") > > The output location is a folder with masterNew.avro and with many many > files like these: > -rw-r--r-- 1 kcsham access_bpf 8 Dec 2 11:37 ._SUCCESS.crc > -rw-r--r-- 1 kcsham access_bpf 44 Dec 2 11:37 > .part-r-00000-0c834f3e-9c15-4470-ad35-02f061826263.avro.crc > -rw-r--r-- 1 kcsham access_bpf 44 Dec 2 11:37 > .part-r-00001-0c834f3e-9c15-4470-ad35-02f061826263.avro.crc > -rw-r--r-- 1 kcsham access_bpf 44 Dec 2 11:37 > .part-r-00002-0c834f3e-9c15-4470-ad35-02f061826263.avro.crc > -rw-r--r-- 1 kcsham access_bpf 0 Dec 2 11:37 _SUCCESS > -rw-r--r-- 1 kcsham access_bpf 4261 Dec 2 11:37 > part-r-00000-0c834f3e-9c15-4470-ad35-02f061826263.avro > -rw-r--r-- 1 kcsham access_bpf 4261 Dec 2 11:37 > part-r-00001-0c834f3e-9c15-4470-ad35-02f061826263.avro > -rw-r--r-- 1 kcsham access_bpf 4261 Dec 2 11:37 > part-r-00002-0c834f3e-9c15-4470-ad35-02f061826263.avro > > > Where there are ~100000 record, it has ~28000 files in that folder. When I > simply want to copy the same dataset to a new location as an exercise from > a local master, it takes long long time and having errors like such as > well. > > 22:01:44.247 [Executor task launch worker-21] WARN > org.apache.spark.storage.MemoryStore - Not enough space to cache > rdd_112058_10705 in memory! (computed 496.0 B so far) > 22:01:44.247 [Executor task launch worker-21] WARN > org.apache.spark.CacheManager - Persisting partition rdd_112058_10705 to > disk instead. > [Stage 0:===================> (10706 + 1) / > 28014]22:01:44.574 [Executor task launch worker-21] WARN > org.apache.spark.storage.MemoryStore - Failed to reserve initial memory > threshold of 1024.0 KB for computing block rdd_112058_10706 in memory. > > > I'm attributing that there are way too many files to manipulate. The > questions: > > 1. Is this the expected format of the avro file written by spark-avro? and > each 'part-' is not more than 4k? > 2. My use case is to append new records to the existing dataset using: > originalDF.unionAll(stageDF).write.avro(masterNew) > Any sqlconf, sparkconf that I should set to allow this to work? > > > Thanks, > kc > > >