Re: SparkSQL AVRO

Ruslan Dautkhanov Mon, 07 Dec 2015 13:23:27 -0800

How many reducers you had that created those avro files?
Each reducer very likely creates its own avro part- file.


We normally use Parquet, but it should be the same for Avro, so this might
be
relevant
http://stackoverflow.com/questions/34026764/how-to-limit-parquet-file-dimension-for-a-parquet-table-in-hive/34059289#34059289




-- 
Ruslan Dautkhanov

On Mon, Dec 7, 2015 at 11:27 AM, Test One <t...@cksworks.com> wrote:

> I'm using spark-avro with SparkSQL to process and output avro files. My
> data has the following schema:
>
> root
>  |-- memberUuid: string (nullable = true)
>  |-- communityUuid: string (nullable = true)
>  |-- email: string (nullable = true)
>  |-- firstName: string (nullable = true)
>  |-- lastName: string (nullable = true)
>  |-- username: string (nullable = true)
>  |-- profiles: map (nullable = true)
>  |    |-- key: string
>  |    |-- value: string (valueContainsNull = true)
>
>
> When I write the file output as such with:
> originalDF.write.avro("masterNew.avro")
>
> The output location is a folder with masterNew.avro and with many many
> files like these:
> -rw-r--r--   1 kcsham  access_bpf     8 Dec  2 11:37 ._SUCCESS.crc
> -rw-r--r--   1 kcsham  access_bpf    44 Dec  2 11:37
> .part-r-00000-0c834f3e-9c15-4470-ad35-02f061826263.avro.crc
> -rw-r--r--   1 kcsham  access_bpf    44 Dec  2 11:37
> .part-r-00001-0c834f3e-9c15-4470-ad35-02f061826263.avro.crc
> -rw-r--r--   1 kcsham  access_bpf    44 Dec  2 11:37
> .part-r-00002-0c834f3e-9c15-4470-ad35-02f061826263.avro.crc
> -rw-r--r--   1 kcsham  access_bpf     0 Dec  2 11:37 _SUCCESS
> -rw-r--r--   1 kcsham  access_bpf  4261 Dec  2 11:37
> part-r-00000-0c834f3e-9c15-4470-ad35-02f061826263.avro
> -rw-r--r--   1 kcsham  access_bpf  4261 Dec  2 11:37
> part-r-00001-0c834f3e-9c15-4470-ad35-02f061826263.avro
> -rw-r--r--   1 kcsham  access_bpf  4261 Dec  2 11:37
> part-r-00002-0c834f3e-9c15-4470-ad35-02f061826263.avro
>
>
> Where there are ~100000 record, it has ~28000 files in that folder. When I
> simply want to copy the same dataset to a new location as an exercise from
> a local master, it takes long long time and having errors like such as
> well.
>
> 22:01:44.247 [Executor task launch worker-21] WARN
>  org.apache.spark.storage.MemoryStore - Not enough space to cache
> rdd_112058_10705 in memory! (computed 496.0 B so far)
> 22:01:44.247 [Executor task launch worker-21] WARN
>  org.apache.spark.CacheManager - Persisting partition rdd_112058_10705 to
> disk instead.
> [Stage 0:===================>                               (10706 + 1) /
> 28014]22:01:44.574 [Executor task launch worker-21] WARN
>  org.apache.spark.storage.MemoryStore - Failed to reserve initial memory
> threshold of 1024.0 KB for computing block rdd_112058_10706 in memory.
>
>
> I'm attributing that there are way too many files to manipulate. The
> questions:
>
> 1. Is this the expected format of the avro file written by spark-avro? and
> each 'part-' is not more than 4k?
> 2. My use case is to append new records to the existing dataset using:
> originalDF.unionAll(stageDF).write.avro(masterNew)
>     Any sqlconf, sparkconf that I should set to allow this to work?
>
>
> Thanks,
> kc
>
>
>

Re: SparkSQL AVRO

Reply via email to