from:"Test One"

SparkSQL AVRO

2015-12-07 Thread Test One

I'm using spark-avro with SparkSQL to process and output avro files. My
data has the following schema:

root
 |-- memberUuid: string (nullable = true)
 |-- communityUuid: string (nullable = true)
 |-- email: string (nullable = true)
 |-- firstName: string (nullable = true)
 |-- lastName: string (nullable = true)
 |-- username: string (nullable = true)
 |-- profiles: map (nullable = true)
 ||-- key: string
 ||-- value: string (valueContainsNull = true)


When I write the file output as such with:
originalDF.write.avro("masterNew.avro")

The output location is a folder with masterNew.avro and with many many
files like these:
-rw-r--r--   1 kcsham  access_bpf 8 Dec  2 11:37 ._SUCCESS.crc
-rw-r--r--   1 kcsham  access_bpf44 Dec  2 11:37
.part-r-0-0c834f3e-9c15-4470-ad35-02f061826263.avro.crc
-rw-r--r--   1 kcsham  access_bpf44 Dec  2 11:37
.part-r-1-0c834f3e-9c15-4470-ad35-02f061826263.avro.crc
-rw-r--r--   1 kcsham  access_bpf44 Dec  2 11:37
.part-r-2-0c834f3e-9c15-4470-ad35-02f061826263.avro.crc
-rw-r--r--   1 kcsham  access_bpf 0 Dec  2 11:37 _SUCCESS
-rw-r--r--   1 kcsham  access_bpf  4261 Dec  2 11:37
part-r-0-0c834f3e-9c15-4470-ad35-02f061826263.avro
-rw-r--r--   1 kcsham  access_bpf  4261 Dec  2 11:37
part-r-1-0c834f3e-9c15-4470-ad35-02f061826263.avro
-rw-r--r--   1 kcsham  access_bpf  4261 Dec  2 11:37
part-r-2-0c834f3e-9c15-4470-ad35-02f061826263.avro


Where there are ~10 record, it has ~28000 files in that folder. When I
simply want to copy the same dataset to a new location as an exercise from
a local master, it takes long long time and having errors like such as
well.

22:01:44.247 [Executor task launch worker-21] WARN
 org.apache.spark.storage.MemoryStore - Not enough space to cache
rdd_112058_10705 in memory! (computed 496.0 B so far)
22:01:44.247 [Executor task launch worker-21] WARN
 org.apache.spark.CacheManager - Persisting partition rdd_112058_10705 to
disk instead.
[Stage 0:===>   (10706 + 1) /
28014]22:01:44.574 [Executor task launch worker-21] WARN
 org.apache.spark.storage.MemoryStore - Failed to reserve initial memory
threshold of 1024.0 KB for computing block rdd_112058_10706 in memory.


I'm attributing that there are way too many files to manipulate. The
questions:

1. Is this the expected format of the avro file written by spark-avro? and
each 'part-' is not more than 4k?
2. My use case is to append new records to the existing dataset using:
originalDF.unionAll(stageDF).write.avro(masterNew)
Any sqlconf, sparkconf that I should set to allow this to work?


Thanks,
kc

Merging two avro RDD/DataFrames

2015-09-28 Thread TEST ONE

I have a daily update of modified users (~100s) output as avro from ETL.
I’d need to find and merge with existing corresponding members in a master
avro file (~100,000s) The merge operation involves merging a ‘profiles’
Map between the matching records.


What would be the recommended pattern to handle record merging with Spark?


Thanks,

kc

SparkSQL AVRO

Merging two avro RDD/DataFrames

2 matches

Site Navigation

Mail list logo

Footer information