It's something like the schema shown below (with several additional
levels/sublevels)

root
 |-- sentAt: long (nullable = true)
 |-- sharing: string (nullable = true)
 |-- receivedAt: long (nullable = true)
 |-- ip: string (nullable = true)
 |-- story: struct (nullable = true)
 |    |-- super: string (nullable = true)
 |    |-- lang: string (nullable = true)
 |    |-- setting: string (nullable = true)
 |    |-- myapp: struct (nullable = true)
 |    |    |-- id: string (nullable = true)
 |    |    |-- ver: string (nullable = true)
 |    |    |-- build: string (nullable = true)
 |    |-- comp: struct (nullable = true)
 |    |    |-- notes: string (nullable = true)
 |    |    |-- source: string (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- content: string (nullable = true)
 |    |    |-- sub: string (nullable = true)
 |    |-- loc: struct (nullable = true)
 |    |    |-- city: string (nullable = true)
 |    |    |-- country: string (nullable = true)
 |    |    |-- lat: double (nullable = true)
 |    |    |-- long: double (nullable = true)

On Wed, Nov 16, 2016 at 2:08 PM, Takeshi Yamamuro <linguin....@gmail.com>
wrote:

> Hi,
>
> What's the schema interpreted by spark?
> A compression logic of the spark caching depends on column types.
>
> // maropu
>
>
> On Wed, Nov 16, 2016 at 5:26 PM, Prithish <prith...@gmail.com> wrote:
>
>> Thanks for your response.
>>
>> I did some more tests and I am seeing that when I have a flatter
>> structure for my AVRO, the cache memory use is close to the CSV. But, when
>> I use few levels of nesting, the cache memory usage blows up. This is
>> really critical for planning the cluster we will be using. To avoid using a
>> larger cluster, looks like, we will have to consider keeping the structure
>> flat as much as possible.
>>
>> On Wed, Nov 16, 2016 at 1:18 PM, Shreya Agarwal <shrey...@microsoft.com>
>> wrote:
>>
>>> (Adding user@spark back to the discussion)
>>>
>>>
>>>
>>> Well, the CSV vs AVRO might be simpler to explain. CSV has a lot of
>>> scope for compression. On the other hand avro and parquet are already
>>> compressed and just store extra schema info, afaik. Avro and parquet are
>>> both going to make your data smaller, parquet through compressed columnar
>>> storage, and avro through its binary data format.
>>>
>>>
>>>
>>> Next, talking about the 62kb becoming 1224kb. I actually do not see such
>>> a massive blow up. The avro you shared is 28kb on my system and becomes
>>> 53.7kb when cached in memory deserialized and 52.9kb when cached In memory
>>> serialized. Exact same numbers with parquet as well. This is expected
>>> behavior, if I am not wrong.
>>>
>>>
>>>
>>> In fact, now that I think about it, even larger blow ups might be valid,
>>> since your data must have been deserialized from the compressed avro
>>> format, making it bigger. The order of magnitude of difference in size
>>> would depend on the type of data you have and how well it was compressable.
>>>
>>>
>>>
>>> The purpose of these formats is to store data to persistent storage in a
>>> way that's faster to read from, not to reduce cache-memory usage.
>>>
>>>
>>>
>>> Maybe others here have more info to share.
>>>
>>>
>>>
>>> Regards,
>>>
>>> Shreya
>>>
>>>
>>>
>>> Sent from my Windows 10 phone
>>>
>>>
>>>
>>> *From: *Prithish <prith...@gmail.com>
>>> *Sent: *Tuesday, November 15, 2016 11:04 PM
>>> *To: *Shreya Agarwal <shrey...@microsoft.com>
>>> *Subject: *Re: AVRO File size when caching in-memory
>>>
>>>
>>> I did another test and noting my observations here. These were done with
>>> the same data in avro and csv formats.
>>>
>>> In AVRO, the file size on disk was 62kb and after caching, the in-memory
>>> size is 1224kb
>>> In CSV, the file size on disk was 690kb and after caching, the in-memory
>>> size is 290kb
>>>
>>> I'm guessing that the spark caching is not able to compress when the
>>> source is avro. Not sure if this is just my immature conclusion. Waiting to
>>> hear your observation.
>>>
>>> On Wed, Nov 16, 2016 at 12:14 PM, Prithish <prith...@gmail.com> wrote:
>>>
>>>> Thanks for your response.
>>>>
>>>> I have attached the code (that I ran using the Spark-shell) as well as
>>>> a sample avro file. After you run this code, the data is cached in memory
>>>> and you can go to the "storage" tab on the Spark-ui (localhost:4040) and
>>>> see the size it uses. In this example the size is small, but in my actual
>>>> scenario, the source file size is 30GB and the in-memory size comes to
>>>> around 800GB. I am trying to understand if this is expected when using avro
>>>> or not.
>>>>
>>>> On Wed, Nov 16, 2016 at 10:37 AM, Shreya Agarwal <
>>>> shrey...@microsoft.com> wrote:
>>>>
>>>>> I haven’t used Avro ever. But if you can send over a quick sample
>>>>> code, I can run and see if I repro it and maybe debug.
>>>>>
>>>>>
>>>>>
>>>>> *From:* Prithish [mailto:prith...@gmail.com]
>>>>> *Sent:* Tuesday, November 15, 2016 8:44 PM
>>>>> *To:* Jörn Franke <jornfra...@gmail.com>
>>>>> *Cc:* User <user@spark.apache.org>
>>>>> *Subject:* Re: AVRO File size when caching in-memory
>>>>>
>>>>>
>>>>>
>>>>> Anyone?
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Nov 15, 2016 at 10:45 AM, Prithish <prith...@gmail.com> wrote:
>>>>>
>>>>> I am using 2.0.1 and databricks avro library 3.0.1. I am running this
>>>>> on the latest AWS EMR release.
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Nov 14, 2016 at 3:06 PM, Jörn Franke <jornfra...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> spark version? Are you using tungsten?
>>>>>
>>>>>
>>>>> > On 14 Nov 2016, at 10:05, Prithish <prith...@gmail.com> wrote:
>>>>> >
>>>>> > Can someone please explain why this happens?
>>>>> >
>>>>> > When I read a 600kb AVRO file and cache this in memory (using
>>>>> cacheTable), it shows up as 11mb (storage tab in Spark UI). I have tried
>>>>> this with different file sizes, and the size in-memory is always
>>>>> proportionate. I thought Spark compresses when using cacheTable.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>

Reply via email to