It's something like the schema shown below (with several additional levels/sublevels)
root |-- sentAt: long (nullable = true) |-- sharing: string (nullable = true) |-- receivedAt: long (nullable = true) |-- ip: string (nullable = true) |-- story: struct (nullable = true) | |-- super: string (nullable = true) | |-- lang: string (nullable = true) | |-- setting: string (nullable = true) | |-- myapp: struct (nullable = true) | | |-- id: string (nullable = true) | | |-- ver: string (nullable = true) | | |-- build: string (nullable = true) | |-- comp: struct (nullable = true) | | |-- notes: string (nullable = true) | | |-- source: string (nullable = true) | | |-- name: string (nullable = true) | | |-- content: string (nullable = true) | | |-- sub: string (nullable = true) | |-- loc: struct (nullable = true) | | |-- city: string (nullable = true) | | |-- country: string (nullable = true) | | |-- lat: double (nullable = true) | | |-- long: double (nullable = true) On Wed, Nov 16, 2016 at 2:08 PM, Takeshi Yamamuro <linguin....@gmail.com> wrote: > Hi, > > What's the schema interpreted by spark? > A compression logic of the spark caching depends on column types. > > // maropu > > > On Wed, Nov 16, 2016 at 5:26 PM, Prithish <prith...@gmail.com> wrote: > >> Thanks for your response. >> >> I did some more tests and I am seeing that when I have a flatter >> structure for my AVRO, the cache memory use is close to the CSV. But, when >> I use few levels of nesting, the cache memory usage blows up. This is >> really critical for planning the cluster we will be using. To avoid using a >> larger cluster, looks like, we will have to consider keeping the structure >> flat as much as possible. >> >> On Wed, Nov 16, 2016 at 1:18 PM, Shreya Agarwal <shrey...@microsoft.com> >> wrote: >> >>> (Adding user@spark back to the discussion) >>> >>> >>> >>> Well, the CSV vs AVRO might be simpler to explain. CSV has a lot of >>> scope for compression. On the other hand avro and parquet are already >>> compressed and just store extra schema info, afaik. Avro and parquet are >>> both going to make your data smaller, parquet through compressed columnar >>> storage, and avro through its binary data format. >>> >>> >>> >>> Next, talking about the 62kb becoming 1224kb. I actually do not see such >>> a massive blow up. The avro you shared is 28kb on my system and becomes >>> 53.7kb when cached in memory deserialized and 52.9kb when cached In memory >>> serialized. Exact same numbers with parquet as well. This is expected >>> behavior, if I am not wrong. >>> >>> >>> >>> In fact, now that I think about it, even larger blow ups might be valid, >>> since your data must have been deserialized from the compressed avro >>> format, making it bigger. The order of magnitude of difference in size >>> would depend on the type of data you have and how well it was compressable. >>> >>> >>> >>> The purpose of these formats is to store data to persistent storage in a >>> way that's faster to read from, not to reduce cache-memory usage. >>> >>> >>> >>> Maybe others here have more info to share. >>> >>> >>> >>> Regards, >>> >>> Shreya >>> >>> >>> >>> Sent from my Windows 10 phone >>> >>> >>> >>> *From: *Prithish <prith...@gmail.com> >>> *Sent: *Tuesday, November 15, 2016 11:04 PM >>> *To: *Shreya Agarwal <shrey...@microsoft.com> >>> *Subject: *Re: AVRO File size when caching in-memory >>> >>> >>> I did another test and noting my observations here. These were done with >>> the same data in avro and csv formats. >>> >>> In AVRO, the file size on disk was 62kb and after caching, the in-memory >>> size is 1224kb >>> In CSV, the file size on disk was 690kb and after caching, the in-memory >>> size is 290kb >>> >>> I'm guessing that the spark caching is not able to compress when the >>> source is avro. Not sure if this is just my immature conclusion. Waiting to >>> hear your observation. >>> >>> On Wed, Nov 16, 2016 at 12:14 PM, Prithish <prith...@gmail.com> wrote: >>> >>>> Thanks for your response. >>>> >>>> I have attached the code (that I ran using the Spark-shell) as well as >>>> a sample avro file. After you run this code, the data is cached in memory >>>> and you can go to the "storage" tab on the Spark-ui (localhost:4040) and >>>> see the size it uses. In this example the size is small, but in my actual >>>> scenario, the source file size is 30GB and the in-memory size comes to >>>> around 800GB. I am trying to understand if this is expected when using avro >>>> or not. >>>> >>>> On Wed, Nov 16, 2016 at 10:37 AM, Shreya Agarwal < >>>> shrey...@microsoft.com> wrote: >>>> >>>>> I haven’t used Avro ever. But if you can send over a quick sample >>>>> code, I can run and see if I repro it and maybe debug. >>>>> >>>>> >>>>> >>>>> *From:* Prithish [mailto:prith...@gmail.com] >>>>> *Sent:* Tuesday, November 15, 2016 8:44 PM >>>>> *To:* Jörn Franke <jornfra...@gmail.com> >>>>> *Cc:* User <user@spark.apache.org> >>>>> *Subject:* Re: AVRO File size when caching in-memory >>>>> >>>>> >>>>> >>>>> Anyone? >>>>> >>>>> >>>>> >>>>> On Tue, Nov 15, 2016 at 10:45 AM, Prithish <prith...@gmail.com> wrote: >>>>> >>>>> I am using 2.0.1 and databricks avro library 3.0.1. I am running this >>>>> on the latest AWS EMR release. >>>>> >>>>> >>>>> >>>>> On Mon, Nov 14, 2016 at 3:06 PM, Jörn Franke <jornfra...@gmail.com> >>>>> wrote: >>>>> >>>>> spark version? Are you using tungsten? >>>>> >>>>> >>>>> > On 14 Nov 2016, at 10:05, Prithish <prith...@gmail.com> wrote: >>>>> > >>>>> > Can someone please explain why this happens? >>>>> > >>>>> > When I read a 600kb AVRO file and cache this in memory (using >>>>> cacheTable), it shows up as 11mb (storage tab in Spark UI). I have tried >>>>> this with different file sizes, and the size in-memory is always >>>>> proportionate. I thought Spark compresses when using cacheTable. >>>>> >>>>> >>>>> >>>>> >>>>> >>>> >>>> >>> >> > > > -- > --- > Takeshi Yamamuro >