Hi, I am trying to understand how spark types are kept in memory and accessed. I tried to look at the code at the definition of MapType and ArrayType for example and I can't seem to find the relevant code for its actual implementation.
I am trying to figure out how these two types are implemented to understand how they match my needs. In general, it appears the size of a map is the same as two arrays which is about double the naïve array implementation: if I have 1000 rows, each with a map from 10K integers to 10K integers, I find through caching the dataframe that the total is is ~150MB (the naïve implementation of two arrays would code 1000*10000*(4+4) or a total of ~80MB). I see the same size if I use two arrays. Second, what would be the performance of updating the map/arrays as they are immutable (i.e. some copying is required). The reason I am asking this is because I wanted to do an aggregate function which calculates a variation of a histogram. The most naïve solution for this would be to have a map from the bin to the count. But since we are talking about an immutable map, wouldn't that cost a lot more? An even further optimization would be to use a mutable array where we combine the key and value to a single value (key and value are both int in my case). Assuming the maximum number of bins is small (e.g. less than 10), it is often cheaper to just search the array for the right key (and in this case the size of the data is expected to be significantly smaller than map). In my case, most of the type (90%) there are less than 3 elements in the bin and If I have more than 10 bins I basically do a combination to reduce the number. For few elements, a map becomes very inefficient - If I create 10M rows with 1 map from int to int each I get an overall of ~380MB meaning ~38 bytes per element (instead of just 8). For array, again it is too large (229MB, i.e. ~23 bytes per element). Is there a way to implement a simple mutable array type to use in the aggregation buffer? Where is the portion of the code that handles the actual type handling? Thanks, Assaf. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Memory-usage-for-spark-types-tp18984.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com.