I think it does mean more memory usage but consider how big your arrays are. Think about your use case requirements and whether it makes sense to use arrays. Also it may be preferable to explode if the arrays are very large. I'd say exploding arrays will make the data more splittable, having the array has benefit of avoiding a join and colocation of the children items but does imply more memory pressure on each executor to read every record in the array, requiring denser nodes.
I hope that helps. On Sun, 19 Jan 2020, 7:50 am V0lleyBallJunki3, <venkatda...@gmail.com> wrote: > I am using a dataframe and has structure like this : > > root > |-- orders: array (nullable = true) > | |-- element: struct (containsNull = true) > | | |-- amount: double (nullable = true) > | | |-- id: string (nullable = true) > |-- user: string (nullable = true) > |-- language: string (nullable = true) > > Each user has multiple orders. Now if I explode orders like this: > > df.select($"user", explode($"orders").as("order")) . Each order element > will > become a row with a duplicated user and language. Was wondering if spark > actually converts each order element into a single row in memory or it just > logical. Because if a single user has 1000 orders then wouldn't it lead to > a lot more memory consumption since it is duplicating user and language a > 1000 times (once for each order) in memory? > > > > -- > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >