Depends on the use case, if you have to join, you're saving a join and a
shuffle from having it already in an array.

If you explode, at least sort within partitions to get you predicate
pushdown when you read the data next time.

On Sun, 19 Jan 2020, 1:19 pm Jörn Franke, <jornfra...@gmail.com> wrote:

> Why not two tables and then you can join them? This would be the standard
> way. it depends what your full use case is, what volumes / orders you
> expect on average, how aggregations and filters look like. The example
> below states that you do a Select all on the table.
>
> > Am 19.01.2020 um 01:50 schrieb V0lleyBallJunki3 <venkatda...@gmail.com>:
> >
> > I am using a dataframe and has structure like this :
> >
> > root
> > |-- orders: array (nullable = true)
> > |    |-- element: struct (containsNull = true)
> > |    |    |-- amount: double (nullable = true)
> > |    |    |-- id: string (nullable = true)
> > |-- user: string (nullable = true)
> > |-- language: string (nullable = true)
> >
> > Each user has multiple orders. Now if I explode orders like this:
> >
> > df.select($"user", explode($"orders").as("order")) . Each order element
> will
> > become a row with a duplicated user and language. Was wondering if spark
> > actually converts each order element into a single row in memory or it
> just
> > logical. Because if a single user has 1000 orders  then wouldn't it lead
> to
> > a lot more memory consumption since it is duplicating user and language a
> > 1000 times (once for each order) in memory?
> >
> >
> >
> > --
> > Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
> >
> > ---------------------------------------------------------------------
> > To unsubscribe e-mail: user-unsubscr...@spark.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Reply via email to