All:

If this question was already discussed, please let me know. I can try to
look into the archive.

Data Characteristics:
    entity_id  date  fact_1 fact_2  fact_N   derived_1  derived_2  derived_X

a) There are 1000s of such entities in the system
b) Each one has various Fact attributes per each day (to begin with). In
future, we wanted to support multiple entries per day
c) Goal is to calculate various Derived attributes...some of them are
Windows functions, such as Average, Moving Average etc
d) The total number of rows per each entity might not be equally
distributed

Question:
1) What's the best way to partition the data for better performance
optimization? Any things to consider given point #d above?

Sample code:
The following code seems to work fine on a smaller sample size:
      window =
Window.partitionBy('entity_id').orderBy('date').rowsBetween(-30, 0)
      moving_avg = mean(df['fact_1']).over(window)
      moving_avg
      df2 = df.withColumn('derived_moving_avg', moving_avg)

Please advise if there are any aspects that need to be considered to make
it efficient to run on a larger data size (with N-node spark cluster).

Thanks in advance,
Vasu.

Reply via email to