Hi,

I'm trying to implement a folding function in Spark, it takes an input k and
a data frame of ids and dates. k=1 will be just the data frame, k=2 will,
consist of the min and max date for each id once and the rest twice, k=3
will consist of min and max once, min+1 and max-1, twice and the rest three
times, etc.

Code in scala, with variable names changed:
  val acctMinDates = df.groupBy("id").agg(min("thedate"))
  val acctMaxDates = df.groupBy("id").agg(max("thedate"))
  val acctDates = acctMinDates.join(acctMaxDates, "id").collect()

  var filterString = "";

  for (i <- 1 to k - 1) {

    if (i == 1) {
      for (aDate <- acctDates) {
        filterString = filterString + "(id = " + aDate(0) + " and thedate >
" + aDate(1) + " and thedate < " + aDate(2) + ") or ";
      }
      filterString = filterString.substring(0, filterString.size - 4)
    }
    df = df.unionAll(df.where(filterString));
  }
}

Code that is being attempted to translate, from pandas/python:

df = pd.concat([df.groupby('id').apply(lambda x: pd.concat([x.iloc[i: i + k]
for i in range(len(x.index) - k + 1)]))])



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Large-where-clause-StackOverflow-1-5-2-tp27544.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to