Hi,
I'm trying to implement a folding function in Spark, it takes an input k and
a data frame of ids and dates. k=1 will be just the data frame, k=2 will,
consist of the min and max date for each id once and the rest twice, k=3
will consist of min and max once, min+1 and max-1, twice and the rest three
times, etc.
Code in scala, with variable names changed:
val acctMinDates = df.groupBy("id").agg(min("thedate"))
val acctMaxDates = df.groupBy("id").agg(max("thedate"))
val acctDates = acctMinDates.join(acctMaxDates, "id").collect()
var filterString = "";
for (i <- 1 to k - 1) {
if (i == 1) {
for (aDate <- acctDates) {
filterString = filterString + "(id = " + aDate(0) + " and thedate >
" + aDate(1) + " and thedate < " + aDate(2) + ") or ";
}
filterString = filterString.substring(0, filterString.size - 4)
}
df = df.unionAll(df.where(filterString));
}
}
Code that is being attempted to translate, from pandas/python:
df = pd.concat([df.groupby('id').apply(lambda x: pd.concat([x.iloc[i: i + k]
for i in range(len(x.index) - k + 1)]))])
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Large-where-clause-StackOverflow-1-5-2-tp27544.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org