Raghavendra,
Thanks for the quick reply! I don’t think I included enough information in my
question. I am hoping to get fields that are not directly part of the
aggregation. Imagine a dataframe representing website views with a userID,
datetime, and a webpage address. How could I find the
Did you try sorting it by datetime and doing a groupBy on the userID?
On Aug 21, 2015 12:47 PM, Nathan Skone nat...@skone.org wrote:
Raghavendra,
Thanks for the quick reply! I don’t think I included enough information in
my question. I am hoping to get fields that are not directly part of the
Impact,
You can group by the data and then sort it by timestamp and take max to
select the oldest value.
On Aug 21, 2015 11:15 PM, Impact nat...@skone.org wrote:
I am also looking for a way to achieve the reducebykey functionality on
data
frames. In my case I need to select one particular row
I am also looking for a way to achieve the reducebykey functionality on data
frames. In my case I need to select one particular row (the oldest, based on
a timestamp column value) by key.
--
View this message in context:
Nathan,
I achieve this using rowNumber. Here is a Python DataFrame example:
from pyspark.sql.window import Window
from pyspark.sql.functions import desc, rowNumber
yourOutputDF = (
yourInputDF
.withColumn(first, rowNumber()
Hello,
I'm migrating some RDD-based code to using DataFrames. We've seen massive
speedups so far!
One of the operations in the old code creates an array of the values for
each key, as follows:
val collatedRDD =
valuesRDD.mapValues(value=Array(value)).reduceByKey((array1,array2) =