Re: Aggregate to array (or 'slice by key') with DataFrames

2015-08-21 Thread Nathan Skone
Raghavendra, Thanks for the quick reply! I don’t think I included enough information in my question. I am hoping to get fields that are not directly part of the aggregation. Imagine a dataframe representing website views with a userID, datetime, and a webpage address. How could I find the

Re: Aggregate to array (or 'slice by key') with DataFrames

2015-08-21 Thread Akhil Das
Did you try sorting it by datetime and doing a groupBy on the userID? On Aug 21, 2015 12:47 PM, Nathan Skone nat...@skone.org wrote: Raghavendra, Thanks for the quick reply! I don’t think I included enough information in my question. I am hoping to get fields that are not directly part of the

Re: Aggregate to array (or 'slice by key') with DataFrames

2015-08-21 Thread Raghavendra Pandey
Impact, You can group by the data and then sort it by timestamp and take max to select the oldest value. On Aug 21, 2015 11:15 PM, Impact nat...@skone.org wrote: I am also looking for a way to achieve the reducebykey functionality on data frames. In my case I need to select one particular row

Re: Aggregate to array (or 'slice by key') with DataFrames

2015-08-21 Thread Impact
I am also looking for a way to achieve the reducebykey functionality on data frames. In my case I need to select one particular row (the oldest, based on a timestamp column value) by key. -- View this message in context:

Re: Aggregate to array (or 'slice by key') with DataFrames

2015-08-21 Thread Dan LaBar
Nathan, I achieve this using rowNumber. Here is a Python DataFrame example: from pyspark.sql.window import Window from pyspark.sql.functions import desc, rowNumber yourOutputDF = ( yourInputDF .withColumn(first, rowNumber()

Aggregate to array (or 'slice by key') with DataFrames

2015-07-05 Thread Alex Beatson
Hello, I'm migrating some RDD-based code to using DataFrames. We've seen massive speedups so far! One of the operations in the old code creates an array of the values for each key, as follows: val collatedRDD = valuesRDD.mapValues(value=Array(value)).reduceByKey((array1,array2) =