Thanks Ted, It looks like I cannot use row_number then. I tried to run a sample window function and got below error org.apache.spark.sql.AnalysisException: Could not resolve window function 'avg'. Note that, using window functions currently requires a HiveContext;
On Wed, Nov 25, 2015 at 8:28 PM, Ted Yu <yuzhih...@gmail.com> wrote: Vishnu: > rowNumber (deprecated, replaced with row_number) is a window function. > > * Window function: returns a sequential number starting at 1 within a > window partition. > * > * @group window_funcs > * @since 1.6.0 > */ > def row_number(): Column = withExpr { > UnresolvedWindowFunction("row_number", Nil) } > > Sample usage: > > df = sqlContext.range(1<<20) > df2 = df.select((df.id % 1000).alias("A"), (df.id / 1000).alias('B')) > ws = Window.partitionBy(df2.A).orderBy(df2.B) > df3 = df2.select("client", "date", > rowNumber().over(ws).alias("rn")).filter("rn < 0") > > Cheers > > On Wed, Nov 25, 2015 at 5:08 PM, Vishnu Viswanath < > vishnu.viswanat...@gmail.com> wrote: > >> Thanks Jeff, >> >> rowNumber is a function in org.apache.spark.sql.functions link >> <https://spark.apache.org/docs/1.4.0/api/scala/index.html#org.apache.spark.sql.functions$> >> >> I will try to use monotonicallyIncreasingId and see if it works. >> >> You’d better to use join to correlate 2 data frames : Yes, thats why I >> thought of adding row number in both the DataFrames and join them based on >> row number. Is there any better way of doing this? Both DataFrames will >> have same number of rows always, but are not related by any column to do >> join. >> >> Thanks and Regards, >> Vishnu Viswanath >> >> >> On Wed, Nov 25, 2015 at 6:43 PM, Jeff Zhang <zjf...@gmail.com> wrote: >> >>> >>> I tried to use df.withColumn but I am getting below exception. >>> >>> What is rowNumber here ? UDF ? You can use monotonicallyIncreasingId >>> for generating id >>> >>> >>> Also, is it possible to add a column from one dataframe to another? >>> >>> You can't, because how can you add one dataframe to another if they have >>> different number of rows. You'd better to use join to correlate 2 data >>> frames. >>> >>> On Thu, Nov 26, 2015 at 6:39 AM, Vishnu Viswanath < >>> vishnu.viswanat...@gmail.com> wrote: >>> >>>> Hi, >>>> >>>> I am trying to add the row number to a spark dataframe. >>>> This is my dataframe: >>>> >>>> scala> df.printSchema >>>> root >>>> |-- line: string (nullable = true) >>>> >>>> I tried to use df.withColumn but I am getting below exception. >>>> >>>> scala> df.withColumn("row",rowNumber) >>>> org.apache.spark.sql.AnalysisException: unresolved operator 'Project >>>> [line#2326,'row_number() AS row#2327]; >>>> at >>>> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37) >>>> at >>>> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44) >>>> at >>>> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:174) >>>> at >>>> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49) >>>> >>>> Also, is it possible to add a column from one dataframe to another? >>>> something like >>>> >>>> scala> df.withColumn("line2",df2("line")) >>>> >>>> org.apache.spark.sql.AnalysisException: resolved attribute(s) line#2330 >>>> missing from line#2326 in operator !Project [line#2326,line#2330 AS >>>> line2#2331]; >>>> >>>> >>>> >>>> Thanks and Regards, >>>> Vishnu Viswanath >>>> *www.vishnuviswanath.com <http://www.vishnuviswanath.com>* >>>> >>> >>> >>> >>> -- >>> Best Regards >>> >>> Jeff Zhang >>> >> >> >> >> -- >> Thanks and Regards, >> Vishnu Viswanath >> +1 309 550 2311 >> *www.vishnuviswanath.com <http://www.vishnuviswanath.com>* >> > > -- Thanks and Regards, Vishnu Viswanath, *www.vishnuviswanath.com <http://www.vishnuviswanath.com>*