Re: Adding new column to Dataframe

Ted Yu Wed, 25 Nov 2015 18:29:03 -0800

Vishnu:
rowNumber (deprecated, replaced with row_number) is a window function.


   * Window function: returns a sequential number starting at 1 within a
window partition.
   *
   * @group window_funcs
   * @since 1.6.0
   */
  def row_number(): Column = withExpr {
UnresolvedWindowFunction("row_number", Nil) }

Sample usage:

    df =  sqlContext.range(1<<20)
    df2 = df.select((df.id % 1000).alias("A"), (df.id / 1000).alias('B'))
    ws = Window.partitionBy(df2.A).orderBy(df2.B)
    df3 = df2.select("client", "date",
rowNumber().over(ws).alias("rn")).filter("rn < 0")

Cheers

On Wed, Nov 25, 2015 at 5:08 PM, Vishnu Viswanath <
vishnu.viswanat...@gmail.com> wrote:

> Thanks Jeff,
>
> rowNumber is a function in org.apache.spark.sql.functions link
> <https://spark.apache.org/docs/1.4.0/api/scala/index.html#org.apache.spark.sql.functions$>
>
> I will try to use monotonicallyIncreasingId and see if it works.
>
> You’d better to use join to correlate 2 data frames : Yes, thats why I
> thought of adding row number in both the DataFrames and join them based on
> row number. Is there any better way of doing this? Both DataFrames will
> have same number of rows always, but are not related by any column to do
> join.
>
> Thanks and Regards,
> Vishnu Viswanath
> 
>
> On Wed, Nov 25, 2015 at 6:43 PM, Jeff Zhang <zjf...@gmail.com> wrote:
>
>> >>> I tried to use df.withColumn but I am getting below exception.
>>
>> What is rowNumber here ? UDF ?  You can use monotonicallyIncreasingId
>> for generating id
>>
>> >>> Also, is it possible to add a column from one dataframe to another?
>>
>> You can't, because how can you add one dataframe to another if they have
>> different number of rows. You'd better to use join to correlate 2 data
>> frames.
>>
>> On Thu, Nov 26, 2015 at 6:39 AM, Vishnu Viswanath <
>> vishnu.viswanat...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I am trying to add the row number to a spark dataframe.
>>> This is my dataframe:
>>>
>>> scala> df.printSchema
>>> root
>>> |-- line: string (nullable = true)
>>>
>>> I tried to use df.withColumn but I am getting below exception.
>>>
>>> scala> df.withColumn("row",rowNumber)
>>> org.apache.spark.sql.AnalysisException: unresolved operator 'Project 
>>> [line#2326,'row_number() AS row#2327];
>>> at 
>>> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>>> at 
>>> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>>> at 
>>> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:174)
>>> at 
>>> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>>>
>>> Also, is it possible to add a column from one dataframe to another?
>>> something like
>>>
>>> scala> df.withColumn("line2",df2("line"))
>>>
>>> org.apache.spark.sql.AnalysisException: resolved attribute(s) line#2330 
>>> missing from line#2326 in operator !Project [line#2326,line#2330 AS 
>>> line2#2331];
>>>
>>> 
>>>
>>> Thanks and Regards,
>>> Vishnu Viswanath
>>> *www.vishnuviswanath.com <http://www.vishnuviswanath.com>*
>>>
>>
>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>
>
>
>
> --
> Thanks and Regards,
> Vishnu Viswanath
> +1 309 550 2311
> *www.vishnuviswanath.com <http://www.vishnuviswanath.com>*
>

Re: Adding new column to Dataframe

Reply via email to