Re: Adding new column to Dataframe

Vishnu Viswanath Wed, 25 Nov 2015 19:58:42 -0800

Thanks Ted,

It looks like I cannot use row_number then. I tried to run a sample window
function and got below error
org.apache.spark.sql.AnalysisException: Could not resolve window function
'avg'. Note that, using window functions currently requires a HiveContext;


On Wed, Nov 25, 2015 at 8:28 PM, Ted Yu <yuzhih...@gmail.com> wrote:

Vishnu:
> rowNumber (deprecated, replaced with row_number) is a window function.
>
>    * Window function: returns a sequential number starting at 1 within a
> window partition.
>    *
>    * @group window_funcs
>    * @since 1.6.0
>    */
>   def row_number(): Column = withExpr {
> UnresolvedWindowFunction("row_number", Nil) }
>
> Sample usage:
>
>     df =  sqlContext.range(1<<20)
>     df2 = df.select((df.id % 1000).alias("A"), (df.id / 1000).alias('B'))
>     ws = Window.partitionBy(df2.A).orderBy(df2.B)
>     df3 = df2.select("client", "date",
> rowNumber().over(ws).alias("rn")).filter("rn < 0")
>
> Cheers
>
> On Wed, Nov 25, 2015 at 5:08 PM, Vishnu Viswanath <
> vishnu.viswanat...@gmail.com> wrote:
>
>> Thanks Jeff,
>>
>> rowNumber is a function in org.apache.spark.sql.functions link
>> <https://spark.apache.org/docs/1.4.0/api/scala/index.html#org.apache.spark.sql.functions$>
>>
>> I will try to use monotonicallyIncreasingId and see if it works.
>>
>> You’d better to use join to correlate 2 data frames : Yes, thats why I
>> thought of adding row number in both the DataFrames and join them based on
>> row number. Is there any better way of doing this? Both DataFrames will
>> have same number of rows always, but are not related by any column to do
>> join.
>>
>> Thanks and Regards,
>> Vishnu Viswanath
>> 
>>
>> On Wed, Nov 25, 2015 at 6:43 PM, Jeff Zhang <zjf...@gmail.com> wrote:
>>
>>> >>> I tried to use df.withColumn but I am getting below exception.
>>>
>>> What is rowNumber here ? UDF ?  You can use monotonicallyIncreasingId
>>> for generating id
>>>
>>> >>> Also, is it possible to add a column from one dataframe to another?
>>>
>>> You can't, because how can you add one dataframe to another if they have
>>> different number of rows. You'd better to use join to correlate 2 data
>>> frames.
>>>
>>> On Thu, Nov 26, 2015 at 6:39 AM, Vishnu Viswanath <
>>> vishnu.viswanat...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I am trying to add the row number to a spark dataframe.
>>>> This is my dataframe:
>>>>
>>>> scala> df.printSchema
>>>> root
>>>> |-- line: string (nullable = true)
>>>>
>>>> I tried to use df.withColumn but I am getting below exception.
>>>>
>>>> scala> df.withColumn("row",rowNumber)
>>>> org.apache.spark.sql.AnalysisException: unresolved operator 'Project 
>>>> [line#2326,'row_number() AS row#2327];
>>>> at 
>>>> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
>>>> at 
>>>> org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
>>>> at 
>>>> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:174)
>>>> at 
>>>> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
>>>>
>>>> Also, is it possible to add a column from one dataframe to another?
>>>> something like
>>>>
>>>> scala> df.withColumn("line2",df2("line"))
>>>>
>>>> org.apache.spark.sql.AnalysisException: resolved attribute(s) line#2330 
>>>> missing from line#2326 in operator !Project [line#2326,line#2330 AS 
>>>> line2#2331];
>>>>
>>>> 
>>>>
>>>> Thanks and Regards,
>>>> Vishnu Viswanath
>>>> *www.vishnuviswanath.com <http://www.vishnuviswanath.com>*
>>>>
>>>
>>>
>>>
>>> --
>>> Best Regards
>>>
>>> Jeff Zhang
>>>
>>
>>
>>
>> --
>> Thanks and Regards,
>> Vishnu Viswanath
>> +1 309 550 2311
>> *www.vishnuviswanath.com <http://www.vishnuviswanath.com>*
>>
>
> 
-- 
Thanks and Regards,
Vishnu Viswanath,
*www.vishnuviswanath.com <http://www.vishnuviswanath.com>*

Re: Adding new column to Dataframe

Reply via email to