Re: How to get recent value in spark dataframe

2016-12-20 Thread Divya Gehlot
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-windows.html

Hope this helps


Thanks,
Divya

On 15 December 2016 at 12:49, Milin korath 
wrote:

> Hi
>
> I have a spark data frame with following structure
>
>  id  flag price date
>   a   0100  2015
>   a   050   2015
>   a   1200  2014
>   a   1300  2013
>   a   0400  2012
>
> I need to create a data frame with recent value of flag 1 and updated in
> the flag 0 rows.
>
>   id  flag price date new_column
>   a   0100  2015200
>   a   050   2015200
>   a   1200  2014null
>   a   1300  2013null
>   a   0400  2012null
>
> We have 2 rows having flag=0. Consider the first row(flag=0),I will have 2
> values(200 and 300) and I am taking the recent one 200(2014). And the last
> row I don't have any recent value for flag 1 so it is updated with null.
>
> Looking for a solution using scala. Any help would be appreciated.Thanks
>
> Thanks
> Milin
>


Re: How to get recent value in spark dataframe

2016-12-19 Thread ayan guha
You have 2 parts to it

1. Do a sub query where for each primary key derive latest value of flag=1
records. Ensure you get exactly 1 record per primary key value. Here you
can use rank() over (partition by primary key order by year desc)

2. Join your original dataset with the above on primary key. If year is
higher than latest flag=1 record then take it else mark it null

If you can have primary keys which may have no flag=1 records then they
wouldn't show up in set 1 above. So if you still want them in result then
adjust 1 accordingly.

Best
Ayan
On Mon., 19 Dec. 2016 at 1:01 pm, Richard Xin
 wrote:

> I am not sure I understood your logic, but it seems to me that you could
> take a look of Hive's Lead/Lag functions.
>
>
> On Monday, December 19, 2016 1:41 AM, Milin korath <
> milin.kor...@impelsys.com> wrote:
>
>
> thanks, I tried with left outer join. My dataset having around 400M
> records and lot of shuffling is happening.Is there any other workaround
> apart from Join,I tried use window function but I am not getting a proper
> solution,
>
>
> Thanks
>
> On Sat, Dec 17, 2016 at 4:55 AM, Michael Armbrust 
> wrote:
>
> Oh and to get the null for missing years, you'd need to do an outer join
> with a table containing all of the years you are interested in.
>
> On Fri, Dec 16, 2016 at 3:24 PM, Michael Armbrust 
> wrote:
>
> Are you looking for argmax? Here is an example
> 
> .
>
> On Wed, Dec 14, 2016 at 8:49 PM, Milin korath 
> wrote:
>
> Hi
>
> I have a spark data frame with following structure
>
>  id  flag price date
>
>   a   0100  2015
>
>   a   050   2015
>
>   a   1200  2014
>
>   a   1300  2013
>
>   a   0400  2012
>
> I need to create a data frame with recent value of flag 1 and updated in
> the flag 0 rows.
>
>   id  flag price date new_column
>
>   a   0100  2015200
>
>   a   050   2015200
>
>   a   1200  2014null
>
>   a   1300  2013null
>
>   a   0400  2012null
>
> We have 2 rows having flag=0. Consider the first row(flag=0),I will have 2
> values(200 and 300) and I am taking the recent one 200(2014). And the last
> row I don't have any recent value for flag 1 so it is updated with null.
> Looking for a solution using scala. Any help would be appreciated.Thanks
>
> Thanks
> Milin
>
>
>
>
>
>
>
>
>
>
>
>


Re: How to get recent value in spark dataframe

2016-12-18 Thread Richard Xin
I am not sure I understood your logic, but it seems to me that you could take a 
look of Hive's Lead/Lag functions. 

On Monday, December 19, 2016 1:41 AM, Milin korath 
 wrote:
 

 thanks, I tried with left outer join. My dataset having around 400M records 
and lot of shuffling is happening.Is there any other workaround apart from 
Join,I tried use window function but I am not getting a proper solution, 

Thanks
On Sat, Dec 17, 2016 at 4:55 AM, Michael Armbrust  
wrote:

Oh and to get the null for missing years, you'd need to do an outer join with a 
table containing all of the years you are interested in.
On Fri, Dec 16, 2016 at 3:24 PM, Michael Armbrust  
wrote:

Are you looking for argmax? Here is an example.
On Wed, Dec 14, 2016 at 8:49 PM, Milin korath  wrote:

Hi 


| I have a spark data frame with following structure id  flag price date
  a   0100  2015
  a   050   2015
  a   1200  2014
  a   1300  2013
  a   0400  2012I need to create a data frame with recent value of flag 1 
and updated in the flag 0 rows.  id  flag price date new_column
  a   0100  2015200
  a   050   2015200
  a   1200  2014null
  a   1300  2013null
  a   0400  2012nullWe have 2 rows having flag=0. Consider the 
first row(flag=0),I will have 2 values(200 and 300) and I am taking the recent 
one 200(2014). And the last row I don't have any recent value for flag 1 so it 
is updated with null.Looking for a solution using scala. Any help would be 
appreciated.Thanks |
|


Thanks Milin







   

Re: How to get recent value in spark dataframe

2016-12-18 Thread Milin korath
thanks, I tried with left outer join. My dataset having around 400M records
and lot of shuffling is happening.Is there any other workaround apart from
Join,I tried use window function but I am not getting a proper solution,


Thanks

On Sat, Dec 17, 2016 at 4:55 AM, Michael Armbrust 
wrote:

> Oh and to get the null for missing years, you'd need to do an outer join
> with a table containing all of the years you are interested in.
>
> On Fri, Dec 16, 2016 at 3:24 PM, Michael Armbrust 
> wrote:
>
>> Are you looking for argmax? Here is an example
>> 
>> .
>>
>> On Wed, Dec 14, 2016 at 8:49 PM, Milin korath 
>> wrote:
>>
>>> Hi
>>>
>>> I have a spark data frame with following structure
>>>
>>>  id  flag price date
>>>   a   0100  2015
>>>   a   050   2015
>>>   a   1200  2014
>>>   a   1300  2013
>>>   a   0400  2012
>>>
>>> I need to create a data frame with recent value of flag 1 and updated in
>>> the flag 0 rows.
>>>
>>>   id  flag price date new_column
>>>   a   0100  2015200
>>>   a   050   2015200
>>>   a   1200  2014null
>>>   a   1300  2013null
>>>   a   0400  2012null
>>>
>>> We have 2 rows having flag=0. Consider the first row(flag=0),I will have
>>> 2 values(200 and 300) and I am taking the recent one 200(2014). And the
>>> last row I don't have any recent value for flag 1 so it is updated with
>>> null.
>>>
>>> Looking for a solution using scala. Any help would be appreciated.Thanks
>>>
>>> Thanks
>>> Milin
>>>
>>
>>
>


How to get recent value in spark dataframe

2016-12-18 Thread milinkorath
0
down vote
favorite
I have a spark data frame with following structure

 id  flag price date
  a   0100  2015
  a   050   2015
  a   1200  2014
  a   1300  2013
  a   0400  2012
I need to create a data frame with recent value of flag 1 and updated in the
flag 0 rows.

  id  flag price date new_column
  a   0100  2015200
  a   050   2015200
  a   1200  2014null
  a   1300  2013null
  a   0400  2012null
We have 2 rows having flag=0. Consider the first row(flag=0),I will have 2
values(200 and 300) and I am taking the recent one 200(2014). And the last
row I don't have any recent value for flag 1 so it is updated with null.

I found a solution with left join.My dataset having around 400M records and
join cause lot of shuffling.Is there any better way to find recent value.


Looking for a solution using scala. Any help would be appreciated.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-get-recent-value-in-spark-dataframe-tp28230.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: How to get recent value in spark dataframe

2016-12-16 Thread Michael Armbrust
Oh and to get the null for missing years, you'd need to do an outer join
with a table containing all of the years you are interested in.

On Fri, Dec 16, 2016 at 3:24 PM, Michael Armbrust 
wrote:

> Are you looking for argmax? Here is an example
> 
> .
>
> On Wed, Dec 14, 2016 at 8:49 PM, Milin korath 
> wrote:
>
>> Hi
>>
>> I have a spark data frame with following structure
>>
>>  id  flag price date
>>   a   0100  2015
>>   a   050   2015
>>   a   1200  2014
>>   a   1300  2013
>>   a   0400  2012
>>
>> I need to create a data frame with recent value of flag 1 and updated in
>> the flag 0 rows.
>>
>>   id  flag price date new_column
>>   a   0100  2015200
>>   a   050   2015200
>>   a   1200  2014null
>>   a   1300  2013null
>>   a   0400  2012null
>>
>> We have 2 rows having flag=0. Consider the first row(flag=0),I will have
>> 2 values(200 and 300) and I am taking the recent one 200(2014). And the
>> last row I don't have any recent value for flag 1 so it is updated with
>> null.
>>
>> Looking for a solution using scala. Any help would be appreciated.Thanks
>>
>> Thanks
>> Milin
>>
>
>


Re: How to get recent value in spark dataframe

2016-12-16 Thread Michael Armbrust
Are you looking for argmax? Here is an example

.

On Wed, Dec 14, 2016 at 8:49 PM, Milin korath 
wrote:

> Hi
>
> I have a spark data frame with following structure
>
>  id  flag price date
>   a   0100  2015
>   a   050   2015
>   a   1200  2014
>   a   1300  2013
>   a   0400  2012
>
> I need to create a data frame with recent value of flag 1 and updated in
> the flag 0 rows.
>
>   id  flag price date new_column
>   a   0100  2015200
>   a   050   2015200
>   a   1200  2014null
>   a   1300  2013null
>   a   0400  2012null
>
> We have 2 rows having flag=0. Consider the first row(flag=0),I will have 2
> values(200 and 300) and I am taking the recent one 200(2014). And the last
> row I don't have any recent value for flag 1 so it is updated with null.
>
> Looking for a solution using scala. Any help would be appreciated.Thanks
>
> Thanks
> Milin
>


Re: How to get recent value in spark dataframe

2016-12-16 Thread vaquar khan
Not sure about your logic 0 and 1 but you can use orderBy the data
according to time and get the first value.

Regards,
Vaquar khan

On Wed, Dec 14, 2016 at 10:49 PM, Milin korath 
wrote:

> Hi
>
> I have a spark data frame with following structure
>
>  id  flag price date
>   a   0100  2015
>   a   050   2015
>   a   1200  2014
>   a   1300  2013
>   a   0400  2012
>
> I need to create a data frame with recent value of flag 1 and updated in
> the flag 0 rows.
>
>   id  flag price date new_column
>   a   0100  2015200
>   a   050   2015200
>   a   1200  2014null
>   a   1300  2013null
>   a   0400  2012null
>
> We have 2 rows having flag=0. Consider the first row(flag=0),I will have 2
> values(200 and 300) and I am taking the recent one 200(2014). And the last
> row I don't have any recent value for flag 1 so it is updated with null.
>
> Looking for a solution using scala. Any help would be appreciated.Thanks
>
> Thanks
> Milin
>



-- 
Regards,
Vaquar Khan
+1 -224-436-0783

IT Architect / Lead Consultant
Greater Chicago


How to get recent value in spark dataframe

2016-12-14 Thread Milin korath
Hi

I have a spark data frame with following structure

 id  flag price date
  a   0100  2015
  a   050   2015
  a   1200  2014
  a   1300  2013
  a   0400  2012

I need to create a data frame with recent value of flag 1 and updated in
the flag 0 rows.

  id  flag price date new_column
  a   0100  2015200
  a   050   2015200
  a   1200  2014null
  a   1300  2013null
  a   0400  2012null

We have 2 rows having flag=0. Consider the first row(flag=0),I will have 2
values(200 and 300) and I am taking the recent one 200(2014). And the last
row I don't have any recent value for flag 1 so it is updated with null.

Looking for a solution using scala. Any help would be appreciated.Thanks

Thanks
Milin