Re: How to deal with context dependent computing?

2018-08-27 Thread devjyoti patra
Hi Junfeng,

You should be able to do this with  window aggregation functions  lead or
lag
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-functions.html#lead

Thanks,
Dev

On Mon, Aug 27, 2018 at 7:08 AM JF Chen  wrote:

> Thanks Sonal.
> For example, I have data as following:
> login 2018/8/27 10:00
> logout 2018/8/27 10:05
> login 2018/8/27 10:08
> logout 2018/8/27 10:15
> login 2018/8/27 11:08
> logout 2018/8/27 11:32
>
> Now I want to calculate the time between each login and logout. For
> example, I should get 5 min, 7 min, 24 min from the above sample data.
> I know I can calculate it with foreach, but it seems all data running on
> spark driver node rather than multi executors.
> So any good way to solve this problem? Thanks!
>
> Regard,
> Junfeng Chen
>
>
> On Thu, Aug 23, 2018 at 6:15 PM Sonal Goyal  wrote:
>
>> Hi Junfeng,
>>
>> Can you please show by means of an example what you are trying to
>> achieve?
>>
>> Thanks,
>> Sonal
>> Nube Technologies 
>>
>> 
>>
>>
>>
>> On Thu, Aug 23, 2018 at 8:22 AM, JF Chen  wrote:
>>
>>> For example, I have some data with timstamp marked as category A and B,
>>> and ordered by time. Now I want to calculate each duration from A to B. In
>>> normal program, I can use the  flag bit to record the preview data if it is
>>> A or B, and then calculate the duration. But in Spark Dataframe, how to do
>>> it?
>>>
>>> Thanks!
>>>
>>> Regard,
>>> Junfeng Chen
>>>
>>
>>

-- 
To achieve, you need thought. You have to know what you are doing and
that's real power.


Re: Fastest way to drop useless columns

2018-05-31 Thread devjyoti patra
One thing that we do on our datasets is :
1. Take 'n' random samples of equal size
2. If the distribution is heavily skewed for one key in your samples. The
way we define "heavy skewness" is; if the mean is more than one std
deviation away from the median.

In your case, you can drop this column.

On Thu, 31 May 2018, 14:55 ,  wrote:

> I believe this only works when we need to drop duplicate ROWS
>
> Here I want to drop cols which contains one unique value
>
>
> Le 2018-05-31 11:16, Divya Gehlot a écrit :
> > you can try dropduplicate function
> >
> >
> https://github.com/spirom/LearningSpark/blob/master/src/main/scala/dataframe/DropDuplicates.scala
> >
> > On 31 May 2018 at 16:34,  wrote:
> >
> >> Hi there !
> >>
> >> I have a potentially large dataset ( regarding number of rows and
> >> cols )
> >>
> >> And I want to find the fastest way to drop some useless cols for me,
> >> i.e. cols containing only an unique value !
> >>
> >> I want to know what do you think that I could do to do this as fast
> >> as possible using spark.
> >>
> >> I already have a solution using distinct().count() or
> >> approxCountDistinct()
> >> But, they may not be the best choice as this requires to go through
> >> all the data, even if the 2 first tested values for a col are
> >> already different ( and in this case I know that I can keep the col
> >> )
> >>
> >> Thx for your ideas !
> >>
> >> Julien
> >>
> >>
> > -
> >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: optimize hive query to move a subset of data from one partition table to another table

2018-02-12 Thread devjyoti patra
Can you try running your query with static literal for date filter.
(join_date >= SOME 2 MONTH OLD DATE). I cannot think of any reason why this
query should create more than 60 tasks.




On 12 Feb 2018 6:26 am, "amit kumar singh"  wrote:

Hi

create table emp as select * from emp_full where join_date
>=date_sub(join_date,2)

i am trying to select from one table insert into another table

i need a way to do select last 2 month of data everytime

table is partitioned on year month day

On Sun, Feb 11, 2018 at 4:30 PM, Richard Qiao 
wrote:

> Would you mind share your code with us to analyze?
>
> > On Feb 10, 2018, at 10:18 AM, amit kumar singh 
> wrote:
> >
> > Hi Team,
> >
> > We have hive external  table which has 50 tb of data partitioned on year
> month day
> >
> > i want to move last 2 month of data into another table
> >
> > when i try to do this through spark ,more than 120k task are getting
> created
> >
> > what is the best way to do this
> >
> > thanks
> > Rohit
>
>


Re: How do I deal with ever growing application log

2017-03-05 Thread devjyoti patra
Timothy, why are you writing application logs to HDFS? In case you want to
analyze these logs later, you can write to local storage on your slave
nodes and later rotate those files to a suitable location. If they are only
going to useful for debugging the application, you can always remove them
periodically.
Thanks,
Dev

On Mar 6, 2017 9:48 AM, "Timothy Chan"  wrote:

> I'm running a single worker EMR cluster for a Structured Streaming job.
> How do I deal with my application log filling up HDFS?
>
> /var/log/spark/apps/application_1487823545416_0021_1.inprogress
>
> is currently 21.8 GB
>
> *Sent with Shift
> *
>