Re: Need help with SparkSQL Query

2018-12-17 Thread Ramandeep Singh Nanda
You can use analytical functions in spark sql.

Something like select * from (select id, row_number() over (partition by id
order by timestamp ) as rn from root) where rn=1

On Mon, Dec 17, 2018 at 4:03 PM Nikhil Goyal  wrote:

> Hi guys,
>
> I have a dataframe of type Record (id: Long, timestamp: Long, isValid:
> Boolean,  other metrics)
>
> Schema looks like this:
> root
>  |-- id: long (nullable = true)
>  |-- timestamp: long (nullable = true)
>  |-- isValid: boolean (nullable = true)
> .
>
> I need to find the earliest valid record per id. In RDD world I can do
> groupBy 'id' and find the earliest one but I am not sure how I can do it in
> SQL. Since I am doing this in PySpark I cannot really use DataSet API for
> this.
>
> One thing I can do is groupBy 'id', find the earliest timestamp available
> and then join with the original dataframe to get the right record (all the
> metrics).
>
> Or I can create a single column with all the records and then implement a
> UDAF in scala and use it in pyspark.
>
> Both solutions don't seem to be straight forward. Is there a simpler
> solution to this?
>
> Thanks
> Nikhil
>


-- 
Regards,
Ramandeep Singh
http://orastack.com
+13474792296
ramannan...@gmail.com


Re: Need help with SparkSQL Query

2018-12-17 Thread Patrick McCarthy
Untested, but something like the below should work:

from pyspark.sql import functions as F
from pyspark.sql import window as W

(record
.withColumn('ts_rank',
F.dense_rank().over(W.Window.orderBy('timestamp').partitionBy("id"))
.filter(F.col('ts_rank')==1)
.drop('ts_rank')
)


On Mon, Dec 17, 2018 at 4:04 PM Nikhil Goyal  wrote:

> Hi guys,
>
> I have a dataframe of type Record (id: Long, timestamp: Long, isValid:
> Boolean,  other metrics)
>
> Schema looks like this:
> root
>  |-- id: long (nullable = true)
>  |-- timestamp: long (nullable = true)
>  |-- isValid: boolean (nullable = true)
> .
>
> I need to find the earliest valid record per id. In RDD world I can do
> groupBy 'id' and find the earliest one but I am not sure how I can do it in
> SQL. Since I am doing this in PySpark I cannot really use DataSet API for
> this.
>
> One thing I can do is groupBy 'id', find the earliest timestamp available
> and then join with the original dataframe to get the right record (all the
> metrics).
>
> Or I can create a single column with all the records and then implement a
> UDAF in scala and use it in pyspark.
>
> Both solutions don't seem to be straight forward. Is there a simpler
> solution to this?
>
> Thanks
> Nikhil
>


Need help with SparkSQL Query

2018-12-17 Thread Nikhil Goyal
Hi guys,

I have a dataframe of type Record (id: Long, timestamp: Long, isValid:
Boolean,  other metrics)

Schema looks like this:
root
 |-- id: long (nullable = true)
 |-- timestamp: long (nullable = true)
 |-- isValid: boolean (nullable = true)
.

I need to find the earliest valid record per id. In RDD world I can do
groupBy 'id' and find the earliest one but I am not sure how I can do it in
SQL. Since I am doing this in PySpark I cannot really use DataSet API for
this.

One thing I can do is groupBy 'id', find the earliest timestamp available
and then join with the original dataframe to get the right record (all the
metrics).

Or I can create a single column with all the records and then implement a
UDAF in scala and use it in pyspark.

Both solutions don't seem to be straight forward. Is there a simpler
solution to this?

Thanks
Nikhil


Need help in SparkSQL

2015-07-22 Thread Jeetendra Gangele
HI All,

I have data in MongoDb(few TBs) which I want to migrate to HDFS to do
complex queries analysis on this data.Queries like AND queries involved
multiple fields

So my question in which which format I should store the data in HDFS so
that processing will be fast for such kind of queries?


Regards
Jeetendra


Re: Need help in SparkSQL

2015-07-22 Thread Jörn Franke
Can you provide an example of an and query ? If you do just look-up you
should try Hbase/ phoenix, otherwise you can try orc with storage index
and/or compression, but this depends on how your queries look like

Le mer. 22 juil. 2015 à 14:48, Jeetendra Gangele gangele...@gmail.com a
écrit :

 HI All,

 I have data in MongoDb(few TBs) which I want to migrate to HDFS to do
 complex queries analysis on this data.Queries like AND queries involved
 multiple fields

 So my question in which which format I should store the data in HDFS so
 that processing will be fast for such kind of queries?


 Regards
 Jeetendra




Re: Need help in SparkSQL

2015-07-22 Thread Jörn Franke
I do not think you can put all your queries into the row key without
duplicating the data for each query. However, this would be more last
resort.

Have you checked out phoenix for Hbase? This might suit your needs. It
makes it much simpler, because it provided sql on top of Hbase.

Nevertheless, Hive could also be a viable alternative depending on how
often you run queries etc

Le jeu. 23 juil. 2015 à 7:14, Jeetendra Gangele gangele...@gmail.com a
écrit :

 Query will be something like that

 1. how many users visited 1 BHK flat in last 1 hour in given particular
 area
 2. how many visitor for flats in give area
 3. list all user who bought given property in last 30 days

 Further it may go too complex involving multiple parameters in my query.

 The problem is HBase is designing row key to get this data efficiently.

 Since I have multiple fields to query upon base may not be a good choice?

 i dont dont to iterate the result set which Hbase returns and give the
 result because this will kill the performance?

 On 23 July 2015 at 01:02, Jörn Franke jornfra...@gmail.com wrote:

 Can you provide an example of an and query ? If you do just look-up you
 should try Hbase/ phoenix, otherwise you can try orc with storage index
 and/or compression, but this depends on how your queries look like

 Le mer. 22 juil. 2015 à 14:48, Jeetendra Gangele gangele...@gmail.com
 a écrit :

 HI All,

 I have data in MongoDb(few TBs) which I want to migrate to HDFS to do
 complex queries analysis on this data.Queries like AND queries involved
 multiple fields

 So my question in which which format I should store the data in HDFS so
 that processing will be fast for such kind of queries?


 Regards
 Jeetendra




 --
 Hi,

 Find my attached resume. I have total around 7 years of work experience.
 I worked for Amazon and Expedia in my previous assignments and currently I
 am working with start- up technology company called Insideview in hyderabad.

 Regards
 Jeetendra



Re: Need help in SparkSQL

2015-07-22 Thread Jeetendra Gangele
Query will be something like that

1. how many users visited 1 BHK flat in last 1 hour in given particular area
2. how many visitor for flats in give area
3. list all user who bought given property in last 30 days

Further it may go too complex involving multiple parameters in my query.

The problem is HBase is designing row key to get this data efficiently.

Since I have multiple fields to query upon base may not be a good choice?

i dont dont to iterate the result set which Hbase returns and give the
result because this will kill the performance?

On 23 July 2015 at 01:02, Jörn Franke jornfra...@gmail.com wrote:

 Can you provide an example of an and query ? If you do just look-up you
 should try Hbase/ phoenix, otherwise you can try orc with storage index
 and/or compression, but this depends on how your queries look like

 Le mer. 22 juil. 2015 à 14:48, Jeetendra Gangele gangele...@gmail.com a
 écrit :

 HI All,

 I have data in MongoDb(few TBs) which I want to migrate to HDFS to do
 complex queries analysis on this data.Queries like AND queries involved
 multiple fields

 So my question in which which format I should store the data in HDFS so
 that processing will be fast for such kind of queries?


 Regards
 Jeetendra




-- 
Hi,

Find my attached resume. I have total around 7 years of work experience.
I worked for Amazon and Expedia in my previous assignments and currently I
am working with start- up technology company called Insideview in hyderabad.

Regards
Jeetendra


RE: Need help in SparkSQL

2015-07-22 Thread Mohammed Guller
Parquet

Mohammed

From: Jeetendra Gangele [mailto:gangele...@gmail.com]
Sent: Wednesday, July 22, 2015 5:48 AM
To: user
Subject: Need help in SparkSQL

HI All,

I have data in MongoDb(few TBs) which I want to migrate to HDFS to do complex 
queries analysis on this data.Queries like AND queries involved multiple fields

So my question in which which format I should store the data in HDFS so that 
processing will be fast for such kind of queries?


Regards
Jeetendra