You can use analytical functions in spark sql.
Something like select * from (select id, row_number() over (partition by id
order by timestamp ) as rn from root) where rn=1
On Mon, Dec 17, 2018 at 4:03 PM Nikhil Goyal wrote:
> Hi guys,
>
> I have a dataframe of type Record (id: Long, timestamp:
Untested, but something like the below should work:
from pyspark.sql import functions as F
from pyspark.sql import window as W
(record
.withColumn('ts_rank',
F.dense_rank().over(W.Window.orderBy('timestamp').partitionBy("id"))
.filter(F.col('ts_rank')==1)
.drop('ts_rank')
)
On Mon, Dec 17,
Hi guys,
I have a dataframe of type Record (id: Long, timestamp: Long, isValid:
Boolean, other metrics)
Schema looks like this:
root
|-- id: long (nullable = true)
|-- timestamp: long (nullable = true)
|-- isValid: boolean (nullable = true)
.
I need to find the earliest valid record
HI All,
I have data in MongoDb(few TBs) which I want to migrate to HDFS to do
complex queries analysis on this data.Queries like AND queries involved
multiple fields
So my question in which which format I should store the data in HDFS so
that processing will be fast for such kind of queries?
Can you provide an example of an and query ? If you do just look-up you
should try Hbase/ phoenix, otherwise you can try orc with storage index
and/or compression, but this depends on how your queries look like
Le mer. 22 juil. 2015 à 14:48, Jeetendra Gangele gangele...@gmail.com a
écrit :
HI
I do not think you can put all your queries into the row key without
duplicating the data for each query. However, this would be more last
resort.
Have you checked out phoenix for Hbase? This might suit your needs. It
makes it much simpler, because it provided sql on top of Hbase.
Nevertheless,
Query will be something like that
1. how many users visited 1 BHK flat in last 1 hour in given particular area
2. how many visitor for flats in give area
3. list all user who bought given property in last 30 days
Further it may go too complex involving multiple parameters in my query.
The
Parquet
Mohammed
From: Jeetendra Gangele [mailto:gangele...@gmail.com]
Sent: Wednesday, July 22, 2015 5:48 AM
To: user
Subject: Need help in SparkSQL
HI All,
I have data in MongoDb(few TBs) which I want to migrate to HDFS to do complex
queries analysis on this data.Queries like AND queries