Re: Need help with SparkSQL Query
You can use analytical functions in spark sql. Something like select * from (select id, row_number() over (partition by id order by timestamp ) as rn from root) where rn=1 On Mon, Dec 17, 2018 at 4:03 PM Nikhil Goyal wrote: > Hi guys, > > I have a dataframe of type Record (id: Long, timestamp: Long, isValid: > Boolean, other metrics) > > Schema looks like this: > root > |-- id: long (nullable = true) > |-- timestamp: long (nullable = true) > |-- isValid: boolean (nullable = true) > . > > I need to find the earliest valid record per id. In RDD world I can do > groupBy 'id' and find the earliest one but I am not sure how I can do it in > SQL. Since I am doing this in PySpark I cannot really use DataSet API for > this. > > One thing I can do is groupBy 'id', find the earliest timestamp available > and then join with the original dataframe to get the right record (all the > metrics). > > Or I can create a single column with all the records and then implement a > UDAF in scala and use it in pyspark. > > Both solutions don't seem to be straight forward. Is there a simpler > solution to this? > > Thanks > Nikhil > -- Regards, Ramandeep Singh http://orastack.com +13474792296 ramannan...@gmail.com
Re: Need help with SparkSQL Query
Untested, but something like the below should work: from pyspark.sql import functions as F from pyspark.sql import window as W (record .withColumn('ts_rank', F.dense_rank().over(W.Window.orderBy('timestamp').partitionBy("id")) .filter(F.col('ts_rank')==1) .drop('ts_rank') ) On Mon, Dec 17, 2018 at 4:04 PM Nikhil Goyal wrote: > Hi guys, > > I have a dataframe of type Record (id: Long, timestamp: Long, isValid: > Boolean, other metrics) > > Schema looks like this: > root > |-- id: long (nullable = true) > |-- timestamp: long (nullable = true) > |-- isValid: boolean (nullable = true) > . > > I need to find the earliest valid record per id. In RDD world I can do > groupBy 'id' and find the earliest one but I am not sure how I can do it in > SQL. Since I am doing this in PySpark I cannot really use DataSet API for > this. > > One thing I can do is groupBy 'id', find the earliest timestamp available > and then join with the original dataframe to get the right record (all the > metrics). > > Or I can create a single column with all the records and then implement a > UDAF in scala and use it in pyspark. > > Both solutions don't seem to be straight forward. Is there a simpler > solution to this? > > Thanks > Nikhil >
Need help with SparkSQL Query
Hi guys, I have a dataframe of type Record (id: Long, timestamp: Long, isValid: Boolean, other metrics) Schema looks like this: root |-- id: long (nullable = true) |-- timestamp: long (nullable = true) |-- isValid: boolean (nullable = true) . I need to find the earliest valid record per id. In RDD world I can do groupBy 'id' and find the earliest one but I am not sure how I can do it in SQL. Since I am doing this in PySpark I cannot really use DataSet API for this. One thing I can do is groupBy 'id', find the earliest timestamp available and then join with the original dataframe to get the right record (all the metrics). Or I can create a single column with all the records and then implement a UDAF in scala and use it in pyspark. Both solutions don't seem to be straight forward. Is there a simpler solution to this? Thanks Nikhil
Need help in SparkSQL
HI All, I have data in MongoDb(few TBs) which I want to migrate to HDFS to do complex queries analysis on this data.Queries like AND queries involved multiple fields So my question in which which format I should store the data in HDFS so that processing will be fast for such kind of queries? Regards Jeetendra
Re: Need help in SparkSQL
Can you provide an example of an and query ? If you do just look-up you should try Hbase/ phoenix, otherwise you can try orc with storage index and/or compression, but this depends on how your queries look like Le mer. 22 juil. 2015 à 14:48, Jeetendra Gangele gangele...@gmail.com a écrit : HI All, I have data in MongoDb(few TBs) which I want to migrate to HDFS to do complex queries analysis on this data.Queries like AND queries involved multiple fields So my question in which which format I should store the data in HDFS so that processing will be fast for such kind of queries? Regards Jeetendra
Re: Need help in SparkSQL
I do not think you can put all your queries into the row key without duplicating the data for each query. However, this would be more last resort. Have you checked out phoenix for Hbase? This might suit your needs. It makes it much simpler, because it provided sql on top of Hbase. Nevertheless, Hive could also be a viable alternative depending on how often you run queries etc Le jeu. 23 juil. 2015 à 7:14, Jeetendra Gangele gangele...@gmail.com a écrit : Query will be something like that 1. how many users visited 1 BHK flat in last 1 hour in given particular area 2. how many visitor for flats in give area 3. list all user who bought given property in last 30 days Further it may go too complex involving multiple parameters in my query. The problem is HBase is designing row key to get this data efficiently. Since I have multiple fields to query upon base may not be a good choice? i dont dont to iterate the result set which Hbase returns and give the result because this will kill the performance? On 23 July 2015 at 01:02, Jörn Franke jornfra...@gmail.com wrote: Can you provide an example of an and query ? If you do just look-up you should try Hbase/ phoenix, otherwise you can try orc with storage index and/or compression, but this depends on how your queries look like Le mer. 22 juil. 2015 à 14:48, Jeetendra Gangele gangele...@gmail.com a écrit : HI All, I have data in MongoDb(few TBs) which I want to migrate to HDFS to do complex queries analysis on this data.Queries like AND queries involved multiple fields So my question in which which format I should store the data in HDFS so that processing will be fast for such kind of queries? Regards Jeetendra -- Hi, Find my attached resume. I have total around 7 years of work experience. I worked for Amazon and Expedia in my previous assignments and currently I am working with start- up technology company called Insideview in hyderabad. Regards Jeetendra
Re: Need help in SparkSQL
Query will be something like that 1. how many users visited 1 BHK flat in last 1 hour in given particular area 2. how many visitor for flats in give area 3. list all user who bought given property in last 30 days Further it may go too complex involving multiple parameters in my query. The problem is HBase is designing row key to get this data efficiently. Since I have multiple fields to query upon base may not be a good choice? i dont dont to iterate the result set which Hbase returns and give the result because this will kill the performance? On 23 July 2015 at 01:02, Jörn Franke jornfra...@gmail.com wrote: Can you provide an example of an and query ? If you do just look-up you should try Hbase/ phoenix, otherwise you can try orc with storage index and/or compression, but this depends on how your queries look like Le mer. 22 juil. 2015 à 14:48, Jeetendra Gangele gangele...@gmail.com a écrit : HI All, I have data in MongoDb(few TBs) which I want to migrate to HDFS to do complex queries analysis on this data.Queries like AND queries involved multiple fields So my question in which which format I should store the data in HDFS so that processing will be fast for such kind of queries? Regards Jeetendra -- Hi, Find my attached resume. I have total around 7 years of work experience. I worked for Amazon and Expedia in my previous assignments and currently I am working with start- up technology company called Insideview in hyderabad. Regards Jeetendra
RE: Need help in SparkSQL
Parquet Mohammed From: Jeetendra Gangele [mailto:gangele...@gmail.com] Sent: Wednesday, July 22, 2015 5:48 AM To: user Subject: Need help in SparkSQL HI All, I have data in MongoDb(few TBs) which I want to migrate to HDFS to do complex queries analysis on this data.Queries like AND queries involved multiple fields So my question in which which format I should store the data in HDFS so that processing will be fast for such kind of queries? Regards Jeetendra