Re: hive query to select top 10 product of each subcategory and select most recent product info

Mohit Durgapal Fri, 11 Apr 2014 01:23:11 -0700

Hi Nitin,

The ddl is as follows:

CREATE EXTERNAL TABLE user_logs(
users_iduuid    string,
siteid  int,
site_catid      int,
stext   string,
catg    int,   // CATEGORY
scatg   int, // SUBCATEGORY
catgname        string,
scatgname       string,
brand   string,    // PRODUCT BRAND NAME
prrange string,
curr    int,
pname   string, // product name
pid     int,  // product ID
price   string,  //Product Price
prodnbr int,
mrp     string,  //MRP
prURL string, //Product url
prIMGURL string, //Product Image URL
opr     string,
oid     string,
txsucc  string,
last_updated    string //timestamp
)

ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','

I am looking for an output where I have top 10 products from each
subcategory(on the basis of count) with all their information like product
name, price, url, imgurl. Any there will be multiple entries for the same
products (pids) within the same subcategory, In that case I have to pick
the product info that is latest(by last_updated field).

I have written a query but it is considering a multiple entries of  product
as  different products If price or any other info changes for that product.

select siteid,site_catid,catg,scatg,COLLECT_SET(PRODDESC) from
(
  select
PRODDESC,displays,siteid,site_catid,catg,scatg,rank(siteid,site_catid,catg,scatg)
as row_number from
  (
  select count(*) as
displays,siteid,site_catid,catg,scatg,CONCAT('{','pname:',pname,',price:',price,',','mrp:',mrp,',curr:',curr,',pid:',pid,'}')
as PRODDESC from
  user_logs group by siteid,site_catid,catg,scatg,pid,pname,price,mrp,curr
order by siteid,site_catid,catg,scatg,displays desc
  ) A
  ) B
WHERE row_number < 10
group by siteid,site_catid,catg,scatg
order by siteid,site_catid,catg,scatg desc;

The rank() method simply helps in fetching top 10 within a subcategory.
Every time it encounters the same combination of
siteid,site_catid,catg,scatg it increments row_number goes till 10.

The problem above is that I am forced to put product info such as
"pname,price,mrp,...." in the group by clause otherwise I will not be able
to get  that information  in "select". Therefore, even if someone changes
just the price a product(this happens very frequently) it is considered a
different product by the above query. And that is something I don't want.

I hope I have made it a little more clear?  Thanks for your reply :)

On Fri, Apr 11, 2014 at 12:45 PM, Nitin Pawar <[email protected]>wrote:

> may be you can share your table ddl, your query and what output r u
> looking for
>
>
> On Fri, Apr 11, 2014 at 12:26 PM, Mohit Durgapal 
> <[email protected]>wrote:
>
>> I have a hive table partitioned by dates. It contains ecomm data in the
>> format siteid,sitecatid,catid,subcatgid,pid,pname,pprice,pmrp,pdesc....
>>
>>
>>
>> What I need to do is to run a query on table above in hive for top 10
>> products(count wise) in each sub category. What adds a bit more complexity
>> is that I need all the information of the product. Now when I do group by
>> with only subcatg,pid, I can only select the same fields. But I want all
>> the data for that product coming in the same row as subcatg & prodid like
>> prodname, proddesc,price, mrp,imageurl. And since some information like
>> price & proddesc of a product keep on changing I want to pick the latest
>> column values(according to a date field) for a pid if we are able to do a
>> group by on subcatg,pid.
>>
>>
>> I am not able to find a solution to my problem in hive. Any help would be
>> much appreciated.
>>
>>
>> Regards
>> Mohit
>>
>
>
>
> --
> Nitin Pawar
>

Re: hive query to select top 10 product of each subcategory and select most recent product info

Reply via email to