Hi Nitin,
The ddl is as follows:
CREATE EXTERNAL TABLE user_logs(
users_iduuid string,
siteid int,
site_catid int,
stext string,
catg int, // CATEGORY
scatg int, // SUBCATEGORY
catgname string,
scatgname string,
brand string, // PRODUCT BRAND NAME
prrange string,
curr int,
pname string, // product name
pid int, // product ID
price string, //Product Price
prodnbr int,
mrp string, //MRP
prURL string, //Product url
prIMGURL string, //Product Image URL
opr string,
oid string,
txsucc string,
last_updated string //timestamp
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
I am looking for an output where I have top 10 products from each
subcategory(on the basis of count) with all their information like product
name, price, url, imgurl. Any there will be multiple entries for the same
products (pids) within the same subcategory, In that case I have to pick
the product info that is latest(by last_updated field).
I have written a query but it is considering a multiple entries of product
as different products If price or any other info changes for that product.
select siteid,site_catid,catg,scatg,COLLECT_SET(PRODDESC) from
(
select
PRODDESC,displays,siteid,site_catid,catg,scatg,rank(siteid,site_catid,catg,scatg)
as row_number from
(
select count(*) as
displays,siteid,site_catid,catg,scatg,CONCAT('{','pname:',pname,',price:',price,',','mrp:',mrp,',curr:',curr,',pid:',pid,'}')
as PRODDESC from
user_logs group by siteid,site_catid,catg,scatg,pid,pname,price,mrp,curr
order by siteid,site_catid,catg,scatg,displays desc
) A
) B
WHERE row_number < 10
group by siteid,site_catid,catg,scatg
order by siteid,site_catid,catg,scatg desc;
The rank() method simply helps in fetching top 10 within a subcategory.
Every time it encounters the same combination of
siteid,site_catid,catg,scatg it increments row_number goes till 10.
The problem above is that I am forced to put product info such as
"pname,price,mrp,...." in the group by clause otherwise I will not be able
to get that information in "select". Therefore, even if someone changes
just the price a product(this happens very frequently) it is considered a
different product by the above query. And that is something I don't want.
I hope I have made it a little more clear? Thanks for your reply :)
On Fri, Apr 11, 2014 at 12:45 PM, Nitin Pawar <[email protected]>wrote:
> may be you can share your table ddl, your query and what output r u
> looking for
>
>
> On Fri, Apr 11, 2014 at 12:26 PM, Mohit Durgapal
> <[email protected]>wrote:
>
>> I have a hive table partitioned by dates. It contains ecomm data in the
>> format siteid,sitecatid,catid,subcatgid,pid,pname,pprice,pmrp,pdesc....
>>
>>
>>
>> What I need to do is to run a query on table above in hive for top 10
>> products(count wise) in each sub category. What adds a bit more complexity
>> is that I need all the information of the product. Now when I do group by
>> with only subcatg,pid, I can only select the same fields. But I want all
>> the data for that product coming in the same row as subcatg & prodid like
>> prodname, proddesc,price, mrp,imageurl. And since some information like
>> price & proddesc of a product keep on changing I want to pick the latest
>> column values(according to a date field) for a pid if we are able to do a
>> group by on subcatg,pid.
>>
>>
>> I am not able to find a solution to my problem in hive. Any help would be
>> much appreciated.
>>
>>
>> Regards
>> Mohit
>>
>
>
>
> --
> Nitin Pawar
>