how to feed sample of data to each mapper

2014-02-26 Thread qiaoresearcher
Assume there is one large data set with size 100G on hdfs, how can I control that every data sent into each mapper is around 10G and the 10G is random sampled from the 100G data set? Do we have any mahout sample code doing this? Any comments will be appreciated. Regards,

how to write hive query to solve this problem?

2013-08-30 Thread qiaoresearcher
I have three tables: Table 1: record when and who visited gas station or not, this contains all the users of interest, name all the users as a set A date | user name| visited gas station? 2013-09-01 tom yes 2013

question about machine learning on Hive

2013-01-17 Thread qiaoresearcher
How to run machine learning algorithms (whatever ML algorithms) directly in Hive? assume the input and output already stored as Hive tables. ps: I know mahout is available there, but would prefer run machine learning algorithms directly in Hive many thanks,

Re: how to obtain the latest record for each user in a hive table?

2012-11-19 Thread qiaoresearcher
row_number, right? i am not hadoop administrator, can I run the rank function in hive? thanks again! On Mon, Nov 19, 2012 at 4:55 PM, Edward Capriolo wrote: > On Mon, Nov 19, 2012 at 4:02 PM, qiaoresearcher > wrote: > > The table format is something like: > > > >

need help on writing hive query

2012-11-02 Thread qiaoresearcher
The table format is something like: user_idvisiting_time visiting_web_page user1 time11 page_string_11 user1 time12 page_string_12 with keyword 'abc' user1 time13 page_string_13 user1 time14 page_strin

need help on writing hive query

2012-10-31 Thread qiaoresearcher
Hi all, here is the question. Assume we have a table like: -- user_id|| user_visiting_time|| user_current_web_page || user_previous_web_page user 1