Assume there is one large data set with size 100G on hdfs, how can I
control that every data sent into each mapper is around 10G and the 10G is
random sampled from the 100G data set? Do we have any mahout sample code
doing this?
Any comments will be appreciated.
Regards,
I have three tables:
Table 1: record when and who visited gas station or not, this contains all
the users of interest, name all the users as a set A
date | user name| visited gas station?
2013-09-01 tom yes
2013
How to run machine learning algorithms (whatever ML algorithms) directly in
Hive? assume the input and output already stored as Hive tables.
ps: I know mahout is available there, but would prefer run machine learning
algorithms directly in Hive
many thanks,
row_number, right? i am not hadoop
administrator, can I run the rank function in hive?
thanks again!
On Mon, Nov 19, 2012 at 4:55 PM, Edward Capriolo wrote:
> On Mon, Nov 19, 2012 at 4:02 PM, qiaoresearcher
> wrote:
> > The table format is something like:
> >
> >
The table format is something like:
user_idvisiting_time visiting_web_page
user1 time11 page_string_11
user1 time12 page_string_12 with keyword 'abc'
user1 time13 page_string_13
user1 time14 page_strin
Hi all,
here is the question. Assume we have a table like:
--
user_id|| user_visiting_time|| user_current_web_page ||
user_previous_web_page
user 1