Re: Split files into 80% and 20% for building model and prediction

2014-12-20 Thread Peyman Mohajerian
you don't have to copy the data to local to do a count. %hdfs dfs -cat file1 | wc -l will do the job On Fri, Dec 12, 2014 at 1:58 AM, Susheel Kumar Gadalay wrote: > > Simple solution.. > > Copy the HDFS file to local and use OS commands to count no of lines > > cat file1 | wc -l > > and cut it ba

Re: Split files into 80% and 20% for building model and prediction

2014-12-12 Thread Wilm Schumacher
eel Kumar Gadalay <mailto:skgada...@gmail.com> > Sent: ‎12/‎12/‎2014 12:00 > To: user@hadoop.apache.org <mailto:user@hadoop.apache.org> > Subject: Re: Split files into 80% and 20% for building model and > prediction > > Simple solution.. > > Copy the HDFS file to

Re: Split files into 80% and 20% for building model and prediction

2014-12-12 Thread Andre Kelpe
Try Cascading multitool: http://docs.cascading.org/multitool/2.6/ - André On Fri, Dec 12, 2014 at 10:30 AM, unmesha sreeveni wrote: > I am trying to divide my HDFS file into 2 parts/files > 80% and 20% for classification algorithm(80% for modelling and 20% for > prediction) > Please provide sug

Re: Split files into 80% and 20% for building model and prediction

2014-12-12 Thread Chris Mawata
How about doing something on the lines of bucketing: Pick a field that is unique for each record and if hash of the field mod 10 is 8 or less it goes in one bin, otherwise into the other one. Cheers Chris On Dec 12, 2014 1:32 AM, "unmesha sreeveni" wrote: > I am trying to divide my HDFS file into

RE: Split files into 80% and 20% for building model and prediction

2014-12-12 Thread Mikael Sitruk
Hi Unmesha With the random approach you don't need to write the MR job for counting. Mikael.s -Original Message- From: "Hitarth" Sent: ‎12/‎12/‎2014 15:20 To: "user@hadoop.apache.org" Subject: Re: Split files into 80% and 20% for building model and predicti

Re: Split files into 80% and 20% for building model and prediction

2014-12-12 Thread Hitarth
t() then if the number generated by random is below 0.8 >> then the row would go to key for training otherwise go to key for the test. >> Mikael.s >> From: Susheel Kumar Gadalay >> Sent: ‎12/‎12/‎2014 12:00 >> To: user@hadoop.apache.org >> Subject: Re: Split f

Re: Split files into 80% and 20% for building model and prediction

2014-12-12 Thread unmesha sreeveni
erated by random is below 0.8 > then the row would go to key for training otherwise go to key for the test. > Mikael.s > -- > From: Susheel Kumar Gadalay > Sent: ‎12/‎12/‎2014 12:00 > To: user@hadoop.apache.org > Subject: Re: Split files into 80%

RE: Split files into 80% and 20% for building model and prediction

2014-12-12 Thread Mikael Sitruk
uot; Sent: ‎12/‎12/‎2014 12:00 To: "user@hadoop.apache.org" Subject: Re: Split files into 80% and 20% for building model and prediction Simple solution.. Copy the HDFS file to local and use OS commands to count no of lines cat file1 | wc -l and cut it based on line number.

Re: Split files into 80% and 20% for building model and prediction

2014-12-12 Thread Susheel Kumar Gadalay
Simple solution.. Copy the HDFS file to local and use OS commands to count no of lines cat file1 | wc -l and cut it based on line number. On 12/12/14, unmesha sreeveni wrote: > I am trying to divide my HDFS file into 2 parts/files > 80% and 20% for classification algorithm(80% for modelling a

Split files into 80% and 20% for building model and prediction

2014-12-12 Thread unmesha sreeveni
I am trying to divide my HDFS file into 2 parts/files 80% and 20% for classification algorithm(80% for modelling and 20% for prediction) Please provide suggestion for the same. To take 80% and 20% to 2 seperate files we need to know the exact number of record in the data set And it is only known if