Yes, it would be better if I do it at the time of insertion.Just have to add one more column.Thanks again.
Regards, Mohammad Tariq On Tue, May 22, 2012 at 2:36 PM, Abhinav Neelam <abhinavroc...@gmail.com> wrote: > Doing it in the pig script is not feasible because pig doesn't have any > notion of sequentiality - to maintain it, you'd need to have access to > state that's shared globally by all the mappers and reducers. One way I can > think of doing this is to have a UDF that maintains state - perhaps it can > maintain a file that's NFS mounted/or in HDFS so that it's available on all > the task nodes; then any call to the UDF can update that file (atomically) > and return a 'row number' that you could associate with your current tuple. > Something like: > B = FOREACH A GENERATE $0, $1, $2, $3, MyUDFs.GETROWNUM() as rownum; > > However, AFAIK, you'd be better off doing it in HBase - perhaps at the time > of record insert, you could also add a 'row number' into the record? > > On 22 May 2012 12:43, Mohammad Tariq <donta...@gmail.com> wrote: > >> Hi Abhinav, >> >> Thanks a lot for the valuable response..Actually I was thinking of >> doing the same thing, but being new to Pig I thought of asking it on >> the mailing list first..As far as the data is concerned, second column >> will always be in ascending order.But I don't think it will be of any >> help..I think whatever you have suggested here would be the >> appropriate solution..Although I would like to ask you one thing..Is >> it feasible to add that first column having count in my pig script or >> do I have to change the data in my Hbase table itself???If yes then >> how can I achieve it in my script??Many thanks. >> >> Regards, >> Mohammad Tariq >> >> >> On Tue, May 22, 2012 at 1:16 AM, Abhinav Neelam <abhinavroc...@gmail.com> >> wrote: >> > Hey Mohammad, >> > >> > You need to have sorting requirements when you say 'top 5' records. >> Because >> > relations/bags in Pig are unordered, it's natural to ask: 'top 5 by what >> > parameter?' I'm unfamiliar with HBase, but if your data in HBase has an >> > implicit ordering with say an auto-increment primary key, or an explicit >> > one, you could include that field in your input to Pig and then apply TOP >> > on that field. >> > >> > Having said that, if I understand your problem correctly, you don't need >> > TOP at all - you just want to process your input in groups of 5 tuples >> at a >> > time. Again, I can't think of a way of doing this without modifying your >> > input. For example, if your input included an extra field like this: >> > 1 18.98 2000 1.21 193.46 2.64 58.17 >> > 1 52.49 2000.5 4.32 947.11 2.74 64.45 >> > 1 115.24 2001 16.8 878.58 2.66 94.49 >> > 1 55.55 2001.5 33.03 656.56 2.82 60.76 >> > 1 156.14 2002 35.52 83.75 2.6 59.57 >> > 2 138.77 2002.5 21.51 105.76 2.62 85.89 >> > 2 71.89 2003 27.79 709.01 2.63 85.44 >> > 2 59.84 2003.5 32.1 444.82 2.72 70.8 >> > 2 103.18 2004 4.09 413.15 2.8 54.37 >> > >> > you could do a group on that field and proceed. Even if you had a field >> > like 'line number' or 'record number' in your input, you could still >> > manipulate that field (say through integer division by 5) to use it for >> > grouping. In any case, you need something to let Pig bring together your >> 5 >> > tuple groups. >> > >> > B = group A by $0; >> > C = FOREACH B { <do some processing on your 5 tuple bag A> ... >> > >> > Thanks, >> > Abhinav >> > >> > On 21 May 2012 23:03, Mohammad Tariq <donta...@gmail.com> wrote: >> > >> >> Hi Ruslan, >> >> >> >> Thanks for the response.I think I have made a mistake.Actually I >> >> just want the top 5 records each time.I don't have any sorting >> >> requirements. >> >> >> >> Regards, >> >> Mohammad Tariq >> >> >> >> >> >> On Mon, May 21, 2012 at 9:31 PM, Ruslan Al-fakikh >> >> <ruslan.al-fak...@jalent.ru> wrote: >> >> > Hey Mohammad, >> >> > >> >> > Here >> >> > c = TOP(5,3,a); >> >> > you say: take 5 records out of a that have the biggest values in the >> >> third >> >> > column. Do you really need that sorting by the third column? >> >> > >> >> > -----Original Message----- >> >> > From: Mohammad Tariq [mailto:donta...@gmail.com] >> >> > Sent: Monday, May 21, 2012 3:54 PM >> >> > To: user@pig.apache.org >> >> > Subject: How to use TOP? >> >> > >> >> > Hello list, >> >> > >> >> > I have an Hdfs file that has 6 columns that contain some data stored >> in >> >> an >> >> > Hbase table.the data looks like this - >> >> > >> >> > 18.98 2000 1.21 193.46 2.64 58.17 >> >> > 52.49 2000.5 4.32 947.11 2.74 64.45 >> >> > 115.24 2001 16.8 878.58 2.66 94.49 >> >> > 55.55 2001.5 33.03 656.56 2.82 60.76 >> >> > 156.14 2002 35.52 83.75 2.6 59.57 >> >> > 138.77 2002.5 21.51 105.76 2.62 85.89 >> >> > 71.89 2003 27.79 709.01 2.63 85.44 >> >> > 59.84 2003.5 32.1 444.82 2.72 70.8 >> >> > 103.18 2004 4.09 413.15 2.8 54.37 >> >> > >> >> > Now I have to take each record along with its next 4 records and do >> some >> >> > processing(for example, in the first shot I have to take records 1-5, >> in >> >> the >> >> > next shot I have to take 2-6 and so on)..I am trying to use TOP for >> this, >> >> > but getting the following error - >> >> > >> >> > 2012-05-21 17:04:30,328 [main] ERROR org.apache.pig.tools.grunt.Grunt >> >> > - ERROR 1200: Pig script failed to parse: >> >> > <line 6, column 37> Invalid scalar projection: parameters : A column >> >> needs >> >> > to be projected from a relation for it to be used as a scalar Details >> at >> >> > logfile: /home/mohammad/pig-0.9.2/logs/pig_1337599211281.log >> >> > >> >> > I am using following commands - >> >> > >> >> > grunt> a = load 'hbase://logdata' >> >> >>> using org.apache.pig.backend.hadoop.hbase.HBaseStorage( >> >> >>> 'cf:DGR cf:HD cf:POR cf:RES cf:RHOB cf:SON', '-loadKey true') as >> (id, >> >> >>> DGR, HD, POR, RES, RHOB, SON); >> >> > grunt> b = foreach a { c = TOP(5,3,a); >> >> >>> generate flatten(c); >> >> >>> } >> >> > >> >> > Could anyone tell me how to achieve that????Many thanks. >> >> > >> >> > Regards, >> >> > Mohammad Tariq >> >> > >> >> >> > >> > >> > >> > -- >> > Hacking is, and always has been, the Holy >> > Grail of computer science. >> > > > > -- > Hacking is, and always has been, the Holy > Grail of computer science.