Re: How to use TOP?

Mohammad Tariq Tue, 22 May 2012 02:51:12 -0700

Yes, it would be better if I do it at the time of insertion.Just have
to add one more column.Thanks again.


Regards,
    Mohammad Tariq


On Tue, May 22, 2012 at 2:36 PM, Abhinav Neelam <abhinavroc...@gmail.com> wrote:
> Doing it in the pig script is not feasible because pig doesn't have any
> notion of sequentiality - to maintain it, you'd need to have access to
> state that's shared globally by all the mappers and reducers. One way I can
> think of doing this is to have a UDF that maintains state - perhaps it can
> maintain a file that's NFS mounted/or in HDFS so that it's available on all
> the task nodes; then any call to the UDF can update that file (atomically)
> and return a 'row number' that you could associate with your current tuple.
> Something like:
> B = FOREACH A GENERATE $0, $1, $2, $3, MyUDFs.GETROWNUM() as rownum;
>
> However, AFAIK, you'd be better off doing it in HBase - perhaps at the time
> of record insert, you could also add a 'row number' into the record?
>
> On 22 May 2012 12:43, Mohammad Tariq <donta...@gmail.com> wrote:
>
>> Hi Abhinav,
>>
>>   Thanks a lot for the valuable response..Actually I was thinking of
>> doing the same thing, but being new to Pig I thought of asking it on
>> the mailing list first..As far as the data is concerned, second column
>> will always be in ascending order.But I don't think it will be of any
>> help..I think whatever you have suggested here would be the
>> appropriate solution..Although I would like to ask you one thing..Is
>> it feasible to add that first column having count in my pig script or
>> do I have to change the data in my Hbase table itself???If yes then
>> how can I achieve it in my script??Many thanks.
>>
>> Regards,
>>     Mohammad Tariq
>>
>>
>> On Tue, May 22, 2012 at 1:16 AM, Abhinav Neelam <abhinavroc...@gmail.com>
>> wrote:
>> > Hey Mohammad,
>> >
>> > You need to have sorting requirements when you say 'top 5' records.
>> Because
>> > relations/bags in Pig are unordered, it's natural to ask: 'top 5 by what
>> > parameter?' I'm unfamiliar with HBase, but if your data in HBase has an
>> > implicit ordering with say an auto-increment primary key, or an explicit
>> > one, you could include that field in your input to Pig and then apply TOP
>> > on that field.
>> >
>> > Having said that, if I understand your problem correctly, you don't need
>> > TOP at all - you just want to process your input in groups of 5 tuples
>> at a
>> > time. Again, I can't think of a way of doing this without modifying your
>> > input. For example, if your input included an extra field like this:
>> > 1 18.98   2000             1.21   193.46  2.64        58.17
>> > 1 52.49   2000.5   4.32           947.11  2.74        64.45
>> > 1 115.24  2001             16.8   878.58  2.66        94.49
>> > 1 55.55   2001.5   33.03  656.56  2.82        60.76
>> > 1 156.14  2002             35.52  83.75   2.6         59.57
>> > 2 138.77  2002.5   21.51  105.76  2.62        85.89
>> > 2 71.89   2003             27.79  709.01  2.63        85.44
>> > 2 59.84   2003.5   32.1           444.82  2.72        70.8
>> > 2 103.18  2004             4.09   413.15  2.8         54.37
>> >
>> > you could do a group on that field and proceed. Even if you had a field
>> > like 'line number' or 'record number' in your input, you could still
>> > manipulate that field (say through integer division by 5) to use it for
>> > grouping. In any case, you need something to let Pig bring together your
>> 5
>> > tuple groups.
>> >
>> > B = group A by $0;
>> > C = FOREACH B { <do some processing on your 5 tuple bag A> ...
>> >
>> > Thanks,
>> > Abhinav
>> >
>> > On 21 May 2012 23:03, Mohammad Tariq <donta...@gmail.com> wrote:
>> >
>> >> Hi Ruslan,
>> >>
>> >>    Thanks for the response.I think I have made a mistake.Actually I
>> >> just want the top 5 records each time.I don't have any sorting
>> >> requirements.
>> >>
>> >> Regards,
>> >>     Mohammad Tariq
>> >>
>> >>
>> >> On Mon, May 21, 2012 at 9:31 PM, Ruslan Al-fakikh
>> >> <ruslan.al-fak...@jalent.ru> wrote:
>> >> > Hey Mohammad,
>> >> >
>> >> > Here
>> >> > c = TOP(5,3,a);
>> >> > you say: take 5 records out of a that have the biggest values in the
>> >> third
>> >> > column. Do you really need that sorting by the third column?
>> >> >
>> >> > -----Original Message-----
>> >> > From: Mohammad Tariq [mailto:donta...@gmail.com]
>> >> > Sent: Monday, May 21, 2012 3:54 PM
>> >> > To: user@pig.apache.org
>> >> > Subject: How to use TOP?
>> >> >
>> >> > Hello list,
>> >> >
>> >> >  I have an Hdfs file that has 6 columns that contain some data stored
>> in
>> >> an
>> >> > Hbase table.the data looks like this -
>> >> >
>> >> > 18.98   2000             1.21   193.46  2.64        58.17
>> >> > 52.49   2000.5   4.32           947.11  2.74        64.45
>> >> > 115.24  2001             16.8   878.58  2.66        94.49
>> >> > 55.55   2001.5   33.03  656.56  2.82        60.76
>> >> > 156.14  2002             35.52  83.75   2.6         59.57
>> >> > 138.77  2002.5   21.51  105.76  2.62        85.89
>> >> > 71.89   2003             27.79  709.01  2.63        85.44
>> >> > 59.84   2003.5   32.1           444.82  2.72        70.8
>> >> > 103.18  2004             4.09   413.15  2.8         54.37
>> >> >
>> >> > Now I have to take each record along with its next 4 records and do
>> some
>> >> > processing(for example, in the first shot I have to take records 1-5,
>> in
>> >> the
>> >> > next shot I have to take 2-6 and so on)..I am trying to use TOP for
>> this,
>> >> > but getting the following error -
>> >> >
>> >> > 2012-05-21 17:04:30,328 [main] ERROR org.apache.pig.tools.grunt.Grunt
>> >> > - ERROR 1200: Pig script failed to parse:
>> >> > <line 6, column 37> Invalid scalar projection: parameters : A column
>> >> needs
>> >> > to be projected from a relation for it to be used as a scalar Details
>> at
>> >> > logfile: /home/mohammad/pig-0.9.2/logs/pig_1337599211281.log
>> >> >
>> >> > I am using following commands -
>> >> >
>> >> > grunt> a = load 'hbase://logdata'
>> >> >>> using org.apache.pig.backend.hadoop.hbase.HBaseStorage(
>> >> >>> 'cf:DGR cf:HD cf:POR cf:RES cf:RHOB cf:SON', '-loadKey true') as
>> (id,
>> >> >>> DGR, HD, POR, RES, RHOB, SON);
>> >> > grunt> b = foreach a { c = TOP(5,3,a);
>> >> >>> generate flatten(c);
>> >> >>> }
>> >> >
>> >> > Could anyone tell me how to achieve that????Many thanks.
>> >> >
>> >> > Regards,
>> >> >     Mohammad Tariq
>> >> >
>> >>
>> >
>> >
>> >
>> > --
>> > Hacking is, and always has been, the Holy
>> > Grail of computer science.
>>
>
>
>
> --
> Hacking is, and always has been, the Holy
> Grail of computer science.

Re: How to use TOP?

Reply via email to