On Thu, Jan 21, 2010 at 1:16 PM, Sriram Muthuswamy Chittathoor <[email protected]> wrote: > I noticed one thing during my sample mapreduce job running -- it creates a > lot of java processes on the slave nodes. Even when I have "reuse.tasks" > property set why does it not use a single jvm. Sometime I see almost like 20 > jvms running in a single box. What property can I use to reduce it from > spawning these huge number of jvm's >
Do a long listing to see what they all are. Each daemon in hadoop/hbase is to its own JVM. You can configure mapreduce to reuse a JVM. That saves some on startup costs but doesn't decrease the total number. How many tasks have you configured to run concurrently? You could turn this down. St.Ack > > > -----Original Message----- > From: [email protected] [mailto:[email protected]] On Behalf Of stack > Sent: Friday, January 22, 2010 2:24 AM > To: [email protected] > Subject: Re: HBase bulk load > > On Thu, Jan 21, 2010 at 12:35 PM, Sriram Muthuswamy Chittathoor > <[email protected]> wrote: >> >> The output of job 1, the one that parses the 4k files and outputs >> user+day/value, if its ordered by user+day, then you can take the >> outputs of >> this first job and feed them to the second job one at a time. HFiles >> will >> be written for some subset of all users but for this subset, all of >> their >> activity over the 8 years will be processed. You'll then move on to the >> next set of users.... >> >> >> --- I am assuming here I will mark that a certain set of users data >> (for all 8 years) goes into a certain hfile and this hfile will just >> keep getting appended to for this same set of users as I progress >> through different years data for the same set of users >> > > No. Not if you are feeding all user data for all years to this first MR job. > > Because the key is user+date, the reducer will get all for a > particular user ordered by date. This will be filled into an hfile. > A user might not all fit in a single hfile. Thats fine too. You > should not have to keep track of anything. The framework does this > for you. > > >> --- I will have to completely process this set of users data year at a >> time in order (2000, 2001 etc) >> > > If you go this route, you'll have to have separate tables for each > year, I believe. > >> >> Maybe write this back to hdfs as sequencefiles rather than as hfiles and >> then take the output of this jobs reducer and feed these to your >> hfileoutputformat job one at a time if you want to piecemeal the >> creation of hfiles >> >> --- I light of the above where does this sequence files fit in ? >> > > They'll be input for a second mapreduce job whose output is hfiles. > > St.Ack > >> >> -----Original Message----- >> From: [email protected] [mailto:[email protected]] On Behalf Of >> stack >> Sent: Saturday, January 16, 2010 5:58 AM >> To: [email protected] >> Subject: Re: HBase bulk load >> >> On Thu, Jan 14, 2010 at 11:05 PM, Sriram Muthuswamy Chittathoor < >> [email protected]> wrote: >> >>> >>> --- We need to bulk load 8 years worth of data from our archives. >> That >>> will 8 * 12 months of data. >>> >> >> Have you run mapreduce jobs over this archive in the past? I ask >> because if >> you have, you may have an input on how long it'll take to do the big or >> part >> of the import. >> >> >> >>> Whats your original key made of? >>> -- Each Data files is a 4K text data which has 6 players data on an >>> average. We will parse it and extract per userid/day data (so many >> each >>> of this would be < .5K) >>> >> >> >> Is your archive in HDFS now? Are the 4k files concatenated into some >> kinda >> archive format? Gzip or something? Is it accessible with http? >> >> >>> >>> Would you do this step in multiple stages or feed this mapreduce job >> all >>> 10 >>> years of data? >>> >>> >>> Either way I can do. Since I have 8 years worth of archived data I >> need >>> to get them onto to the system as a one time effort. If I proceed in >>> this year order will it be fine -- 2000 , 2001 , 2002. The only >>> requirement is at the end these individual years data (in hfiles) >> needs >>> to be loaded in Hbase. >>> >> >> >> If I were to guess, the data in the year 2000 is < 2001 and so on? >> >> Is a table per year going to cut it for you? Don't you want to see the >> user >> data over the whole 8 years? It'll be a pain doing 8 different queries >> and >> aggregating instead of doing one query against a single table? >> >> >>> >>> -- Can u give me some link to doing this. If I am getting u right >> is >>> this the sequence >>> >>> 1. Start with say year 2000 (1 billion 4k files to be processed and >>> loaded) >>> 2. Divide it into splits initially based on just filename ranges >>> (user/day data is hidden inside the file) >>> 3. Each mappers gets a bunch of file (if it is 20 mappers then each >> one >>> will have to process 50 million 4k files (Seems too much even for a >>> single year ?? -- should I go to a single month processing at a time >>> ??) >>> >> >> For 50 million 4k files, you'd want more than 20 mappers. You might >> have 20 >> 'slots' for tasks on your cluster with each time a mapper runs, it might >> process N files. 1 file only would probably be too little work to >> justify >> the firing up of the JVM to run the task. So, you should feed each map >> task >> 100 or 1000 4k files? If 1k files per map task thats 50k map tasks? >> >> >> >>> 4. Each mapper parses the file and extract the user/day records >>> 5. The custom parttioner sends range of users/day to a particular >>> reducer >>> >> >> Yes. Your custom partitioner guarantees that a particular user only >> goes to >> one reducer. How many users do you have do you think? Maybe this job >> is >> too big to do all in the one go. You need to come up with a process >> with >> more steps. A MR job that runs for weeks will fail (smile). Someone >> will >> for sure pull plug on the namenode just as the job is coming to an end. >> >> >> >>> 6. reducer in parallel will generate sequence files -- multiple will >> be >>> there >>> >>> >>> My question here is in each year there will be sequence files >> containing >>> a range of users data. Do I need to identify these and put them >>> together in one hfile as the user/day records for all the 10 years >>> should be together in the final hfile ? >> >> >> No. As stuff flows into the hfiles, it just needs to be guaranteed >> ordered. >> A particular user may span multiple hfiles. >> >> >>> So some manual stuff is required >>> here taking related sequence files (those containing the same range of >>> users / day data) and feeding them to hfileoutputformat job ? >>> >> >> >> The output of job 1, the one that parses the 4k files and outputs >> user+day/value, if its ordered by user+day, then you can take the >> outputs of >> this first job and feed them to the second job one at a time. HFiles >> will >> be written for some subset of all users but for this subset, all of >> their >> activity over the 8 years will be processed. You'll then move on to the >> next set of users.... Eventually you will have many hfiles to upload >> into >> an hbase instance. You'll need to probably modify loadtable.rb some >> (One >> modification you should do I thought is NOT to load an hfile whose >> length is >> 0 bytes). >> >> >> >> >>> -- - Could u also give some links to this multiput technique ?? >>> >>> >> Ryan should put up the patch soon ( >> https://issues.apache.org/jira/browse/HBASE-2066). >> >> >> This seems like a pretty big job. My guess is that its going to take a >> bit >> of time getting it all working. Given your scale, my guess is that >> you'll >> run into some interesting issues. For example, how many of those 4k >> files >> have corruption in them and how will your map tasks deal with the >> corruption? >> >> You need to also figure out some things like how long each step is going >> to >> take, how big the resultant data is going to be, and so on so you can >> guage >> things like the amount of hardware you are going to need to get the job >> done. >> >> The best way to get answers on the above is to start in with running a >> few >> mapreduce jobs passing subsets of the data to see how things work out. >> >> Yours, >> St.Ack >> >> >> >> >>> >>> St.Ack >>> >>> >>> >>> >>> > >>> > -----Original Message----- >>> > From: [email protected] [mailto:[email protected]] On Behalf Of >>> > stack >>> > Sent: Thursday, January 14, 2010 11:33 AM >>> > To: [email protected] >>> > Subject: Re: HBase bulk load >>> > >>> > On Wed, Jan 13, 2010 at 9:49 PM, Sriram Muthuswamy Chittathoor < >>> > [email protected]> wrote: >>> > >>> > > I am trying to use this technique to say bulk load 20 billion >> rows. >>> I >>> > > tried it on a smaller set 20 million rows. A few things I had to >>> take >>> > > care was to write a custom partitioning logic so that a range of >>> keys >>> > > only go to a particular reduce since there was some mention of >>> global >>> > > ordering. >>> > > For example Users (1 -- 1mill) ---> Reducer 1 and so on >>> > > >>> > > Good. >>> > >>> > >>> > >>> > > My questions are: >>> > > 1. Can I divide the bulk loading into multiple runs -- the >>> existing >>> > > bulk load bails out if it finds a HDFS output directory with the >>> same >>> > > name >>> > > >>> > >>> > No. Its not currently written to do that but especially if your >> keys >>> > are >>> > ordered, it probably wouldn't take much to make the above work >> (first >>> > job >>> > does the first set of keys, and so on). >>> > >>> > >>> > > 2. What I want to do is make multiple runs of 10 billion and then >>> > > combine the output before running loadtable.rb -- is this >> possible >>> ? >>> > > I am thinking this may be required in case my MR bulk loading >> fails >>> in >>> > > between and I need to start from where I crashed >>> > > >>> > > Well, MR does retries but, yeah, you could run into some issue at >>> the >>> > 10B >>> > mark and want to then start over from there rather than start from >> the >>> > beginning. >>> > >>> > One thing that the current setup does not do is remove the task >> hfile >>> on >>> > failure. We should add this. Would fix case where when speculative >>> > execution is enabled, and the speculative tasks are kiled, we don't >>> > leave >>> > around half-made hfiles (Currently I believe they they show as >>> > zero-length >>> > files). >>> > >>> > St.Ack >>> > >>> > >>> > >>> > > Any tips with huge bulk loading experience ? >>> > > >>> > > >>> > > -----Original Message----- >>> > > From: [email protected] [mailto:[email protected]] On Behalf >> Of >>> > > stack >>> > > Sent: Thursday, January 14, 2010 6:19 AM >>> > > To: [email protected] >>> > > Subject: Re: HBase bulk load >>> > > >>> > > See >>> > > >>> > >>> >> http://hadoop.apache.org/hbase/docs/r0.20.2/api/org/apache/hadoop/hbase/ >>> > > mapreduce/package-summary.html#bulk >>> > > St.Ack >>> > > >>> > > On Wed, Jan 13, 2010 at 4:30 PM, Ted Yu <[email protected]> >> wrote: >>> > > >>> > > > Jonathan: >>> > > > Since you implemented >>> > > > >>> > > > >>> > > >>> > >>> >> https://issues.apache.org/jira/si/jira.issueviews:issue-html/HBASE-48/HB >>> > > ASE-48.html >>> > > > , >>> > > > maybe you can point me to some document how bulk load is used ? >>> > > > I found bin/loadtable.rb and assume that can be used to import >>> data >>> > > back >>> > > > into HBase. >>> > > > >>> > > > Thanks >>> > > > >>> > > >>> > > This email is sent for and on behalf of Ivy Comptech Private >>> Limited. >>> > Ivy >>> > > Comptech Private Limited is a limited liability company. >>> > > >>> > > This email and any attachments are confidential, and may be >> legally >>> > > privileged and protected by copyright. If you are not the intended >>> > recipient >>> > > dissemination or copying of this email is prohibited. If you have >>> > received >>> > > this in error, please notify the sender by replying by email and >>> then >>> > delete >>> > > the email completely from your system. >>> > > Any views or opinions are solely those of the sender. This >>> > communication >>> > > is not intended to form a binding contract on behalf of Ivy >> Comptech >>> > Private >>> > > Limited unless expressly indicated to the contrary and properly >>> > authorised. >>> > > Any actions taken on the basis of this email are at the >> recipient's >>> > own >>> > > risk. >>> > > >>> > > Registered office: >>> > > Ivy Comptech Private Limited, Cyber Spazio, Road No. 2, Banjara >>> Hills, >>> > > Hyderabad 500 033, Andhra Pradesh, India. Registered number: >> 37994. >>> > > Registered in India. A list of members' names is available for >>> > inspection at >>> > > the registered office. >>> > > >>> > > >>> > >>> >> > > This email is sent for and on behalf of Ivy Comptech Private Limited. Ivy > Comptech Private Limited is a limited liability company. > > This email and any attachments are confidential, and may be legally > privileged and protected by copyright. If you are not the intended recipient > dissemination or copying of this email is prohibited. If you have received > this in error, please notify the sender by replying by email and then delete > the email completely from your system. > Any views or opinions are solely those of the sender. This communication is > not intended to form a binding contract on behalf of Ivy Comptech Private > Limited unless expressly indicated to the contrary and properly authorised. > Any actions taken on the basis of this email are at the recipient's own risk. > > Registered office: > Ivy Comptech Private Limited, Cyber Spazio, Road No. 2, Banjara Hills, > Hyderabad 500 033, Andhra Pradesh, India. Registered number: 37994. > Registered in India. A list of members' names is available for inspection at > the registered office. > >
