Does this help: http://hadoop.apache.org/hbase/docs/r0.20.2/api/org/apache/hadoop/hbase/mapreduce/package-summary.html#package_description
St.Ack On Sat, Dec 26, 2009 at 7:50 AM, Aram Mkhitaryan < [email protected]> wrote: > Hi All, > > I'm new to hadoop stuff and I would be grateful if you could explain > how you say/define so that your data is read from hbase tables in > map-reduce tasks and moreover to tell the system so that it has more > than one task. > is there an article that guides to that kind of stuff? > > thank you very much, > marry christmas and happy new year, > Aram > > > > 2009/12/25 Eason.Lee <[email protected]>: > > I think he means > > http://jobtracker_ip:50030/jobtracker.jsp > > > > 2009/12/25 Xin Jing <[email protected]> > > > >> Good point, we will check the map-reduce number for the peformance > issue. > >> > >> Could you point me to a locaiton where to learn the usage of job > tracker? > >> > >> Thanks > >> - Xin > >> ________________________________________ > >> 发件人: Jeff Zhang [[email protected]] > >> 发送时间: 2009年12月25日 4:32 下午 > >> 收件人: [email protected] > >> 主题: Re: Looking for a better design > >> > >> You can look at the job tracker web UI to get the number of your mapper > >> number ? And how many nodes in your cluster ? I do not think it will > cost > >> you serveral hours to transfer 2 millison pages, I doubt you have only > one > >> mapper to process all the 2 million pages. > >> > >> > >> Jeff Zhang > >> > >> > >> 2009/12/25 Xin Jing <[email protected]> > >> > >> > I am not quite sure how many mapper task during the map-reduce job. We > >> are > >> > using the default partition funtion, using the url as the row key. The > >> > mapper manner is the default manner. It takes serveral hours to finish > >> the > >> > job, we just run it once, and found the performance issue, then ask > for a > >> > better solution if any. We will get more experiment number later... > >> > > >> > Thanks > >> > - Xin > >> > > >> > _______________________________________ > >> > 发件人: Jeff Zhang [[email protected]] > >> > 发送时间: 2009年12月25日 3:59 下午 > >> > 收件人: [email protected] > >> > 主题: Re: Looking for a better design > >> > > >> > Hi Xin, > >> > > >> > How many mapper task do you get when you transfer the 2 million web > pages > >> ? > >> > And what is the job time ? > >> > > >> > > >> > Jeff Zhang > >> > > >> > > >> > 2009/12/24 Xin Jing <[email protected]> > >> > > >> > > Yes, we have the date of the crawled data, and we can use a filter > to > >> > just > >> > > select those on a specific day. But it is not the row key, applying > the > >> > > filter mean scanning the whole table. The performance should be > worse > >> > than > >> > > saving the new data into a temp table, right? > >> > > > >> > > We are using map reduce to transfer the processed data in the temp > >> table > >> > > into the whole table. The map-reduce job is simple, it select the > data > >> in > >> > > map phrase, and import the data into the whole table in reduce > phrase. > >> > > Since the table defination of the temp table and the whole table is > >> > exactly > >> > > same, I am wondering if there is a trick to switch the data in temp > >> table > >> > > into the whole table directly. Just like the partition table manner > in > >> DB > >> > > area. > >> > > > >> > > Thanks > >> > > - Xin > >> > > ________________________________________ > >> > > 发件人: [email protected] [[email protected]] 代表 Jean-Daniel Cryans > [ > >> > > [email protected]] > >> > > 发送时间: 2009年12月25日 3:47 下午 > >> > > 收件人: [email protected] > >> > > 主题: Re: Looking for a better design > >> > > > >> > > If you have the date of the crawl stored in the table, you could set > a > >> > > filter on the Scan object to only scan the rows for a certain day. > >> > > > >> > > Also just to be sure, are you using MapReduce to process the tables? > >> > > > >> > > J-D > >> > > > >> > > 2009/12/24 Xin Jing <[email protected]>: > >> > > > The reason to save the new data into a temp table is, we provide > the > >> > > processed data in a incremental manner, providing new data everyday. > >> But > >> > we > >> > > may process the whole data again some day on demand. If we save the > new > >> > data > >> > > into the whole table, it is hard for us to tell which pages is new. > We > >> > can, > >> > > of course, use a flag to tell the status of the data. But I am > afraid > >> the > >> > > performance may hurt to scan some data from a big data base. > >> > > > > >> > > > Thanks > >> > > > - Xin > >> > > > ________________________________________ > >> > > > 发件人: [email protected] [[email protected]] 代表 Jean-Daniel > Cryans [ > >> > > [email protected]] > >> > > > 发送时间: 2009年12月25日 3:39 下午 > >> > > > 收件人: [email protected] > >> > > > 主题: Re: Looking for a better design > >> > > > > >> > > > What's the reason for first importing into a temp table and not > >> > > > directly into the whole table? > >> > > > > >> > > > Also to improve performance I recommend reading > >> > > > http://wiki.apache.org/hadoop/PerformanceTuning > >> > > > > >> > > > J-D > >> > > > > >> > > > 2009/12/24 Xin Jing <[email protected]>: > >> > > >> Hi All, > >> > > >> > >> > > >> We are processing a big number of web pages, crawling about 2 > >> million > >> > > pages from internet everyday. After processed the new data, we save > >> them > >> > > all. > >> > > >> > >> > > >> Our current design is: > >> > > >> 1. create a temp table and a whole table, the table structure is > >> > exactly > >> > > same. > >> > > >> 2. import the new data into temp table, and process them > >> > > >> 3. dump all the data from temp table into the whole table > >> > > >> 4. clean the temp table > >> > > >> > >> > > >> It works, but the performance is not good, the step 3 takes a > >> loooong > >> > > time. We use map-reduce to transfer the data from temp table into > the > >> > whole > >> > > table, but its performance is too slow. We think there might be > >> something > >> > > wrong in our design, so I am looking for a better design for this > task. > >> > Or > >> > > some hint on the processing. > >> > > >> > >> > > >> Thanks > >> > > >> - Xin > >> > > >> > >> > > > > >> > > > > >> > > > >> > > > >> > > >> > > >
