I think he means http://jobtracker_ip:50030/jobtracker.jsp
2009/12/25 Xin Jing <[email protected]> > Good point, we will check the map-reduce number for the peformance issue. > > Could you point me to a locaiton where to learn the usage of job tracker? > > Thanks > - Xin > ________________________________________ > 发件人: Jeff Zhang [[email protected]] > 发送时间: 2009年12月25日 4:32 下午 > 收件人: [email protected] > 主题: Re: Looking for a better design > > You can look at the job tracker web UI to get the number of your mapper > number ? And how many nodes in your cluster ? I do not think it will cost > you serveral hours to transfer 2 millison pages, I doubt you have only one > mapper to process all the 2 million pages. > > > Jeff Zhang > > > 2009/12/25 Xin Jing <[email protected]> > > > I am not quite sure how many mapper task during the map-reduce job. We > are > > using the default partition funtion, using the url as the row key. The > > mapper manner is the default manner. It takes serveral hours to finish > the > > job, we just run it once, and found the performance issue, then ask for a > > better solution if any. We will get more experiment number later... > > > > Thanks > > - Xin > > > > _______________________________________ > > 发件人: Jeff Zhang [[email protected]] > > 发送时间: 2009年12月25日 3:59 下午 > > 收件人: [email protected] > > 主题: Re: Looking for a better design > > > > Hi Xin, > > > > How many mapper task do you get when you transfer the 2 million web pages > ? > > And what is the job time ? > > > > > > Jeff Zhang > > > > > > 2009/12/24 Xin Jing <[email protected]> > > > > > Yes, we have the date of the crawled data, and we can use a filter to > > just > > > select those on a specific day. But it is not the row key, applying the > > > filter mean scanning the whole table. The performance should be worse > > than > > > saving the new data into a temp table, right? > > > > > > We are using map reduce to transfer the processed data in the temp > table > > > into the whole table. The map-reduce job is simple, it select the data > in > > > map phrase, and import the data into the whole table in reduce phrase. > > > Since the table defination of the temp table and the whole table is > > exactly > > > same, I am wondering if there is a trick to switch the data in temp > table > > > into the whole table directly. Just like the partition table manner in > DB > > > area. > > > > > > Thanks > > > - Xin > > > ________________________________________ > > > 发件人: [email protected] [[email protected]] 代表 Jean-Daniel Cryans [ > > > [email protected]] > > > 发送时间: 2009年12月25日 3:47 下午 > > > 收件人: [email protected] > > > 主题: Re: Looking for a better design > > > > > > If you have the date of the crawl stored in the table, you could set a > > > filter on the Scan object to only scan the rows for a certain day. > > > > > > Also just to be sure, are you using MapReduce to process the tables? > > > > > > J-D > > > > > > 2009/12/24 Xin Jing <[email protected]>: > > > > The reason to save the new data into a temp table is, we provide the > > > processed data in a incremental manner, providing new data everyday. > But > > we > > > may process the whole data again some day on demand. If we save the new > > data > > > into the whole table, it is hard for us to tell which pages is new. We > > can, > > > of course, use a flag to tell the status of the data. But I am afraid > the > > > performance may hurt to scan some data from a big data base. > > > > > > > > Thanks > > > > - Xin > > > > ________________________________________ > > > > 发件人: [email protected] [[email protected]] 代表 Jean-Daniel Cryans [ > > > [email protected]] > > > > 发送时间: 2009年12月25日 3:39 下午 > > > > 收件人: [email protected] > > > > 主题: Re: Looking for a better design > > > > > > > > What's the reason for first importing into a temp table and not > > > > directly into the whole table? > > > > > > > > Also to improve performance I recommend reading > > > > http://wiki.apache.org/hadoop/PerformanceTuning > > > > > > > > J-D > > > > > > > > 2009/12/24 Xin Jing <[email protected]>: > > > >> Hi All, > > > >> > > > >> We are processing a big number of web pages, crawling about 2 > million > > > pages from internet everyday. After processed the new data, we save > them > > > all. > > > >> > > > >> Our current design is: > > > >> 1. create a temp table and a whole table, the table structure is > > exactly > > > same. > > > >> 2. import the new data into temp table, and process them > > > >> 3. dump all the data from temp table into the whole table > > > >> 4. clean the temp table > > > >> > > > >> It works, but the performance is not good, the step 3 takes a > loooong > > > time. We use map-reduce to transfer the data from temp table into the > > whole > > > table, but its performance is too slow. We think there might be > something > > > wrong in our design, so I am looking for a better design for this task. > > Or > > > some hint on the processing. > > > >> > > > >> Thanks > > > >> - Xin > > > >> > > > > > > > > > > > > > > > > >
