Good point, we will check the map-reduce number for the peformance issue. Could you point me to a locaiton where to learn the usage of job tracker?
Thanks - Xin ________________________________________ 发件人: Jeff Zhang [[email protected]] 发送时间: 2009年12月25日 4:32 下午 收件人: [email protected] 主题: Re: Looking for a better design You can look at the job tracker web UI to get the number of your mapper number ? And how many nodes in your cluster ? I do not think it will cost you serveral hours to transfer 2 millison pages, I doubt you have only one mapper to process all the 2 million pages. Jeff Zhang 2009/12/25 Xin Jing <[email protected]> > I am not quite sure how many mapper task during the map-reduce job. We are > using the default partition funtion, using the url as the row key. The > mapper manner is the default manner. It takes serveral hours to finish the > job, we just run it once, and found the performance issue, then ask for a > better solution if any. We will get more experiment number later... > > Thanks > - Xin > > _______________________________________ > 发件人: Jeff Zhang [[email protected]] > 发送时间: 2009年12月25日 3:59 下午 > 收件人: [email protected] > 主题: Re: Looking for a better design > > Hi Xin, > > How many mapper task do you get when you transfer the 2 million web pages ? > And what is the job time ? > > > Jeff Zhang > > > 2009/12/24 Xin Jing <[email protected]> > > > Yes, we have the date of the crawled data, and we can use a filter to > just > > select those on a specific day. But it is not the row key, applying the > > filter mean scanning the whole table. The performance should be worse > than > > saving the new data into a temp table, right? > > > > We are using map reduce to transfer the processed data in the temp table > > into the whole table. The map-reduce job is simple, it select the data in > > map phrase, and import the data into the whole table in reduce phrase. > > Since the table defination of the temp table and the whole table is > exactly > > same, I am wondering if there is a trick to switch the data in temp table > > into the whole table directly. Just like the partition table manner in DB > > area. > > > > Thanks > > - Xin > > ________________________________________ > > 发件人: [email protected] [[email protected]] 代表 Jean-Daniel Cryans [ > > [email protected]] > > 发送时间: 2009年12月25日 3:47 下午 > > 收件人: [email protected] > > 主题: Re: Looking for a better design > > > > If you have the date of the crawl stored in the table, you could set a > > filter on the Scan object to only scan the rows for a certain day. > > > > Also just to be sure, are you using MapReduce to process the tables? > > > > J-D > > > > 2009/12/24 Xin Jing <[email protected]>: > > > The reason to save the new data into a temp table is, we provide the > > processed data in a incremental manner, providing new data everyday. But > we > > may process the whole data again some day on demand. If we save the new > data > > into the whole table, it is hard for us to tell which pages is new. We > can, > > of course, use a flag to tell the status of the data. But I am afraid the > > performance may hurt to scan some data from a big data base. > > > > > > Thanks > > > - Xin > > > ________________________________________ > > > 发件人: [email protected] [[email protected]] 代表 Jean-Daniel Cryans [ > > [email protected]] > > > 发送时间: 2009年12月25日 3:39 下午 > > > 收件人: [email protected] > > > 主题: Re: Looking for a better design > > > > > > What's the reason for first importing into a temp table and not > > > directly into the whole table? > > > > > > Also to improve performance I recommend reading > > > http://wiki.apache.org/hadoop/PerformanceTuning > > > > > > J-D > > > > > > 2009/12/24 Xin Jing <[email protected]>: > > >> Hi All, > > >> > > >> We are processing a big number of web pages, crawling about 2 million > > pages from internet everyday. After processed the new data, we save them > > all. > > >> > > >> Our current design is: > > >> 1. create a temp table and a whole table, the table structure is > exactly > > same. > > >> 2. import the new data into temp table, and process them > > >> 3. dump all the data from temp table into the whole table > > >> 4. clean the temp table > > >> > > >> It works, but the performance is not good, the step 3 takes a loooong > > time. We use map-reduce to transfer the data from temp table into the > whole > > table, but its performance is too slow. We think there might be something > > wrong in our design, so I am looking for a better design for this task. > Or > > some hint on the processing. > > >> > > >> Thanks > > >> - Xin > > >> > > > > > > > > > > >
