Hi Xin, How many mapper task do you get when you transfer the 2 million web pages ? And what is the job time ?
Jeff Zhang 2009/12/24 Xin Jing <[email protected]> > Yes, we have the date of the crawled data, and we can use a filter to just > select those on a specific day. But it is not the row key, applying the > filter mean scanning the whole table. The performance should be worse than > saving the new data into a temp table, right? > > We are using map reduce to transfer the processed data in the temp table > into the whole table. The map-reduce job is simple, it select the data in > map phrase, and import the data into the whole table in reduce phrase. > Since the table defination of the temp table and the whole table is exactly > same, I am wondering if there is a trick to switch the data in temp table > into the whole table directly. Just like the partition table manner in DB > area. > > Thanks > - Xin > ________________________________________ > 发件人: [email protected] [[email protected]] 代表 Jean-Daniel Cryans [ > [email protected]] > 发送时间: 2009年12月25日 3:47 下午 > 收件人: [email protected] > 主题: Re: Looking for a better design > > If you have the date of the crawl stored in the table, you could set a > filter on the Scan object to only scan the rows for a certain day. > > Also just to be sure, are you using MapReduce to process the tables? > > J-D > > 2009/12/24 Xin Jing <[email protected]>: > > The reason to save the new data into a temp table is, we provide the > processed data in a incremental manner, providing new data everyday. But we > may process the whole data again some day on demand. If we save the new data > into the whole table, it is hard for us to tell which pages is new. We can, > of course, use a flag to tell the status of the data. But I am afraid the > performance may hurt to scan some data from a big data base. > > > > Thanks > > - Xin > > ________________________________________ > > 发件人: [email protected] [[email protected]] 代表 Jean-Daniel Cryans [ > [email protected]] > > 发送时间: 2009年12月25日 3:39 下午 > > 收件人: [email protected] > > 主题: Re: Looking for a better design > > > > What's the reason for first importing into a temp table and not > > directly into the whole table? > > > > Also to improve performance I recommend reading > > http://wiki.apache.org/hadoop/PerformanceTuning > > > > J-D > > > > 2009/12/24 Xin Jing <[email protected]>: > >> Hi All, > >> > >> We are processing a big number of web pages, crawling about 2 million > pages from internet everyday. After processed the new data, we save them > all. > >> > >> Our current design is: > >> 1. create a temp table and a whole table, the table structure is exactly > same. > >> 2. import the new data into temp table, and process them > >> 3. dump all the data from temp table into the whole table > >> 4. clean the temp table > >> > >> It works, but the performance is not good, the step 3 takes a loooong > time. We use map-reduce to transfer the data from temp table into the whole > table, but its performance is too slow. We think there might be something > wrong in our design, so I am looking for a better design for this task. Or > some hint on the processing. > >> > >> Thanks > >> - Xin > >> > > > > > >
