If you have the date of the crawl stored in the table, you could set a filter on the Scan object to only scan the rows for a certain day.
Also just to be sure, are you using MapReduce to process the tables? J-D 2009/12/24 Xin Jing <[email protected]>: > The reason to save the new data into a temp table is, we provide the > processed data in a incremental manner, providing new data everyday. But we > may process the whole data again some day on demand. If we save the new data > into the whole table, it is hard for us to tell which pages is new. We can, > of course, use a flag to tell the status of the data. But I am afraid the > performance may hurt to scan some data from a big data base. > > Thanks > - Xin > ________________________________________ > 发件人: [email protected] [[email protected]] 代表 Jean-Daniel Cryans > [[email protected]] > 发送时间: 2009年12月25日 3:39 下午 > 收件人: [email protected] > 主题: Re: Looking for a better design > > What's the reason for first importing into a temp table and not > directly into the whole table? > > Also to improve performance I recommend reading > http://wiki.apache.org/hadoop/PerformanceTuning > > J-D > > 2009/12/24 Xin Jing <[email protected]>: >> Hi All, >> >> We are processing a big number of web pages, crawling about 2 million pages >> from internet everyday. After processed the new data, we save them all. >> >> Our current design is: >> 1. create a temp table and a whole table, the table structure is exactly >> same. >> 2. import the new data into temp table, and process them >> 3. dump all the data from temp table into the whole table >> 4. clean the temp table >> >> It works, but the performance is not good, the step 3 takes a loooong time. >> We use map-reduce to transfer the data from temp table into the whole table, >> but its performance is too slow. We think there might be something wrong in >> our design, so I am looking for a better design for this task. Or some hint >> on the processing. >> >> Thanks >> - Xin >> > >
