If you have the date of the crawl stored in the table, you could set a
filter on the Scan object to only scan the rows for a certain day.

Also just to be sure, are you using MapReduce to process the tables?

J-D

2009/12/24 Xin Jing <[email protected]>:
> The reason to save the new data into a temp table is, we provide the 
> processed data in a incremental manner, providing new data everyday. But we 
> may process the whole data again some day on demand. If we save the new data 
> into the whole table, it is hard for us to tell which pages is new. We can, 
> of course, use a flag to tell the status of the data. But I am afraid the 
> performance may hurt to scan some data from a big data base.
>
> Thanks
> - Xin
> ________________________________________
> 发件人: [email protected] [[email protected]] 代表 Jean-Daniel Cryans 
> [[email protected]]
> 发送时间: 2009年12月25日 3:39 下午
> 收件人: [email protected]
> 主题: Re: Looking for a better design
>
> What's the reason for first importing into a temp table and not
> directly into the whole table?
>
> Also to improve performance I recommend reading
> http://wiki.apache.org/hadoop/PerformanceTuning
>
> J-D
>
> 2009/12/24 Xin Jing <[email protected]>:
>> Hi All,
>>
>> We are processing a big number of web pages, crawling about 2 million pages 
>> from internet everyday. After processed the new data, we save them all.
>>
>> Our current design is:
>> 1. create a temp table and a whole table, the table structure is exactly 
>> same.
>> 2. import the new data into temp table, and process them
>> 3. dump all the data from temp table into the whole table
>> 4. clean the temp table
>>
>> It works, but the performance is not good, the step 3 takes a loooong time. 
>> We use map-reduce to transfer the data from temp table into the whole table, 
>> but its performance is too slow. We think there might be something wrong in 
>> our design, so I am looking for a better design for this task. Or some hint 
>> on the processing.
>>
>> Thanks
>> - Xin
>>
>
>

Reply via email to