The reason to save the new data into a temp table is, we provide the processed 
data in a incremental manner, providing new data everyday. But we may process 
the whole data again some day on demand. If we save the new data into the whole 
table, it is hard for us to tell which pages is new. We can, of course, use a 
flag to tell the status of the data. But I am afraid the performance may hurt 
to scan some data from a big data base.

Thanks
- Xin
________________________________________
发件人: [email protected] [[email protected]] 代表 Jean-Daniel Cryans 
[[email protected]]
发送时间: 2009年12月25日 3:39 下午
收件人: [email protected]
主题: Re: Looking for a better design

What's the reason for first importing into a temp table and not
directly into the whole table?

Also to improve performance I recommend reading
http://wiki.apache.org/hadoop/PerformanceTuning

J-D

2009/12/24 Xin Jing <[email protected]>:
> Hi All,
>
> We are processing a big number of web pages, crawling about 2 million pages 
> from internet everyday. After processed the new data, we save them all.
>
> Our current design is:
> 1. create a temp table and a whole table, the table structure is exactly same.
> 2. import the new data into temp table, and process them
> 3. dump all the data from temp table into the whole table
> 4. clean the temp table
>
> It works, but the performance is not good, the step 3 takes a loooong time. 
> We use map-reduce to transfer the data from temp table into the whole table, 
> but its performance is too slow. We think there might be something wrong in 
> our design, so I am looking for a better design for this task. Or some hint 
> on the processing.
>
> Thanks
> - Xin
>

Reply via email to