Re: Looking for a better design

Jeff Zhang Thu, 24 Dec 2009 23:59:44 -0800

Hi Xin,

How many mapper task do you get when you transfer the 2 million web pages ?
And what is the job time ?



Jeff Zhang


2009/12/24 Xin Jing <[email protected]>

> Yes, we have the date of the crawled data, and we can use a filter to just
> select those on a specific day. But it is not the row key, applying the
> filter mean scanning the whole table.  The performance should be worse than
> saving the new data into a temp table, right?
>
> We are using map reduce to transfer the processed data in the temp table
> into the whole table. The map-reduce job is simple, it select the data in
> map phrase, and import the data into the whole table in reduce phrase.
> Since the table defination of the temp table and the whole table is exactly
> same, I am wondering if there is a trick to switch the data in temp table
> into the whole table directly. Just like the partition table manner in DB
> area.
>
> Thanks
> - Xin
> ________________________________________
> 发件人: [email protected] [[email protected]] 代表 Jean-Daniel Cryans [
> [email protected]]
> 发送时间: 2009年12月25日 3:47 下午
> 收件人: [email protected]
> 主题: Re: Looking for a better design
>
> If you have the date of the crawl stored in the table, you could set a
> filter on the Scan object to only scan the rows for a certain day.
>
> Also just to be sure, are you using MapReduce to process the tables?
>
> J-D
>
> 2009/12/24 Xin Jing <[email protected]>:
> > The reason to save the new data into a temp table is, we provide the
> processed data in a incremental manner, providing new data everyday. But we
> may process the whole data again some day on demand. If we save the new data
> into the whole table, it is hard for us to tell which pages is new. We can,
> of course, use a flag to tell the status of the data. But I am afraid the
> performance may hurt to scan some data from a big data base.
> >
> > Thanks
> > - Xin
> > ________________________________________
> > 发件人: [email protected] [[email protected]] 代表 Jean-Daniel Cryans [
> [email protected]]
> > 发送时间: 2009年12月25日 3:39 下午
> > 收件人: [email protected]
> > 主题: Re: Looking for a better design
> >
> > What's the reason for first importing into a temp table and not
> > directly into the whole table?
> >
> > Also to improve performance I recommend reading
> > http://wiki.apache.org/hadoop/PerformanceTuning
> >
> > J-D
> >
> > 2009/12/24 Xin Jing <[email protected]>:
> >> Hi All,
> >>
> >> We are processing a big number of web pages, crawling about 2 million
> pages from internet everyday. After processed the new data, we save them
> all.
> >>
> >> Our current design is:
> >> 1. create a temp table and a whole table, the table structure is exactly
> same.
> >> 2. import the new data into temp table, and process them
> >> 3. dump all the data from temp table into the whole table
> >> 4. clean the temp table
> >>
> >> It works, but the performance is not good, the step 3 takes a loooong
> time. We use map-reduce to transfer the data from temp table into the whole
> table, but its performance is too slow. We think there might be something
> wrong in our design, so I am looking for a better design for this task. Or
> some hint on the processing.
> >>
> >> Thanks
> >> - Xin
> >>
> >
> >
>
>

Re: Looking for a better design

Reply via email to