RE: Looking for a better design

Xin Jing Fri, 25 Dec 2009 01:03:15 -0800

Good point, we will check the map-reduce number for the peformance issue.

Could you point me to a locaiton where to learn the usage of job tracker?


Thanks
- Xin
________________________________________
发件人: Jeff Zhang [[email protected]]
发送时间: 2009年12月25日 4:32 下午
收件人: [email protected]
主题: Re: Looking for a better design

You can look at the job tracker web UI to get the number of your mapper
number ?  And how many nodes in your cluster ? I do not think it will cost
you serveral hours to transfer 2 millison pages, I doubt you have only one
mapper to process all the 2 million pages.


Jeff Zhang


2009/12/25 Xin Jing <[email protected]>

> I am not quite sure how many mapper task during the map-reduce job. We are
> using the default partition funtion, using the url as the row key. The
> mapper manner is the default manner. It takes serveral hours to finish the
> job, we just run it once, and found the performance issue, then ask for a
> better solution if any. We will get more experiment number later...
>
> Thanks
> - Xin
>
> _______________________________________
> 发件人: Jeff Zhang [[email protected]]
> 发送时间: 2009年12月25日 3:59 下午
> 收件人: [email protected]
> 主题: Re: Looking for a better design
>
> Hi Xin,
>
> How many mapper task do you get when you transfer the 2 million web pages ?
> And what is the job time ?
>
>
> Jeff Zhang
>
>
> 2009/12/24 Xin Jing <[email protected]>
>
> > Yes, we have the date of the crawled data, and we can use a filter to
> just
> > select those on a specific day. But it is not the row key, applying the
> > filter mean scanning the whole table.  The performance should be worse
> than
> > saving the new data into a temp table, right?
> >
> > We are using map reduce to transfer the processed data in the temp table
> > into the whole table. The map-reduce job is simple, it select the data in
> > map phrase, and import the data into the whole table in reduce phrase.
> > Since the table defination of the temp table and the whole table is
> exactly
> > same, I am wondering if there is a trick to switch the data in temp table
> > into the whole table directly. Just like the partition table manner in DB
> > area.
> >
> > Thanks
> > - Xin
> > ________________________________________
> > 发件人: [email protected] [[email protected]] 代表 Jean-Daniel Cryans [
> > [email protected]]
> > 发送时间: 2009年12月25日 3:47 下午
> > 收件人: [email protected]
> > 主题: Re: Looking for a better design
> >
> > If you have the date of the crawl stored in the table, you could set a
> > filter on the Scan object to only scan the rows for a certain day.
> >
> > Also just to be sure, are you using MapReduce to process the tables?
> >
> > J-D
> >
> > 2009/12/24 Xin Jing <[email protected]>:
> > > The reason to save the new data into a temp table is, we provide the
> > processed data in a incremental manner, providing new data everyday. But
> we
> > may process the whole data again some day on demand. If we save the new
> data
> > into the whole table, it is hard for us to tell which pages is new. We
> can,
> > of course, use a flag to tell the status of the data. But I am afraid the
> > performance may hurt to scan some data from a big data base.
> > >
> > > Thanks
> > > - Xin
> > > ________________________________________
> > > 发件人: [email protected] [[email protected]] 代表 Jean-Daniel Cryans [
> > [email protected]]
> > > 发送时间: 2009年12月25日 3:39 下午
> > > 收件人: [email protected]
> > > 主题: Re: Looking for a better design
> > >
> > > What's the reason for first importing into a temp table and not
> > > directly into the whole table?
> > >
> > > Also to improve performance I recommend reading
> > > http://wiki.apache.org/hadoop/PerformanceTuning
> > >
> > > J-D
> > >
> > > 2009/12/24 Xin Jing <[email protected]>:
> > >> Hi All,
> > >>
> > >> We are processing a big number of web pages, crawling about 2 million
> > pages from internet everyday. After processed the new data, we save them
> > all.
> > >>
> > >> Our current design is:
> > >> 1. create a temp table and a whole table, the table structure is
> exactly
> > same.
> > >> 2. import the new data into temp table, and process them
> > >> 3. dump all the data from temp table into the whole table
> > >> 4. clean the temp table
> > >>
> > >> It works, but the performance is not good, the step 3 takes a loooong
> > time. We use map-reduce to transfer the data from temp table into the
> whole
> > table, but its performance is too slow. We think there might be something
> > wrong in our design, so I am looking for a better design for this task.
> Or
> > some hint on the processing.
> > >>
> > >> Thanks
> > >> - Xin
> > >>
> > >
> > >
> >
> >
>

RE: Looking for a better design

Reply via email to