RE: How to fit the index into the memory for the web search benchmark

Volos Stavros Tue, 23 Oct 2012 01:22:00 -0700

Hi Hailong,

Try to set log4j.logger.org.apache.hadoop=INFO.


The class ipc.RPC is part of the hadoop distribution; hence, the above flag 
affects the logging level of this class.

The faban client shoots queries to the web interface; therefore, there 
shouldn't be any disparity. 

I expect that you will see all the ipc.RPC calls as soon as you set the 
logger.hadoop flag to INFO.

Please let me know the outcome.

Regards,
-Stavros.
________________________________________
From: Hailong Yang [[email protected]]
Sent: Tuesday, October 23, 2012 1:12 AM
To: Volos Stavros
Cc: [email protected]; Lingjia Tang; Jason Mars
Subject: Re: How to fit the index into the memory for the web search benchmark

Yes, it is a part of the hadoop.log. I have attached the log4j.properties file. 
I noticed the following two lines might be suspicious.
log4j.logger.org.apache.nutch=INFO
log4j.logger.org.apache.hadoop=WARN
So which one indicates the log level of the backend during the search? If it is 
the latter, maybe we can change the level to INFO so that we can get more 
information.
BTW, are you using the built-in web portal to issue the queries? I looked 
through the source code of the front page (search.jsp). And I found the 
following snippet.

Hits hits;
   try{
      query.getParams().initFrom(start + hitsToRetrieve, hitsPerSite, "site", 
sort, reverse);
     hits = bean.search(query);
   } catch (IOException e){
     hits = new Hits(0,new Hit[0]);
   }
   int end = (int)Math.min(hits.getLength(), start + hitsPerPage);
   int length = end-start;
   int realEnd = (int)Math.min(hits.getLength(), start + hitsToRetrieve);

   Hit[] show = hits.getHits(start, realEnd-start);
   HitDetails[] details = bean.getDetails(show);
   Summary[] summaries = bean.getSummary(details, query);
   bean.LOG.info<http://bean.LOG.info>("total hits: " + hits.getTotal());

So that means if you are using the web portal to perform the search, you will 
intrigue the getSummary method. However, I am only using the search driver 
provided by the faban client, which contains no logic to access the raw data of 
the indexes. If that is the case, then it explains the disparity.

Best

Hailong

On Mon, Oct 22, 2012 at 3:30 PM, Volos Stavros 
<[email protected]<mailto:[email protected]>> wrote:
Dear Hailong,

Is this a part of the hadoop.log content?

It looks like that the logging parameters in the log4j.properties file are not 
properly set. Can you please attach the conf/log4j.properties file of the Nutch 
distribution you are using for search (the instance that is running on the 
backend node).

This is what my hadoop.log looks like. As you can see the ipc.RPC logs all the 
calls (search, getDetails, getSummary)

2012-05-25 20:05:26,614 INFO  ipc.RPC - Call: search(complete)
2012-05-25 20:05:26,614 INFO  searcher.NutchBean - searching for 20 raw hits
2012-05-25 20:05:26,614 INFO  searcher.NutchBean - re-searching for 40 raw 
hits, query: complete -site:"ajc.stats.com<http://ajc.stats.com/>"
2012-05-25 20:05:26,615 INFO  searcher.NutchBean - found 14155 raw hits
2012-05-25 20:05:26,615 DEBUG ipc.Server - Served: search queueTime= 186 
procesingTime= 1
2012-05-25 20:05:26,615 INFO  ipc.RPC - Return: 
org.apache.nutch.searcher.Hits@1c210f97
2012-05-25 20:05:26,615 DEBUG ipc.Server - IPC Server Responder: responding to 
#462940 from 192.168.9.135:51920<http://192.168.9.135:51920>
2012-05-25 20:05:26,615 DEBUG ipc.Server - IPC Server Responder: responding to 
#462940 from 192.168.9.135:51920<http://192.168.9.135:51920> Wrote 409 bytes.
2012-05-25 20:05:26,615 DEBUG ipc.Server - IPC Server handler 2 on 8890: has 
#462941 from 192.168.9.135:44300<http://192.168.9.135:44300>
2012-05-25 20:05:26,615 INFO  ipc.RPC - Call: 
getDetails([Lorg.apache.nutch.searcher.Hit;@3e869...
2012-05-25 20:05:26,615 DEBUG ipc.Server -  got #463094
2012-05-25 20:05:26,615 DEBUG ipc.Server - Served: getDetails queueTime= 187 
procesingTime= 0
2012-05-25 20:05:26,615 DEBUG ipc.Server - Served: getSummary queueTime= 187 
procesingTime= 4
2012-05-25 20:05:26,615 INFO  ipc.RPC - Return: 
[Lorg.apache.nutch.searcher.HitDetails;@7495195...
2012-05-25 20:05:26,615 DEBUG ipc.Server - IPC Server Responder: responding to 
#462941 from 192.168.9.135:44300<http://192.168.9.135:44300>
2012-05-25 20:05:26,615 DEBUG ipc.Server - IPC Server Responder: responding to 
#462941 from 192.168.9.135:44300<http://192.168.9.135:44300> Wrote 3379 bytes.
2012-05-25 20:05:26,615 DEBUG ipc.Server - IPC Server handler 2 on 8890: has 
#462942 from 192.168.9.135:51921<http://192.168.9.135:51921>
2012-05-25 20:05:26,615 INFO  ipc.RPC - Call: 
getSummary([Lorg.apache.nutch.searcher.HitDetails...

Regards,
-Stavros.

On Oct 22, 2012, at 8:14 PM, Hailong Yang wrote:

Dear Stavros,

I could not verify what you said at the backend. What I saw at the backend was 
almost the same with the frontend. There was no indication in the log file that 
the getSummary method was intrigued. I also looked into the source code of the 
driver, for each query it sent a http request to the backend. However, the 
response may not necessarily include the contents in its message body. Could 
you refer me to the logs where you see the detailed contents for each query?

2012-10-22 10:03:47,995 INFO  searcher.NutchBean - found 1 raw hits
2012-10-22 10:03:47,996 INFO  searcher.NutchBean - re-searching for 40 raw 
hits, query: personal "at a"
2012-10-22 10:03:47,997 INFO  searcher.NutchBean - found 1 raw hits
2012-10-22 10:03:47,997 INFO  searcher.NutchBean - re-searching for 80 raw 
hits, query: personal "at a"
2012-10-22 10:03:47,997 INFO  searcher.NutchBean - found 1 raw hits
2012-10-22 10:03:47,997 INFO  searcher.NutchBean - re-searching for 160 raw 
hits, query: personal "at a"
2012-10-22 10:03:47,998 INFO  searcher.NutchBean - found 1 raw hits
2012-10-22 10:03:48,037 INFO  searcher.NutchBean - searching for 20 raw hits
2012-10-22 10:03:48,038 INFO  searcher.NutchBean - re-searching for 40 raw 
hits, query: personal "at a"
2012-10-22 10:03:48,038 INFO  searcher.NutchBean - found 1 raw hits
2012-10-22 10:03:48,038 INFO  searcher.NutchBean - re-searching for 80 raw 
hits, query: personal "at a"
2012-10-22 10:03:48,039 INFO  searcher.NutchBean - found 1 raw hits
2012-10-22 10:03:48,039 INFO  searcher.NutchBean - re-searching for 160 raw 
hits, query: personal "at a"
2012-10-22 10:03:48,040 INFO  searcher.NutchBean - found 1 raw hits
2012-10-22 10:03:48,041 INFO  searcher.NutchBean - searching for 20 raw hits
2012-10-22 10:03:48,042 INFO  searcher.NutchBean - re-searching for 40 raw 
hits, query: personal "at a"
2012-10-22 10:03:48,042 INFO  searcher.NutchBean - found 1 raw hits
2012-10-22 10:03:48,042 INFO  searcher.NutchBean - re-searching for 80 raw 
hits, query: personal "at a"
2012-10-22 10:03:48,043 INFO  searcher.NutchBean - found 1 raw hits
2012-10-22 10:03:48,043 INFO  searcher.NutchBean - re-searching for 160 raw 
hits, query: personal "at a"
2012-10-22 10:03:48,044 INFO  searcher.NutchBean - found 1 raw hits

Best

Hailong

On Mon, Oct 22, 2012 at 9:43 AM, Hailong Yang 
<[email protected]<mailto:[email protected]>> wrote:
Dear Stavros,

Thank you very much for your reply. I will go through that log right away.

Best

Hailong


On Mon, Oct 22, 2012 at 4:07 AM, Volos Stavros 
<[email protected]<mailto:[email protected]>> wrote:
Dear Hailong,

The frontend will ask the summary for the top documents. A backend node will 
receive a getSummary request for every top document it owns. You can go through 
the logs of the backend node and verify that the node does receive getSummary 
requests.

Regards,
-Stavros.
________________________________________
From: Hailong Yang [[email protected]<mailto:[email protected]>]
Sent: Monday, October 22, 2012 10:38 AM
To: Volos Stavros
Cc: [email protected]<mailto:[email protected]>; Lingjia Tang; 
Jason Mars
Subject: Re: How to fit the index into the memory for the web search benchmark

Dear Stavros,

I am confused why we need to bring the segments into memory. I examined the log 
file from the front end server which recorded the queries sent to and responses 
received from the nutch server. The log file showed the nutch server only 
replied how many hits were found in the crawled dataset without being asked for 
the details of the page contents. So that means when orchestrating the 
searching, the object NutchBean never needs to call the method getSummary that 
accesses the segments to retrieve the page contents. That is also to say we 
don't need to care about whether the size of the segments could be able to fit 
into the memory for this specific web search workload in CloudSuite, right? 
Please Correct me if I am wrong.

Best

Hailong


On Sun, Oct 21, 2012 at 9:04 AM, Volos Stavros 
<[email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>>>
 wrote:
Dear Hailong,

The reason you get I/O activity is due to the fact that the segments don't fit 
into the memory.

I would recommend reducing the size of your index so that indexes+segments 
occupy roughly 16GB.

This is relatively easy to do in case you used multiple reducer tasks (during 
the crawling phase) to create
multiple partitions.

(see Notes at http://parsa.epfl.ch/cloudsuite/search.html: The 
mapred.reduce.tasks property
determines how many index and segment partitions will be created.)

Regards,
-Stavros.
________________________________________
From: Hailong Yang 
[[email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>>]
Sent: Friday, October 19, 2012 8:03 PM
To: Volos Stavros
Cc: 
[email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>>;
 Lingjia Tang; Jason Mars
Subject: Re: How to fit the index into the memory for the web search benchmark

Dear Stavros,

Thank you for your reply. I understand the data structures required during the 
search. The 6GB is only the size of the actual index ( the directory of 
indexes). The whole data including the segments accounts for 30GB.

Best

Hailong

On Fri, Oct 19, 2012 at 9:03 AM, Volos Stavros 
<[email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>><mailto:[email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>>>>
 wrote:
Dear Hailong,

There are two components that are used when performing a query against the 
index serving node:
(a) the actual index (under indexes)
(b) segments (under segments)

What exactly is 6GB? Are you including the segments as well?

Regards,
-Stavros.


________________________________________
From: Hailong Yang 
[[email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>><mailto:[email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>>>]
Sent: Wednesday, October 17, 2012 4:51 AM
To: 
[email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>><mailto:[email protected]<mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>>>
Cc: Lingjia Tang; Jason Mars
Subject: How to fit the index into the memory for the web search benchmark

Hi CloudSuite,

I am experimenting with the web search benchmark. However, I am wondering how 
to fit the index into the memory in order to avoid unnecessary disk access. I 
have a 6GB index crawled from wikipedia and the RAM is 16GB. During the 
workload execution, I noticed there were periodical 2% I/O utilization increase 
and the memory used by nutch server was always less than 500MB. So I guess the 
whole index is not brought into the memory by default before serving the search 
queries, right? Could you tell me how to do that exactly as you did in the 
clearing cloud paper. Thanks!


Best

Hailong

RE: How to fit the index into the memory for the web search benchmark

Reply via email to